All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] hmm: fixes and documentations v2
@ 2018-03-16 19:14 ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Ralph Campbell, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Removed pointless VM_BUG_ON() cced stable when appropriate and splitted
the last patch into _many_ smaller patches to make it easier to review.
The end result is same modulo comments i received so far and the extra
documentation i added while splitting thing up. Below is previous cover
letter (everything in it is still true):

----------------------------------------------------------------------

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>

Jérôme Glisse (12):
  mm/hmm: fix header file if/else/endif maze
  mm/hmm: hmm_pfns_bad() was accessing wrong struct
  mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
  mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
  mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to
    ulong
  mm/hmm: cleanup special vma handling (VM_SPECIAL)
  mm/hmm: do not differentiate between empty entry or missing directory
  mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  mm/hmm: move hmm_pfns_clear() closer to where it is use
  mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  mm/hmm: use device driver encoding for HMM pfn

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed v2

 Documentation/vm/hmm.txt | 360 +++++++++++++++++-----------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 156 ++++++++-------
 mm/hmm.c                 | 495 +++++++++++++++++++++++++++++------------------
 4 files changed, 582 insertions(+), 430 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/4] hmm: fixes and documentations v2
@ 2018-03-16 19:14 ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Ralph Campbell, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Removed pointless VM_BUG_ON() cced stable when appropriate and splitted
the last patch into _many_ smaller patches to make it easier to review.
The end result is same modulo comments i received so far and the extra
documentation i added while splitting thing up. Below is previous cover
letter (everything in it is still true):

----------------------------------------------------------------------

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>

JA(C)rA'me Glisse (12):
  mm/hmm: fix header file if/else/endif maze
  mm/hmm: hmm_pfns_bad() was accessing wrong struct
  mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
  mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
  mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to
    ulong
  mm/hmm: cleanup special vma handling (VM_SPECIAL)
  mm/hmm: do not differentiate between empty entry or missing directory
  mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  mm/hmm: move hmm_pfns_clear() closer to where it is use
  mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  mm/hmm: use device driver encoding for HMM pfn

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed v2

 Documentation/vm/hmm.txt | 360 +++++++++++++++++-----------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 156 ++++++++-------
 mm/hmm.c                 | 495 +++++++++++++++++++++++++++++------------------
 4 files changed, 582 insertions(+), 430 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/14] mm/hmm: documentation editorial update to HMM documentation
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..c7418c56a0ac 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence missmatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 01/14] mm/hmm: documentation editorial update to HMM documentation
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..c7418c56a0ac 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence missmatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: Jérôme Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..ef6044d08cc5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,6 +498,9 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* IS_ENABLED(CONFIG_HMM) */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
@@ -513,8 +516,4 @@ static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
-
-#else /* IS_ENABLED(CONFIG_HMM) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..ef6044d08cc5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,6 +498,9 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* IS_ENABLED(CONFIG_HMM) */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
@@ -513,8 +516,4 @@ static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
-
-#else /* IS_ENABLED(CONFIG_HMM) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, stable, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 18 +++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ef6044d08cc5..61b0e1c05ee1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..6088fa6ed137 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm = mm->hmm;
+	struct hmm_mirror *mirror;
+	struct hmm_mirror *mirror_next;
+
+	down_write(&hmm->mirrors_sem);
+	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release)
+			mirror->ops->release(mirror);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +200,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +246,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, stable, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 18 +++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ef6044d08cc5..61b0e1c05ee1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..6088fa6ed137 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm = mm->hmm;
+	struct hmm_mirror *mirror;
+	struct hmm_mirror *mirror_next;
+
+	down_write(&hmm->mirrors_sem);
+	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release)
+			mirror->ops->release(mirror);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +200,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +246,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..64d9e7dae712 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -293,7 +293,8 @@ static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..64d9e7dae712 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -293,7 +293,8 @@ static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
struct as parameter and were initializing that struct with others of
their parameters. Have caller of those function do this as they are
likely to already do and only pass this struct to both function this
shorten function signature and make it easiers in the future to add
new parameters by simply adding them to the structure.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 18 ++++---------
 mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
 2 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 61b0e1c05ee1..b65e527dd120 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -274,6 +274,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
@@ -281,6 +282,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
@@ -301,12 +303,8 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
@@ -327,13 +325,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 64d9e7dae712..49f0f6b337ed 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -490,11 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
- * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * @range: range being snapshoted and all needed informations
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -508,26 +504,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -538,9 +531,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -558,14 +548,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -585,10 +574,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -596,13 +585,13 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
- *   hmm_vma_range_done(vma, range);
- *   device_update_page_table(pfns);
+ *   hmm_vma_range_done(range);
+ *   device_update_page_table(range->pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
 	struct hmm *hmm;
@@ -612,7 +601,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 		return false;
 	}
 
-	hmm = hmm_register(vma->vm_mm);
+	hmm = hmm_register(range->vma->vm_mm);
 	if (!hmm) {
 		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
 		return false;
@@ -628,11 +617,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
- * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @range: range being faulted and all needed informations
  * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
@@ -648,10 +633,10 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, write, block);
  *   switch (ret) {
  *   case -EAGAIN:
- *     hmm_vma_range_done(vma, range);
+ *     hmm_vma_range_done(range);
  *     // You might want to rate limit or yield to play nicely, you may
  *     // also commit any valid pfn in the array assuming that you are
  *     // getting true from hmm_vma_range_monitor_end()
@@ -665,7 +650,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
+ *   hmm_vma_range_done(range);
  *   // Commit pfns we got from hmm_vma_fault()
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
@@ -675,28 +660,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
+	unsigned long start = range->start;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -704,9 +685,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -714,7 +692,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return 0;
 	}
 
@@ -734,7 +712,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -742,8 +720,8 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
struct as parameter and were initializing that struct with others of
their parameters. Have caller of those function do this as they are
likely to already do and only pass this struct to both function this
shorten function signature and make it easiers in the future to add
new parameters by simply adding them to the structure.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 18 ++++---------
 mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
 2 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 61b0e1c05ee1..b65e527dd120 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -274,6 +274,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
@@ -281,6 +282,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
@@ -301,12 +303,8 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
@@ -327,13 +325,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 64d9e7dae712..49f0f6b337ed 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -490,11 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
- * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * @range: range being snapshoted and all needed informations
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -508,26 +504,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -538,9 +531,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -558,14 +548,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -585,10 +574,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -596,13 +585,13 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
- *   hmm_vma_range_done(vma, range);
- *   device_update_page_table(pfns);
+ *   hmm_vma_range_done(range);
+ *   device_update_page_table(range->pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
 	struct hmm *hmm;
@@ -612,7 +601,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 		return false;
 	}
 
-	hmm = hmm_register(vma->vm_mm);
+	hmm = hmm_register(range->vma->vm_mm);
 	if (!hmm) {
 		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
 		return false;
@@ -628,11 +617,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
- * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @range: range being faulted and all needed informations
  * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
@@ -648,10 +633,10 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, write, block);
  *   switch (ret) {
  *   case -EAGAIN:
- *     hmm_vma_range_done(vma, range);
+ *     hmm_vma_range_done(range);
  *     // You might want to rate limit or yield to play nicely, you may
  *     // also commit any valid pfn in the array assuming that you are
  *     // getting true from hmm_vma_range_monitor_end()
@@ -665,7 +650,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
+ *   hmm_vma_range_done(range);
  *   // Commit pfns we got from hmm_vma_fault()
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
@@ -675,28 +660,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
+	unsigned long start = range->start;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -704,9 +685,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -714,7 +692,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return 0;
 	}
 
@@ -734,7 +712,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -742,8 +720,8 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 14 ++++++--------
 mm/hmm.c            | 28 ++++++++++++++++++++++++----
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b65e527dd120..4bdc58ffe9f3 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index 49f0f6b337ed..fa3c605c4b96 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -374,11 +374,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -390,6 +388,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -454,7 +453,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -474,7 +472,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -536,6 +534,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read assume it does not allow write as
+		 * only peculiar architecture allow write without read and this
+		 * is not a case we care about (some operation like atomic no
+		 * longer make sense).
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return 0;
+	}
+
 	hmm_vma_walk.fault = false;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
@@ -690,6 +699,17 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read assume it does not allow write as
+		 * only peculiar architecture allow write without read and this
+		 * is not a case we care about (some operation like atomic no
+		 * longer make sense).
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return 0;
+	}
+
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
 		hmm_pfns_special(range->pfns, range->start, range->end);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 14 ++++++--------
 mm/hmm.c            | 28 ++++++++++++++++++++++++----
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b65e527dd120..4bdc58ffe9f3 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index 49f0f6b337ed..fa3c605c4b96 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -374,11 +374,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -390,6 +388,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -454,7 +453,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -474,7 +472,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -536,6 +534,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read assume it does not allow write as
+		 * only peculiar architecture allow write without read and this
+		 * is not a case we care about (some operation like atomic no
+		 * longer make sense).
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return 0;
+	}
+
 	hmm_vma_walk.fault = false;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
@@ -690,6 +699,17 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read assume it does not allow write as
+		 * only peculiar architecture allow write without read and this
+		 * is not a case we care about (some operation like atomic no
+		 * longer make sense).
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return 0;
+	}
+
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
 		hmm_pfns_special(range->pfns, range->start, range->end);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver
to directly use pfn value return by HMM (saving memory and CPU cycles
use for convertion between the two).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
 mm/hmm.c            | 26 +++++++++++++-------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bdc58ffe9f3..78b3ed6d7977 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,8 +80,6 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
- *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
  * HMM_PFN_WRITE: CPU page table has write permission set
@@ -93,8 +91,6 @@ struct hmm;
  *      set and the pfn value is undefined.
  * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
-
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
@@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_SHIFT 6
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
- * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @pfn: HMM pfn value to get corresponding struct page from
+ * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
- * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ * If the uint64_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return NULL;
@@ -119,11 +115,11 @@ static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @pfn: HMM pfn value to extract pfn from
+ * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return -1UL;
@@ -131,21 +127,21 @@ static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
- * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @page: struct page pointer for which to create the HMM pfn
+ * Returns: valid HMM pfn for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(struct page *page)
 {
 	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
- * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @pfn: pfn value for which to create the HMM pfn
+ * Returns: valid HMM pfn for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
 {
 	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
@@ -284,7 +280,7 @@ struct hmm_range {
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
-	hmm_pfn_t		*pfns;
+	uint64_t		*pfns;
 	bool			valid;
 };
 
@@ -307,7 +303,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The HMM pfn array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -316,7 +312,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the HMM pfn array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
diff --git a/mm/hmm.c b/mm/hmm.c
index fa3c605c4b96..f674b73e7f4a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -261,7 +261,7 @@ struct hmm_vma_walk {
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -281,7 +281,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
@@ -295,7 +295,7 @@ static int hmm_pfns_bad(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
@@ -305,7 +305,7 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
@@ -319,7 +319,7 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -344,7 +344,7 @@ static int hmm_vma_walk_clear(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -371,7 +371,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
 	pte_t *ptep;
@@ -388,7 +388,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
-		hmm_pfn_t flag = 0;
+		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -413,7 +413,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
 		return 0;
 	}
 
@@ -447,7 +447,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * device and report anything else as error.
 			 */
 			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_private_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
@@ -472,7 +472,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
+		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -634,8 +634,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * This is similar to a regular CPU page fault except that it will not trigger
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
- * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * On error, for one virtual address in the range, the function will mark the
+ * correspond HMM pfn entry with error flag.
  *
  * Expected use pattern:
  * retry:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver
to directly use pfn value return by HMM (saving memory and CPU cycles
use for convertion between the two).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
 mm/hmm.c            | 26 +++++++++++++-------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bdc58ffe9f3..78b3ed6d7977 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,8 +80,6 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
- *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
  * HMM_PFN_WRITE: CPU page table has write permission set
@@ -93,8 +91,6 @@ struct hmm;
  *      set and the pfn value is undefined.
  * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
-
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
@@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_SHIFT 6
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
- * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @pfn: HMM pfn value to get corresponding struct page from
+ * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
- * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ * If the uint64_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return NULL;
@@ -119,11 +115,11 @@ static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @pfn: HMM pfn value to extract pfn from
+ * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return -1UL;
@@ -131,21 +127,21 @@ static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
- * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @page: struct page pointer for which to create the HMM pfn
+ * Returns: valid HMM pfn for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(struct page *page)
 {
 	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
- * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @pfn: pfn value for which to create the HMM pfn
+ * Returns: valid HMM pfn for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
 {
 	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
@@ -284,7 +280,7 @@ struct hmm_range {
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
-	hmm_pfn_t		*pfns;
+	uint64_t		*pfns;
 	bool			valid;
 };
 
@@ -307,7 +303,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The HMM pfn array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -316,7 +312,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the HMM pfn array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
diff --git a/mm/hmm.c b/mm/hmm.c
index fa3c605c4b96..f674b73e7f4a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -261,7 +261,7 @@ struct hmm_vma_walk {
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -281,7 +281,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
@@ -295,7 +295,7 @@ static int hmm_pfns_bad(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
@@ -305,7 +305,7 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
@@ -319,7 +319,7 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -344,7 +344,7 @@ static int hmm_vma_walk_clear(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -371,7 +371,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
 	pte_t *ptep;
@@ -388,7 +388,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
-		hmm_pfn_t flag = 0;
+		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -413,7 +413,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
 		return 0;
 	}
 
@@ -447,7 +447,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * device and report anything else as error.
 			 */
 			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_private_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
@@ -472,7 +472,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
+		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -634,8 +634,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * This is similar to a regular CPU page fault except that it will not trigger
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
- * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * On error, for one virtual address in the range, the function will mark the
+ * correspond HMM pfn entry with error flag.
  *
  * Expected use pattern:
  * retry:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/14] mm/hmm: cleanup special vma handling (VM_SPECIAL)
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model accross device drivers on
those vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault()
and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
both were filling the HMM pfn array with special entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index f674b73e7f4a..04595a994542 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -281,14 +281,6 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(uint64_t *pfns,
-			     unsigned long addr,
-			     unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
-}
-
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
@@ -486,6 +478,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_special(struct hmm_range *range)
+{
+	unsigned long addr = range->start, i = 0;
+
+	for (; addr < range->end; addr += PAGE_SIZE, i++)
+		range->pfns[i] = HMM_PFN_SPECIAL;
+}
+
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshoted and all needed informations
@@ -509,12 +509,6 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return -EINVAL;
-	}
-
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -528,6 +522,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -693,6 +693,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -710,12 +716,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return 0;
-	}
-
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/14] mm/hmm: cleanup special vma handling (VM_SPECIAL)
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model accross device drivers on
those vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault()
and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
both were filling the HMM pfn array with special entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index f674b73e7f4a..04595a994542 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -281,14 +281,6 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(uint64_t *pfns,
-			     unsigned long addr,
-			     unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
-}
-
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
@@ -486,6 +478,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_special(struct hmm_range *range)
+{
+	unsigned long addr = range->start, i = 0;
+
+	for (; addr < range->end; addr += PAGE_SIZE, i++)
+		range->pfns[i] = HMM_PFN_SPECIAL;
+}
+
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshoted and all needed informations
@@ -509,12 +509,6 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return -EINVAL;
-	}
-
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -528,6 +522,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -693,6 +693,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -710,12 +716,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return 0;
-	}
-
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory
  2018-03-16 19:14 ` jglisse
@ 2018-03-16 19:14   ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

There is no point in differentiating between a range for which there
is not even a directory (and thus entries) and empty entry (pte_none()
or pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  8 +++-----
 mm/hmm.c            | 45 +++++++++++++++------------------------------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 78b3ed6d7977..6d2b6bf6da4b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  * HMM_PFN_VALID: pfn is valid
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
- * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
@@ -94,10 +93,9 @@ struct hmm;
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_EMPTY (1 << 3)
-#define HMM_PFN_SPECIAL (1 << 4)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
-#define HMM_PFN_SHIFT 6
+#define HMM_PFN_SPECIAL (1 << 3)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_SHIFT 5
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
diff --git a/mm/hmm.c b/mm/hmm.c
index 04595a994542..2118e42cb838 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -305,6 +305,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
 		*pfns = 0;
 }
 
+/*
+ * hmm_vma_walk_hole() - handle a range back by no pmd or no pte
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @walk: mm_walk structure
+ * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ *
+ * This is an helper call whenever pmd_none() or pte_none() returns true
+ * or when there is no directory covering the range.
+ */
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
@@ -314,31 +324,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
-		if (hmm_vma_walk->fault) {
-			int ret;
-
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
-		}
-	}
-
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
-}
-
-static int hmm_vma_walk_clear(unsigned long addr,
-			      unsigned long end,
-			      struct mm_walk *walk)
-{
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
-	uint64_t *pfns = range->pfns;
-	unsigned long i;
-
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
@@ -397,10 +382,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
 		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfns[i] = 0;
 
 		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
+			pfns[i] = 0;
 			if (hmm_vma_walk->fault)
 				goto fault;
 			continue;
@@ -470,8 +455,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 fault:
 		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
+		/* Fault all pages in range if ask for */
+		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory
@ 2018-03-16 19:14   ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-16 19:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

There is no point in differentiating between a range for which there
is not even a directory (and thus entries) and empty entry (pte_none()
or pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  8 +++-----
 mm/hmm.c            | 45 +++++++++++++++------------------------------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 78b3ed6d7977..6d2b6bf6da4b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  * HMM_PFN_VALID: pfn is valid
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
- * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
@@ -94,10 +93,9 @@ struct hmm;
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_EMPTY (1 << 3)
-#define HMM_PFN_SPECIAL (1 << 4)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
-#define HMM_PFN_SHIFT 6
+#define HMM_PFN_SPECIAL (1 << 3)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_SHIFT 5
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
diff --git a/mm/hmm.c b/mm/hmm.c
index 04595a994542..2118e42cb838 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -305,6 +305,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
 		*pfns = 0;
 }
 
+/*
+ * hmm_vma_walk_hole() - handle a range back by no pmd or no pte
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @walk: mm_walk structure
+ * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ *
+ * This is an helper call whenever pmd_none() or pte_none() returns true
+ * or when there is no directory covering the range.
+ */
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
@@ -314,31 +324,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
-		if (hmm_vma_walk->fault) {
-			int ret;
-
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
-		}
-	}
-
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
-}
-
-static int hmm_vma_walk_clear(unsigned long addr,
-			      unsigned long end,
-			      struct mm_walk *walk)
-{
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
-	uint64_t *pfns = range->pfns;
-	unsigned long i;
-
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
@@ -397,10 +382,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
 		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfns[i] = 0;
 
 		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
+			pfns[i] = 0;
 			if (hmm_vma_walk->fault)
 				goto fault;
 			continue;
@@ -470,8 +455,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 fault:
 		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
+		/* Fault all pages in range if ask for */
+		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
  2018-03-16 19:14   ` jglisse
  (?)
@ 2018-03-16 21:09   ` Andrew Morton
  2018-03-16 21:18       ` Jerome Glisse
  -1 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2018-03-16 21:09 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, John Hubbard,
	Evgeny Baskakov

On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.

"were wrong" is not a sufficient explanation of the problem, especially
if we're requesting a -stable backport.  Please fully describe the
effects of a bug when fixing it?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-16 19:14   ` jglisse
  (?)
@ 2018-03-16 21:12   ` Andrew Morton
  2018-03-16 21:26       ` Jerome Glisse
  -1 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2018-03-16 21:12 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Ralph Campbell, stable, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Fri, 16 Mar 2018 15:14:08 -0400 jglisse@redhat.com wrote:

> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.

Again, what are the user-visible effects of the bug?  Such info is
needed when others review our request for a -stable backport.  And the
many people who review -stable patches for integration into their own
kernel trees will want to understand the benefit of the patch to their
users.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
  2018-03-16 21:09   ` Andrew Morton
  2018-03-16 21:18       ` Jerome Glisse
@ 2018-03-16 21:18       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-16 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, John Hubbard,
	Evgeny Baskakov

On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
> On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
> 
> "were wrong" is not a sufficient explanation of the problem, especially
> if we're requesting a -stable backport.  Please fully describe the
> effects of a bug when fixing it?

Build issue (compilation failure) if you have multiple includes of
hmm.h through different headers is the most obvious issue. So it
will be very obvious with any big driver that include the file in
different headers.

I can respin with that. Sorry again for not being more explanatory
it is always hard for me to figure what is not obvious to others.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
@ 2018-03-16 21:18       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-16 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, John Hubbard,
	Evgeny Baskakov

On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
> On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:
> 
> > From: J�r�me Glisse <jglisse@redhat.com>
> > 
> > The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
> 
> "were wrong" is not a sufficient explanation of the problem, especially
> if we're requesting a -stable backport.  Please fully describe the
> effects of a bug when fixing it?

Build issue (compilation failure) if you have multiple includes of
hmm.h through different headers is the most obvious issue. So it
will be very obvious with any big driver that include the file in
different headers.

I can respin with that. Sorry again for not being more explanatory
it is always hard for me to figure what is not obvious to others.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
@ 2018-03-16 21:18       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-16 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, John Hubbard,
	Evgeny Baskakov

On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
> On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
> 
> "were wrong" is not a sufficient explanation of the problem, especially
> if we're requesting a -stable backport.  Please fully describe the
> effects of a bug when fixing it?

Build issue (compilation failure) if you have multiple includes of
hmm.h through different headers is the most obvious issue. So it
will be very obvious with any big driver that include the file in
different headers.

I can respin with that. Sorry again for not being more explanatory
it is always hard for me to figure what is not obvious to others.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-16 21:12   ` Andrew Morton
  2018-03-16 21:26       ` Jerome Glisse
@ 2018-03-16 21:26       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-16 21:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, stable, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Fri, Mar 16, 2018 at 02:12:21PM -0700, Andrew Morton wrote:
> On Fri, 16 Mar 2018 15:14:08 -0400 jglisse@redhat.com wrote:
> 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> 
> Again, what are the user-visible effects of the bug?  Such info is
> needed when others review our request for a -stable backport.  And the
> many people who review -stable patches for integration into their own
> kernel trees will want to understand the benefit of the patch to their
> users.

I have not had any issues in any of my own testing but nouveau driver
is not as advance as the NVidia closed driver in respect to HMM inte-
gration yet.

If any issues they will happen between exit_mm() and exit_files() in
do_exit() (kernel/exit.c) exit_mm() tear down the mm struct but without
this callback the device driver might still be handling page fault and
thus might potentialy tries to handle them against a dead mm_struct.

So i am not sure what are the symptoms. To be fair there is no public
driver using that part of HMM beside nouveau rfc patches. So at this
point the impact on anybody is non existent. If anyone want to back-
port nouveau HMM support once it make it upstream it will probably
have to backport more things along the way. This is why i am not that
aggressive on ccing stable so far.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-16 21:26       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-16 21:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, stable, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Fri, Mar 16, 2018 at 02:12:21PM -0700, Andrew Morton wrote:
> On Fri, 16 Mar 2018 15:14:08 -0400 jglisse@redhat.com wrote:
> 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> 
> Again, what are the user-visible effects of the bug?  Such info is
> needed when others review our request for a -stable backport.  And the
> many people who review -stable patches for integration into their own
> kernel trees will want to understand the benefit of the patch to their
> users.

I have not had any issues in any of my own testing but nouveau driver
is not as advance as the NVidia closed driver in respect to HMM inte-
gration yet.

If any issues they will happen between exit_mm() and exit_files() in
do_exit() (kernel/exit.c) exit_mm() tear down the mm struct but without
this callback the device driver might still be handling page fault and
thus might potentialy tries to handle them against a dead mm_struct.

So i am not sure what are the symptoms. To be fair there is no public
driver using that part of HMM beside nouveau rfc patches. So at this
point the impact on anybody is non existent. If anyone want to back-
port nouveau HMM support once it make it upstream it will probably
have to backport more things along the way. This is why i am not that
aggressive on ccing stable so far.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-16 21:26       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-16 21:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, stable, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Fri, Mar 16, 2018 at 02:12:21PM -0700, Andrew Morton wrote:
> On Fri, 16 Mar 2018 15:14:08 -0400 jglisse@redhat.com wrote:
> 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> 
> Again, what are the user-visible effects of the bug?  Such info is
> needed when others review our request for a -stable backport.  And the
> many people who review -stable patches for integration into their own
> kernel trees will want to understand the benefit of the patch to their
> users.

I have not had any issues in any of my own testing but nouveau driver
is not as advance as the NVidia closed driver in respect to HMM inte-
gration yet.

If any issues they will happen between exit_mm() and exit_files() in
do_exit() (kernel/exit.c) exit_mm() tear down the mm struct but without
this callback the device driver might still be handling page fault and
thus might potentialy tries to handle them against a dead mm_struct.

So i am not sure what are the symptoms. To be fair there is no public
driver using that part of HMM beside nouveau rfc patches. So at this
point the impact on anybody is non existent. If anyone want to back-
port nouveau HMM support once it make it upstream it will probably
have to backport more things along the way. This is why i am not that
aggressive on ccing stable so far.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
  2018-03-16 21:18       ` Jerome Glisse
  (?)
  (?)
@ 2018-03-16 21:35       ` Andrew Morton
  2018-03-16 21:40           ` John Hubbard
  -1 siblings, 1 reply; 55+ messages in thread
From: Andrew Morton @ 2018-03-16 21:35 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, John Hubbard,
	Evgeny Baskakov

On Fri, 16 Mar 2018 17:18:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:

> On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
> > On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:
> > 
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > > 
> > > The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
> > 
> > "were wrong" is not a sufficient explanation of the problem, especially
> > if we're requesting a -stable backport.  Please fully describe the
> > effects of a bug when fixing it?
> 
> Build issue (compilation failure) if you have multiple includes of
> hmm.h through different headers is the most obvious issue. So it
> will be very obvious with any big driver that include the file in
> different headers.

That doesn't seem to warrant a -stable backport?  The developer of such
a driver will simply fix the headers?

> I can respin with that. Sorry again for not being more explanatory
> it is always hard for me to figure what is not obvious to others.

I updated the changelog, no respin needed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-16 21:26       ` Jerome Glisse
@ 2018-03-16 21:37         ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-16 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, stable, Evgeny Baskakov,
	Mark Hairgrove

On 03/16/2018 02:26 PM, Jerome Glisse wrote:
> On Fri, Mar 16, 2018 at 02:12:21PM -0700, Andrew Morton wrote:
>> On Fri, 16 Mar 2018 15:14:08 -0400 jglisse@redhat.com wrote:
>>
>>> The hmm_mirror_register() function registers a callback for when
>>> the CPU pagetable is modified. Normally, the device driver will
>>> call hmm_mirror_unregister() when the process using the device is
>>> finished. However, if the process exits uncleanly, the struct_mm
>>> can be destroyed with no warning to the device driver.
>>
>> Again, what are the user-visible effects of the bug?  Such info is
>> needed when others review our request for a -stable backport.  And the
>> many people who review -stable patches for integration into their own
>> kernel trees will want to understand the benefit of the patch to their
>> users.
> 
> I have not had any issues in any of my own testing but nouveau driver
> is not as advance as the NVidia closed driver in respect to HMM inte-
> gration yet.
> 
> If any issues they will happen between exit_mm() and exit_files() in
> do_exit() (kernel/exit.c) exit_mm() tear down the mm struct but without
> this callback the device driver might still be handling page fault and
> thus might potentialy tries to handle them against a dead mm_struct.
> 
> So i am not sure what are the symptoms. To be fair there is no public
> driver using that part of HMM beside nouveau rfc patches. So at this
> point the impact on anybody is non existent. If anyone want to back-
> port nouveau HMM support once it make it upstream it will probably
> have to backport more things along the way. This is why i am not that
> aggressive on ccing stable so far.

The problem I'd like to avoid is: having a version of HMM in stable that
is missing this new callback. And without it, once the driver starts doing
actual concurrent operations, we can expect that the race condition will
happen.

It just seems unfortunate to have stable versions out there that would
be exposed to this, when it only require a small patch to avoid it.

On the other hand, it's also reasonable to claim that this is part of the
evolving HMM feature, and as such, this new feature does not belong in
stable.  I'm not sure which argument carries more weight here.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-16 21:37         ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-16 21:37 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, stable, Evgeny Baskakov,
	Mark Hairgrove

On 03/16/2018 02:26 PM, Jerome Glisse wrote:
> On Fri, Mar 16, 2018 at 02:12:21PM -0700, Andrew Morton wrote:
>> On Fri, 16 Mar 2018 15:14:08 -0400 jglisse@redhat.com wrote:
>>
>>> The hmm_mirror_register() function registers a callback for when
>>> the CPU pagetable is modified. Normally, the device driver will
>>> call hmm_mirror_unregister() when the process using the device is
>>> finished. However, if the process exits uncleanly, the struct_mm
>>> can be destroyed with no warning to the device driver.
>>
>> Again, what are the user-visible effects of the bug?  Such info is
>> needed when others review our request for a -stable backport.  And the
>> many people who review -stable patches for integration into their own
>> kernel trees will want to understand the benefit of the patch to their
>> users.
> 
> I have not had any issues in any of my own testing but nouveau driver
> is not as advance as the NVidia closed driver in respect to HMM inte-
> gration yet.
> 
> If any issues they will happen between exit_mm() and exit_files() in
> do_exit() (kernel/exit.c) exit_mm() tear down the mm struct but without
> this callback the device driver might still be handling page fault and
> thus might potentialy tries to handle them against a dead mm_struct.
> 
> So i am not sure what are the symptoms. To be fair there is no public
> driver using that part of HMM beside nouveau rfc patches. So at this
> point the impact on anybody is non existent. If anyone want to back-
> port nouveau HMM support once it make it upstream it will probably
> have to backport more things along the way. This is why i am not that
> aggressive on ccing stable so far.

The problem I'd like to avoid is: having a version of HMM in stable that
is missing this new callback. And without it, once the driver starts doing
actual concurrent operations, we can expect that the race condition will
happen.

It just seems unfortunate to have stable versions out there that would
be exposed to this, when it only require a small patch to avoid it.

On the other hand, it's also reasonable to claim that this is part of the
evolving HMM feature, and as such, this new feature does not belong in
stable.  I'm not sure which argument carries more weight here.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
  2018-03-16 21:35       ` Andrew Morton
@ 2018-03-16 21:40           ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-16 21:40 UTC (permalink / raw)
  To: Andrew Morton, Jerome Glisse
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, Evgeny Baskakov

On 03/16/2018 02:35 PM, Andrew Morton wrote:
> On Fri, 16 Mar 2018 17:18:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
>> On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
>>> On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:
>>>
>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>
>>>> The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
>>>
>>> "were wrong" is not a sufficient explanation of the problem, especially
>>> if we're requesting a -stable backport.  Please fully describe the
>>> effects of a bug when fixing it?
>>
>> Build issue (compilation failure) if you have multiple includes of
>> hmm.h through different headers is the most obvious issue. So it
>> will be very obvious with any big driver that include the file in
>> different headers.
> 
> That doesn't seem to warrant a -stable backport?  The developer of such
> a driver will simply fix the headers?

Right. For this patch, I would strongly request a -stable backport.  It's 
really going to cause problems if anyone tries to use -stable with HMM,
without this fix.

thanks,
-- 
John Hubbard
NVIDIA

> 
>> I can respin with that. Sorry again for not being more explanatory
>> it is always hard for me to figure what is not obvious to others.
> 
> I updated the changelog, no respin needed.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze
@ 2018-03-16 21:40           ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-16 21:40 UTC (permalink / raw)
  To: Andrew Morton, Jerome Glisse
  Cc: linux-mm, linux-kernel, stable, Ralph Campbell, Evgeny Baskakov

On 03/16/2018 02:35 PM, Andrew Morton wrote:
> On Fri, 16 Mar 2018 17:18:02 -0400 Jerome Glisse <jglisse@redhat.com> wrote:
> 
>> On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
>>> On Fri, 16 Mar 2018 15:14:07 -0400 jglisse@redhat.com wrote:
>>>
>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>
>>>> The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
>>>
>>> "were wrong" is not a sufficient explanation of the problem, especially
>>> if we're requesting a -stable backport.  Please fully describe the
>>> effects of a bug when fixing it?
>>
>> Build issue (compilation failure) if you have multiple includes of
>> hmm.h through different headers is the most obvious issue. So it
>> will be very obvious with any big driver that include the file in
>> different headers.
> 
> That doesn't seem to warrant a -stable backport?  The developer of such
> a driver will simply fix the headers?

Right. For this patch, I would strongly request a -stable backport.  It's 
really going to cause problems if anyone tries to use -stable with HMM,
without this fix.

thanks,
-- 
John Hubbard
NVIDIA

> 
>> I can respin with that. Sorry again for not being more explanatory
>> it is always hard for me to figure what is not obvious to others.
> 
> I updated the changelog, no respin needed.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 02/14] mm/hmm: fix header file if/else/endif maze v2
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  1:20     ` jglisse
  -1 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-17  1:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: Jérôme Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because
of this after multiple include there was multiple definition of both
hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM
was enabled (CONFIG_HMM set).

Changed since v1:
  - Fix the maze when CONFIG_HMM is disabled not just when it is
    enabled. This fix bot build failure.
  - Improved commit message.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..36dd21fe5caf 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,23 +498,16 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-#endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-#if IS_ENABLED(CONFIG_HMM_MIRROR)
 void hmm_mm_destroy(struct mm_struct *mm);
 
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
-#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
-
 #else /* IS_ENABLED(CONFIG_HMM) */
 static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
+#endif /* IS_ENABLED(CONFIG_HMM) */
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/14] mm/hmm: fix header file if/else/endif maze v2
@ 2018-03-17  1:20     ` jglisse
  0 siblings, 0 replies; 55+ messages in thread
From: jglisse @ 2018-03-17  1:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because
of this after multiple include there was multiple definition of both
hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM
was enabled (CONFIG_HMM set).

Changed since v1:
  - Fix the maze when CONFIG_HMM is disabled not just when it is
    enabled. This fix bot build failure.
  - Improved commit message.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..36dd21fe5caf 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,23 +498,16 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-#endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-#if IS_ENABLED(CONFIG_HMM_MIRROR)
 void hmm_mm_destroy(struct mm_struct *mm);
 
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
-#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
-
 #else /* IS_ENABLED(CONFIG_HMM) */
 static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
+#endif /* IS_ENABLED(CONFIG_HMM) */
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  2:04     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  2:04 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, stable, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The private field of mm_walk struct point to an hmm_vma_walk struct and
> not to the hmm_range struct desired. Fix to get proper struct pointer.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: stable@vger.kernel.org
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..64d9e7dae712 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -293,7 +293,8 @@ static int hmm_pfns_bad(unsigned long addr,
>  			unsigned long end,
>  			struct mm_walk *walk)
>  {
> -	struct hmm_range *range = walk->private;
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
>  	hmm_pfn_t *pfns = range->pfns;
>  	unsigned long i;
>  

This fix looks good. I also checked the other uses of walk->private, of course, 
but it was only this one that was wrong.

I think this patch also belongs in -stable, because it is a simple bug fix.

For the description, well...actually, because ->range is the first element in
struct hmm_vma_walk, you probably end up with the same pointer value, both
before and after this fix. So maybe there are no symptoms to see. Maybe that's
an argument for *not* putting it in -stable, too. I'll leave that question
to more experienced people.

Either way, you can add: 

Reviewed by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct
@ 2018-03-17  2:04     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  2:04 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, stable, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The private field of mm_walk struct point to an hmm_vma_walk struct and
> not to the hmm_range struct desired. Fix to get proper struct pointer.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: stable@vger.kernel.org
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..64d9e7dae712 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -293,7 +293,8 @@ static int hmm_pfns_bad(unsigned long addr,
>  			unsigned long end,
>  			struct mm_walk *walk)
>  {
> -	struct hmm_range *range = walk->private;
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct hmm_range *range = hmm_vma_walk->range;
>  	hmm_pfn_t *pfns = range->pfns;
>  	unsigned long i;
>  

This fix looks good. I also checked the other uses of walk->private, of course, 
but it was only this one that was wrong.

I think this patch also belongs in -stable, because it is a simple bug fix.

For the description, well...actually, because ->range is the first element in
struct hmm_vma_walk, you probably end up with the same pointer value, both
before and after this fix. So maybe there are no symptoms to see. Maybe that's
an argument for *not* putting it in -stable, too. I'll leave that question
to more experienced people.

Either way, you can add: 

Reviewed by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  2:36     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  2:36 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 

<snip>

> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm *hmm = mm->hmm;
> +	struct hmm_mirror *mirror;
> +	struct hmm_mirror *mirror_next;
> +
> +	down_write(&hmm->mirrors_sem);
> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release)
> +			mirror->ops->release(mirror);
> +	}
> +	up_write(&hmm->mirrors_sem);
> +}
> +

OK, as for actual code review:

This part of the locking looks good. However, I think it can race against
hmm_mirror_register(), because hmm_mirror_register() will just add a new 
mirror regardless.

So:

thread 1                                      thread 2
--------------                                -----------------
hmm_release                                   hmm_mirror_register 
    down_write(&hmm->mirrors_sem);                <blocked: waiting for sem>
        // deletes all list items
    up_write
                                                  unblocked: adds new mirror
                                              

...so I think we need a way to back out of any pending hmm_mirror_register()
calls, as part of the .release steps, right? It seems hard for the device driver,
which could be inside of hmm_mirror_register(), to handle that. Especially considering
that right now, hmm_mirror_register() will return success in this case--so
there is no indication that anything is wrong.

Maybe hmm_mirror_register() could return an error (and not add to the mirror list),
in such a situation, how's that sound?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-17  2:36     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  2:36 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 

<snip>

> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm *hmm = mm->hmm;
> +	struct hmm_mirror *mirror;
> +	struct hmm_mirror *mirror_next;
> +
> +	down_write(&hmm->mirrors_sem);
> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release)
> +			mirror->ops->release(mirror);
> +	}
> +	up_write(&hmm->mirrors_sem);
> +}
> +

OK, as for actual code review:

This part of the locking looks good. However, I think it can race against
hmm_mirror_register(), because hmm_mirror_register() will just add a new 
mirror regardless.

So:

thread 1                                      thread 2
--------------                                -----------------
hmm_release                                   hmm_mirror_register 
    down_write(&hmm->mirrors_sem);                <blocked: waiting for sem>
        // deletes all list items
    up_write
                                                  unblocked: adds new mirror
                                              

...so I think we need a way to back out of any pending hmm_mirror_register()
calls, as part of the .release steps, right? It seems hard for the device driver,
which could be inside of hmm_mirror_register(), to handle that. Especially considering
that right now, hmm_mirror_register() will return success in this case--so
there is no indication that anything is wrong.

Maybe hmm_mirror_register() could return an error (and not add to the mirror list),
in such a situation, how's that sound?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  3:08     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:08 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 

Hi Jerome,

I failed to find any problems in this patch, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

There are a couple of documentation recommended typo fixes listed
below, which are very minor but as long as I'm here I'll point them out.

> Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
> struct as parameter and were initializing that struct with others of
> their parameters. Have caller of those function do this as they are
> likely to already do and only pass this struct to both function this
> shorten function signature and make it easiers in the future to add

                                         easier

> new parameters by simply adding them to the structure.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 18 ++++---------
>  mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
>  2 files changed, 33 insertions(+), 63 deletions(-)


<snip>
>  
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 64d9e7dae712..49f0f6b337ed 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -490,11 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  /*
>   * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
> - * @vma: virtual memory area containing the virtual address range
> - * @range: used to track snapshot validity
> - * @start: range virtual start address (inclusive)
> - * @end: range virtual end address (exclusive)
> - * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
> + * @range: range being snapshoted and all needed informations

Let's just say this:

* @range: range being snapshotted


<snip>

> @@ -628,11 +617,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>  
>  /*
>   * hmm_vma_fault() - try to fault some address in a virtual address range
> - * @vma: virtual memory area containing the virtual address range
> - * @range: use to track pfns array content validity
> - * @start: fault range virtual start address (inclusive)
> - * @end: fault range virtual end address (exclusive)
> - * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
> + * @range: range being faulted and all needed informations

Similarly here, let's just write it like this:

* @range: range being faulted


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
@ 2018-03-17  3:08     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:08 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 

Hi Jerome,

I failed to find any problems in this patch, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

There are a couple of documentation recommended typo fixes listed
below, which are very minor but as long as I'm here I'll point them out.

> Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
> struct as parameter and were initializing that struct with others of
> their parameters. Have caller of those function do this as they are
> likely to already do and only pass this struct to both function this
> shorten function signature and make it easiers in the future to add

                                         easier

> new parameters by simply adding them to the structure.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 18 ++++---------
>  mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
>  2 files changed, 33 insertions(+), 63 deletions(-)


<snip>
>  
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 64d9e7dae712..49f0f6b337ed 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -490,11 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  /*
>   * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
> - * @vma: virtual memory area containing the virtual address range
> - * @range: used to track snapshot validity
> - * @start: range virtual start address (inclusive)
> - * @end: range virtual end address (exclusive)
> - * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
> + * @range: range being snapshoted and all needed informations

Let's just say this:

* @range: range being snapshotted


<snip>

> @@ -628,11 +617,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>  
>  /*
>   * hmm_vma_fault() - try to fault some address in a virtual address range
> - * @vma: virtual memory area containing the virtual address range
> - * @range: use to track pfns array content validity
> - * @start: fault range virtual start address (inclusive)
> - * @end: fault range virtual end address (exclusive)
> - * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
> + * @range: range being faulted and all needed informations

Similarly here, let's just write it like this:

* @range: range being faulted


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  3:30     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:30 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Only peculiar architecture allow write without read thus assume that
> any valid pfn do allow for read. Note we do not care for write only
> because it does make sense with thing like atomic compare and exchange
> or any other operations that allow you to get the memory value through
> them.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 14 ++++++--------
>  mm/hmm.c            | 28 ++++++++++++++++++++++++----
>  2 files changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index b65e527dd120..4bdc58ffe9f3 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid

Maybe write it like this:

* HMM_PFN_VALID: pfn is valid. This implies that it has, at least, read permission.

> - * HMM_PFN_READ:  CPU page table has read permission set
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> @@ -97,13 +96,12 @@ struct hmm;
>  typedef unsigned long hmm_pfn_t;
>  
>  #define HMM_PFN_VALID (1 << 0)

<snip>

>  
> @@ -536,6 +534,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	list_add_rcu(&range->list, &hmm->ranges);
>  	spin_unlock(&hmm->lock);
>  
> +	if (!(vma->vm_flags & VM_READ)) {
> +		/*
> +		 * If vma do not allow read assume it does not allow write as
> +		 * only peculiar architecture allow write without read and this
> +		 * is not a case we care about (some operation like atomic no
> +		 * longer make sense).
> +		 */
> +		hmm_pfns_clear(range->pfns, range->start, range->end);
> +		return 0;

1. Shouldn't we return an error here? All is not well. No one has any pfns, even
   though they tried to get some. :)

2. I think this check needs to be done much earlier, right after the "Sanity
   check, this should not happen" code in this routine.

> +	}
> +
>  	hmm_vma_walk.fault = false;
>  	hmm_vma_walk.range = range;
>  	mm_walk.private = &hmm_vma_walk;
> @@ -690,6 +699,17 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
>  	list_add_rcu(&range->list, &hmm->ranges);
>  	spin_unlock(&hmm->lock);
>  
> +	if (!(vma->vm_flags & VM_READ)) {
> +		/*
> +		 * If vma do not allow read assume it does not allow write as
> +		 * only peculiar architecture allow write without read and this
> +		 * is not a case we care about (some operation like atomic no
> +		 * longer make sense).
> +		 */

For the comment wording (for this one, and the one above), how about:

/*
 * If the vma does not allow read access, then assume that 
 * it does not allow write access, either.
 */

...and then leave the more extensive explanation to the commit log. Or,
if we really want a longer explananation right here, then:

/*
 * If the vma does not allow read access, then assume that 
 * it does not allow write access, either. Architectures that
 * allow write without read access are not supported by HMM,
 * because operations such as atomic access would not work.
 */


> +		hmm_pfns_clear(range->pfns, range->start, range->end);
> +		return 0;
> +	}

Similar points as above: it seems like an error case, and the check should be right near 
the beginning of the function.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
@ 2018-03-17  3:30     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:30 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Only peculiar architecture allow write without read thus assume that
> any valid pfn do allow for read. Note we do not care for write only
> because it does make sense with thing like atomic compare and exchange
> or any other operations that allow you to get the memory value through
> them.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 14 ++++++--------
>  mm/hmm.c            | 28 ++++++++++++++++++++++++----
>  2 files changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index b65e527dd120..4bdc58ffe9f3 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid

Maybe write it like this:

* HMM_PFN_VALID: pfn is valid. This implies that it has, at least, read permission.

> - * HMM_PFN_READ:  CPU page table has read permission set
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> @@ -97,13 +96,12 @@ struct hmm;
>  typedef unsigned long hmm_pfn_t;
>  
>  #define HMM_PFN_VALID (1 << 0)

<snip>

>  
> @@ -536,6 +534,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>  	list_add_rcu(&range->list, &hmm->ranges);
>  	spin_unlock(&hmm->lock);
>  
> +	if (!(vma->vm_flags & VM_READ)) {
> +		/*
> +		 * If vma do not allow read assume it does not allow write as
> +		 * only peculiar architecture allow write without read and this
> +		 * is not a case we care about (some operation like atomic no
> +		 * longer make sense).
> +		 */
> +		hmm_pfns_clear(range->pfns, range->start, range->end);
> +		return 0;

1. Shouldn't we return an error here? All is not well. No one has any pfns, even
   though they tried to get some. :)

2. I think this check needs to be done much earlier, right after the "Sanity
   check, this should not happen" code in this routine.

> +	}
> +
>  	hmm_vma_walk.fault = false;
>  	hmm_vma_walk.range = range;
>  	mm_walk.private = &hmm_vma_walk;
> @@ -690,6 +699,17 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
>  	list_add_rcu(&range->list, &hmm->ranges);
>  	spin_unlock(&hmm->lock);
>  
> +	if (!(vma->vm_flags & VM_READ)) {
> +		/*
> +		 * If vma do not allow read assume it does not allow write as
> +		 * only peculiar architecture allow write without read and this
> +		 * is not a case we care about (some operation like atomic no
> +		 * longer make sense).
> +		 */

For the comment wording (for this one, and the one above), how about:

/*
 * If the vma does not allow read access, then assume that 
 * it does not allow write access, either.
 */

...and then leave the more extensive explanation to the commit log. Or,
if we really want a longer explananation right here, then:

/*
 * If the vma does not allow read access, then assume that 
 * it does not allow write access, either. Architectures that
 * allow write without read access are not supported by HMM,
 * because operations such as atomic access would not work.
 */


> +		hmm_pfns_clear(range->pfns, range->start, range->end);
> +		return 0;
> +	}

Similar points as above: it seems like an error case, and the check should be right near 
the beginning of the function.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-17  2:36     ` John Hubbard
@ 2018-03-17  3:47       ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:47 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/16/2018 07:36 PM, John Hubbard wrote:
> On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
> 
> <snip>
> 
>> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>> +{
>> +	struct hmm *hmm = mm->hmm;
>> +	struct hmm_mirror *mirror;
>> +	struct hmm_mirror *mirror_next;
>> +
>> +	down_write(&hmm->mirrors_sem);
>> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
>> +		list_del_init(&mirror->list);
>> +		if (mirror->ops->release)
>> +			mirror->ops->release(mirror);
>> +	}
>> +	up_write(&hmm->mirrors_sem);
>> +}
>> +
> 
> OK, as for actual code review:
> 
> This part of the locking looks good. However, I think it can race against
> hmm_mirror_register(), because hmm_mirror_register() will just add a new 
> mirror regardless.
> 
> So:
> 
> thread 1                                      thread 2
> --------------                                -----------------
> hmm_release                                   hmm_mirror_register 
>     down_write(&hmm->mirrors_sem);                <blocked: waiting for sem>
>         // deletes all list items
>     up_write
>                                                   unblocked: adds new mirror
>                                               
> 
> ...so I think we need a way to back out of any pending hmm_mirror_register()
> calls, as part of the .release steps, right? It seems hard for the device driver,
> which could be inside of hmm_mirror_register(), to handle that. Especially considering
> that right now, hmm_mirror_register() will return success in this case--so
> there is no indication that anything is wrong.
> 
> Maybe hmm_mirror_register() could return an error (and not add to the mirror list),
> in such a situation, how's that sound?
> 

In other words, I think this would help (not tested yet beyond a quick compile,
but it's pretty simple):

diff --git a/mm/hmm.c b/mm/hmm.c
index 7ccca5478ea1..da39f8522dca 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -66,6 +66,7 @@ struct hmm {
        struct list_head        mirrors;
        struct mmu_notifier     mmu_notifier;
        struct rw_semaphore     mirrors_sem;
+       bool                    shutting_down;
 };
 
 /*
@@ -99,6 +100,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
        INIT_LIST_HEAD(&hmm->ranges);
        spin_lock_init(&hmm->lock);
        hmm->mm = mm;
+       hmm->shutting_down = false;
 
        /*
         * We should only get here if hold the mmap_sem in write mode ie on
@@ -167,6 +169,7 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
        struct hmm_mirror *mirror_next;
 
        down_write(&hmm->mirrors_sem);
+       hmm->shutting_down = true;
        list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
                list_del_init(&mirror->list);
                if (mirror->ops->release)
@@ -227,6 +230,10 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
                return -ENOMEM;
 
        down_write(&mirror->hmm->mirrors_sem);
+       if (mirror->hmm->shutting_down) {
+               up_write(&mirror->hmm->mirrors_sem);
+               return -ESRCH;
+       }
        list_add(&mirror->list, &mirror->hmm->mirrors);
        up_write(&mirror->hmm->mirrors_sem);


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-17  3:47       ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:47 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/16/2018 07:36 PM, John Hubbard wrote:
> On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
> 
> <snip>
> 
>> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>> +{
>> +	struct hmm *hmm = mm->hmm;
>> +	struct hmm_mirror *mirror;
>> +	struct hmm_mirror *mirror_next;
>> +
>> +	down_write(&hmm->mirrors_sem);
>> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
>> +		list_del_init(&mirror->list);
>> +		if (mirror->ops->release)
>> +			mirror->ops->release(mirror);
>> +	}
>> +	up_write(&hmm->mirrors_sem);
>> +}
>> +
> 
> OK, as for actual code review:
> 
> This part of the locking looks good. However, I think it can race against
> hmm_mirror_register(), because hmm_mirror_register() will just add a new 
> mirror regardless.
> 
> So:
> 
> thread 1                                      thread 2
> --------------                                -----------------
> hmm_release                                   hmm_mirror_register 
>     down_write(&hmm->mirrors_sem);                <blocked: waiting for sem>
>         // deletes all list items
>     up_write
>                                                   unblocked: adds new mirror
>                                               
> 
> ...so I think we need a way to back out of any pending hmm_mirror_register()
> calls, as part of the .release steps, right? It seems hard for the device driver,
> which could be inside of hmm_mirror_register(), to handle that. Especially considering
> that right now, hmm_mirror_register() will return success in this case--so
> there is no indication that anything is wrong.
> 
> Maybe hmm_mirror_register() could return an error (and not add to the mirror list),
> in such a situation, how's that sound?
> 

In other words, I think this would help (not tested yet beyond a quick compile,
but it's pretty simple):

diff --git a/mm/hmm.c b/mm/hmm.c
index 7ccca5478ea1..da39f8522dca 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -66,6 +66,7 @@ struct hmm {
        struct list_head        mirrors;
        struct mmu_notifier     mmu_notifier;
        struct rw_semaphore     mirrors_sem;
+       bool                    shutting_down;
 };
 
 /*
@@ -99,6 +100,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
        INIT_LIST_HEAD(&hmm->ranges);
        spin_lock_init(&hmm->lock);
        hmm->mm = mm;
+       hmm->shutting_down = false;
 
        /*
         * We should only get here if hold the mmap_sem in write mode ie on
@@ -167,6 +169,7 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
        struct hmm_mirror *mirror_next;
 
        down_write(&hmm->mirrors_sem);
+       hmm->shutting_down = true;
        list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
                list_del_init(&mirror->list);
                if (mirror->ops->release)
@@ -227,6 +230,10 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
                return -ENOMEM;
 
        down_write(&mirror->hmm->mirrors_sem);
+       if (mirror->hmm->shutting_down) {
+               up_write(&mirror->hmm->mirrors_sem);
+               return -ESRCH;
+       }
        list_add(&mirror->list, &mirror->hmm->mirrors);
        up_write(&mirror->hmm->mirrors_sem);


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  3:59     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:59 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 

Hi Jerome,

This one looks great. A couple of trivial typo fixes are listed below.

You can add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

> All device driver we care about are using 64bits page table entry. In
> order to match this and to avoid useless define convert all HMM pfn to
> directly use uint64_t. It is a first step on the road to allow driver
> to directly use pfn value return by HMM (saving memory and CPU cycles
> use for convertion between the two).

  used for conversion
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
>  mm/hmm.c            | 26 +++++++++++++-------------
>  2 files changed, 34 insertions(+), 38 deletions(-)
> 

<snip>

> @@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
>  #define HMM_PFN_SHIFT 6
>  
>  /*
> - * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
> - * @pfn: hmm_pfn_t to convert to struct page
> - * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
> + * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> + * @pfn: HMM pfn value to get corresponding struct page from
> + * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
>   *
> - * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
> - * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
> + * If the uint64_t is valid (ie valid flag set) then return the struct page

      If the HMM pfn is valid

<snip>

>  
> @@ -634,8 +634,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>   * This is similar to a regular CPU page fault except that it will not trigger
>   * any memory migration if the memory being faulted is not accessible by CPUs.
>   *
> - * On error, for one virtual address in the range, the function will set the
> - * hmm_pfn_t error flag for the corresponding pfn entry.
> + * On error, for one virtual address in the range, the function will mark the
> + * correspond HMM pfn entry with error flag.

      corresponding HMM pfn entry with an error flag.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong
@ 2018-03-17  3:59     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  3:59 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 

Hi Jerome,

This one looks great. A couple of trivial typo fixes are listed below.

You can add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

> All device driver we care about are using 64bits page table entry. In
> order to match this and to avoid useless define convert all HMM pfn to
> directly use uint64_t. It is a first step on the road to allow driver
> to directly use pfn value return by HMM (saving memory and CPU cycles
> use for convertion between the two).

  used for conversion
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
>  mm/hmm.c            | 26 +++++++++++++-------------
>  2 files changed, 34 insertions(+), 38 deletions(-)
> 

<snip>

> @@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
>  #define HMM_PFN_SHIFT 6
>  
>  /*
> - * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
> - * @pfn: hmm_pfn_t to convert to struct page
> - * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
> + * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> + * @pfn: HMM pfn value to get corresponding struct page from
> + * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
>   *
> - * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
> - * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
> + * If the uint64_t is valid (ie valid flag set) then return the struct page

      If the HMM pfn is valid

<snip>

>  
> @@ -634,8 +634,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>   * This is similar to a regular CPU page fault except that it will not trigger
>   * any memory migration if the memory being faulted is not accessible by CPUs.
>   *
> - * On error, for one virtual address in the range, the function will set the
> - * hmm_pfn_t error flag for the corresponding pfn entry.
> + * On error, for one virtual address in the range, the function will mark the
> + * correspond HMM pfn entry with error flag.

      corresponding HMM pfn entry with an error flag.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/14] mm/hmm: cleanup special vma handling (VM_SPECIAL)
  2018-03-16 19:14   ` jglisse
@ 2018-03-17  4:35     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  4:35 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Special vma (one with any of the VM_SPECIAL flags) can not be access by
> device because there is no consistent model accross device drivers on
> those vma and their backing memory.
> 
> This patch directly use hmm_range struct for hmm_pfns_special() argument
> as it is always affecting the whole vma and thus the whole range.
> 
> It also make behavior consistent after this patch both hmm_vma_fault()
> and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
> hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
> both were filling the HMM pfn array with special entry.
> 

Hi Jerome,

This seems correct. 

<snip>

> @@ -486,6 +478,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	return 0;
>  }
>  
> +static void hmm_pfns_special(struct hmm_range *range)
> +{
> +	unsigned long addr = range->start, i = 0;
> +
> +	for (; addr < range->end; addr += PAGE_SIZE, i++)
> +		range->pfns[i] = HMM_PFN_SPECIAL;
> +}

Silly nit: the above would read more naturally, like this:

	unsigned long addr, i = 0;

	for (addr = range->start; addr < range->end; addr += PAGE_SIZE, i++)
		range->pfns[i] = HMM_PFN_SPECIAL;

Either way,

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/14] mm/hmm: cleanup special vma handling (VM_SPECIAL)
@ 2018-03-17  4:35     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  4:35 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Special vma (one with any of the VM_SPECIAL flags) can not be access by
> device because there is no consistent model accross device drivers on
> those vma and their backing memory.
> 
> This patch directly use hmm_range struct for hmm_pfns_special() argument
> as it is always affecting the whole vma and thus the whole range.
> 
> It also make behavior consistent after this patch both hmm_vma_fault()
> and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
> hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
> both were filling the HMM pfn array with special entry.
> 

Hi Jerome,

This seems correct. 

<snip>

> @@ -486,6 +478,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	return 0;
>  }
>  
> +static void hmm_pfns_special(struct hmm_range *range)
> +{
> +	unsigned long addr = range->start, i = 0;
> +
> +	for (; addr < range->end; addr += PAGE_SIZE, i++)
> +		range->pfns[i] = HMM_PFN_SPECIAL;
> +}

Silly nit: the above would read more naturally, like this:

	unsigned long addr, i = 0;

	for (addr = range->start; addr < range->end; addr += PAGE_SIZE, i++)
		range->pfns[i] = HMM_PFN_SPECIAL;

Either way,

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-17  3:47       ` John Hubbard
@ 2018-03-17  4:39         ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  4:39 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/16/2018 08:47 PM, John Hubbard wrote:
> On 03/16/2018 07:36 PM, John Hubbard wrote:
>> On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>
>> <snip>
>>
>>> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>>> +{
>>> +	struct hmm *hmm = mm->hmm;
>>> +	struct hmm_mirror *mirror;
>>> +	struct hmm_mirror *mirror_next;
>>> +
>>> +	down_write(&hmm->mirrors_sem);
>>> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
>>> +		list_del_init(&mirror->list);
>>> +		if (mirror->ops->release)
>>> +			mirror->ops->release(mirror);
>>> +	}
>>> +	up_write(&hmm->mirrors_sem);
>>> +}
>>> +
>>
>> OK, as for actual code review:
>>
>> This part of the locking looks good. However, I think it can race against
>> hmm_mirror_register(), because hmm_mirror_register() will just add a new 
>> mirror regardless.
>>
>> So:
>>
>> thread 1                                      thread 2
>> --------------                                -----------------
>> hmm_release                                   hmm_mirror_register 
>>     down_write(&hmm->mirrors_sem);                <blocked: waiting for sem>
>>         // deletes all list items
>>     up_write
>>                                                   unblocked: adds new mirror
>>                                               
>>

Mark Hairgrove just pointed out some more fun facts:

1. Because hmm_mirror_register() needs to be called with an mm that has a non-zero
refcount, you generally cannot get an hmm_release callback, so the above race should
not happen.

2. We looked around, and the code is missing a call to mmu_notifier_unregister().
That means that it is going to leak memory and not let the mm get released either.

Maybe having each mirror have its own mmu notifier callback is a possible way
to solve this.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-17  4:39         ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-17  4:39 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/16/2018 08:47 PM, John Hubbard wrote:
> On 03/16/2018 07:36 PM, John Hubbard wrote:
>> On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>
>> <snip>
>>
>>> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>>> +{
>>> +	struct hmm *hmm = mm->hmm;
>>> +	struct hmm_mirror *mirror;
>>> +	struct hmm_mirror *mirror_next;
>>> +
>>> +	down_write(&hmm->mirrors_sem);
>>> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
>>> +		list_del_init(&mirror->list);
>>> +		if (mirror->ops->release)
>>> +			mirror->ops->release(mirror);
>>> +	}
>>> +	up_write(&hmm->mirrors_sem);
>>> +}
>>> +
>>
>> OK, as for actual code review:
>>
>> This part of the locking looks good. However, I think it can race against
>> hmm_mirror_register(), because hmm_mirror_register() will just add a new 
>> mirror regardless.
>>
>> So:
>>
>> thread 1                                      thread 2
>> --------------                                -----------------
>> hmm_release                                   hmm_mirror_register 
>>     down_write(&hmm->mirrors_sem);                <blocked: waiting for sem>
>>         // deletes all list items
>>     up_write
>>                                                   unblocked: adds new mirror
>>                                               
>>

Mark Hairgrove just pointed out some more fun facts:

1. Because hmm_mirror_register() needs to be called with an mm that has a non-zero
refcount, you generally cannot get an hmm_release callback, so the above race should
not happen.

2. We looked around, and the code is missing a call to mmu_notifier_unregister().
That means that it is going to leak memory and not let the mm get released either.

Maybe having each mirror have its own mmu notifier callback is a possible way
to solve this.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory
  2018-03-16 19:14   ` jglisse
@ 2018-03-19 23:06     ` John Hubbard
  -1 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-19 23:06 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> There is no point in differentiating between a range for which there
> is not even a directory (and thus entries) and empty entry (pte_none()
> or pmd_none() returns true).
> 
> Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h |  8 +++-----
>  mm/hmm.c            | 45 +++++++++++++++------------------------------
>  2 files changed, 18 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 78b3ed6d7977..6d2b6bf6da4b 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   * HMM_PFN_VALID: pfn is valid
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> @@ -94,10 +93,9 @@ struct hmm;
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_EMPTY (1 << 3)
> -#define HMM_PFN_SPECIAL (1 << 4)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> -#define HMM_PFN_SHIFT 6
> +#define HMM_PFN_SPECIAL (1 << 3)
> +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_SHIFT 5
>  
>  /*
>   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 04595a994542..2118e42cb838 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -305,6 +305,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
>  		*pfns = 0;
>  }
>  
> +/*
> + * hmm_vma_walk_hole() - handle a range back by no pmd or no pte


Maybe write it like this:

 * hmm_vma_walk_hole() - handle a range that is not backed by any pmd or pte
 

> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @walk: mm_walk structure
> + * Returns: 0 on success, -EAGAIN after page fault, or page fault error
> + *
> + * This is an helper call whenever pmd_none() or pte_none() returns true
> + * or when there is no directory covering the range.

Instead of those two lines, how about:

 * This routine will be called whenever pmd_none() or pte_none() returns
 * true, or whenever there is no page directory covering the VA range.


> + */
>  static int hmm_vma_walk_hole(unsigned long addr,
>  			     unsigned long end,
>  			     struct mm_walk *walk)
> @@ -314,31 +324,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
>  	uint64_t *pfns = range->pfns;
>  	unsigned long i;
>  
> -	hmm_vma_walk->last = addr;
> -	i = (addr - range->start) >> PAGE_SHIFT;
> -	for (; addr < end; addr += PAGE_SIZE, i++) {
> -		pfns[i] = HMM_PFN_EMPTY;
> -		if (hmm_vma_walk->fault) {
> -			int ret;
> -
> -			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
> -			if (ret != -EAGAIN)
> -				return ret;
> -		}
> -	}
> -
> -	return hmm_vma_walk->fault ? -EAGAIN : 0;
> -}
> -
> -static int hmm_vma_walk_clear(unsigned long addr,
> -			      unsigned long end,
> -			      struct mm_walk *walk)
> -{
> -	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> -	struct hmm_range *range = hmm_vma_walk->range;
> -	uint64_t *pfns = range->pfns;
> -	unsigned long i;
> -

Nice consolidation!

>  	hmm_vma_walk->last = addr;
>  	i = (addr - range->start) >> PAGE_SHIFT;
>  	for (; addr < end; addr += PAGE_SIZE, i++) {
> @@ -397,10 +382,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
>  			goto again;
>  		if (pmd_protnone(pmd))
> -			return hmm_vma_walk_clear(start, end, walk);
> +			return hmm_vma_walk_hole(start, end, walk);
>  
>  		if (write_fault && !pmd_write(pmd))
> -			return hmm_vma_walk_clear(start, end, walk);
> +			return hmm_vma_walk_hole(start, end, walk);
>  
>  		pfn = pmd_pfn(pmd) + pte_index(addr);
>  		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> @@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		pfns[i] = 0;
>  
>  		if (pte_none(pte)) {
> -			pfns[i] = HMM_PFN_EMPTY;
> +			pfns[i] = 0;

Why is this being set to zero? (0 == HMM_PFN_VALID, btw.)
I would have expected HMM_PFN_NONE. Actually, looking through the 
larger patchset, I think there are a couple of big questions about
these HMM_PFN_* flags. Maybe it's just that the comment documentation
has fallen completely behind, but it looks like there is an actual
problem in the code:

1. HMM_PFN_* used to be bit shifts, so setting them directly sometimes
worked. But now they are enum values, so that doesn't work anymore.
Yet they are still being set directly in places: the enum is being
treated as a flag, probably incorrectly.

Previously: 

#define HMM_PFN_VALID (1 << 0)
#define HMM_PFN_WRITE (1 << 1)
#define HMM_PFN_ERROR (1 << 2)
#define HMM_PFN_EMPTY (1 << 3)
...

New:

enum hmm_pfn_flag_e {
	HMM_PFN_VALID = 0,
	HMM_PFN_WRITE,
	HMM_PFN_ERROR,
	HMM_PFN_NONE,
...

Yet we still have, for example:

    pfns = HMM_PFN_ERROR;

This might be accidentally working, because HMM_PFN_ERROR
has a value of 2, so only one bit is set, but...yikes.

2. The hmm_range.flags variable is a uint64_t* (pointer). And then
the patchset uses the HMM_PFN_* enum to *index* into that as an 
array. Something is highly suspect here, because...an array that is
indexed by HMM_PFN_* enums? It's certainly not documented that way.

Examples:
    range->flags[HMM_PFN_ERROR]
 
    range->flags[HMM_PFN_VALID] 

I'll go through and try to point these out right next to the relevant
parts of the patchset, but because I'm taking a little longer than 
I'd hoped to review this, I figured it's best to alert you earlier, as
soon as I spot something.

>  			if (hmm_vma_walk->fault)
>  				goto fault;
>  			continue;
> @@ -470,8 +455,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  fault:
>  		pte_unmap(ptep);
> -		/* Fault all pages in range */
> -		return hmm_vma_walk_clear(start, end, walk);
> +		/* Fault all pages in range if ask for */

Maybe this, instead (assuming I understand this correctly):

                /*
                 * Fault in each page in the range, if that page was
                 * requested.
                 */

thanks,
-- 
John Hubbard
NVIDIA

> +		return hmm_vma_walk_hole(start, end, walk);
>  	}
>  	pte_unmap(ptep - 1);
>  
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory
@ 2018-03-19 23:06     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2018-03-19 23:06 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/16/2018 12:14 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> There is no point in differentiating between a range for which there
> is not even a directory (and thus entries) and empty entry (pte_none()
> or pmd_none() returns true).
> 
> Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h |  8 +++-----
>  mm/hmm.c            | 45 +++++++++++++++------------------------------
>  2 files changed, 18 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 78b3ed6d7977..6d2b6bf6da4b 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   * HMM_PFN_VALID: pfn is valid
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> @@ -94,10 +93,9 @@ struct hmm;
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_EMPTY (1 << 3)
> -#define HMM_PFN_SPECIAL (1 << 4)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> -#define HMM_PFN_SHIFT 6
> +#define HMM_PFN_SPECIAL (1 << 3)
> +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_SHIFT 5
>  
>  /*
>   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 04595a994542..2118e42cb838 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -305,6 +305,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
>  		*pfns = 0;
>  }
>  
> +/*
> + * hmm_vma_walk_hole() - handle a range back by no pmd or no pte


Maybe write it like this:

 * hmm_vma_walk_hole() - handle a range that is not backed by any pmd or pte
 

> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @walk: mm_walk structure
> + * Returns: 0 on success, -EAGAIN after page fault, or page fault error
> + *
> + * This is an helper call whenever pmd_none() or pte_none() returns true
> + * or when there is no directory covering the range.

Instead of those two lines, how about:

 * This routine will be called whenever pmd_none() or pte_none() returns
 * true, or whenever there is no page directory covering the VA range.


> + */
>  static int hmm_vma_walk_hole(unsigned long addr,
>  			     unsigned long end,
>  			     struct mm_walk *walk)
> @@ -314,31 +324,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
>  	uint64_t *pfns = range->pfns;
>  	unsigned long i;
>  
> -	hmm_vma_walk->last = addr;
> -	i = (addr - range->start) >> PAGE_SHIFT;
> -	for (; addr < end; addr += PAGE_SIZE, i++) {
> -		pfns[i] = HMM_PFN_EMPTY;
> -		if (hmm_vma_walk->fault) {
> -			int ret;
> -
> -			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
> -			if (ret != -EAGAIN)
> -				return ret;
> -		}
> -	}
> -
> -	return hmm_vma_walk->fault ? -EAGAIN : 0;
> -}
> -
> -static int hmm_vma_walk_clear(unsigned long addr,
> -			      unsigned long end,
> -			      struct mm_walk *walk)
> -{
> -	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> -	struct hmm_range *range = hmm_vma_walk->range;
> -	uint64_t *pfns = range->pfns;
> -	unsigned long i;
> -

Nice consolidation!

>  	hmm_vma_walk->last = addr;
>  	i = (addr - range->start) >> PAGE_SHIFT;
>  	for (; addr < end; addr += PAGE_SIZE, i++) {
> @@ -397,10 +382,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
>  			goto again;
>  		if (pmd_protnone(pmd))
> -			return hmm_vma_walk_clear(start, end, walk);
> +			return hmm_vma_walk_hole(start, end, walk);
>  
>  		if (write_fault && !pmd_write(pmd))
> -			return hmm_vma_walk_clear(start, end, walk);
> +			return hmm_vma_walk_hole(start, end, walk);
>  
>  		pfn = pmd_pfn(pmd) + pte_index(addr);
>  		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> @@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		pfns[i] = 0;
>  
>  		if (pte_none(pte)) {
> -			pfns[i] = HMM_PFN_EMPTY;
> +			pfns[i] = 0;

Why is this being set to zero? (0 == HMM_PFN_VALID, btw.)
I would have expected HMM_PFN_NONE. Actually, looking through the 
larger patchset, I think there are a couple of big questions about
these HMM_PFN_* flags. Maybe it's just that the comment documentation
has fallen completely behind, but it looks like there is an actual
problem in the code:

1. HMM_PFN_* used to be bit shifts, so setting them directly sometimes
worked. But now they are enum values, so that doesn't work anymore.
Yet they are still being set directly in places: the enum is being
treated as a flag, probably incorrectly.

Previously: 

#define HMM_PFN_VALID (1 << 0)
#define HMM_PFN_WRITE (1 << 1)
#define HMM_PFN_ERROR (1 << 2)
#define HMM_PFN_EMPTY (1 << 3)
...

New:

enum hmm_pfn_flag_e {
	HMM_PFN_VALID = 0,
	HMM_PFN_WRITE,
	HMM_PFN_ERROR,
	HMM_PFN_NONE,
...

Yet we still have, for example:

    pfns = HMM_PFN_ERROR;

This might be accidentally working, because HMM_PFN_ERROR
has a value of 2, so only one bit is set, but...yikes.

2. The hmm_range.flags variable is a uint64_t* (pointer). And then
the patchset uses the HMM_PFN_* enum to *index* into that as an 
array. Something is highly suspect here, because...an array that is
indexed by HMM_PFN_* enums? It's certainly not documented that way.

Examples:
    range->flags[HMM_PFN_ERROR]
 
    range->flags[HMM_PFN_VALID] 

I'll go through and try to point these out right next to the relevant
parts of the patchset, but because I'm taking a little longer than 
I'd hoped to review this, I figured it's best to alert you earlier, as
soon as I spot something.

>  			if (hmm_vma_walk->fault)
>  				goto fault;
>  			continue;
> @@ -470,8 +455,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  fault:
>  		pte_unmap(ptep);
> -		/* Fault all pages in range */
> -		return hmm_vma_walk_clear(start, end, walk);
> +		/* Fault all pages in range if ask for */

Maybe this, instead (assuming I understand this correctly):

                /*
                 * Fault in each page in the range, if that page was
                 * requested.
                 */

thanks,
-- 
John Hubbard
NVIDIA

> +		return hmm_vma_walk_hole(start, end, walk);
>  	}
>  	pte_unmap(ptep - 1);
>  
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory
  2018-03-19 23:06     ` John Hubbard
@ 2018-03-20  2:08       ` Jerome Glisse
  -1 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-20  2:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Mon, Mar 19, 2018 at 04:06:11PM -0700, John Hubbard wrote:

[...]

> > @@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  		pfns[i] = 0;
> >  
> >  		if (pte_none(pte)) {
> > -			pfns[i] = HMM_PFN_EMPTY;
> > +			pfns[i] = 0;
> 
> Why is this being set to zero? (0 == HMM_PFN_VALID, btw.)
> I would have expected HMM_PFN_NONE. Actually, looking through the 
> larger patchset, I think there are a couple of big questions about
> these HMM_PFN_* flags. Maybe it's just that the comment documentation
> has fallen completely behind, but it looks like there is an actual
> problem in the code:
> 
> 1. HMM_PFN_* used to be bit shifts, so setting them directly sometimes
> worked. But now they are enum values, so that doesn't work anymore.
> Yet they are still being set directly in places: the enum is being
> treated as a flag, probably incorrectly.
> 
> Previously: 
> 
> #define HMM_PFN_VALID (1 << 0)
> #define HMM_PFN_WRITE (1 << 1)
> #define HMM_PFN_ERROR (1 << 2)
> #define HMM_PFN_EMPTY (1 << 3)
> ...
> 
> New:
> 
> enum hmm_pfn_flag_e {
> 	HMM_PFN_VALID = 0,
> 	HMM_PFN_WRITE,
> 	HMM_PFN_ERROR,
> 	HMM_PFN_NONE,
> ...
> 
> Yet we still have, for example:
> 
>     pfns = HMM_PFN_ERROR;
> 
> This might be accidentally working, because HMM_PFN_ERROR
> has a value of 2, so only one bit is set, but...yikes.
> 
> 2. The hmm_range.flags variable is a uint64_t* (pointer). And then
> the patchset uses the HMM_PFN_* enum to *index* into that as an 
> array. Something is highly suspect here, because...an array that is
> indexed by HMM_PFN_* enums? It's certainly not documented that way.
> 
> Examples:
>     range->flags[HMM_PFN_ERROR]
>  
>     range->flags[HMM_PFN_VALID] 
> 
> I'll go through and try to point these out right next to the relevant
> parts of the patchset, but because I'm taking a little longer than 
> I'd hoped to review this, I figured it's best to alert you earlier, as
> soon as I spot something.
> 

I added more comments in v3 to explain this in last patch (15), and i
also splited values and flags hoping this make it more clear. Maybe
look at how nouveau use that NV_HMM_PAGE_FLAG* and NV_HMM_PAGE_VALUE*
[1] [2]

It is the same idea as pgprot_t vm_page_prot in vma struct except that
it is per device driver and hence not something i can optimize away at
build time for all possible user of HMM and thus why i use an array
provided by each device driver.

Hope this helps explaining it. Note that this would be best to discuss
as last patch review as this patch has nothing to do with that.

Cheers,
Jérôme

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=b5da479c212d1fe7b6734dd8e69045e23871fdc8
[2] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=1993d1b09f5941e0fab80c0b485eef296119d393

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory
@ 2018-03-20  2:08       ` Jerome Glisse
  0 siblings, 0 replies; 55+ messages in thread
From: Jerome Glisse @ 2018-03-20  2:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Mon, Mar 19, 2018 at 04:06:11PM -0700, John Hubbard wrote:

[...]

> > @@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  		pfns[i] = 0;
> >  
> >  		if (pte_none(pte)) {
> > -			pfns[i] = HMM_PFN_EMPTY;
> > +			pfns[i] = 0;
> 
> Why is this being set to zero? (0 == HMM_PFN_VALID, btw.)
> I would have expected HMM_PFN_NONE. Actually, looking through the 
> larger patchset, I think there are a couple of big questions about
> these HMM_PFN_* flags. Maybe it's just that the comment documentation
> has fallen completely behind, but it looks like there is an actual
> problem in the code:
> 
> 1. HMM_PFN_* used to be bit shifts, so setting them directly sometimes
> worked. But now they are enum values, so that doesn't work anymore.
> Yet they are still being set directly in places: the enum is being
> treated as a flag, probably incorrectly.
> 
> Previously: 
> 
> #define HMM_PFN_VALID (1 << 0)
> #define HMM_PFN_WRITE (1 << 1)
> #define HMM_PFN_ERROR (1 << 2)
> #define HMM_PFN_EMPTY (1 << 3)
> ...
> 
> New:
> 
> enum hmm_pfn_flag_e {
> 	HMM_PFN_VALID = 0,
> 	HMM_PFN_WRITE,
> 	HMM_PFN_ERROR,
> 	HMM_PFN_NONE,
> ...
> 
> Yet we still have, for example:
> 
>     pfns = HMM_PFN_ERROR;
> 
> This might be accidentally working, because HMM_PFN_ERROR
> has a value of 2, so only one bit is set, but...yikes.
> 
> 2. The hmm_range.flags variable is a uint64_t* (pointer). And then
> the patchset uses the HMM_PFN_* enum to *index* into that as an 
> array. Something is highly suspect here, because...an array that is
> indexed by HMM_PFN_* enums? It's certainly not documented that way.
> 
> Examples:
>     range->flags[HMM_PFN_ERROR]
>  
>     range->flags[HMM_PFN_VALID] 
> 
> I'll go through and try to point these out right next to the relevant
> parts of the patchset, but because I'm taking a little longer than 
> I'd hoped to review this, I figured it's best to alert you earlier, as
> soon as I spot something.
> 

I added more comments in v3 to explain this in last patch (15), and i
also splited values and flags hoping this make it more clear. Maybe
look at how nouveau use that NV_HMM_PAGE_FLAG* and NV_HMM_PAGE_VALUE*
[1] [2]

It is the same idea as pgprot_t vm_page_prot in vma struct except that
it is per device driver and hence not something i can optimize away at
build time for all possible user of HMM and thus why i use an array
provided by each device driver.

Hope this helps explaining it. Note that this would be best to discuss
as last patch review as this patch has nothing to do with that.

Cheers,
Jerome

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=b5da479c212d1fe7b6734dd8e69045e23871fdc8
[2] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=1993d1b09f5941e0fab80c0b485eef296119d393

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2018-03-20  2:08 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 19:14 [PATCH 0/4] hmm: fixes and documentations v2 jglisse
2018-03-16 19:14 ` jglisse
2018-03-16 19:14 ` [PATCH 01/14] mm/hmm: documentation editorial update to HMM documentation jglisse
2018-03-16 19:14   ` jglisse
2018-03-16 19:14 ` [PATCH 02/14] mm/hmm: fix header file if/else/endif maze jglisse
2018-03-16 19:14   ` jglisse
2018-03-16 21:09   ` Andrew Morton
2018-03-16 21:18     ` Jerome Glisse
2018-03-16 21:18       ` Jerome Glisse
2018-03-16 21:18       ` Jerome Glisse
2018-03-16 21:35       ` Andrew Morton
2018-03-16 21:40         ` John Hubbard
2018-03-16 21:40           ` John Hubbard
2018-03-17  1:20   ` [PATCH 02/14] mm/hmm: fix header file if/else/endif maze v2 jglisse
2018-03-17  1:20     ` jglisse
2018-03-16 19:14 ` [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2 jglisse
2018-03-16 19:14   ` jglisse
2018-03-16 21:12   ` Andrew Morton
2018-03-16 21:26     ` Jerome Glisse
2018-03-16 21:26       ` Jerome Glisse
2018-03-16 21:26       ` Jerome Glisse
2018-03-16 21:37       ` John Hubbard
2018-03-16 21:37         ` John Hubbard
2018-03-17  2:36   ` John Hubbard
2018-03-17  2:36     ` John Hubbard
2018-03-17  3:47     ` John Hubbard
2018-03-17  3:47       ` John Hubbard
2018-03-17  4:39       ` John Hubbard
2018-03-17  4:39         ` John Hubbard
2018-03-16 19:14 ` [PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct jglisse
2018-03-16 19:14   ` jglisse
2018-03-17  2:04   ` John Hubbard
2018-03-17  2:04     ` John Hubbard
2018-03-16 19:14 ` [PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters jglisse
2018-03-16 19:14   ` jglisse
2018-03-17  3:08   ` John Hubbard
2018-03-17  3:08     ` John Hubbard
2018-03-16 19:14 ` [PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture jglisse
2018-03-16 19:14   ` jglisse
2018-03-17  3:30   ` John Hubbard
2018-03-17  3:30     ` John Hubbard
2018-03-16 19:14 ` [PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong jglisse
2018-03-16 19:14   ` jglisse
2018-03-17  3:59   ` John Hubbard
2018-03-17  3:59     ` John Hubbard
2018-03-16 19:14 ` [PATCH 08/14] mm/hmm: cleanup special vma handling (VM_SPECIAL) jglisse
2018-03-16 19:14   ` jglisse
2018-03-17  4:35   ` John Hubbard
2018-03-17  4:35     ` John Hubbard
2018-03-16 19:14 ` [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory jglisse
2018-03-16 19:14   ` jglisse
2018-03-19 23:06   ` John Hubbard
2018-03-19 23:06     ` John Hubbard
2018-03-20  2:08     ` Jerome Glisse
2018-03-20  2:08       ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.