All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] hmm: fixes and documentations v3
@ 2018-03-20  2:00 ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

Added a patch to fix zombie mm_struct (missing call to mmu notifier
unregister) this was lost in translation at some point. Included all
typos and comments received so far (and even more typos fixes). Added
more comments. Updated individual patch version to reflect changes.

Below are previous cover letter (everything in them are still true):

----------------------------------------------------------------------
cover letter for v2:

Removed pointless VM_BUG_ON() cced stable when appropriate and splitted
the last patch into _many_ smaller patches to make it easier to review.
The end result is same modulo comments i received so far and the extra
documentation i added while splitting thing up. Below is previous cover
letter (everything in it is still true):
----------------------------------------------------------------------
cover letter for v1:

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Jérôme Glisse (13):
  mm/hmm: fix header file if/else/endif maze v2
  mm/hmm: unregister mmu_notifier when last HMM client quit
  mm/hmm: hmm_pfns_bad() was accessing wrong struct
  mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
    v2
  mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
  mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to
    ulong v2
  mm/hmm: cleanup special vma handling (VM_SPECIAL)
  mm/hmm: do not differentiate between empty entry or missing directory
    v2
  mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  mm/hmm: move hmm_pfns_clear() closer to where it is use
  mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  mm/hmm: use device driver encoding for HMM pfn v2

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed v2

 Documentation/vm/hmm.txt | 360 ++++++++++++++++----------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 201 +++++++++++-------
 mm/hmm.c                 | 526 ++++++++++++++++++++++++++++++-----------------
 4 files changed, 648 insertions(+), 440 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 00/15] hmm: fixes and documentations v3
@ 2018-03-20  2:00 ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Added a patch to fix zombie mm_struct (missing call to mmu notifier
unregister) this was lost in translation at some point. Included all
typos and comments received so far (and even more typos fixes). Added
more comments. Updated individual patch version to reflect changes.

Below are previous cover letter (everything in them are still true):

----------------------------------------------------------------------
cover letter for v2:

Removed pointless VM_BUG_ON() cced stable when appropriate and splitted
the last patch into _many_ smaller patches to make it easier to review.
The end result is same modulo comments i received so far and the extra
documentation i added while splitting thing up. Below is previous cover
letter (everything in it is still true):
----------------------------------------------------------------------
cover letter for v1:

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

JA(C)rA'me Glisse (13):
  mm/hmm: fix header file if/else/endif maze v2
  mm/hmm: unregister mmu_notifier when last HMM client quit
  mm/hmm: hmm_pfns_bad() was accessing wrong struct
  mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
    v2
  mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
  mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to
    ulong v2
  mm/hmm: cleanup special vma handling (VM_SPECIAL)
  mm/hmm: do not differentiate between empty entry or missing directory
    v2
  mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  mm/hmm: move hmm_pfns_clear() closer to where it is use
  mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  mm/hmm: use device driver encoding for HMM pfn v2

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed v2

 Documentation/vm/hmm.txt | 360 ++++++++++++++++----------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 201 +++++++++++-------
 mm/hmm.c                 | 526 ++++++++++++++++++++++++++++++-----------------
 4 files changed, 648 insertions(+), 440 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..e99b97003982 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence mismatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..e99b97003982 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence mismatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: Jérôme Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because
of this after multiple include there was multiple definition of both
hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM
was enabled (CONFIG_HMM set).

Changed since v1:
  - Fix the maze when CONFIG_HMM is disabled not just when it is
    disabled. This fix bot build failure.
  - Improved commit message.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..36dd21fe5caf 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,23 +498,16 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-#endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-#if IS_ENABLED(CONFIG_HMM_MIRROR)
 void hmm_mm_destroy(struct mm_struct *mm);
 
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
-#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
-
 #else /* IS_ENABLED(CONFIG_HMM) */
 static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
+#endif /* IS_ENABLED(CONFIG_HMM) */
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because
of this after multiple include there was multiple definition of both
hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM
was enabled (CONFIG_HMM set).

Changed since v1:
  - Fix the maze when CONFIG_HMM is disabled not just when it is
    disabled. This fix bot build failure.
  - Improved commit message.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..36dd21fe5caf 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,23 +498,16 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-#endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-#if IS_ENABLED(CONFIG_HMM_MIRROR)
 void hmm_mm_destroy(struct mm_struct *mm);
 
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
-#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
-
 #else /* IS_ENABLED(CONFIG_HMM) */
 static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
+#endif /* IS_ENABLED(CONFIG_HMM) */
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, stable, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 18 +++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 36dd21fe5caf..fa7b51f65905 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..6088fa6ed137 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm = mm->hmm;
+	struct hmm_mirror *mirror;
+	struct hmm_mirror *mirror_next;
+
+	down_write(&hmm->mirrors_sem);
+	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release)
+			mirror->ops->release(mirror);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +200,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +246,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, stable, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 18 +++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 36dd21fe5caf..fa7b51f65905 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..6088fa6ed137 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm = mm->hmm;
+	struct hmm_mirror *mirror;
+	struct hmm_mirror *mirror_next;
+
+	down_write(&hmm->mirrors_sem);
+	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release)
+			mirror->ops->release(mirror);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +200,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +246,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..667944630dc9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
 	struct hmm *hmm = mirror->hmm;
+	struct mm_struct *mm = NULL;
+	bool unregister = false;
 
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	unregister = list_empty(&hmm->mirrors);
 	up_write(&hmm->mirrors_sem);
+
+	if (!unregister)
+		return;
+
+	spin_lock(&hmm->mm->page_table_lock);
+	if (hmm->mm->hmm == hmm) {
+		mm = hmm->mm;
+		mm->hmm = NULL;
+	}
+	spin_unlock(&hmm->mm->page_table_lock);
+
+	if (mm == NULL)
+		return;
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..667944630dc9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
 	struct hmm *hmm = mirror->hmm;
+	struct mm_struct *mm = NULL;
+	bool unregister = false;
 
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	unregister = list_empty(&hmm->mirrors);
 	up_write(&hmm->mirrors_sem);
+
+	if (!unregister)
+		return;
+
+	spin_lock(&hmm->mm->page_table_lock);
+	if (hmm->mm->hmm == hmm) {
+		mm = hmm->mm;
+		mm->hmm = NULL;
+	}
+	spin_unlock(&hmm->mm->page_table_lock);
+
+	if (mm == NULL)
+		return;
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 667944630dc9..f5631e1a7319 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -312,7 +312,8 @@ static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 667944630dc9..f5631e1a7319 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -312,7 +312,8 @@ static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
struct as parameter and were initializing that struct with others of
their parameters. Have caller of those function do this as they are
likely to already do and only pass this struct to both function this
shorten function signature and make it easier in the future to add
new parameters by simply adding them to the structure.

Changed since v1:
  - Improved comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 18 ++++---------
 mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
 2 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fa7b51f65905..d0d6760cdada 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -274,6 +274,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
@@ -281,6 +282,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
@@ -301,12 +303,8 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
@@ -327,13 +325,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index f5631e1a7319..73d4b33d787f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -509,11 +509,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
- * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * @range: range being snapshotted
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -527,26 +523,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -557,9 +550,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -577,14 +567,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -604,10 +593,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -615,13 +604,13 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
- *   hmm_vma_range_done(vma, range);
- *   device_update_page_table(pfns);
+ *   hmm_vma_range_done(range);
+ *   device_update_page_table(range->pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
 	struct hmm *hmm;
@@ -631,7 +620,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 		return false;
 	}
 
-	hmm = hmm_register(vma->vm_mm);
+	hmm = hmm_register(range->vma->vm_mm);
 	if (!hmm) {
 		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
 		return false;
@@ -647,11 +636,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
- * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @range: range being faulted
  * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
@@ -667,10 +652,10 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, write, block);
  *   switch (ret) {
  *   case -EAGAIN:
- *     hmm_vma_range_done(vma, range);
+ *     hmm_vma_range_done(range);
  *     // You might want to rate limit or yield to play nicely, you may
  *     // also commit any valid pfn in the array assuming that you are
  *     // getting true from hmm_vma_range_monitor_end()
@@ -684,7 +669,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
+ *   hmm_vma_range_done(range);
  *   // Commit pfns we got from hmm_vma_fault()
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
@@ -694,28 +679,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
+	unsigned long start = range->start;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -723,9 +704,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -733,7 +711,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return 0;
 	}
 
@@ -753,7 +731,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -761,8 +739,8 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
struct as parameter and were initializing that struct with others of
their parameters. Have caller of those function do this as they are
likely to already do and only pass this struct to both function this
shorten function signature and make it easier in the future to add
new parameters by simply adding them to the structure.

Changed since v1:
  - Improved comments

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 18 ++++---------
 mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
 2 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fa7b51f65905..d0d6760cdada 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -274,6 +274,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
@@ -281,6 +282,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
@@ -301,12 +303,8 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
@@ -327,13 +325,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index f5631e1a7319..73d4b33d787f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -509,11 +509,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
- * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * @range: range being snapshotted
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -527,26 +523,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -557,9 +550,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -577,14 +567,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -604,10 +593,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -615,13 +604,13 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
- *   hmm_vma_range_done(vma, range);
- *   device_update_page_table(pfns);
+ *   hmm_vma_range_done(range);
+ *   device_update_page_table(range->pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
 	struct hmm *hmm;
@@ -631,7 +620,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 		return false;
 	}
 
-	hmm = hmm_register(vma->vm_mm);
+	hmm = hmm_register(range->vma->vm_mm);
 	if (!hmm) {
 		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
 		return false;
@@ -647,11 +636,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
- * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @range: range being faulted
  * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
@@ -667,10 +652,10 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, write, block);
  *   switch (ret) {
  *   case -EAGAIN:
- *     hmm_vma_range_done(vma, range);
+ *     hmm_vma_range_done(range);
  *     // You might want to rate limit or yield to play nicely, you may
  *     // also commit any valid pfn in the array assuming that you are
  *     // getting true from hmm_vma_range_monitor_end()
@@ -684,7 +669,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
+ *   hmm_vma_range_done(range);
  *   // Commit pfns we got from hmm_vma_fault()
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
@@ -694,28 +679,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
+	unsigned long start = range->start;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -723,9 +704,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -733,7 +711,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return 0;
 	}
 
@@ -753,7 +731,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -761,8 +739,8 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Changed since v1:
  - Fail early on vma without read permission and return an error
  - Improve comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 16 +++++++---------
 mm/hmm.c            | 44 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d0d6760cdada..dd907f614dfe 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,8 +83,7 @@ struct hmm;
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
- * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
+ * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index 73d4b33d787f..3e61c5af1f98 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -393,11 +393,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -409,6 +407,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -473,7 +472,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -493,7 +491,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -510,7 +508,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          vma permission, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
@@ -549,6 +548,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -662,6 +672,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *     goto retry;
  *   case 0:
  *     break;
+ *   case -ENOMEM:
+ *   case -EINVAL:
+ *   case -EPERM:
  *   default:
  *     // Handle error !
  *     up_read(&mm->mmap_sem)
@@ -703,11 +716,16 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
@@ -715,6 +733,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
+	/* Initialize range to track CPU page table update */
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Changed since v1:
  - Fail early on vma without read permission and return an error
  - Improve comments

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 16 +++++++---------
 mm/hmm.c            | 44 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d0d6760cdada..dd907f614dfe 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,8 +83,7 @@ struct hmm;
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
- * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
+ * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index 73d4b33d787f..3e61c5af1f98 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -393,11 +393,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -409,6 +407,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -473,7 +472,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -493,7 +491,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -510,7 +508,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          vma permission, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
@@ -549,6 +548,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -662,6 +672,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *     goto retry;
  *   case 0:
  *     break;
+ *   case -ENOMEM:
+ *   case -EINVAL:
+ *   case -EPERM:
  *   default:
  *     // Handle error !
  *     up_read(&mm->mmap_sem)
@@ -703,11 +716,16 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
@@ -715,6 +733,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
+	/* Initialize range to track CPU page table update */
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver
to directly use pfn value return by HMM (saving memory and CPU cycles
use for conversion between the two).

Changed since v1:
  - Fix comments typos

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
 mm/hmm.c            | 26 +++++++++++++-------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index dd907f614dfe..54d684fe3b90 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,8 +80,6 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
- *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
@@ -93,8 +91,6 @@ struct hmm;
  *      set and the pfn value is undefined.
  * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
-
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
@@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_SHIFT 6
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
- * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @pfn: HMM pfn value to get corresponding struct page from
+ * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
- * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ * If the HMM pfn is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return NULL;
@@ -119,11 +115,11 @@ static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @pfn: HMM pfn value to extract pfn from
+ * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return -1UL;
@@ -131,21 +127,21 @@ static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
- * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @page: struct page pointer for which to create the HMM pfn
+ * Returns: valid HMM pfn for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(struct page *page)
 {
 	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
- * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @pfn: pfn value for which to create the HMM pfn
+ * Returns: valid HMM pfn for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
 {
 	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
@@ -284,7 +280,7 @@ struct hmm_range {
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
-	hmm_pfn_t		*pfns;
+	uint64_t		*pfns;
 	bool			valid;
 };
 
@@ -307,7 +303,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The HMM pfn array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -316,7 +312,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the HMM pfn array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
diff --git a/mm/hmm.c b/mm/hmm.c
index 3e61c5af1f98..b4db0b1b709a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -280,7 +280,7 @@ struct hmm_vma_walk {
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -300,7 +300,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
@@ -314,7 +314,7 @@ static int hmm_pfns_bad(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
@@ -324,7 +324,7 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
@@ -338,7 +338,7 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -363,7 +363,7 @@ static int hmm_vma_walk_clear(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -390,7 +390,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
 	pte_t *ptep;
@@ -407,7 +407,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
-		hmm_pfn_t flag = 0;
+		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -432,7 +432,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
 		return 0;
 	}
 
@@ -466,7 +466,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * device and report anything else as error.
 			 */
 			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_private_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
@@ -491,7 +491,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
+		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -654,8 +654,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * This is similar to a regular CPU page fault except that it will not trigger
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
- * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * On error, for one virtual address in the range, the function will mark the
+ * corresponding HMM pfn entry with an error flag.
  *
  * Expected use pattern:
  * retry:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver
to directly use pfn value return by HMM (saving memory and CPU cycles
use for conversion between the two).

Changed since v1:
  - Fix comments typos

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
 mm/hmm.c            | 26 +++++++++++++-------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index dd907f614dfe..54d684fe3b90 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,8 +80,6 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
- *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
@@ -93,8 +91,6 @@ struct hmm;
  *      set and the pfn value is undefined.
  * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
-
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
@@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_SHIFT 6
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
- * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @pfn: HMM pfn value to get corresponding struct page from
+ * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
- * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ * If the HMM pfn is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return NULL;
@@ -119,11 +115,11 @@ static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @pfn: HMM pfn value to extract pfn from
+ * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return -1UL;
@@ -131,21 +127,21 @@ static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
- * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @page: struct page pointer for which to create the HMM pfn
+ * Returns: valid HMM pfn for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(struct page *page)
 {
 	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
- * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @pfn: pfn value for which to create the HMM pfn
+ * Returns: valid HMM pfn for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
 {
 	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
@@ -284,7 +280,7 @@ struct hmm_range {
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
-	hmm_pfn_t		*pfns;
+	uint64_t		*pfns;
 	bool			valid;
 };
 
@@ -307,7 +303,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The HMM pfn array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -316,7 +312,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the HMM pfn array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
diff --git a/mm/hmm.c b/mm/hmm.c
index 3e61c5af1f98..b4db0b1b709a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -280,7 +280,7 @@ struct hmm_vma_walk {
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -300,7 +300,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
@@ -314,7 +314,7 @@ static int hmm_pfns_bad(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
@@ -324,7 +324,7 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
@@ -338,7 +338,7 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -363,7 +363,7 @@ static int hmm_vma_walk_clear(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -390,7 +390,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
 	pte_t *ptep;
@@ -407,7 +407,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
-		hmm_pfn_t flag = 0;
+		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -432,7 +432,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
 		return 0;
 	}
 
@@ -466,7 +466,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * device and report anything else as error.
 			 */
 			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_private_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
@@ -491,7 +491,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
+		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -654,8 +654,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * This is similar to a regular CPU page fault except that it will not trigger
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
- * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * On error, for one virtual address in the range, the function will mark the
+ * corresponding HMM pfn entry with an error flag.
  *
  * Expected use pattern:
  * retry:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL)
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model across device drivers on
those vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault()
and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
both were filling the HMM pfn array with special entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b4db0b1b709a..2df69a95c5ab 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -300,14 +300,6 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(uint64_t *pfns,
-			     unsigned long addr,
-			     unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
-}
-
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
@@ -505,6 +497,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_special(struct hmm_range *range)
+{
+	unsigned long addr = range->start, i = 0;
+
+	for (; addr < range->end; addr += PAGE_SIZE, i++)
+		range->pfns[i] = HMM_PFN_SPECIAL;
+}
+
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
@@ -529,12 +529,6 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return -EINVAL;
-	}
-
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -548,6 +542,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	if (!(vma->vm_flags & VM_READ)) {
 		/*
 		 * If vma do not allow read access, then assume that it does
@@ -716,6 +716,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	if (!(vma->vm_flags & VM_READ)) {
 		/*
 		 * If vma do not allow read access, then assume that it does
@@ -727,12 +733,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return -EPERM;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return 0;
-	}
-
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL)
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model across device drivers on
those vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault()
and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
both were filling the HMM pfn array with special entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b4db0b1b709a..2df69a95c5ab 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -300,14 +300,6 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(uint64_t *pfns,
-			     unsigned long addr,
-			     unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
-}
-
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
@@ -505,6 +497,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_special(struct hmm_range *range)
+{
+	unsigned long addr = range->start, i = 0;
+
+	for (; addr < range->end; addr += PAGE_SIZE, i++)
+		range->pfns[i] = HMM_PFN_SPECIAL;
+}
+
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
@@ -529,12 +529,6 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return -EINVAL;
-	}
-
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -548,6 +542,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	if (!(vma->vm_flags & VM_READ)) {
 		/*
 		 * If vma do not allow read access, then assume that it does
@@ -716,6 +716,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	if (!(vma->vm_flags & VM_READ)) {
 		/*
 		 * If vma do not allow read access, then assume that it does
@@ -727,12 +733,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return -EPERM;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return 0;
-	}
-
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

There is no point in differentiating between a range for which there
is not even a directory (and thus entries) and empty entry (pte_none()
or pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Changed since v1:
  - Improved comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  8 +++-----
 mm/hmm.c            | 45 +++++++++++++++------------------------------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 54d684fe3b90..cf283db22106 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
- * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
@@ -94,10 +93,9 @@ struct hmm;
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_EMPTY (1 << 3)
-#define HMM_PFN_SPECIAL (1 << 4)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
-#define HMM_PFN_SHIFT 6
+#define HMM_PFN_SPECIAL (1 << 3)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_SHIFT 5
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
diff --git a/mm/hmm.c b/mm/hmm.c
index 2df69a95c5ab..52204037ad84 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -324,6 +324,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
 		*pfns = 0;
 }
 
+/*
+ * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @walk: mm_walk structure
+ * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ *
+ * This function will be called whenever pmd_none() or pte_none() returns true,
+ * or whenever there is no page directory covering the virtual address range.
+ */
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
@@ -333,31 +343,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
-		if (hmm_vma_walk->fault) {
-			int ret;
-
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
-		}
-	}
-
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
-}
-
-static int hmm_vma_walk_clear(unsigned long addr,
-			      unsigned long end,
-			      struct mm_walk *walk)
-{
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
-	uint64_t *pfns = range->pfns;
-	unsigned long i;
-
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
@@ -416,10 +401,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
 		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfns[i] = 0;
 
 		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
+			pfns[i] = 0;
 			if (hmm_vma_walk->fault)
 				goto fault;
 			continue;
@@ -489,8 +474,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 fault:
 		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
+		/* Fault any virtual address we were ask to fault */
+		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

There is no point in differentiating between a range for which there
is not even a directory (and thus entries) and empty entry (pte_none()
or pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Changed since v1:
  - Improved comments

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |  8 +++-----
 mm/hmm.c            | 45 +++++++++++++++------------------------------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 54d684fe3b90..cf283db22106 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
- * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
@@ -94,10 +93,9 @@ struct hmm;
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_EMPTY (1 << 3)
-#define HMM_PFN_SPECIAL (1 << 4)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
-#define HMM_PFN_SHIFT 6
+#define HMM_PFN_SPECIAL (1 << 3)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_SHIFT 5
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
diff --git a/mm/hmm.c b/mm/hmm.c
index 2df69a95c5ab..52204037ad84 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -324,6 +324,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
 		*pfns = 0;
 }
 
+/*
+ * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @walk: mm_walk structure
+ * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ *
+ * This function will be called whenever pmd_none() or pte_none() returns true,
+ * or whenever there is no page directory covering the virtual address range.
+ */
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
@@ -333,31 +343,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
-		if (hmm_vma_walk->fault) {
-			int ret;
-
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
-		}
-	}
-
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
-}
-
-static int hmm_vma_walk_clear(unsigned long addr,
-			      unsigned long end,
-			      struct mm_walk *walk)
-{
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
-	uint64_t *pfns = range->pfns;
-	unsigned long i;
-
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
@@ -416,10 +401,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
 		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfns[i] = 0;
 
 		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
+			pfns[i] = 0;
 			if (hmm_vma_walk->fault)
 				goto fault;
 			continue;
@@ -489,8 +474,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 fault:
 		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
+		/* Fault any virtual address we were ask to fault */
+		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Make naming consistent across code, DEVICE_PRIVATE is the name use
outside HMM code so use that one.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 4 ++--
 mm/hmm.c            | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index cf283db22106..e8515cad5a00 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,13 +88,13 @@ struct hmm;
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
+ * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
  */
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
 #define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
 #define HMM_PFN_SHIFT 5
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 52204037ad84..b4b8b89ee7bb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -448,7 +448,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
 					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Make naming consistent across code, DEVICE_PRIVATE is the name use
outside HMM code so use that one.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 4 ++--
 mm/hmm.c            | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index cf283db22106..e8515cad5a00 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,13 +88,13 @@ struct hmm;
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
+ * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
  */
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
 #define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
 #define HMM_PFN_SHIFT 5
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 52204037ad84..b4b8b89ee7bb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -448,7 +448,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
 					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Move hmm_pfns_clear() closer to where it is use to make it clear it
is not use by page table walkers.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b4b8b89ee7bb..52cdceb35733 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -316,14 +316,6 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
-			   unsigned long addr,
-			   unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
-}
-
 /*
  * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
  * @start: range virtual start address (inclusive)
@@ -482,6 +474,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_clear(uint64_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static void hmm_pfns_special(struct hmm_range *range)
 {
 	unsigned long addr = range->start, i = 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Move hmm_pfns_clear() closer to where it is use to make it clear it
is not use by page table walkers.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b4b8b89ee7bb..52cdceb35733 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -316,14 +316,6 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
-			   unsigned long addr,
-			   unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
-}
-
 /*
  * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
  * @start: range virtual start address (inclusive)
@@ -482,6 +474,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_clear(uint64_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static void hmm_pfns_special(struct hmm_range *range)
 {
 	unsigned long addr = range->start, i = 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

No functional change, just create one function to handle pmd and one
to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 102 insertions(+), 72 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 52cdceb35733..dc703e9c3a95 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
+static int hmm_vma_handle_pmd(struct mm_walk *walk,
+			      unsigned long addr,
+			      unsigned long end,
+			      uint64_t *pfns,
+			      pmd_t pmd)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	unsigned long pfn, i;
+	uint64_t flag = 0;
+
+	if (pmd_protnone(pmd))
+		return hmm_vma_walk_hole(addr, end, walk);
+
+	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
+		return hmm_vma_walk_hole(addr, end, walk);
+
+	pfn = pmd_pfn(pmd) + pte_index(addr);
+	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
+		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+	hmm_vma_walk->last = end;
+	return 0;
+}
+
+static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
+			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
+			      uint64_t *pfns)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t pte = *ptep;
+
+	*pfns = 0;
+
+	if (pte_none(pte)) {
+		*pfns = 0;
+		if (hmm_vma_walk->fault)
+			goto fault;
+		return 0;
+	}
+
+	if (!pte_present(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (!non_swap_entry(entry)) {
+			if (hmm_vma_walk->fault)
+				goto fault;
+			return 0;
+		}
+
+		/*
+		 * This is a special swap entry, ignore migration, use
+		 * device and report anything else as error.
+		 */
+		if (is_device_private_entry(entry)) {
+			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
+			if (is_write_device_private_entry(entry)) {
+				*pfns |= HMM_PFN_WRITE;
+			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
+				goto fault;
+			*pfns |= HMM_PFN_DEVICE_PRIVATE;
+			return 0;
+		}
+
+		if (is_migration_entry(entry)) {
+			if (hmm_vma_walk->fault) {
+				pte_unmap(ptep);
+				hmm_vma_walk->last = addr;
+				migration_entry_wait(vma->vm_mm,
+						pmdp, addr);
+				return -EAGAIN;
+			}
+			return 0;
+		}
+
+		/* Report error for everything else */
+		*pfns = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
+		goto fault;
+
+	*pfns = hmm_pfn_from_pfn(pte_pfn(pte));
+	*pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	return 0;
+
+fault:
+	pte_unmap(ptep);
+	/* Fault any virtual address we were ask to fault */
+	return hmm_vma_walk_hole(addr, end, walk);
+}
+
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long start,
 			    unsigned long end,
@@ -358,25 +451,20 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	struct vm_area_struct *vma = walk->vma;
 	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
-	bool write_fault;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
 	if (pmd_none(*pmdp))
 		return hmm_vma_walk_hole(start, end, walk);
 
-	if (pmd_huge(*pmdp) && vma->vm_flags & VM_HUGETLB)
+	if (pmd_huge(*pmdp) && (range->vma->vm_flags & VM_HUGETLB))
 		return hmm_pfns_bad(start, end, walk);
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
-		unsigned long pfn;
-		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -392,17 +480,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
-		if (pmd_protnone(pmd))
-			return hmm_vma_walk_hole(start, end, walk);
 
-		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_hole(start, end, walk);
-
-		pfn = pmd_pfn(pmd) + pte_index(addr);
-		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
-		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
-		return 0;
+		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
 	}
 
 	if (pmd_bad(*pmdp))
@@ -410,67 +489,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	ptep = pte_offset_map(pmdp, addr);
 	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
-		pte_t pte = *ptep;
+		int r;
 
-		pfns[i] = 0;
-
-		if (pte_none(pte)) {
-			pfns[i] = 0;
-			if (hmm_vma_walk->fault)
-				goto fault;
-			continue;
-		}
-
-		if (!pte_present(pte)) {
-			swp_entry_t entry = pte_to_swp_entry(pte);
-
-			if (!non_swap_entry(entry)) {
-				if (hmm_vma_walk->fault)
-					goto fault;
-				continue;
-			}
-
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			 */
-			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
-				if (is_write_device_private_entry(entry)) {
-					pfns[i] |= HMM_PFN_WRITE;
-				} else if (write_fault)
-					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
-			} else if (is_migration_entry(entry)) {
-				if (hmm_vma_walk->fault) {
-					pte_unmap(ptep);
-					hmm_vma_walk->last = addr;
-					migration_entry_wait(vma->vm_mm,
-							     pmdp, addr);
-					return -EAGAIN;
-				}
-				continue;
-			} else {
-				/* Report error for everything else */
-				pfns[i] = HMM_PFN_ERROR;
-			}
-			continue;
+		r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]);
+		if (r) {
+			/* hmm_vma_handle_pte() did unmap pte directory */
+			hmm_vma_walk->last = addr;
+			return r;
 		}
-
-		if (write_fault && !pte_write(pte))
-			goto fault;
-
-		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
-		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-		continue;
-
-fault:
-		pte_unmap(ptep);
-		/* Fault any virtual address we were ask to fault */
-		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
+	hmm_vma_walk->last = addr;
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

No functional change, just create one function to handle pmd and one
to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 102 insertions(+), 72 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 52cdceb35733..dc703e9c3a95 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
+static int hmm_vma_handle_pmd(struct mm_walk *walk,
+			      unsigned long addr,
+			      unsigned long end,
+			      uint64_t *pfns,
+			      pmd_t pmd)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	unsigned long pfn, i;
+	uint64_t flag = 0;
+
+	if (pmd_protnone(pmd))
+		return hmm_vma_walk_hole(addr, end, walk);
+
+	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
+		return hmm_vma_walk_hole(addr, end, walk);
+
+	pfn = pmd_pfn(pmd) + pte_index(addr);
+	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
+		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+	hmm_vma_walk->last = end;
+	return 0;
+}
+
+static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
+			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
+			      uint64_t *pfns)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t pte = *ptep;
+
+	*pfns = 0;
+
+	if (pte_none(pte)) {
+		*pfns = 0;
+		if (hmm_vma_walk->fault)
+			goto fault;
+		return 0;
+	}
+
+	if (!pte_present(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (!non_swap_entry(entry)) {
+			if (hmm_vma_walk->fault)
+				goto fault;
+			return 0;
+		}
+
+		/*
+		 * This is a special swap entry, ignore migration, use
+		 * device and report anything else as error.
+		 */
+		if (is_device_private_entry(entry)) {
+			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
+			if (is_write_device_private_entry(entry)) {
+				*pfns |= HMM_PFN_WRITE;
+			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
+				goto fault;
+			*pfns |= HMM_PFN_DEVICE_PRIVATE;
+			return 0;
+		}
+
+		if (is_migration_entry(entry)) {
+			if (hmm_vma_walk->fault) {
+				pte_unmap(ptep);
+				hmm_vma_walk->last = addr;
+				migration_entry_wait(vma->vm_mm,
+						pmdp, addr);
+				return -EAGAIN;
+			}
+			return 0;
+		}
+
+		/* Report error for everything else */
+		*pfns = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
+		goto fault;
+
+	*pfns = hmm_pfn_from_pfn(pte_pfn(pte));
+	*pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	return 0;
+
+fault:
+	pte_unmap(ptep);
+	/* Fault any virtual address we were ask to fault */
+	return hmm_vma_walk_hole(addr, end, walk);
+}
+
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long start,
 			    unsigned long end,
@@ -358,25 +451,20 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	struct vm_area_struct *vma = walk->vma;
 	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
-	bool write_fault;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
 	if (pmd_none(*pmdp))
 		return hmm_vma_walk_hole(start, end, walk);
 
-	if (pmd_huge(*pmdp) && vma->vm_flags & VM_HUGETLB)
+	if (pmd_huge(*pmdp) && (range->vma->vm_flags & VM_HUGETLB))
 		return hmm_pfns_bad(start, end, walk);
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
-		unsigned long pfn;
-		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -392,17 +480,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
-		if (pmd_protnone(pmd))
-			return hmm_vma_walk_hole(start, end, walk);
 
-		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_hole(start, end, walk);
-
-		pfn = pmd_pfn(pmd) + pte_index(addr);
-		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
-		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
-		return 0;
+		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
 	}
 
 	if (pmd_bad(*pmdp))
@@ -410,67 +489,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	ptep = pte_offset_map(pmdp, addr);
 	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
-		pte_t pte = *ptep;
+		int r;
 
-		pfns[i] = 0;
-
-		if (pte_none(pte)) {
-			pfns[i] = 0;
-			if (hmm_vma_walk->fault)
-				goto fault;
-			continue;
-		}
-
-		if (!pte_present(pte)) {
-			swp_entry_t entry = pte_to_swp_entry(pte);
-
-			if (!non_swap_entry(entry)) {
-				if (hmm_vma_walk->fault)
-					goto fault;
-				continue;
-			}
-
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			 */
-			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
-				if (is_write_device_private_entry(entry)) {
-					pfns[i] |= HMM_PFN_WRITE;
-				} else if (write_fault)
-					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
-			} else if (is_migration_entry(entry)) {
-				if (hmm_vma_walk->fault) {
-					pte_unmap(ptep);
-					hmm_vma_walk->last = addr;
-					migration_entry_wait(vma->vm_mm,
-							     pmdp, addr);
-					return -EAGAIN;
-				}
-				continue;
-			} else {
-				/* Report error for everything else */
-				pfns[i] = HMM_PFN_ERROR;
-			}
-			continue;
+		r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]);
+		if (r) {
+			/* hmm_vma_handle_pte() did unmap pte directory */
+			hmm_vma_walk->last = addr;
+			return r;
 		}
-
-		if (write_fault && !pte_write(pte))
-			goto fault;
-
-		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
-		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-		continue;
-
-fault:
-		pte_unmap(ptep);
-		/* Fault any virtual address we were ask to fault */
-		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
+	hmm_vma_walk->last = addr;
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This change hmm_vma_fault() to not take a global write fault flag
for a range but instead rely on caller to populate HMM pfns array
with proper fault flag ie HMM_PFN_VALID if driver want read fault
for that address or HMM_PFN_VALID and HMM_PFN_WRITE for write.

Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can
ask for device private memory to be migrated back to system memory
through page fault.

This is more flexible API and it better reflects how device handles
and reports fault.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |   2 +-
 mm/hmm.c            | 150 +++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 119 insertions(+), 33 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e8515cad5a00..0f7ea3074175 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -317,7 +317,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
+int hmm_vma_fault(struct hmm_range *range, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index dc703e9c3a95..956689804600 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -275,12 +275,10 @@ struct hmm_vma_walk {
 	unsigned long		last;
 	bool			fault;
 	bool			block;
-	bool			write;
 };
 
-static int hmm_vma_do_fault(struct mm_walk *walk,
-			    unsigned long addr,
-			    uint64_t *pfn)
+static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
+			    bool write_fault, uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -288,7 +286,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	int r;
 
 	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
-	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
 	r = handle_mm_fault(vma, addr, flags);
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
@@ -320,15 +318,17 @@ static int hmm_pfns_bad(unsigned long addr,
  * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
+ * @fault: should we fault or not ?
+ * @write_fault: write fault ?
  * @walk: mm_walk structure
  * Returns: 0 on success, -EAGAIN after page fault, or page fault error
  *
  * This function will be called whenever pmd_none() or pte_none() returns true,
  * or whenever there is no page directory covering the virtual address range.
  */
-static int hmm_vma_walk_hole(unsigned long addr,
-			     unsigned long end,
-			     struct mm_walk *walk)
+static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
+			      bool fault, bool write_fault,
+			      struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
@@ -339,16 +339,89 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
-		if (hmm_vma_walk->fault) {
+		if (fault || write_fault) {
 			int ret;
 
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			ret = hmm_vma_do_fault(walk, addr, write_fault,
+					       &pfns[i]);
 			if (ret != -EAGAIN)
 				return ret;
 		}
 	}
 
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
+	return (fault || write_fault) ? -EAGAIN : 0;
+}
+
+static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+				      uint64_t pfns, uint64_t cpu_flags,
+				      bool *fault, bool *write_fault)
+{
+	*fault = *write_fault = false;
+	if (!hmm_vma_walk->fault)
+		return;
+
+	/* We aren't ask to do anything ... */
+	if (!(pfns & HMM_PFN_VALID))
+		return;
+	/* If CPU page table is not valid then we need to fault */
+	*fault = cpu_flags & HMM_PFN_VALID;
+	/* Need to write fault ? */
+	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+		*fault = *write_fault = false;
+		return;
+	}
+	/* Do we fault on device memory ? */
+	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
+	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
+		*write_fault = pfns & HMM_PFN_WRITE;
+		*fault = true;
+	}
+}
+
+static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+				 const uint64_t *pfns, unsigned long npages,
+				 uint64_t cpu_flags, bool *fault,
+				 bool *write_fault)
+{
+	unsigned long i;
+
+	if (!hmm_vma_walk->fault) {
+		*fault = *write_fault = false;
+		return;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		hmm_pte_need_fault(hmm_vma_walk, pfns[i], cpu_flags,
+				   fault, write_fault);
+		if ((*fault) || (*write_fault))
+			return;
+	}
+}
+
+static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	bool fault, write_fault;
+	unsigned long i, npages;
+	uint64_t *pfns;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	npages = (end - addr) >> PAGE_SHIFT;
+	pfns = &range->pfns[i];
+	hmm_range_need_fault(hmm_vma_walk, pfns, npages,
+			     0, &fault, &write_fault);
+	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
+}
+
+static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+{
+	if (pmd_protnone(pmd))
+		return 0;
+	return pmd_write(pmd) ? HMM_PFN_VALID |
+				HMM_PFN_WRITE :
+				HMM_PFN_VALID;
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -358,14 +431,17 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	unsigned long pfn, i;
-	uint64_t flag = 0;
+	unsigned long pfn, npages, i;
+	uint64_t flag = 0, cpu_flags;
+	bool fault, write_fault;
 
-	if (pmd_protnone(pmd))
-		return hmm_vma_walk_hole(addr, end, walk);
+	npages = (end - addr) >> PAGE_SHIFT;
+	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
+			     &fault, &write_fault);
 
-	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
-		return hmm_vma_walk_hole(addr, end, walk);
+	if (pmd_protnone(pmd) || fault || write_fault)
+		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
 	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -375,19 +451,33 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 	return 0;
 }
 
+static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+{
+	if (pte_none(pte) || !pte_present(pte))
+		return 0;
+	return pte_write(pte) ? HMM_PFN_VALID |
+				HMM_PFN_WRITE :
+				HMM_PFN_VALID;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
 			      uint64_t *pfns)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct vm_area_struct *vma = walk->vma;
+	bool fault, write_fault;
+	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
 	*pfns = 0;
+	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
+			   &fault, &write_fault);
 
 	if (pte_none(pte)) {
 		*pfns = 0;
-		if (hmm_vma_walk->fault)
+		if (fault || write_fault)
 			goto fault;
 		return 0;
 	}
@@ -396,7 +486,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		if (!non_swap_entry(entry)) {
-			if (hmm_vma_walk->fault)
+			if (fault || write_fault)
 				goto fault;
 			return 0;
 		}
@@ -406,21 +496,20 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
+			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags |= is_write_device_private_entry(entry) ?
+					HMM_PFN_WRITE : 0;
 			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
-			if (is_write_device_private_entry(entry)) {
-				*pfns |= HMM_PFN_WRITE;
-			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
-				goto fault;
 			*pfns |= HMM_PFN_DEVICE_PRIVATE;
 			return 0;
 		}
 
 		if (is_migration_entry(entry)) {
-			if (hmm_vma_walk->fault) {
+			if (fault || write_fault) {
 				pte_unmap(ptep);
 				hmm_vma_walk->last = addr;
 				migration_entry_wait(vma->vm_mm,
-						pmdp, addr);
+						     pmdp, addr);
 				return -EAGAIN;
 			}
 			return 0;
@@ -431,17 +520,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		return -EFAULT;
 	}
 
-	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
+	if (fault || write_fault)
 		goto fault;
 
-	*pfns = hmm_pfn_from_pfn(pte_pfn(pte));
-	*pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	*pfns = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
 	pte_unmap(ptep);
 	/* Fault any virtual address we were ask to fault */
-	return hmm_vma_walk_hole(addr, end, walk);
+	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -662,7 +750,6 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
  * @range: range being faulted
- * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
  *
@@ -707,7 +794,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
+int hmm_vma_fault(struct hmm_range *range, bool block)
 {
 	struct vm_area_struct *vma = range->vma;
 	unsigned long start = range->start;
@@ -755,7 +842,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	spin_unlock(&hmm->lock);
 
 	hmm_vma_walk.fault = true;
-	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This change hmm_vma_fault() to not take a global write fault flag
for a range but instead rely on caller to populate HMM pfns array
with proper fault flag ie HMM_PFN_VALID if driver want read fault
for that address or HMM_PFN_VALID and HMM_PFN_WRITE for write.

Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can
ask for device private memory to be migrated back to system memory
through page fault.

This is more flexible API and it better reflects how device handles
and reports fault.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |   2 +-
 mm/hmm.c            | 150 +++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 119 insertions(+), 33 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e8515cad5a00..0f7ea3074175 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -317,7 +317,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
+int hmm_vma_fault(struct hmm_range *range, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index dc703e9c3a95..956689804600 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -275,12 +275,10 @@ struct hmm_vma_walk {
 	unsigned long		last;
 	bool			fault;
 	bool			block;
-	bool			write;
 };
 
-static int hmm_vma_do_fault(struct mm_walk *walk,
-			    unsigned long addr,
-			    uint64_t *pfn)
+static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
+			    bool write_fault, uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -288,7 +286,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	int r;
 
 	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
-	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
 	r = handle_mm_fault(vma, addr, flags);
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
@@ -320,15 +318,17 @@ static int hmm_pfns_bad(unsigned long addr,
  * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
+ * @fault: should we fault or not ?
+ * @write_fault: write fault ?
  * @walk: mm_walk structure
  * Returns: 0 on success, -EAGAIN after page fault, or page fault error
  *
  * This function will be called whenever pmd_none() or pte_none() returns true,
  * or whenever there is no page directory covering the virtual address range.
  */
-static int hmm_vma_walk_hole(unsigned long addr,
-			     unsigned long end,
-			     struct mm_walk *walk)
+static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
+			      bool fault, bool write_fault,
+			      struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
@@ -339,16 +339,89 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
-		if (hmm_vma_walk->fault) {
+		if (fault || write_fault) {
 			int ret;
 
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			ret = hmm_vma_do_fault(walk, addr, write_fault,
+					       &pfns[i]);
 			if (ret != -EAGAIN)
 				return ret;
 		}
 	}
 
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
+	return (fault || write_fault) ? -EAGAIN : 0;
+}
+
+static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+				      uint64_t pfns, uint64_t cpu_flags,
+				      bool *fault, bool *write_fault)
+{
+	*fault = *write_fault = false;
+	if (!hmm_vma_walk->fault)
+		return;
+
+	/* We aren't ask to do anything ... */
+	if (!(pfns & HMM_PFN_VALID))
+		return;
+	/* If CPU page table is not valid then we need to fault */
+	*fault = cpu_flags & HMM_PFN_VALID;
+	/* Need to write fault ? */
+	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+		*fault = *write_fault = false;
+		return;
+	}
+	/* Do we fault on device memory ? */
+	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
+	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
+		*write_fault = pfns & HMM_PFN_WRITE;
+		*fault = true;
+	}
+}
+
+static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+				 const uint64_t *pfns, unsigned long npages,
+				 uint64_t cpu_flags, bool *fault,
+				 bool *write_fault)
+{
+	unsigned long i;
+
+	if (!hmm_vma_walk->fault) {
+		*fault = *write_fault = false;
+		return;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		hmm_pte_need_fault(hmm_vma_walk, pfns[i], cpu_flags,
+				   fault, write_fault);
+		if ((*fault) || (*write_fault))
+			return;
+	}
+}
+
+static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	bool fault, write_fault;
+	unsigned long i, npages;
+	uint64_t *pfns;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	npages = (end - addr) >> PAGE_SHIFT;
+	pfns = &range->pfns[i];
+	hmm_range_need_fault(hmm_vma_walk, pfns, npages,
+			     0, &fault, &write_fault);
+	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
+}
+
+static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+{
+	if (pmd_protnone(pmd))
+		return 0;
+	return pmd_write(pmd) ? HMM_PFN_VALID |
+				HMM_PFN_WRITE :
+				HMM_PFN_VALID;
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -358,14 +431,17 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	unsigned long pfn, i;
-	uint64_t flag = 0;
+	unsigned long pfn, npages, i;
+	uint64_t flag = 0, cpu_flags;
+	bool fault, write_fault;
 
-	if (pmd_protnone(pmd))
-		return hmm_vma_walk_hole(addr, end, walk);
+	npages = (end - addr) >> PAGE_SHIFT;
+	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
+			     &fault, &write_fault);
 
-	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
-		return hmm_vma_walk_hole(addr, end, walk);
+	if (pmd_protnone(pmd) || fault || write_fault)
+		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
 	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -375,19 +451,33 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 	return 0;
 }
 
+static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+{
+	if (pte_none(pte) || !pte_present(pte))
+		return 0;
+	return pte_write(pte) ? HMM_PFN_VALID |
+				HMM_PFN_WRITE :
+				HMM_PFN_VALID;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
 			      uint64_t *pfns)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct vm_area_struct *vma = walk->vma;
+	bool fault, write_fault;
+	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
 	*pfns = 0;
+	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
+			   &fault, &write_fault);
 
 	if (pte_none(pte)) {
 		*pfns = 0;
-		if (hmm_vma_walk->fault)
+		if (fault || write_fault)
 			goto fault;
 		return 0;
 	}
@@ -396,7 +486,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		if (!non_swap_entry(entry)) {
-			if (hmm_vma_walk->fault)
+			if (fault || write_fault)
 				goto fault;
 			return 0;
 		}
@@ -406,21 +496,20 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
+			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags |= is_write_device_private_entry(entry) ?
+					HMM_PFN_WRITE : 0;
 			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
-			if (is_write_device_private_entry(entry)) {
-				*pfns |= HMM_PFN_WRITE;
-			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
-				goto fault;
 			*pfns |= HMM_PFN_DEVICE_PRIVATE;
 			return 0;
 		}
 
 		if (is_migration_entry(entry)) {
-			if (hmm_vma_walk->fault) {
+			if (fault || write_fault) {
 				pte_unmap(ptep);
 				hmm_vma_walk->last = addr;
 				migration_entry_wait(vma->vm_mm,
-						pmdp, addr);
+						     pmdp, addr);
 				return -EAGAIN;
 			}
 			return 0;
@@ -431,17 +520,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		return -EFAULT;
 	}
 
-	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
+	if (fault || write_fault)
 		goto fault;
 
-	*pfns = hmm_pfn_from_pfn(pte_pfn(pte));
-	*pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	*pfns = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
 	pte_unmap(ptep);
 	/* Fault any virtual address we were ask to fault */
-	return hmm_vma_walk_hole(addr, end, walk);
+	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -662,7 +750,6 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
  * @range: range being faulted
- * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
  *
@@ -707,7 +794,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
+int hmm_vma_fault(struct hmm_range *range, bool block)
 {
 	struct vm_area_struct *vma = range->vma;
 	unsigned long start = range->start;
@@ -755,7 +842,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	spin_unlock(&hmm->lock);
 
 	hmm_vma_walk.fault = true;
-	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-20  2:00 ` jglisse
@ 2018-03-20  2:00   ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
and pfn shift value allowing them to define their own encoding for HMM
pfn that are fill inside the pfns array of the hmm_range struct. With
this device driver can get pfn that match their own private encoding
out of HMM without having to do any conversion.

Changed since v1:
  - Split flags and special values for clarification
  - Improved comments and provide examples

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
 mm/hmm.c            |  85 +++++++++++++++++++---------------
 2 files changed, 142 insertions(+), 73 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0f7ea3074175..5d26e0a223d9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,68 +80,145 @@
 struct hmm;
 
 /*
+ * hmm_pfn_flag_e - HMM flag enums
+ *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
+ *
+ * The driver provide a flags array, if driver valid bit for an entry is bit
+ * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
+ * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
+ * Same logic apply to all flags. This is same idea as vm_page_prot in vma
+ * except that this is per device driver rather than per architecture.
+ */
+enum hmm_pfn_flag_e {
+	HMM_PFN_VALID = 0,
+	HMM_PFN_WRITE,
+	HMM_PFN_DEVICE_PRIVATE,
+	HMM_PFN_FLAG_MAX
+};
+
+/*
+ * hmm_pfn_value_e - HMM pfn special value
+ *
+ * Flags:
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
+ *
+ * Driver provide entry value for none entry, error entry and special entry,
+ * driver can alias (ie use same value for error and special for instance). It
+ * should not alias none and error or special.
+ *
+ * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
+ * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
+ * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
+ * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
  */
-#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
-#define HMM_PFN_SHIFT 5
+enum hmm_pfn_value_e {
+	HMM_PFN_ERROR,
+	HMM_PFN_NONE,
+	HMM_PFN_SPECIAL,
+	HMM_PFN_VALUE_MAX
+};
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @vma: the vm area struct for the range
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @flags: pfn flags to match device driver page table
+ * @values: pfn value for some special case (none, special, error, ...)
+ * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct vm_area_struct	*vma;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	uint64_t		*pfns;
+	const uint64_t		*flags;
+	const uint64_t		*values;
+	uint8_t			pfn_shift;
+	bool			valid;
+};
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to get corresponding struct page from
  * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
  * If the HMM pfn is valid (ie valid flag set) then return the struct page
  * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_to_page(uint64_t pfn)
+static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return NULL;
-	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return NULL;
+	return pfn_to_page(pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to extract pfn from
  * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return -1UL;
-	return (pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return -1UL;
+	return (pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @range: range use to encode HMM pfn value
  * @page: struct page pointer for which to create the HMM pfn
  * Returns: valid HMM pfn for the page
  */
-static inline uint64_t hmm_pfn_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
+					 struct page *page)
 {
-	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (page_to_pfn(page) << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 /*
  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @range: range use to encode HMM pfn value
  * @pfn: pfn value for which to create the HMM pfn
  * Returns: valid HMM pfn for the pfn
  */
-static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
+					unsigned long pfn)
 {
-	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (pfn << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 
@@ -263,25 +340,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
-/*
- * struct hmm_range - track invalidation lock on virtual address range
- *
- * @vma: the vm area struct for the range
- * @list: all range lock are on a list
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @pfns: array of pfns (big enough for the range)
- * @valid: pfns array did not change since it has been fill by an HMM function
- */
-struct hmm_range {
-	struct vm_area_struct	*vma;
-	struct list_head	list;
-	unsigned long		start;
-	unsigned long		end;
-	uint64_t		*pfns;
-	bool			valid;
-};
-
 /*
  * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
  * driver lock that serializes device page table updates, then call
diff --git a/mm/hmm.c b/mm/hmm.c
index 956689804600..99473ca37b4c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -282,6 +282,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	int r;
 
@@ -291,7 +292,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
 	if (r & VM_FAULT_ERROR) {
-		*pfn = HMM_PFN_ERROR;
+		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
@@ -309,7 +310,7 @@ static int hmm_pfns_bad(unsigned long addr,
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = HMM_PFN_ERROR;
+		pfns[i] = range->values[HMM_PFN_ERROR];
 
 	return 0;
 }
@@ -338,7 +339,7 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = 0;
+		pfns[i] = range->values[HMM_PFN_NONE];
 		if (fault || write_fault) {
 			int ret;
 
@@ -356,24 +357,27 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 				      uint64_t pfns, uint64_t cpu_flags,
 				      bool *fault, bool *write_fault)
 {
+	struct hmm_range *range = hmm_vma_walk->range;
+
 	*fault = *write_fault = false;
 	if (!hmm_vma_walk->fault)
 		return;
 
 	/* We aren't ask to do anything ... */
-	if (!(pfns & HMM_PFN_VALID))
+	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
 	/* If CPU page table is not valid then we need to fault */
-	*fault = cpu_flags & HMM_PFN_VALID;
+	*fault = cpu_flags & range->flags[HMM_PFN_VALID];
 	/* Need to write fault ? */
-	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+	if ((pfns & range->flags[HMM_PFN_WRITE]) &&
+	    !(cpu_flags & range->flags[HMM_PFN_WRITE])) {
 		*fault = *write_fault = false;
 		return;
 	}
 	/* Do we fault on device memory ? */
-	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
-	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
-		*write_fault = pfns & HMM_PFN_WRITE;
+	if ((pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) &&
+	    (cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
+		*write_fault = pfns & range->flags[HMM_PFN_WRITE];
 		*fault = true;
 	}
 }
@@ -415,13 +419,13 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
-static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
 {
 	if (pmd_protnone(pmd))
 		return 0;
-	return pmd_write(pmd) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pmd_write(pmd) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -431,12 +435,13 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long pfn, npages, i;
-	uint64_t flag = 0, cpu_flags;
 	bool fault, write_fault;
+	uint64_t cpu_flags;
 
 	npages = (end - addr) >> PAGE_SHIFT;
-	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
 	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
 			     &fault, &write_fault);
 
@@ -444,20 +449,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
-	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
 	hmm_vma_walk->last = end;
 	return 0;
 }
 
-static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
 {
 	if (pte_none(pte) || !pte_present(pte))
 		return 0;
-	return pte_write(pte) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
@@ -465,18 +469,18 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      uint64_t *pfns)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	bool fault, write_fault;
 	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
-	*pfns = 0;
-	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	*pfns = range->values[HMM_PFN_NONE];
+	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
 	hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
 			   &fault, &write_fault);
 
 	if (pte_none(pte)) {
-		*pfns = 0;
 		if (fault || write_fault)
 			goto fault;
 		return 0;
@@ -496,11 +500,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
-			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags = range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_DEVICE_PRIVATE];
 			cpu_flags |= is_write_device_private_entry(entry) ?
-					HMM_PFN_WRITE : 0;
-			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
-			*pfns |= HMM_PFN_DEVICE_PRIVATE;
+				range->flags[HMM_PFN_WRITE] : 0;
+			hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
+					   &fault, &write_fault);
+			if (fault || write_fault)
+				goto fault;
+			*pfns = hmm_pfn_from_pfn(range, swp_offset(entry));
+			*pfns |= cpu_flags;
 			return 0;
 		}
 
@@ -516,14 +525,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		}
 
 		/* Report error for everything else */
-		*pfns = HMM_PFN_ERROR;
+		*pfns = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
 	if (fault || write_fault)
 		goto fault;
 
-	*pfns = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
+	*pfns = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
@@ -592,12 +601,13 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
+static void hmm_pfns_clear(struct hmm_range *range,
+			   uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
+		*pfns = range->values[HMM_PFN_NONE];
 }
 
 static void hmm_pfns_special(struct hmm_range *range)
@@ -605,7 +615,7 @@ static void hmm_pfns_special(struct hmm_range *range)
 	unsigned long addr = range->start, i = 0;
 
 	for (; addr < range->end; addr += PAGE_SIZE, i++)
-		range->pfns[i] = HMM_PFN_SPECIAL;
+		range->pfns[i] = range->values[HMM_PFN_SPECIAL];
 }
 
 /*
@@ -658,7 +668,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -811,7 +821,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -831,7 +841,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -864,7 +874,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
+			       range->end);
 		hmm_vma_range_done(range);
 	}
 	return ret;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
@ 2018-03-20  2:00   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
and pfn shift value allowing them to define their own encoding for HMM
pfn that are fill inside the pfns array of the hmm_range struct. With
this device driver can get pfn that match their own private encoding
out of HMM without having to do any conversion.

Changed since v1:
  - Split flags and special values for clarification
  - Improved comments and provide examples

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
 mm/hmm.c            |  85 +++++++++++++++++++---------------
 2 files changed, 142 insertions(+), 73 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0f7ea3074175..5d26e0a223d9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,68 +80,145 @@
 struct hmm;
 
 /*
+ * hmm_pfn_flag_e - HMM flag enums
+ *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
+ *
+ * The driver provide a flags array, if driver valid bit for an entry is bit
+ * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
+ * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
+ * Same logic apply to all flags. This is same idea as vm_page_prot in vma
+ * except that this is per device driver rather than per architecture.
+ */
+enum hmm_pfn_flag_e {
+	HMM_PFN_VALID = 0,
+	HMM_PFN_WRITE,
+	HMM_PFN_DEVICE_PRIVATE,
+	HMM_PFN_FLAG_MAX
+};
+
+/*
+ * hmm_pfn_value_e - HMM pfn special value
+ *
+ * Flags:
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
+ *
+ * Driver provide entry value for none entry, error entry and special entry,
+ * driver can alias (ie use same value for error and special for instance). It
+ * should not alias none and error or special.
+ *
+ * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
+ * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
+ * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
+ * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
  */
-#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
-#define HMM_PFN_SHIFT 5
+enum hmm_pfn_value_e {
+	HMM_PFN_ERROR,
+	HMM_PFN_NONE,
+	HMM_PFN_SPECIAL,
+	HMM_PFN_VALUE_MAX
+};
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @vma: the vm area struct for the range
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @flags: pfn flags to match device driver page table
+ * @values: pfn value for some special case (none, special, error, ...)
+ * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct vm_area_struct	*vma;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	uint64_t		*pfns;
+	const uint64_t		*flags;
+	const uint64_t		*values;
+	uint8_t			pfn_shift;
+	bool			valid;
+};
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to get corresponding struct page from
  * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
  * If the HMM pfn is valid (ie valid flag set) then return the struct page
  * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_to_page(uint64_t pfn)
+static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return NULL;
-	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return NULL;
+	return pfn_to_page(pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to extract pfn from
  * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return -1UL;
-	return (pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return -1UL;
+	return (pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @range: range use to encode HMM pfn value
  * @page: struct page pointer for which to create the HMM pfn
  * Returns: valid HMM pfn for the page
  */
-static inline uint64_t hmm_pfn_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
+					 struct page *page)
 {
-	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (page_to_pfn(page) << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 /*
  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @range: range use to encode HMM pfn value
  * @pfn: pfn value for which to create the HMM pfn
  * Returns: valid HMM pfn for the pfn
  */
-static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
+					unsigned long pfn)
 {
-	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (pfn << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 
@@ -263,25 +340,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
-/*
- * struct hmm_range - track invalidation lock on virtual address range
- *
- * @vma: the vm area struct for the range
- * @list: all range lock are on a list
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @pfns: array of pfns (big enough for the range)
- * @valid: pfns array did not change since it has been fill by an HMM function
- */
-struct hmm_range {
-	struct vm_area_struct	*vma;
-	struct list_head	list;
-	unsigned long		start;
-	unsigned long		end;
-	uint64_t		*pfns;
-	bool			valid;
-};
-
 /*
  * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
  * driver lock that serializes device page table updates, then call
diff --git a/mm/hmm.c b/mm/hmm.c
index 956689804600..99473ca37b4c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -282,6 +282,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	int r;
 
@@ -291,7 +292,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
 	if (r & VM_FAULT_ERROR) {
-		*pfn = HMM_PFN_ERROR;
+		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
@@ -309,7 +310,7 @@ static int hmm_pfns_bad(unsigned long addr,
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = HMM_PFN_ERROR;
+		pfns[i] = range->values[HMM_PFN_ERROR];
 
 	return 0;
 }
@@ -338,7 +339,7 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = 0;
+		pfns[i] = range->values[HMM_PFN_NONE];
 		if (fault || write_fault) {
 			int ret;
 
@@ -356,24 +357,27 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 				      uint64_t pfns, uint64_t cpu_flags,
 				      bool *fault, bool *write_fault)
 {
+	struct hmm_range *range = hmm_vma_walk->range;
+
 	*fault = *write_fault = false;
 	if (!hmm_vma_walk->fault)
 		return;
 
 	/* We aren't ask to do anything ... */
-	if (!(pfns & HMM_PFN_VALID))
+	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
 	/* If CPU page table is not valid then we need to fault */
-	*fault = cpu_flags & HMM_PFN_VALID;
+	*fault = cpu_flags & range->flags[HMM_PFN_VALID];
 	/* Need to write fault ? */
-	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+	if ((pfns & range->flags[HMM_PFN_WRITE]) &&
+	    !(cpu_flags & range->flags[HMM_PFN_WRITE])) {
 		*fault = *write_fault = false;
 		return;
 	}
 	/* Do we fault on device memory ? */
-	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
-	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
-		*write_fault = pfns & HMM_PFN_WRITE;
+	if ((pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) &&
+	    (cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
+		*write_fault = pfns & range->flags[HMM_PFN_WRITE];
 		*fault = true;
 	}
 }
@@ -415,13 +419,13 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
-static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
 {
 	if (pmd_protnone(pmd))
 		return 0;
-	return pmd_write(pmd) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pmd_write(pmd) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -431,12 +435,13 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long pfn, npages, i;
-	uint64_t flag = 0, cpu_flags;
 	bool fault, write_fault;
+	uint64_t cpu_flags;
 
 	npages = (end - addr) >> PAGE_SHIFT;
-	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
 	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
 			     &fault, &write_fault);
 
@@ -444,20 +449,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
-	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
 	hmm_vma_walk->last = end;
 	return 0;
 }
 
-static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
 {
 	if (pte_none(pte) || !pte_present(pte))
 		return 0;
-	return pte_write(pte) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
@@ -465,18 +469,18 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      uint64_t *pfns)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	bool fault, write_fault;
 	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
-	*pfns = 0;
-	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	*pfns = range->values[HMM_PFN_NONE];
+	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
 	hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
 			   &fault, &write_fault);
 
 	if (pte_none(pte)) {
-		*pfns = 0;
 		if (fault || write_fault)
 			goto fault;
 		return 0;
@@ -496,11 +500,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
-			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags = range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_DEVICE_PRIVATE];
 			cpu_flags |= is_write_device_private_entry(entry) ?
-					HMM_PFN_WRITE : 0;
-			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
-			*pfns |= HMM_PFN_DEVICE_PRIVATE;
+				range->flags[HMM_PFN_WRITE] : 0;
+			hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
+					   &fault, &write_fault);
+			if (fault || write_fault)
+				goto fault;
+			*pfns = hmm_pfn_from_pfn(range, swp_offset(entry));
+			*pfns |= cpu_flags;
 			return 0;
 		}
 
@@ -516,14 +525,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		}
 
 		/* Report error for everything else */
-		*pfns = HMM_PFN_ERROR;
+		*pfns = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
 	if (fault || write_fault)
 		goto fault;
 
-	*pfns = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
+	*pfns = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
@@ -592,12 +601,13 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
+static void hmm_pfns_clear(struct hmm_range *range,
+			   uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
+		*pfns = range->values[HMM_PFN_NONE];
 }
 
 static void hmm_pfns_special(struct hmm_range *range)
@@ -605,7 +615,7 @@ static void hmm_pfns_special(struct hmm_range *range)
 	unsigned long addr = range->start, i = 0;
 
 	for (; addr < range->end; addr += PAGE_SIZE, i++)
-		range->pfns[i] = HMM_PFN_SPECIAL;
+		range->pfns[i] = range->values[HMM_PFN_SPECIAL];
 }
 
 /*
@@ -658,7 +668,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -811,7 +821,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -831,7 +841,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -864,7 +874,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
+			       range->end);
 		hmm_vma_range_done(range);
 	}
 	return ret;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-20  2:00   ` jglisse
@ 2018-03-21  4:14     ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  4:14 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.
> 
> Changed since v1:
>   - dropped VM_BUG_ON()
>   - cc stable
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: stable@vger.kernel.org
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 10 ++++++++++
>  mm/hmm.c            | 18 +++++++++++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 36dd21fe5caf..fa7b51f65905 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,6 +218,16 @@ enum hmm_update_type {
>   * @update: callback to update range on a device
>   */
>  struct hmm_mirror_ops {
> +	/* release() - release hmm_mirror
> +	 *
> +	 * @mirror: pointer to struct hmm_mirror
> +	 *
> +	 * This is called when the mm_struct is being released.
> +	 * The callback should make sure no references to the mirror occur
> +	 * after the callback returns.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +
>  	/* sync_cpu_device_pagetables() - synchronize page tables
>  	 *
>  	 * @mirror: pointer to struct hmm_mirror
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..6088fa6ed137 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
>  	up_read(&hmm->mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm *hmm = mm->hmm;
> +	struct hmm_mirror *mirror;
> +	struct hmm_mirror *mirror_next;
> +
> +	down_write(&hmm->mirrors_sem);
> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release)
> +			mirror->ops->release(mirror);

Hi Jerome,

This presents a deadlock problem (details below). As for solution ideas, 
Mark Hairgrove points out that the MMU notifiers had to solve the
same sort of problem, and part of the solution involves "avoid
holding locks when issuing these callbacks". That's not an entire 
solution description, of course, but it seems like a good start.

Anyway, for the deadlock problem:

Each of these ->release callbacks potentially has to wait for the 
hmm_invalidate_range() callbacks to finish. That is not shown in any
code directly, but it's because: when a device driver is processing 
the above ->release callback, it has to allow any in-progress operations 
to finish up (as specified clearly in your comment documentation above). 

Some of those operations will invariably need to do things that result 
in page invalidations, thus triggering the hmm_invalidate_range() callback.
Then, the hmm_invalidate_range() callback tries to acquire the same 
hmm->mirrors_sem lock, thus leading to deadlock:

hmm_invalidate_range():
// ...
	down_read(&hmm->mirrors_sem);
	list_for_each_entry(mirror, &hmm->mirrors, list)
		mirror->ops->sync_cpu_device_pagetables(mirror, action,
							start, end);
	up_read(&hmm->mirrors_sem);

thanks,
--
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21  4:14     ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  4:14 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.
> 
> Changed since v1:
>   - dropped VM_BUG_ON()
>   - cc stable
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: stable@vger.kernel.org
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 10 ++++++++++
>  mm/hmm.c            | 18 +++++++++++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 36dd21fe5caf..fa7b51f65905 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,6 +218,16 @@ enum hmm_update_type {
>   * @update: callback to update range on a device
>   */
>  struct hmm_mirror_ops {
> +	/* release() - release hmm_mirror
> +	 *
> +	 * @mirror: pointer to struct hmm_mirror
> +	 *
> +	 * This is called when the mm_struct is being released.
> +	 * The callback should make sure no references to the mirror occur
> +	 * after the callback returns.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +
>  	/* sync_cpu_device_pagetables() - synchronize page tables
>  	 *
>  	 * @mirror: pointer to struct hmm_mirror
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..6088fa6ed137 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
>  	up_read(&hmm->mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm *hmm = mm->hmm;
> +	struct hmm_mirror *mirror;
> +	struct hmm_mirror *mirror_next;
> +
> +	down_write(&hmm->mirrors_sem);
> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release)
> +			mirror->ops->release(mirror);

Hi Jerome,

This presents a deadlock problem (details below). As for solution ideas, 
Mark Hairgrove points out that the MMU notifiers had to solve the
same sort of problem, and part of the solution involves "avoid
holding locks when issuing these callbacks". That's not an entire 
solution description, of course, but it seems like a good start.

Anyway, for the deadlock problem:

Each of these ->release callbacks potentially has to wait for the 
hmm_invalidate_range() callbacks to finish. That is not shown in any
code directly, but it's because: when a device driver is processing 
the above ->release callback, it has to allow any in-progress operations 
to finish up (as specified clearly in your comment documentation above). 

Some of those operations will invariably need to do things that result 
in page invalidations, thus triggering the hmm_invalidate_range() callback.
Then, the hmm_invalidate_range() callback tries to acquire the same 
hmm->mirrors_sem lock, thus leading to deadlock:

hmm_invalidate_range():
// ...
	down_read(&hmm->mirrors_sem);
	list_for_each_entry(mirror, &hmm->mirrors, list)
		mirror->ops->sync_cpu_device_pagetables(mirror, action,
							start, end);
	up_read(&hmm->mirrors_sem);

thanks,
--
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit
  2018-03-20  2:00   ` jglisse
@ 2018-03-21  4:24     ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  4:24 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..667944630dc9 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  {
>  	struct hmm *hmm = mirror->hmm;
> +	struct mm_struct *mm = NULL;
> +	bool unregister = false;
>  
>  	down_write(&hmm->mirrors_sem);
>  	list_del_init(&mirror->list);
> +	unregister = list_empty(&hmm->mirrors);

Hi Jerome,

This first minor point may be irrelevant, depending on how you fix 
the other problem below, but: tiny naming idea: rename unregister 
to either "should_unregister", or "mirror_snapshot_empty"...the 
latter helps show that this is stale information, once the lock is 
dropped. 

>  	up_write(&hmm->mirrors_sem);
> +
> +	if (!unregister)
> +		return;

Whee, here I am, lock-free, in the middle of a race condition
window. :)  Right here, someone (hmm_mirror_register) could be adding
another mirror.

It's not immediately clear to me what the best solution is.
I'd be happier if we didn't have to drop one lock and take
another like this, but if we do, then maybe rechecking that
the list hasn't changed...safely, somehow, is a way forward here.


> +
> +	spin_lock(&hmm->mm->page_table_lock);
> +	if (hmm->mm->hmm == hmm) {
> +		mm = hmm->mm;
> +		mm->hmm = NULL;
> +	}
> +	spin_unlock(&hmm->mm->page_table_lock);
> +
> +	if (mm == NULL)
> +		return;
> +
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> +	kfree(hmm);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
>  

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit
@ 2018-03-21  4:24     ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  4:24 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..667944630dc9 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  {
>  	struct hmm *hmm = mirror->hmm;
> +	struct mm_struct *mm = NULL;
> +	bool unregister = false;
>  
>  	down_write(&hmm->mirrors_sem);
>  	list_del_init(&mirror->list);
> +	unregister = list_empty(&hmm->mirrors);

Hi Jerome,

This first minor point may be irrelevant, depending on how you fix 
the other problem below, but: tiny naming idea: rename unregister 
to either "should_unregister", or "mirror_snapshot_empty"...the 
latter helps show that this is stale information, once the lock is 
dropped. 

>  	up_write(&hmm->mirrors_sem);
> +
> +	if (!unregister)
> +		return;

Whee, here I am, lock-free, in the middle of a race condition
window. :)  Right here, someone (hmm_mirror_register) could be adding
another mirror.

It's not immediately clear to me what the best solution is.
I'd be happier if we didn't have to drop one lock and take
another like this, but if we do, then maybe rechecking that
the list hasn't changed...safely, somehow, is a way forward here.


> +
> +	spin_lock(&hmm->mm->page_table_lock);
> +	if (hmm->mm->hmm == hmm) {
> +		mm = hmm->mm;
> +		mm->hmm = NULL;
> +	}
> +	spin_unlock(&hmm->mm->page_table_lock);
> +
> +	if (mm == NULL)
> +		return;
> +
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> +	kfree(hmm);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
>  

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-20  2:00   ` jglisse
@ 2018-03-21  4:39     ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  4:39 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
> and pfn shift value allowing them to define their own encoding for HMM
> pfn that are fill inside the pfns array of the hmm_range struct. With
> this device driver can get pfn that match their own private encoding
> out of HMM without having to do any conversion.
> 
> Changed since v1:
>   - Split flags and special values for clarification
>   - Improved comments and provide examples
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
>  mm/hmm.c            |  85 +++++++++++++++++++---------------
>  2 files changed, 142 insertions(+), 73 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 0f7ea3074175..5d26e0a223d9 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,68 +80,145 @@
>  struct hmm;
>  
>  /*
> + * hmm_pfn_flag_e - HMM flag enums
> + *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> + *
> + * The driver provide a flags array, if driver valid bit for an entry is bit
> + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> + * except that this is per device driver rather than per architecture.

Hi Jerome,

If we go with this approach--and I hope not, I'll try to talk you down from the
ledge, in a moment--then maybe we should add the following to the comments: 

"There is only one bit ever set in each hmm_range.flags[entry]." 

Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
strange way to do things (especially when there is a much easier way), and deserves 
that extra bit of helpful documentation.

More below...

> + */
> +enum hmm_pfn_flag_e {
> +	HMM_PFN_VALID = 0,
> +	HMM_PFN_WRITE,
> +	HMM_PFN_DEVICE_PRIVATE,
> +	HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * hmm_pfn_value_e - HMM pfn special value
> + *
> + * Flags:
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
>   *      set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> + *
> + * Driver provide entry value for none entry, error entry and special entry,
> + * driver can alias (ie use same value for error and special for instance). It
> + * should not alias none and error or special.
> + *
> + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
>   */
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_WRITE (1 << 1)
> -#define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_SPECIAL (1 << 3)
> -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> -#define HMM_PFN_SHIFT 5
> +enum hmm_pfn_value_e {
> +	HMM_PFN_ERROR,
> +	HMM_PFN_NONE,
> +	HMM_PFN_SPECIAL,
> +	HMM_PFN_VALUE_MAX
> +};

I can think of perhaps two good solid ways to get what you want, without
moving to what I consider an unnecessary excursion into arrays of flags. 
If I understand correctly, you want to let each architecture
specify which bit to use for each of the above HMM_PFN_* flags. 

The way you have it now, the code does things like this:

        cpu_flags & range->flags[HMM_PFN_WRITE]

but that array entry is mostly empty space, and it's confusing. It would
be nicer to see:

        cpu_flags & HMM_PFN_WRITE

...which you can easily do, by defining HMM_PFN_WRITE and friends in an
arch-specific header file.

The other way to make this more readable would be to use helper routines
similar to what the vm_pgprot* routines do:

static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
}

...but that's also unnecessary.

Let's just keep it simple, and go back to the bitmap flags!

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
@ 2018-03-21  4:39     ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  4:39 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
> and pfn shift value allowing them to define their own encoding for HMM
> pfn that are fill inside the pfns array of the hmm_range struct. With
> this device driver can get pfn that match their own private encoding
> out of HMM without having to do any conversion.
> 
> Changed since v1:
>   - Split flags and special values for clarification
>   - Improved comments and provide examples
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
>  mm/hmm.c            |  85 +++++++++++++++++++---------------
>  2 files changed, 142 insertions(+), 73 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 0f7ea3074175..5d26e0a223d9 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,68 +80,145 @@
>  struct hmm;
>  
>  /*
> + * hmm_pfn_flag_e - HMM flag enums
> + *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> + *
> + * The driver provide a flags array, if driver valid bit for an entry is bit
> + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> + * except that this is per device driver rather than per architecture.

Hi Jerome,

If we go with this approach--and I hope not, I'll try to talk you down from the
ledge, in a moment--then maybe we should add the following to the comments: 

"There is only one bit ever set in each hmm_range.flags[entry]." 

Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
strange way to do things (especially when there is a much easier way), and deserves 
that extra bit of helpful documentation.

More below...

> + */
> +enum hmm_pfn_flag_e {
> +	HMM_PFN_VALID = 0,
> +	HMM_PFN_WRITE,
> +	HMM_PFN_DEVICE_PRIVATE,
> +	HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * hmm_pfn_value_e - HMM pfn special value
> + *
> + * Flags:
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
>   *      set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> + *
> + * Driver provide entry value for none entry, error entry and special entry,
> + * driver can alias (ie use same value for error and special for instance). It
> + * should not alias none and error or special.
> + *
> + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
>   */
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_WRITE (1 << 1)
> -#define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_SPECIAL (1 << 3)
> -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> -#define HMM_PFN_SHIFT 5
> +enum hmm_pfn_value_e {
> +	HMM_PFN_ERROR,
> +	HMM_PFN_NONE,
> +	HMM_PFN_SPECIAL,
> +	HMM_PFN_VALUE_MAX
> +};

I can think of perhaps two good solid ways to get what you want, without
moving to what I consider an unnecessary excursion into arrays of flags. 
If I understand correctly, you want to let each architecture
specify which bit to use for each of the above HMM_PFN_* flags. 

The way you have it now, the code does things like this:

        cpu_flags & range->flags[HMM_PFN_WRITE]

but that array entry is mostly empty space, and it's confusing. It would
be nicer to see:

        cpu_flags & HMM_PFN_WRITE

...which you can easily do, by defining HMM_PFN_WRITE and friends in an
arch-specific header file.

The other way to make this more readable would be to use helper routines
similar to what the vm_pgprot* routines do:

static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
}

...but that's also unnecessary.

Let's just keep it simple, and go back to the bitmap flags!

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  2018-03-20  2:00   ` jglisse
@ 2018-03-21  5:07     ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  5:07 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> No functional change, just create one function to handle pmd and one
> to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
>  1 file changed, 102 insertions(+), 72 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 52cdceb35733..dc703e9c3a95 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
>  	return hmm_vma_walk->fault ? -EAGAIN : 0;
>  }
>  
> +static int hmm_vma_handle_pmd(struct mm_walk *walk,
> +			      unsigned long addr,
> +			      unsigned long end,
> +			      uint64_t *pfns,

Hi Jerome,

Nice cleanup, it makes it much easier to follow the code now.

Let's please rename the pfns argument above to "pfn", because in this
helper (and the _pte helper too), there is only one pfn involved, rather
than an array of them.

> +			      pmd_t pmd)
> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	unsigned long pfn, i;
> +	uint64_t flag = 0;
> +
> +	if (pmd_protnone(pmd))
> +		return hmm_vma_walk_hole(addr, end, walk);
> +
> +	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
> +		return hmm_vma_walk_hole(addr, end, walk);
> +
> +	pfn = pmd_pfn(pmd) + pte_index(addr);
> +	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> +	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> +		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
> +	hmm_vma_walk->last = end;
> +	return 0;
> +}
> +
> +static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> +			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
> +			      uint64_t *pfns)

Same thing here: rename pfns --> pfn.

I moved diffs around to attempt to confirm that this is just a refactoring,
and it does look the same. It's easy to overlook things here, but:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t pte = *ptep;
> +
> +	*pfns = 0;
> +
> +	if (pte_none(pte)) {
> +		*pfns = 0;
> +		if (hmm_vma_walk->fault)
> +			goto fault;
> +		return 0;
> +	}
> +
> +	if (!pte_present(pte)) {
> +		swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +		if (!non_swap_entry(entry)) {
> +			if (hmm_vma_walk->fault)
> +				goto fault;
> +			return 0;
> +		}
> +
> +		/*
> +		 * This is a special swap entry, ignore migration, use
> +		 * device and report anything else as error.
> +		 */
> +		if (is_device_private_entry(entry)) {
> +			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
> +			if (is_write_device_private_entry(entry)) {
> +				*pfns |= HMM_PFN_WRITE;
> +			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
> +				goto fault;
> +			*pfns |= HMM_PFN_DEVICE_PRIVATE;
> +			return 0;
> +		}
> +
> +		if (is_migration_entry(entry)) {
> +			if (hmm_vma_walk->fault) {
> +				pte_unmap(ptep);
> +				hmm_vma_walk->last = addr;
> +				migration_entry_wait(vma->vm_mm,
> +						pmdp, addr);
> +				return -EAGAIN;
> +			}
> +			return 0;
> +		}
> +
> +		/* Report error for everything else */
> +		*pfns = HMM_PFN_ERROR;
> +		return -EFAULT;
> +	}
> +
> +	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
> +		goto fault;
> +
> +	*pfns = hmm_pfn_from_pfn(pte_pfn(pte));
> +	*pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
> +	return 0;
> +
> +fault:
> +	pte_unmap(ptep);
> +	/* Fault any virtual address we were ask to fault */
> +	return hmm_vma_walk_hole(addr, end, walk);
> +}
> +
>  static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  			    unsigned long start,
>  			    unsigned long end,
> @@ -358,25 +451,20 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  {
>  	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>  	struct hmm_range *range = hmm_vma_walk->range;
> -	struct vm_area_struct *vma = walk->vma;
>  	uint64_t *pfns = range->pfns;
>  	unsigned long addr = start, i;
> -	bool write_fault;
>  	pte_t *ptep;
>  
>  	i = (addr - range->start) >> PAGE_SHIFT;
> -	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
>  
>  again:
>  	if (pmd_none(*pmdp))
>  		return hmm_vma_walk_hole(start, end, walk);
>  
> -	if (pmd_huge(*pmdp) && vma->vm_flags & VM_HUGETLB)
> +	if (pmd_huge(*pmdp) && (range->vma->vm_flags & VM_HUGETLB))
>  		return hmm_pfns_bad(start, end, walk);
>  
>  	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
> -		unsigned long pfn;
> -		uint64_t flag = 0;
>  		pmd_t pmd;
>  
>  		/*
> @@ -392,17 +480,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		barrier();
>  		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
>  			goto again;
> -		if (pmd_protnone(pmd))
> -			return hmm_vma_walk_hole(start, end, walk);
>  
> -		if (write_fault && !pmd_write(pmd))
> -			return hmm_vma_walk_hole(start, end, walk);
> -
> -		pfn = pmd_pfn(pmd) + pte_index(addr);
> -		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> -		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
> -			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
> -		return 0;
> +		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
>  	}
>  
>  	if (pmd_bad(*pmdp))
> @@ -410,67 +489,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  	ptep = pte_offset_map(pmdp, addr);
>  	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
> -		pte_t pte = *ptep;
> +		int r;
>  
> -		pfns[i] = 0;
> -
> -		if (pte_none(pte)) {
> -			pfns[i] = 0;
> -			if (hmm_vma_walk->fault)
> -				goto fault;
> -			continue;
> -		}
> -
> -		if (!pte_present(pte)) {
> -			swp_entry_t entry = pte_to_swp_entry(pte);
> -
> -			if (!non_swap_entry(entry)) {
> -				if (hmm_vma_walk->fault)
> -					goto fault;
> -				continue;
> -			}
> -
> -			/*
> -			 * This is a special swap entry, ignore migration, use
> -			 * device and report anything else as error.
> -			 */
> -			if (is_device_private_entry(entry)) {
> -				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
> -				if (is_write_device_private_entry(entry)) {
> -					pfns[i] |= HMM_PFN_WRITE;
> -				} else if (write_fault)
> -					goto fault;
> -				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
> -			} else if (is_migration_entry(entry)) {
> -				if (hmm_vma_walk->fault) {
> -					pte_unmap(ptep);
> -					hmm_vma_walk->last = addr;
> -					migration_entry_wait(vma->vm_mm,
> -							     pmdp, addr);
> -					return -EAGAIN;
> -				}
> -				continue;
> -			} else {
> -				/* Report error for everything else */
> -				pfns[i] = HMM_PFN_ERROR;
> -			}
> -			continue;
> +		r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]);
> +		if (r) {
> +			/* hmm_vma_handle_pte() did unmap pte directory */
> +			hmm_vma_walk->last = addr;
> +			return r;
>  		}
> -
> -		if (write_fault && !pte_write(pte))
> -			goto fault;
> -
> -		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
> -		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
> -		continue;
> -
> -fault:
> -		pte_unmap(ptep);
> -		/* Fault any virtual address we were ask to fault */
> -		return hmm_vma_walk_hole(start, end, walk);
>  	}
>  	pte_unmap(ptep - 1);
>  
> +	hmm_vma_walk->last = addr;
>  	return 0;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
@ 2018-03-21  5:07     ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  5:07 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> No functional change, just create one function to handle pmd and one
> to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
>  1 file changed, 102 insertions(+), 72 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 52cdceb35733..dc703e9c3a95 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
>  	return hmm_vma_walk->fault ? -EAGAIN : 0;
>  }
>  
> +static int hmm_vma_handle_pmd(struct mm_walk *walk,
> +			      unsigned long addr,
> +			      unsigned long end,
> +			      uint64_t *pfns,

Hi Jerome,

Nice cleanup, it makes it much easier to follow the code now.

Let's please rename the pfns argument above to "pfn", because in this
helper (and the _pte helper too), there is only one pfn involved, rather
than an array of them.

> +			      pmd_t pmd)
> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	unsigned long pfn, i;
> +	uint64_t flag = 0;
> +
> +	if (pmd_protnone(pmd))
> +		return hmm_vma_walk_hole(addr, end, walk);
> +
> +	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
> +		return hmm_vma_walk_hole(addr, end, walk);
> +
> +	pfn = pmd_pfn(pmd) + pte_index(addr);
> +	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> +	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> +		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
> +	hmm_vma_walk->last = end;
> +	return 0;
> +}
> +
> +static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> +			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
> +			      uint64_t *pfns)

Same thing here: rename pfns --> pfn.

I moved diffs around to attempt to confirm that this is just a refactoring,
and it does look the same. It's easy to overlook things here, but:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> +{
> +	struct hmm_vma_walk *hmm_vma_walk = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t pte = *ptep;
> +
> +	*pfns = 0;
> +
> +	if (pte_none(pte)) {
> +		*pfns = 0;
> +		if (hmm_vma_walk->fault)
> +			goto fault;
> +		return 0;
> +	}
> +
> +	if (!pte_present(pte)) {
> +		swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +		if (!non_swap_entry(entry)) {
> +			if (hmm_vma_walk->fault)
> +				goto fault;
> +			return 0;
> +		}
> +
> +		/*
> +		 * This is a special swap entry, ignore migration, use
> +		 * device and report anything else as error.
> +		 */
> +		if (is_device_private_entry(entry)) {
> +			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
> +			if (is_write_device_private_entry(entry)) {
> +				*pfns |= HMM_PFN_WRITE;
> +			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
> +				goto fault;
> +			*pfns |= HMM_PFN_DEVICE_PRIVATE;
> +			return 0;
> +		}
> +
> +		if (is_migration_entry(entry)) {
> +			if (hmm_vma_walk->fault) {
> +				pte_unmap(ptep);
> +				hmm_vma_walk->last = addr;
> +				migration_entry_wait(vma->vm_mm,
> +						pmdp, addr);
> +				return -EAGAIN;
> +			}
> +			return 0;
> +		}
> +
> +		/* Report error for everything else */
> +		*pfns = HMM_PFN_ERROR;
> +		return -EFAULT;
> +	}
> +
> +	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
> +		goto fault;
> +
> +	*pfns = hmm_pfn_from_pfn(pte_pfn(pte));
> +	*pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
> +	return 0;
> +
> +fault:
> +	pte_unmap(ptep);
> +	/* Fault any virtual address we were ask to fault */
> +	return hmm_vma_walk_hole(addr, end, walk);
> +}
> +
>  static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  			    unsigned long start,
>  			    unsigned long end,
> @@ -358,25 +451,20 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  {
>  	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>  	struct hmm_range *range = hmm_vma_walk->range;
> -	struct vm_area_struct *vma = walk->vma;
>  	uint64_t *pfns = range->pfns;
>  	unsigned long addr = start, i;
> -	bool write_fault;
>  	pte_t *ptep;
>  
>  	i = (addr - range->start) >> PAGE_SHIFT;
> -	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
>  
>  again:
>  	if (pmd_none(*pmdp))
>  		return hmm_vma_walk_hole(start, end, walk);
>  
> -	if (pmd_huge(*pmdp) && vma->vm_flags & VM_HUGETLB)
> +	if (pmd_huge(*pmdp) && (range->vma->vm_flags & VM_HUGETLB))
>  		return hmm_pfns_bad(start, end, walk);
>  
>  	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
> -		unsigned long pfn;
> -		uint64_t flag = 0;
>  		pmd_t pmd;
>  
>  		/*
> @@ -392,17 +480,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		barrier();
>  		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
>  			goto again;
> -		if (pmd_protnone(pmd))
> -			return hmm_vma_walk_hole(start, end, walk);
>  
> -		if (write_fault && !pmd_write(pmd))
> -			return hmm_vma_walk_hole(start, end, walk);
> -
> -		pfn = pmd_pfn(pmd) + pte_index(addr);
> -		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> -		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
> -			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
> -		return 0;
> +		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
>  	}
>  
>  	if (pmd_bad(*pmdp))
> @@ -410,67 +489,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  	ptep = pte_offset_map(pmdp, addr);
>  	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
> -		pte_t pte = *ptep;
> +		int r;
>  
> -		pfns[i] = 0;
> -
> -		if (pte_none(pte)) {
> -			pfns[i] = 0;
> -			if (hmm_vma_walk->fault)
> -				goto fault;
> -			continue;
> -		}
> -
> -		if (!pte_present(pte)) {
> -			swp_entry_t entry = pte_to_swp_entry(pte);
> -
> -			if (!non_swap_entry(entry)) {
> -				if (hmm_vma_walk->fault)
> -					goto fault;
> -				continue;
> -			}
> -
> -			/*
> -			 * This is a special swap entry, ignore migration, use
> -			 * device and report anything else as error.
> -			 */
> -			if (is_device_private_entry(entry)) {
> -				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
> -				if (is_write_device_private_entry(entry)) {
> -					pfns[i] |= HMM_PFN_WRITE;
> -				} else if (write_fault)
> -					goto fault;
> -				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
> -			} else if (is_migration_entry(entry)) {
> -				if (hmm_vma_walk->fault) {
> -					pte_unmap(ptep);
> -					hmm_vma_walk->last = addr;
> -					migration_entry_wait(vma->vm_mm,
> -							     pmdp, addr);
> -					return -EAGAIN;
> -				}
> -				continue;
> -			} else {
> -				/* Report error for everything else */
> -				pfns[i] = HMM_PFN_ERROR;
> -			}
> -			continue;
> +		r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]);
> +		if (r) {
> +			/* hmm_vma_handle_pte() did unmap pte directory */
> +			hmm_vma_walk->last = addr;
> +			return r;
>  		}
> -
> -		if (write_fault && !pte_write(pte))
> -			goto fault;
> -
> -		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
> -		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
> -		continue;
> -
> -fault:
> -		pte_unmap(ptep);
> -		/* Fault any virtual address we were ask to fault */
> -		return hmm_vma_walk_hole(start, end, walk);
>  	}
>  	pte_unmap(ptep - 1);
>  
> +	hmm_vma_walk->last = addr;
>  	return 0;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
  2018-03-20  2:00   ` jglisse
@ 2018-03-21  5:24     ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  5:24 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> There is no point in differentiating between a range for which there
> is not even a directory (and thus entries) and empty entry (pte_none()
> or pmd_none() returns true).
> 
> Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> 
> Changed since v1:
>   - Improved comments
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h |  8 +++-----
>  mm/hmm.c            | 45 +++++++++++++++------------------------------
>  2 files changed, 18 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 54d684fe3b90..cf283db22106 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> @@ -94,10 +93,9 @@ struct hmm;
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_EMPTY (1 << 3)

Hi Jerome,

Nearly done with this one...see below for a bit more detail, but I think if we did this:

    #define HMM_PFN_EMPTY (0)

...it would work out nicely.

> -#define HMM_PFN_SPECIAL (1 << 4)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> -#define HMM_PFN_SHIFT 6
> +#define HMM_PFN_SPECIAL (1 << 3)
> +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_SHIFT 5
>  

<snip>

> @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		pfns[i] = 0;
>  
>  		if (pte_none(pte)) {
> -			pfns[i] = HMM_PFN_EMPTY;
> +			pfns[i] = 0;

This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
Symbols are better than raw numbers here.


>  			if (hmm_vma_walk->fault)
>  				goto fault;
>  			continue;
> @@ -489,8 +474,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  fault:
>  		pte_unmap(ptep);
> -		/* Fault all pages in range */
> -		return hmm_vma_walk_clear(start, end, walk);
> +		/* Fault any virtual address we were ask to fault */

                                                     asked to fault

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
@ 2018-03-21  5:24     ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21  5:24 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> There is no point in differentiating between a range for which there
> is not even a directory (and thus entries) and empty entry (pte_none()
> or pmd_none() returns true).
> 
> Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> 
> Changed since v1:
>   - Improved comments
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h |  8 +++-----
>  mm/hmm.c            | 45 +++++++++++++++------------------------------
>  2 files changed, 18 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 54d684fe3b90..cf283db22106 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> @@ -94,10 +93,9 @@ struct hmm;
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_EMPTY (1 << 3)

Hi Jerome,

Nearly done with this one...see below for a bit more detail, but I think if we did this:

    #define HMM_PFN_EMPTY (0)

...it would work out nicely.

> -#define HMM_PFN_SPECIAL (1 << 4)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> -#define HMM_PFN_SHIFT 6
> +#define HMM_PFN_SPECIAL (1 << 3)
> +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_SHIFT 5
>  

<snip>

> @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  		pfns[i] = 0;
>  
>  		if (pte_none(pte)) {
> -			pfns[i] = HMM_PFN_EMPTY;
> +			pfns[i] = 0;

This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
Symbols are better than raw numbers here.


>  			if (hmm_vma_walk->fault)
>  				goto fault;
>  			continue;
> @@ -489,8 +474,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  fault:
>  		pte_unmap(ptep);
> -		/* Fault all pages in range */
> -		return hmm_vma_walk_clear(start, end, walk);
> +		/* Fault any virtual address we were ask to fault */

                                                     asked to fault

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
  2018-03-21  5:24     ` John Hubbard
@ 2018-03-21 14:48       ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 14:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 10:24:34PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > There is no point in differentiating between a range for which there
> > is not even a directory (and thus entries) and empty entry (pte_none()
> > or pmd_none() returns true).
> > 
> > Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> > duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> > 
> > Changed since v1:
> >   - Improved comments
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  include/linux/hmm.h |  8 +++-----
> >  mm/hmm.c            | 45 +++++++++++++++------------------------------
> >  2 files changed, 18 insertions(+), 35 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 54d684fe3b90..cf283db22106 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -84,7 +84,6 @@ struct hmm;
> >   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
> >   * HMM_PFN_WRITE: CPU page table has write permission set
> >   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> > - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> >   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
> >   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
> >   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> > @@ -94,10 +93,9 @@ struct hmm;
> >  #define HMM_PFN_VALID (1 << 0)
> >  #define HMM_PFN_WRITE (1 << 1)
> >  #define HMM_PFN_ERROR (1 << 2)
> > -#define HMM_PFN_EMPTY (1 << 3)
> 
> Hi Jerome,
> 
> Nearly done with this one...see below for a bit more detail, but I think if we did this:
> 
>     #define HMM_PFN_EMPTY (0)
> 
> ...it would work out nicely.
> 
> > -#define HMM_PFN_SPECIAL (1 << 4)
> > -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> > -#define HMM_PFN_SHIFT 6
> > +#define HMM_PFN_SPECIAL (1 << 3)
> > +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> > +#define HMM_PFN_SHIFT 5
> >  
> 
> <snip>
> 
> > @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  		pfns[i] = 0;
> >  
> >  		if (pte_none(pte)) {
> > -			pfns[i] = HMM_PFN_EMPTY;
> > +			pfns[i] = 0;
> 
> This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
> Symbols are better than raw numbers here.
> 

The last patch do that so i don't think it is worth respinning
just to make this intermediate state prettier.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
@ 2018-03-21 14:48       ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 14:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 10:24:34PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > There is no point in differentiating between a range for which there
> > is not even a directory (and thus entries) and empty entry (pte_none()
> > or pmd_none() returns true).
> > 
> > Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> > duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> > 
> > Changed since v1:
> >   - Improved comments
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  include/linux/hmm.h |  8 +++-----
> >  mm/hmm.c            | 45 +++++++++++++++------------------------------
> >  2 files changed, 18 insertions(+), 35 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 54d684fe3b90..cf283db22106 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -84,7 +84,6 @@ struct hmm;
> >   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
> >   * HMM_PFN_WRITE: CPU page table has write permission set
> >   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> > - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> >   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
> >   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
> >   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> > @@ -94,10 +93,9 @@ struct hmm;
> >  #define HMM_PFN_VALID (1 << 0)
> >  #define HMM_PFN_WRITE (1 << 1)
> >  #define HMM_PFN_ERROR (1 << 2)
> > -#define HMM_PFN_EMPTY (1 << 3)
> 
> Hi Jerome,
> 
> Nearly done with this one...see below for a bit more detail, but I think if we did this:
> 
>     #define HMM_PFN_EMPTY (0)
> 
> ...it would work out nicely.
> 
> > -#define HMM_PFN_SPECIAL (1 << 4)
> > -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> > -#define HMM_PFN_SHIFT 6
> > +#define HMM_PFN_SPECIAL (1 << 3)
> > +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> > +#define HMM_PFN_SHIFT 5
> >  
> 
> <snip>
> 
> > @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  		pfns[i] = 0;
> >  
> >  		if (pte_none(pte)) {
> > -			pfns[i] = HMM_PFN_EMPTY;
> > +			pfns[i] = 0;
> 
> This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
> Symbols are better than raw numbers here.
> 

The last patch do that so i don't think it is worth respinning
just to make this intermediate state prettier.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  2018-03-21  5:07     ` John Hubbard
@ 2018-03-21 15:08       ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 15:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 10:07:29PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > No functional change, just create one function to handle pmd and one
> > to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
> >  1 file changed, 102 insertions(+), 72 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 52cdceb35733..dc703e9c3a95 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
> >  	return hmm_vma_walk->fault ? -EAGAIN : 0;
> >  }
> >  
> > +static int hmm_vma_handle_pmd(struct mm_walk *walk,
> > +			      unsigned long addr,
> > +			      unsigned long end,
> > +			      uint64_t *pfns,
> 
> Hi Jerome,
> 
> Nice cleanup, it makes it much easier to follow the code now.
> 
> Let's please rename the pfns argument above to "pfn", because in this
> helper (and the _pte helper too), there is only one pfn involved, rather
> than an array of them.

This is only true to handle_pte, for handle_pmd it will go over several
pfn entries. But they will all get fill with same value modulo pfn which
will increase monotically (ie same flag as pmd permissions apply to all
entries).

Note sure s/pfns/pfn for hmm_vma_handle_pte() warrant a respin.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
@ 2018-03-21 15:08       ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 15:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 10:07:29PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > No functional change, just create one function to handle pmd and one
> > to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
> >  1 file changed, 102 insertions(+), 72 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 52cdceb35733..dc703e9c3a95 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
> >  	return hmm_vma_walk->fault ? -EAGAIN : 0;
> >  }
> >  
> > +static int hmm_vma_handle_pmd(struct mm_walk *walk,
> > +			      unsigned long addr,
> > +			      unsigned long end,
> > +			      uint64_t *pfns,
> 
> Hi Jerome,
> 
> Nice cleanup, it makes it much easier to follow the code now.
> 
> Let's please rename the pfns argument above to "pfn", because in this
> helper (and the _pte helper too), there is only one pfn involved, rather
> than an array of them.

This is only true to handle_pte, for handle_pmd it will go over several
pfn entries. But they will all get fill with same value modulo pfn which
will increase monotically (ie same flag as pmd permissions apply to all
entries).

Note sure s/pfns/pfn for hmm_vma_handle_pte() warrant a respin.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-21  4:39     ` John Hubbard
@ 2018-03-21 15:52       ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 15:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 0f7ea3074175..5d26e0a223d9 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -80,68 +80,145 @@
> >  struct hmm;
> >  
> >  /*
> > + * hmm_pfn_flag_e - HMM flag enums
> > + *
> >   * Flags:
> >   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
> >   * HMM_PFN_WRITE: CPU page table has write permission set
> > + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> > + *
> > + * The driver provide a flags array, if driver valid bit for an entry is bit
> > + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> > + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> > + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> > + * except that this is per device driver rather than per architecture.
> 
> Hi Jerome,
> 
> If we go with this approach--and I hope not, I'll try to talk you down from the
> ledge, in a moment--then maybe we should add the following to the comments: 
> 
> "There is only one bit ever set in each hmm_range.flags[entry]." 

This is not necesarily true, driver can have multiple bits set matching
one HMM flag. Actually i do expect to see that (in nouveau).


> Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
> strange way to do things (especially when there is a much easier way), and deserves 
> that extra bit of helpful documentation.
> 
> More below...
> 
> > + */
> > +enum hmm_pfn_flag_e {
> > +	HMM_PFN_VALID = 0,
> > +	HMM_PFN_WRITE,
> > +	HMM_PFN_DEVICE_PRIVATE,
> > +	HMM_PFN_FLAG_MAX
> > +};
> > +
> > +/*
> > + * hmm_pfn_value_e - HMM pfn special value
> > + *
> > + * Flags:
> >   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> > + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
> >   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
> >   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
> >   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> >   *      set and the pfn value is undefined.
> > - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> > + *
> > + * Driver provide entry value for none entry, error entry and special entry,
> > + * driver can alias (ie use same value for error and special for instance). It
> > + * should not alias none and error or special.
> > + *
> > + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> > + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> > + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> > + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
> >   */
> > -#define HMM_PFN_VALID (1 << 0)
> > -#define HMM_PFN_WRITE (1 << 1)
> > -#define HMM_PFN_ERROR (1 << 2)
> > -#define HMM_PFN_SPECIAL (1 << 3)
> > -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> > -#define HMM_PFN_SHIFT 5
> > +enum hmm_pfn_value_e {
> > +	HMM_PFN_ERROR,
> > +	HMM_PFN_NONE,
> > +	HMM_PFN_SPECIAL,
> > +	HMM_PFN_VALUE_MAX
> > +};
> 
> I can think of perhaps two good solid ways to get what you want, without
> moving to what I consider an unnecessary excursion into arrays of flags. 
> If I understand correctly, you want to let each architecture
> specify which bit to use for each of the above HMM_PFN_* flags. 
> 
> The way you have it now, the code does things like this:
> 
>         cpu_flags & range->flags[HMM_PFN_WRITE]
> 
> but that array entry is mostly empty space, and it's confusing. It would
> be nicer to see:
> 
>         cpu_flags & HMM_PFN_WRITE
> 
> ...which you can easily do, by defining HMM_PFN_WRITE and friends in an
> arch-specific header file.

arch-specific header do not work, HMM can be use in same kernel by
different device driver (AMD and NVidia for instance) and each will
have their own page table entry format and they won't match.


> The other way to make this more readable would be to use helper routines
> similar to what the vm_pgprot* routines do:
> 
> static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
> {
> 	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
> }
> 
> ...but that's also unnecessary.
> 
> Let's just keep it simple, and go back to the bitmap flags!

This simplify nouveau code and it is the reason why i did that patch.
I am sure it can simplify NVidia uvm code, i can look into it if you
want to give pointers. Idea here is that HMM can fill array with some-
thing that match device driver internal format and avoid the conversion
step from HMM format to driver format (saving CPU cycles and memory
doing so). I am open to alternative that give the same end result.

[Just because code is worth 2^32 words :)

Without this patch:
    int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
    {
        uint64_t *hmm_pfns, *nouveau_pfns;

        hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        hmm_vma_fault(..., hmm_pfns, ...);

        for (i = 0; i < npages; ++i) {
            nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
        }
        ...
    }

With this patch:
    int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
    {
        uint64_t *nouveau_pfns;

        nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        hmm_vma_fault(..., nouveau_pfns, ...);

        ...
    }

Benefit from this patch is quite obvious to me. Down the road with bit
more integration between HMM and IOMMU/DMA this can turn into something
directly ready for hardware consumptions.

Note that you could argue that i can convert nouveau to use HMM format
but this would not work, first because it requires a lot of changes in
nouuveau, second because HMM do not have all the flags needed by the
drivers (nor does HMM need them). HMM being the helper here, i feel it
is up to HMM to adapt to drivers than the other way around.]

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
@ 2018-03-21 15:52       ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 15:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>

[...]

> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 0f7ea3074175..5d26e0a223d9 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -80,68 +80,145 @@
> >  struct hmm;
> >  
> >  /*
> > + * hmm_pfn_flag_e - HMM flag enums
> > + *
> >   * Flags:
> >   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
> >   * HMM_PFN_WRITE: CPU page table has write permission set
> > + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> > + *
> > + * The driver provide a flags array, if driver valid bit for an entry is bit
> > + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> > + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> > + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> > + * except that this is per device driver rather than per architecture.
> 
> Hi Jerome,
> 
> If we go with this approach--and I hope not, I'll try to talk you down from the
> ledge, in a moment--then maybe we should add the following to the comments: 
> 
> "There is only one bit ever set in each hmm_range.flags[entry]." 

This is not necesarily true, driver can have multiple bits set matching
one HMM flag. Actually i do expect to see that (in nouveau).


> Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
> strange way to do things (especially when there is a much easier way), and deserves 
> that extra bit of helpful documentation.
> 
> More below...
> 
> > + */
> > +enum hmm_pfn_flag_e {
> > +	HMM_PFN_VALID = 0,
> > +	HMM_PFN_WRITE,
> > +	HMM_PFN_DEVICE_PRIVATE,
> > +	HMM_PFN_FLAG_MAX
> > +};
> > +
> > +/*
> > + * hmm_pfn_value_e - HMM pfn special value
> > + *
> > + * Flags:
> >   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> > + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
> >   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
> >   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
> >   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> >   *      set and the pfn value is undefined.
> > - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> > + *
> > + * Driver provide entry value for none entry, error entry and special entry,
> > + * driver can alias (ie use same value for error and special for instance). It
> > + * should not alias none and error or special.
> > + *
> > + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> > + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> > + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> > + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
> >   */
> > -#define HMM_PFN_VALID (1 << 0)
> > -#define HMM_PFN_WRITE (1 << 1)
> > -#define HMM_PFN_ERROR (1 << 2)
> > -#define HMM_PFN_SPECIAL (1 << 3)
> > -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> > -#define HMM_PFN_SHIFT 5
> > +enum hmm_pfn_value_e {
> > +	HMM_PFN_ERROR,
> > +	HMM_PFN_NONE,
> > +	HMM_PFN_SPECIAL,
> > +	HMM_PFN_VALUE_MAX
> > +};
> 
> I can think of perhaps two good solid ways to get what you want, without
> moving to what I consider an unnecessary excursion into arrays of flags. 
> If I understand correctly, you want to let each architecture
> specify which bit to use for each of the above HMM_PFN_* flags. 
> 
> The way you have it now, the code does things like this:
> 
>         cpu_flags & range->flags[HMM_PFN_WRITE]
> 
> but that array entry is mostly empty space, and it's confusing. It would
> be nicer to see:
> 
>         cpu_flags & HMM_PFN_WRITE
> 
> ...which you can easily do, by defining HMM_PFN_WRITE and friends in an
> arch-specific header file.

arch-specific header do not work, HMM can be use in same kernel by
different device driver (AMD and NVidia for instance) and each will
have their own page table entry format and they won't match.


> The other way to make this more readable would be to use helper routines
> similar to what the vm_pgprot* routines do:
> 
> static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
> {
> 	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
> }
> 
> ...but that's also unnecessary.
> 
> Let's just keep it simple, and go back to the bitmap flags!

This simplify nouveau code and it is the reason why i did that patch.
I am sure it can simplify NVidia uvm code, i can look into it if you
want to give pointers. Idea here is that HMM can fill array with some-
thing that match device driver internal format and avoid the conversion
step from HMM format to driver format (saving CPU cycles and memory
doing so). I am open to alternative that give the same end result.

[Just because code is worth 2^32 words :)

Without this patch:
    int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
    {
        uint64_t *hmm_pfns, *nouveau_pfns;

        hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        hmm_vma_fault(..., hmm_pfns, ...);

        for (i = 0; i < npages; ++i) {
            nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
        }
        ...
    }

With this patch:
    int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
    {
        uint64_t *nouveau_pfns;

        nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        hmm_vma_fault(..., nouveau_pfns, ...);

        ...
    }

Benefit from this patch is quite obvious to me. Down the road with bit
more integration between HMM and IOMMU/DMA this can turn into something
directly ready for hardware consumptions.

Note that you could argue that i can convert nouveau to use HMM format
but this would not work, first because it requires a lot of changes in
nouuveau, second because HMM do not have all the flags needed by the
drivers (nor does HMM need them). HMM being the helper here, i feel it
is up to HMM to adapt to drivers than the other way around.]

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-21  4:14     ` John Hubbard
  (?)
@ 2018-03-21 18:03       ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 18:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> > 
> > Changed since v1:
> >   - dropped VM_BUG_ON()
> >   - cc stable
> > 
> > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: stable@vger.kernel.org
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  include/linux/hmm.h | 10 ++++++++++
> >  mm/hmm.c            | 18 +++++++++++++++++-
> >  2 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 36dd21fe5caf..fa7b51f65905 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -218,6 +218,16 @@ enum hmm_update_type {
> >   * @update: callback to update range on a device
> >   */
> >  struct hmm_mirror_ops {
> > +	/* release() - release hmm_mirror
> > +	 *
> > +	 * @mirror: pointer to struct hmm_mirror
> > +	 *
> > +	 * This is called when the mm_struct is being released.
> > +	 * The callback should make sure no references to the mirror occur
> > +	 * after the callback returns.
> > +	 */
> > +	void (*release)(struct hmm_mirror *mirror);
> > +
> >  	/* sync_cpu_device_pagetables() - synchronize page tables
> >  	 *
> >  	 * @mirror: pointer to struct hmm_mirror
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 320545b98ff5..6088fa6ed137 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
> >  	up_read(&hmm->mirrors_sem);
> >  }
> >  
> > +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm = mm->hmm;
> > +	struct hmm_mirror *mirror;
> > +	struct hmm_mirror *mirror_next;
> > +
> > +	down_write(&hmm->mirrors_sem);
> > +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> > +		list_del_init(&mirror->list);
> > +		if (mirror->ops->release)
> > +			mirror->ops->release(mirror);
> 
> Hi Jerome,
> 
> This presents a deadlock problem (details below). As for solution ideas, 
> Mark Hairgrove points out that the MMU notifiers had to solve the
> same sort of problem, and part of the solution involves "avoid
> holding locks when issuing these callbacks". That's not an entire 
> solution description, of course, but it seems like a good start.
> 
> Anyway, for the deadlock problem:
> 
> Each of these ->release callbacks potentially has to wait for the 
> hmm_invalidate_range() callbacks to finish. That is not shown in any
> code directly, but it's because: when a device driver is processing 
> the above ->release callback, it has to allow any in-progress operations 
> to finish up (as specified clearly in your comment documentation above). 
> 
> Some of those operations will invariably need to do things that result 
> in page invalidations, thus triggering the hmm_invalidate_range() callback.
> Then, the hmm_invalidate_range() callback tries to acquire the same 
> hmm->mirrors_sem lock, thus leading to deadlock:
> 
> hmm_invalidate_range():
> // ...
> 	down_read(&hmm->mirrors_sem);
> 	list_for_each_entry(mirror, &hmm->mirrors, list)
> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
> 							start, end);
> 	up_read(&hmm->mirrors_sem);

That is just illegal, the release callback is not allowed to trigger
invalidation all it does is kill all device's threads and stop device
page fault from happening. So there is no deadlock issues. I can re-
inforce the comment some more (see [1] for example on what it should
be).

Also it is illegal for the sync callback to trigger any mmu_notifier
callback. I thought this was obvious. The sync callback should only
update device page table and do _nothing else_. No way to make this
re-entrant.

For anonymous private memory migrated to device memory it is freed
shortly after the release callback (see exit_mmap()). For share memory
you might want to migrate back to regular memory but that will be fine
as you will not get mmu_notifier callback any more.

So i don't see any deadlock here.

Cheers,
Jérôme

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=93adb3e6b4f39d5d146b6a8afb4175d37bdd4890

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 18:03       ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 18:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> > 
> > Changed since v1:
> >   - dropped VM_BUG_ON()
> >   - cc stable
> > 
> > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > Signed-off-by: J�r�me Glisse <jglisse@redhat.com>
> > Cc: stable@vger.kernel.org
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  include/linux/hmm.h | 10 ++++++++++
> >  mm/hmm.c            | 18 +++++++++++++++++-
> >  2 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 36dd21fe5caf..fa7b51f65905 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -218,6 +218,16 @@ enum hmm_update_type {
> >   * @update: callback to update range on a device
> >   */
> >  struct hmm_mirror_ops {
> > +	/* release() - release hmm_mirror
> > +	 *
> > +	 * @mirror: pointer to struct hmm_mirror
> > +	 *
> > +	 * This is called when the mm_struct is being released.
> > +	 * The callback should make sure no references to the mirror occur
> > +	 * after the callback returns.
> > +	 */
> > +	void (*release)(struct hmm_mirror *mirror);
> > +
> >  	/* sync_cpu_device_pagetables() - synchronize page tables
> >  	 *
> >  	 * @mirror: pointer to struct hmm_mirror
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 320545b98ff5..6088fa6ed137 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
> >  	up_read(&hmm->mirrors_sem);
> >  }
> >  
> > +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm = mm->hmm;
> > +	struct hmm_mirror *mirror;
> > +	struct hmm_mirror *mirror_next;
> > +
> > +	down_write(&hmm->mirrors_sem);
> > +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> > +		list_del_init(&mirror->list);
> > +		if (mirror->ops->release)
> > +			mirror->ops->release(mirror);
> 
> Hi Jerome,
> 
> This presents a deadlock problem (details below). As for solution ideas, 
> Mark Hairgrove points out that the MMU notifiers had to solve the
> same sort of problem, and part of the solution involves "avoid
> holding locks when issuing these callbacks". That's not an entire 
> solution description, of course, but it seems like a good start.
> 
> Anyway, for the deadlock problem:
> 
> Each of these ->release callbacks potentially has to wait for the 
> hmm_invalidate_range() callbacks to finish. That is not shown in any
> code directly, but it's because: when a device driver is processing 
> the above ->release callback, it has to allow any in-progress operations 
> to finish up (as specified clearly in your comment documentation above). 
> 
> Some of those operations will invariably need to do things that result 
> in page invalidations, thus triggering the hmm_invalidate_range() callback.
> Then, the hmm_invalidate_range() callback tries to acquire the same 
> hmm->mirrors_sem lock, thus leading to deadlock:
> 
> hmm_invalidate_range():
> // ...
> 	down_read(&hmm->mirrors_sem);
> 	list_for_each_entry(mirror, &hmm->mirrors, list)
> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
> 							start, end);
> 	up_read(&hmm->mirrors_sem);

That is just illegal, the release callback is not allowed to trigger
invalidation all it does is kill all device's threads and stop device
page fault from happening. So there is no deadlock issues. I can re-
inforce the comment some more (see [1] for example on what it should
be).

Also it is illegal for the sync callback to trigger any mmu_notifier
callback. I thought this was obvious. The sync callback should only
update device page table and do _nothing else_. No way to make this
re-entrant.

For anonymous private memory migrated to device memory it is freed
shortly after the release callback (see exit_mmap()). For share memory
you might want to migrate back to regular memory but that will be fine
as you will not get mmu_notifier callback any more.

So i don't see any deadlock here.

Cheers,
J�r�me

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=93adb3e6b4f39d5d146b6a8afb4175d37bdd4890

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 18:03       ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 18:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> > 
> > Changed since v1:
> >   - dropped VM_BUG_ON()
> >   - cc stable
> > 
> > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: stable@vger.kernel.org
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  include/linux/hmm.h | 10 ++++++++++
> >  mm/hmm.c            | 18 +++++++++++++++++-
> >  2 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 36dd21fe5caf..fa7b51f65905 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -218,6 +218,16 @@ enum hmm_update_type {
> >   * @update: callback to update range on a device
> >   */
> >  struct hmm_mirror_ops {
> > +	/* release() - release hmm_mirror
> > +	 *
> > +	 * @mirror: pointer to struct hmm_mirror
> > +	 *
> > +	 * This is called when the mm_struct is being released.
> > +	 * The callback should make sure no references to the mirror occur
> > +	 * after the callback returns.
> > +	 */
> > +	void (*release)(struct hmm_mirror *mirror);
> > +
> >  	/* sync_cpu_device_pagetables() - synchronize page tables
> >  	 *
> >  	 * @mirror: pointer to struct hmm_mirror
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 320545b98ff5..6088fa6ed137 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
> >  	up_read(&hmm->mirrors_sem);
> >  }
> >  
> > +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm = mm->hmm;
> > +	struct hmm_mirror *mirror;
> > +	struct hmm_mirror *mirror_next;
> > +
> > +	down_write(&hmm->mirrors_sem);
> > +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> > +		list_del_init(&mirror->list);
> > +		if (mirror->ops->release)
> > +			mirror->ops->release(mirror);
> 
> Hi Jerome,
> 
> This presents a deadlock problem (details below). As for solution ideas, 
> Mark Hairgrove points out that the MMU notifiers had to solve the
> same sort of problem, and part of the solution involves "avoid
> holding locks when issuing these callbacks". That's not an entire 
> solution description, of course, but it seems like a good start.
> 
> Anyway, for the deadlock problem:
> 
> Each of these ->release callbacks potentially has to wait for the 
> hmm_invalidate_range() callbacks to finish. That is not shown in any
> code directly, but it's because: when a device driver is processing 
> the above ->release callback, it has to allow any in-progress operations 
> to finish up (as specified clearly in your comment documentation above). 
> 
> Some of those operations will invariably need to do things that result 
> in page invalidations, thus triggering the hmm_invalidate_range() callback.
> Then, the hmm_invalidate_range() callback tries to acquire the same 
> hmm->mirrors_sem lock, thus leading to deadlock:
> 
> hmm_invalidate_range():
> // ...
> 	down_read(&hmm->mirrors_sem);
> 	list_for_each_entry(mirror, &hmm->mirrors, list)
> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
> 							start, end);
> 	up_read(&hmm->mirrors_sem);

That is just illegal, the release callback is not allowed to trigger
invalidation all it does is kill all device's threads and stop device
page fault from happening. So there is no deadlock issues. I can re-
inforce the comment some more (see [1] for example on what it should
be).

Also it is illegal for the sync callback to trigger any mmu_notifier
callback. I thought this was obvious. The sync callback should only
update device page table and do _nothing else_. No way to make this
re-entrant.

For anonymous private memory migrated to device memory it is freed
shortly after the release callback (see exit_mmap()). For share memory
you might want to migrate back to regular memory but that will be fine
as you will not get mmu_notifier callback any more.

So i don't see any deadlock here.

Cheers,
Jerome

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=93adb3e6b4f39d5d146b6a8afb4175d37bdd4890

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit
  2018-03-21  4:24     ` John Hubbard
@ 2018-03-21 18:12       ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 18:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:24:41PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This code was lost in translation at one point. This properly call
> > mmu_notifier_unregister_no_release() once last user is gone. This
> > fix the zombie mm_struct as without this patch we do not drop the
> > refcount we have on it.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6088fa6ed137..667944630dc9 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
> >  void hmm_mirror_unregister(struct hmm_mirror *mirror)
> >  {
> >  	struct hmm *hmm = mirror->hmm;
> > +	struct mm_struct *mm = NULL;
> > +	bool unregister = false;
> >  
> >  	down_write(&hmm->mirrors_sem);
> >  	list_del_init(&mirror->list);
> > +	unregister = list_empty(&hmm->mirrors);
> 
> Hi Jerome,
> 
> This first minor point may be irrelevant, depending on how you fix 
> the other problem below, but: tiny naming idea: rename unregister 
> to either "should_unregister", or "mirror_snapshot_empty"...the 
> latter helps show that this is stale information, once the lock is 
> dropped. 

First name make sense, second doesn't (at least to me), mirror is dead
at this point, it does not have any implication respective to snapshot
(the mm might still be very well alive and in active use).


> >  	up_write(&hmm->mirrors_sem);
> > +
> > +	if (!unregister)
> > +		return;
> 
> Whee, here I am, lock-free, in the middle of a race condition
> window. :)  Right here, someone (hmm_mirror_register) could be adding
> another mirror.
> 
> It's not immediately clear to me what the best solution is.
> I'd be happier if we didn't have to drop one lock and take
> another like this, but if we do, then maybe rechecking that
> the list hasn't changed...safely, somehow, is a way forward here.
> 

First i want to stress this is very unlikely race, it can only happens
if one hmm_mirror unregister and is last one while another new one try
to register. Highly unlikely but i am sending a v2 which fix that any-
way better safe than sorry. It makes the register side a bit ugly.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit
@ 2018-03-21 18:12       ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 18:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:24:41PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > This code was lost in translation at one point. This properly call
> > mmu_notifier_unregister_no_release() once last user is gone. This
> > fix the zombie mm_struct as without this patch we do not drop the
> > refcount we have on it.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6088fa6ed137..667944630dc9 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
> >  void hmm_mirror_unregister(struct hmm_mirror *mirror)
> >  {
> >  	struct hmm *hmm = mirror->hmm;
> > +	struct mm_struct *mm = NULL;
> > +	bool unregister = false;
> >  
> >  	down_write(&hmm->mirrors_sem);
> >  	list_del_init(&mirror->list);
> > +	unregister = list_empty(&hmm->mirrors);
> 
> Hi Jerome,
> 
> This first minor point may be irrelevant, depending on how you fix 
> the other problem below, but: tiny naming idea: rename unregister 
> to either "should_unregister", or "mirror_snapshot_empty"...the 
> latter helps show that this is stale information, once the lock is 
> dropped. 

First name make sense, second doesn't (at least to me), mirror is dead
at this point, it does not have any implication respective to snapshot
(the mm might still be very well alive and in active use).


> >  	up_write(&hmm->mirrors_sem);
> > +
> > +	if (!unregister)
> > +		return;
> 
> Whee, here I am, lock-free, in the middle of a race condition
> window. :)  Right here, someone (hmm_mirror_register) could be adding
> another mirror.
> 
> It's not immediately clear to me what the best solution is.
> I'd be happier if we didn't have to drop one lock and take
> another like this, but if we do, then maybe rechecking that
> the list hasn't changed...safely, somehow, is a way forward here.
> 

First i want to stress this is very unlikely race, it can only happens
if one hmm_mirror unregister and is last one while another new one try
to register. Highly unlikely but i am sending a v2 which fix that any-
way better safe than sorry. It makes the register side a bit ugly.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-20  2:00   ` jglisse
@ 2018-03-21 18:16     ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-21 18:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Changed since v1:
  - close race window between a last mirror unregistering and a new
    mirror registering, which could have lead to use after free()
    kind of bug

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..f75aa8df6e97 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
+again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	list_add(&mirror->list, &mirror->hmm->mirrors);
-	up_write(&mirror->hmm->mirrors_sem);
+	if (mirror->hmm->mm == NULL) {
+		/*
+		 * A racing hmm_mirror_unregister() is about to destroy the hmm
+		 * struct. Try again to allocate a new one.
+		 */
+		up_write(&mirror->hmm->mirrors_sem);
+		mirror->hmm = NULL;
+		goto again;
+	} else {
+		list_add(&mirror->list, &mirror->hmm->mirrors);
+		up_write(&mirror->hmm->mirrors_sem);
+	}
 
 	return 0;
 }
@@ -244,10 +255,30 @@ EXPORT_SYMBOL(hmm_mirror_register);
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
 	struct hmm *hmm = mirror->hmm;
+	bool should_unregister = false;
+	struct mm_struct *mm;
+
+	if (list_empty(&mirror->list))
+		return;
 
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	should_unregister = list_empty(&hmm->mirrors);
+	mm = hmm->mm;
+	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
+
+	if (!should_unregister || mm == NULL)
+		return;
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm) {
+		mm->hmm = NULL;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-21 18:16     ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-21 18:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Changed since v1:
  - close race window between a last mirror unregistering and a new
    mirror registering, which could have lead to use after free()
    kind of bug

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..f75aa8df6e97 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
+again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	list_add(&mirror->list, &mirror->hmm->mirrors);
-	up_write(&mirror->hmm->mirrors_sem);
+	if (mirror->hmm->mm == NULL) {
+		/*
+		 * A racing hmm_mirror_unregister() is about to destroy the hmm
+		 * struct. Try again to allocate a new one.
+		 */
+		up_write(&mirror->hmm->mirrors_sem);
+		mirror->hmm = NULL;
+		goto again;
+	} else {
+		list_add(&mirror->list, &mirror->hmm->mirrors);
+		up_write(&mirror->hmm->mirrors_sem);
+	}
 
 	return 0;
 }
@@ -244,10 +255,30 @@ EXPORT_SYMBOL(hmm_mirror_register);
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
 	struct hmm *hmm = mirror->hmm;
+	bool should_unregister = false;
+	struct mm_struct *mm;
+
+	if (list_empty(&mirror->list))
+		return;
 
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	should_unregister = list_empty(&hmm->mirrors);
+	mm = hmm->mm;
+	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
+
+	if (!should_unregister || mm == NULL)
+		return;
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm) {
+		mm->hmm = NULL;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-21 18:03       ` Jerome Glisse
@ 2018-03-21 22:16         ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 22:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Ralph Campbell <rcampbell@nvidia.com>

<snip>

>> Hi Jerome,
>>
>> This presents a deadlock problem (details below). As for solution ideas, 
>> Mark Hairgrove points out that the MMU notifiers had to solve the
>> same sort of problem, and part of the solution involves "avoid
>> holding locks when issuing these callbacks". That's not an entire 
>> solution description, of course, but it seems like a good start.
>>
>> Anyway, for the deadlock problem:
>>
>> Each of these ->release callbacks potentially has to wait for the 
>> hmm_invalidate_range() callbacks to finish. That is not shown in any
>> code directly, but it's because: when a device driver is processing 
>> the above ->release callback, it has to allow any in-progress operations 
>> to finish up (as specified clearly in your comment documentation above). 
>>
>> Some of those operations will invariably need to do things that result 
>> in page invalidations, thus triggering the hmm_invalidate_range() callback.
>> Then, the hmm_invalidate_range() callback tries to acquire the same 
>> hmm->mirrors_sem lock, thus leading to deadlock:
>>
>> hmm_invalidate_range():
>> // ...
>> 	down_read(&hmm->mirrors_sem);
>> 	list_for_each_entry(mirror, &hmm->mirrors, list)
>> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
>> 							start, end);
>> 	up_read(&hmm->mirrors_sem);
> 
> That is just illegal, the release callback is not allowed to trigger
> invalidation all it does is kill all device's threads and stop device
> page fault from happening. So there is no deadlock issues. I can re-
> inforce the comment some more (see [1] for example on what it should
> be).

That rule is fine, and it is true that the .release callback will not 
directly trigger any invalidations. However, the problem is in letting 
any *existing* outstanding operations finish up. We have to let 
existing operations "drain", in order to meet the requirement that 
everything is done when .release returns.

For example, if a device driver thread is in the middle of working through
its fault buffer, it will call migrate_vma(), which will in turn unmap
pages. That will cause an hmm_invalidate_range() callback, which tries
to take hmm->mirrors_sems, and we deadlock.

There's no way to "kill" such a thread while it's in the middle of
migrate_vma(), you have to let it finish up.

> 
> Also it is illegal for the sync callback to trigger any mmu_notifier
> callback. I thought this was obvious. The sync callback should only
> update device page table and do _nothing else_. No way to make this
> re-entrant.

That is obvious, yes. I am not trying to say there is any problem with
that rule. It's the "drain outstanding operations during .release", 
above, that is the real problem.

thanks,
-- 
John Hubbard
NVIDIA

> 
> For anonymous private memory migrated to device memory it is freed
> shortly after the release callback (see exit_mmap()). For share memory
> you might want to migrate back to regular memory but that will be fine
> as you will not get mmu_notifier callback any more.
> 
> So i don't see any deadlock here.
> 
> Cheers,
> Jérôme
> 
> [1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=93adb3e6b4f39d5d146b6a8afb4175d37bdd4890
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 22:16         ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 22:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Ralph Campbell <rcampbell@nvidia.com>

<snip>

>> Hi Jerome,
>>
>> This presents a deadlock problem (details below). As for solution ideas, 
>> Mark Hairgrove points out that the MMU notifiers had to solve the
>> same sort of problem, and part of the solution involves "avoid
>> holding locks when issuing these callbacks". That's not an entire 
>> solution description, of course, but it seems like a good start.
>>
>> Anyway, for the deadlock problem:
>>
>> Each of these ->release callbacks potentially has to wait for the 
>> hmm_invalidate_range() callbacks to finish. That is not shown in any
>> code directly, but it's because: when a device driver is processing 
>> the above ->release callback, it has to allow any in-progress operations 
>> to finish up (as specified clearly in your comment documentation above). 
>>
>> Some of those operations will invariably need to do things that result 
>> in page invalidations, thus triggering the hmm_invalidate_range() callback.
>> Then, the hmm_invalidate_range() callback tries to acquire the same 
>> hmm->mirrors_sem lock, thus leading to deadlock:
>>
>> hmm_invalidate_range():
>> // ...
>> 	down_read(&hmm->mirrors_sem);
>> 	list_for_each_entry(mirror, &hmm->mirrors, list)
>> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
>> 							start, end);
>> 	up_read(&hmm->mirrors_sem);
> 
> That is just illegal, the release callback is not allowed to trigger
> invalidation all it does is kill all device's threads and stop device
> page fault from happening. So there is no deadlock issues. I can re-
> inforce the comment some more (see [1] for example on what it should
> be).

That rule is fine, and it is true that the .release callback will not 
directly trigger any invalidations. However, the problem is in letting 
any *existing* outstanding operations finish up. We have to let 
existing operations "drain", in order to meet the requirement that 
everything is done when .release returns.

For example, if a device driver thread is in the middle of working through
its fault buffer, it will call migrate_vma(), which will in turn unmap
pages. That will cause an hmm_invalidate_range() callback, which tries
to take hmm->mirrors_sems, and we deadlock.

There's no way to "kill" such a thread while it's in the middle of
migrate_vma(), you have to let it finish up.

> 
> Also it is illegal for the sync callback to trigger any mmu_notifier
> callback. I thought this was obvious. The sync callback should only
> update device page table and do _nothing else_. No way to make this
> re-entrant.

That is obvious, yes. I am not trying to say there is any problem with
that rule. It's the "drain outstanding operations during .release", 
above, that is the real problem.

thanks,
-- 
John Hubbard
NVIDIA

> 
> For anonymous private memory migrated to device memory it is freed
> shortly after the release callback (see exit_mmap()). For share memory
> you might want to migrate back to regular memory but that will be fine
> as you will not get mmu_notifier callback any more.
> 
> So i don't see any deadlock here.
> 
> Cheers,
> Jérôme
> 
> [1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm&id=93adb3e6b4f39d5d146b6a8afb4175d37bdd4890
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  2018-03-21 15:08       ` Jerome Glisse
@ 2018-03-21 22:36         ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 22:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 08:08 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 10:07:29PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>

<snip>
 
>>> +static int hmm_vma_handle_pmd(struct mm_walk *walk,
>>> +			      unsigned long addr,
>>> +			      unsigned long end,
>>> +			      uint64_t *pfns,
>>
>> Hi Jerome,
>>
>> Nice cleanup, it makes it much easier to follow the code now.
>>
>> Let's please rename the pfns argument above to "pfn", because in this
>> helper (and the _pte helper too), there is only one pfn involved, rather
>> than an array of them.
> 
> This is only true to handle_pte, for handle_pmd it will go over several
> pfn entries. But they will all get fill with same value modulo pfn which
> will increase monotically (ie same flag as pmd permissions apply to all
> entries).

oops, yes you are right about handle_pmd.

> 
> Note sure s/pfns/pfn for hmm_vma_handle_pte() warrant a respin.

Probably not, unless there is some other reason to respin. Anyway, this patch
looks good either way, I think, so you can still add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
@ 2018-03-21 22:36         ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 22:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 08:08 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 10:07:29PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>

<snip>
 
>>> +static int hmm_vma_handle_pmd(struct mm_walk *walk,
>>> +			      unsigned long addr,
>>> +			      unsigned long end,
>>> +			      uint64_t *pfns,
>>
>> Hi Jerome,
>>
>> Nice cleanup, it makes it much easier to follow the code now.
>>
>> Let's please rename the pfns argument above to "pfn", because in this
>> helper (and the _pte helper too), there is only one pfn involved, rather
>> than an array of them.
> 
> This is only true to handle_pte, for handle_pmd it will go over several
> pfn entries. But they will all get fill with same value modulo pfn which
> will increase monotically (ie same flag as pmd permissions apply to all
> entries).

oops, yes you are right about handle_pmd.

> 
> Note sure s/pfns/pfn for hmm_vma_handle_pte() warrant a respin.

Probably not, unless there is some other reason to respin. Anyway, this patch
looks good either way, I think, so you can still add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-21 22:16         ` John Hubbard
  (?)
@ 2018-03-21 22:46           ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 22:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> > On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> <snip>
> 
> >> Hi Jerome,
> >>
> >> This presents a deadlock problem (details below). As for solution ideas, 
> >> Mark Hairgrove points out that the MMU notifiers had to solve the
> >> same sort of problem, and part of the solution involves "avoid
> >> holding locks when issuing these callbacks". That's not an entire 
> >> solution description, of course, but it seems like a good start.
> >>
> >> Anyway, for the deadlock problem:
> >>
> >> Each of these ->release callbacks potentially has to wait for the 
> >> hmm_invalidate_range() callbacks to finish. That is not shown in any
> >> code directly, but it's because: when a device driver is processing 
> >> the above ->release callback, it has to allow any in-progress operations 
> >> to finish up (as specified clearly in your comment documentation above). 
> >>
> >> Some of those operations will invariably need to do things that result 
> >> in page invalidations, thus triggering the hmm_invalidate_range() callback.
> >> Then, the hmm_invalidate_range() callback tries to acquire the same 
> >> hmm->mirrors_sem lock, thus leading to deadlock:
> >>
> >> hmm_invalidate_range():
> >> // ...
> >> 	down_read(&hmm->mirrors_sem);
> >> 	list_for_each_entry(mirror, &hmm->mirrors, list)
> >> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
> >> 							start, end);
> >> 	up_read(&hmm->mirrors_sem);
> > 
> > That is just illegal, the release callback is not allowed to trigger
> > invalidation all it does is kill all device's threads and stop device
> > page fault from happening. So there is no deadlock issues. I can re-
> > inforce the comment some more (see [1] for example on what it should
> > be).
> 
> That rule is fine, and it is true that the .release callback will not 
> directly trigger any invalidations. However, the problem is in letting 
> any *existing* outstanding operations finish up. We have to let 
> existing operations "drain", in order to meet the requirement that 
> everything is done when .release returns.
> 
> For example, if a device driver thread is in the middle of working through
> its fault buffer, it will call migrate_vma(), which will in turn unmap
> pages. That will cause an hmm_invalidate_range() callback, which tries
> to take hmm->mirrors_sems, and we deadlock.
> 
> There's no way to "kill" such a thread while it's in the middle of
> migrate_vma(), you have to let it finish up.
>
> > Also it is illegal for the sync callback to trigger any mmu_notifier
> > callback. I thought this was obvious. The sync callback should only
> > update device page table and do _nothing else_. No way to make this
> > re-entrant.
> 
> That is obvious, yes. I am not trying to say there is any problem with
> that rule. It's the "drain outstanding operations during .release", 
> above, that is the real problem.

Maybe just relax the release callback wording, it should stop any
more processing of fault buffer but not wait for it to finish. In
nouveau code i kill thing but i do not wait hence i don't deadlock.

What matter is to stop any further processing. Yes some fault might
be in flight but they will serialize on various lock. So just do not
wait in the release callback, kill thing. I might have a bug where i
still fill in GPU page table in nouveau, i will check nouveau code
for that.

Kill thing should also kill the channel (i don't do that in nouveau
because i am waiting on some channel patchset) but i am not sure if
hardware like it if we kill channel before stoping fault notification.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 22:46           ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 22:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> > On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> <snip>
> 
> >> Hi Jerome,
> >>
> >> This presents a deadlock problem (details below). As for solution ideas, 
> >> Mark Hairgrove points out that the MMU notifiers had to solve the
> >> same sort of problem, and part of the solution involves "avoid
> >> holding locks when issuing these callbacks". That's not an entire 
> >> solution description, of course, but it seems like a good start.
> >>
> >> Anyway, for the deadlock problem:
> >>
> >> Each of these ->release callbacks potentially has to wait for the 
> >> hmm_invalidate_range() callbacks to finish. That is not shown in any
> >> code directly, but it's because: when a device driver is processing 
> >> the above ->release callback, it has to allow any in-progress operations 
> >> to finish up (as specified clearly in your comment documentation above). 
> >>
> >> Some of those operations will invariably need to do things that result 
> >> in page invalidations, thus triggering the hmm_invalidate_range() callback.
> >> Then, the hmm_invalidate_range() callback tries to acquire the same 
> >> hmm->mirrors_sem lock, thus leading to deadlock:
> >>
> >> hmm_invalidate_range():
> >> // ...
> >> 	down_read(&hmm->mirrors_sem);
> >> 	list_for_each_entry(mirror, &hmm->mirrors, list)
> >> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
> >> 							start, end);
> >> 	up_read(&hmm->mirrors_sem);
> > 
> > That is just illegal, the release callback is not allowed to trigger
> > invalidation all it does is kill all device's threads and stop device
> > page fault from happening. So there is no deadlock issues. I can re-
> > inforce the comment some more (see [1] for example on what it should
> > be).
> 
> That rule is fine, and it is true that the .release callback will not 
> directly trigger any invalidations. However, the problem is in letting 
> any *existing* outstanding operations finish up. We have to let 
> existing operations "drain", in order to meet the requirement that 
> everything is done when .release returns.
> 
> For example, if a device driver thread is in the middle of working through
> its fault buffer, it will call migrate_vma(), which will in turn unmap
> pages. That will cause an hmm_invalidate_range() callback, which tries
> to take hmm->mirrors_sems, and we deadlock.
> 
> There's no way to "kill" such a thread while it's in the middle of
> migrate_vma(), you have to let it finish up.
>
> > Also it is illegal for the sync callback to trigger any mmu_notifier
> > callback. I thought this was obvious. The sync callback should only
> > update device page table and do _nothing else_. No way to make this
> > re-entrant.
> 
> That is obvious, yes. I am not trying to say there is any problem with
> that rule. It's the "drain outstanding operations during .release", 
> above, that is the real problem.

Maybe just relax the release callback wording, it should stop any
more processing of fault buffer but not wait for it to finish. In
nouveau code i kill thing but i do not wait hence i don't deadlock.

What matter is to stop any further processing. Yes some fault might
be in flight but they will serialize on various lock. So just do not
wait in the release callback, kill thing. I might have a bug where i
still fill in GPU page table in nouveau, i will check nouveau code
for that.

Kill thing should also kill the channel (i don't do that in nouveau
because i am waiting on some channel patchset) but i am not sure if
hardware like it if we kill channel before stoping fault notification.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 22:46           ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 22:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> > On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> <snip>
> 
> >> Hi Jerome,
> >>
> >> This presents a deadlock problem (details below). As for solution ideas, 
> >> Mark Hairgrove points out that the MMU notifiers had to solve the
> >> same sort of problem, and part of the solution involves "avoid
> >> holding locks when issuing these callbacks". That's not an entire 
> >> solution description, of course, but it seems like a good start.
> >>
> >> Anyway, for the deadlock problem:
> >>
> >> Each of these ->release callbacks potentially has to wait for the 
> >> hmm_invalidate_range() callbacks to finish. That is not shown in any
> >> code directly, but it's because: when a device driver is processing 
> >> the above ->release callback, it has to allow any in-progress operations 
> >> to finish up (as specified clearly in your comment documentation above). 
> >>
> >> Some of those operations will invariably need to do things that result 
> >> in page invalidations, thus triggering the hmm_invalidate_range() callback.
> >> Then, the hmm_invalidate_range() callback tries to acquire the same 
> >> hmm->mirrors_sem lock, thus leading to deadlock:
> >>
> >> hmm_invalidate_range():
> >> // ...
> >> 	down_read(&hmm->mirrors_sem);
> >> 	list_for_each_entry(mirror, &hmm->mirrors, list)
> >> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
> >> 							start, end);
> >> 	up_read(&hmm->mirrors_sem);
> > 
> > That is just illegal, the release callback is not allowed to trigger
> > invalidation all it does is kill all device's threads and stop device
> > page fault from happening. So there is no deadlock issues. I can re-
> > inforce the comment some more (see [1] for example on what it should
> > be).
> 
> That rule is fine, and it is true that the .release callback will not 
> directly trigger any invalidations. However, the problem is in letting 
> any *existing* outstanding operations finish up. We have to let 
> existing operations "drain", in order to meet the requirement that 
> everything is done when .release returns.
> 
> For example, if a device driver thread is in the middle of working through
> its fault buffer, it will call migrate_vma(), which will in turn unmap
> pages. That will cause an hmm_invalidate_range() callback, which tries
> to take hmm->mirrors_sems, and we deadlock.
> 
> There's no way to "kill" such a thread while it's in the middle of
> migrate_vma(), you have to let it finish up.
>
> > Also it is illegal for the sync callback to trigger any mmu_notifier
> > callback. I thought this was obvious. The sync callback should only
> > update device page table and do _nothing else_. No way to make this
> > re-entrant.
> 
> That is obvious, yes. I am not trying to say there is any problem with
> that rule. It's the "drain outstanding operations during .release", 
> above, that is the real problem.

Maybe just relax the release callback wording, it should stop any
more processing of fault buffer but not wait for it to finish. In
nouveau code i kill thing but i do not wait hence i don't deadlock.

What matter is to stop any further processing. Yes some fault might
be in flight but they will serialize on various lock. So just do not
wait in the release callback, kill thing. I might have a bug where i
still fill in GPU page table in nouveau, i will check nouveau code
for that.

Kill thing should also kill the channel (i don't do that in nouveau
because i am waiting on some channel patchset) but i am not sure if
hardware like it if we kill channel before stoping fault notification.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-21 22:46           ` Jerome Glisse
@ 2018-03-21 23:10             ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> <snip>
>>
>>>> Hi Jerome,
>>>>
>>>> This presents a deadlock problem (details below). As for solution ideas, 
>>>> Mark Hairgrove points out that the MMU notifiers had to solve the
>>>> same sort of problem, and part of the solution involves "avoid
>>>> holding locks when issuing these callbacks". That's not an entire 
>>>> solution description, of course, but it seems like a good start.
>>>>
>>>> Anyway, for the deadlock problem:
>>>>
>>>> Each of these ->release callbacks potentially has to wait for the 
>>>> hmm_invalidate_range() callbacks to finish. That is not shown in any
>>>> code directly, but it's because: when a device driver is processing 
>>>> the above ->release callback, it has to allow any in-progress operations 
>>>> to finish up (as specified clearly in your comment documentation above). 
>>>>
>>>> Some of those operations will invariably need to do things that result 
>>>> in page invalidations, thus triggering the hmm_invalidate_range() callback.
>>>> Then, the hmm_invalidate_range() callback tries to acquire the same 
>>>> hmm->mirrors_sem lock, thus leading to deadlock:
>>>>
>>>> hmm_invalidate_range():
>>>> // ...
>>>> 	down_read(&hmm->mirrors_sem);
>>>> 	list_for_each_entry(mirror, &hmm->mirrors, list)
>>>> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
>>>> 							start, end);
>>>> 	up_read(&hmm->mirrors_sem);
>>>
>>> That is just illegal, the release callback is not allowed to trigger
>>> invalidation all it does is kill all device's threads and stop device
>>> page fault from happening. So there is no deadlock issues. I can re-
>>> inforce the comment some more (see [1] for example on what it should
>>> be).
>>
>> That rule is fine, and it is true that the .release callback will not 
>> directly trigger any invalidations. However, the problem is in letting 
>> any *existing* outstanding operations finish up. We have to let 
>> existing operations "drain", in order to meet the requirement that 
>> everything is done when .release returns.
>>
>> For example, if a device driver thread is in the middle of working through
>> its fault buffer, it will call migrate_vma(), which will in turn unmap
>> pages. That will cause an hmm_invalidate_range() callback, which tries
>> to take hmm->mirrors_sems, and we deadlock.
>>
>> There's no way to "kill" such a thread while it's in the middle of
>> migrate_vma(), you have to let it finish up.
>>
>>> Also it is illegal for the sync callback to trigger any mmu_notifier
>>> callback. I thought this was obvious. The sync callback should only
>>> update device page table and do _nothing else_. No way to make this
>>> re-entrant.
>>
>> That is obvious, yes. I am not trying to say there is any problem with
>> that rule. It's the "drain outstanding operations during .release", 
>> above, that is the real problem.
> 
> Maybe just relax the release callback wording, it should stop any
> more processing of fault buffer but not wait for it to finish. In
> nouveau code i kill thing but i do not wait hence i don't deadlock.

But you may crash, because that approach allows .release to finish
up, thus removing the mm entirely, out from under (for example)
a migrate_vma call--or any other call that refers to the mm.

It doesn't seem too hard to avoid the problem, though: maybe we
can just drop the lock while doing the mirror->ops->release callback.
There are a few ways to do this, but one example is: 

    -- take the lock,
        -- copy the list to a local list, deleting entries as you go,
    -- drop the lock, 
    -- iterate through the local list copy and 
        -- issue the mirror->ops->release callbacks.

At this point, more items could have been added to the list, so repeat
the above until the original list is empty. 

This is subject to a limited starvation case if mirror keep getting 
registered, but I think we can ignore that, because it only lasts as long as 
mirrors keep getting added, and then it finishes up.

> 
> What matter is to stop any further processing. Yes some fault might
> be in flight but they will serialize on various lock. 

Those faults in flight could already be at a point where they have taken
whatever locks they need, so we don't dare let the mm get destroyed while
such fault handling is in progress.


So just do not
> wait in the release callback, kill thing. I might have a bug where i
> still fill in GPU page table in nouveau, i will check nouveau code
> for that.

Again, we can't "kill" a thread of execution (this would often be an
interrupt bottom half context, btw) while it is, for example,
in the middle of migrate_vma.

I really don't believe there is a safe way to do this without draining
the existing operations before .release returns, and for that, we'll need to 
issue the .release callbacks while not holding locks.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Kill thing should also kill the channel (i don't do that in nouveau
> because i am waiting on some channel patchset) but i am not sure if
> hardware like it if we kill channel before stoping fault notification.
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 23:10             ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>
>> <snip>
>>
>>>> Hi Jerome,
>>>>
>>>> This presents a deadlock problem (details below). As for solution ideas, 
>>>> Mark Hairgrove points out that the MMU notifiers had to solve the
>>>> same sort of problem, and part of the solution involves "avoid
>>>> holding locks when issuing these callbacks". That's not an entire 
>>>> solution description, of course, but it seems like a good start.
>>>>
>>>> Anyway, for the deadlock problem:
>>>>
>>>> Each of these ->release callbacks potentially has to wait for the 
>>>> hmm_invalidate_range() callbacks to finish. That is not shown in any
>>>> code directly, but it's because: when a device driver is processing 
>>>> the above ->release callback, it has to allow any in-progress operations 
>>>> to finish up (as specified clearly in your comment documentation above). 
>>>>
>>>> Some of those operations will invariably need to do things that result 
>>>> in page invalidations, thus triggering the hmm_invalidate_range() callback.
>>>> Then, the hmm_invalidate_range() callback tries to acquire the same 
>>>> hmm->mirrors_sem lock, thus leading to deadlock:
>>>>
>>>> hmm_invalidate_range():
>>>> // ...
>>>> 	down_read(&hmm->mirrors_sem);
>>>> 	list_for_each_entry(mirror, &hmm->mirrors, list)
>>>> 		mirror->ops->sync_cpu_device_pagetables(mirror, action,
>>>> 							start, end);
>>>> 	up_read(&hmm->mirrors_sem);
>>>
>>> That is just illegal, the release callback is not allowed to trigger
>>> invalidation all it does is kill all device's threads and stop device
>>> page fault from happening. So there is no deadlock issues. I can re-
>>> inforce the comment some more (see [1] for example on what it should
>>> be).
>>
>> That rule is fine, and it is true that the .release callback will not 
>> directly trigger any invalidations. However, the problem is in letting 
>> any *existing* outstanding operations finish up. We have to let 
>> existing operations "drain", in order to meet the requirement that 
>> everything is done when .release returns.
>>
>> For example, if a device driver thread is in the middle of working through
>> its fault buffer, it will call migrate_vma(), which will in turn unmap
>> pages. That will cause an hmm_invalidate_range() callback, which tries
>> to take hmm->mirrors_sems, and we deadlock.
>>
>> There's no way to "kill" such a thread while it's in the middle of
>> migrate_vma(), you have to let it finish up.
>>
>>> Also it is illegal for the sync callback to trigger any mmu_notifier
>>> callback. I thought this was obvious. The sync callback should only
>>> update device page table and do _nothing else_. No way to make this
>>> re-entrant.
>>
>> That is obvious, yes. I am not trying to say there is any problem with
>> that rule. It's the "drain outstanding operations during .release", 
>> above, that is the real problem.
> 
> Maybe just relax the release callback wording, it should stop any
> more processing of fault buffer but not wait for it to finish. In
> nouveau code i kill thing but i do not wait hence i don't deadlock.

But you may crash, because that approach allows .release to finish
up, thus removing the mm entirely, out from under (for example)
a migrate_vma call--or any other call that refers to the mm.

It doesn't seem too hard to avoid the problem, though: maybe we
can just drop the lock while doing the mirror->ops->release callback.
There are a few ways to do this, but one example is: 

    -- take the lock,
        -- copy the list to a local list, deleting entries as you go,
    -- drop the lock, 
    -- iterate through the local list copy and 
        -- issue the mirror->ops->release callbacks.

At this point, more items could have been added to the list, so repeat
the above until the original list is empty. 

This is subject to a limited starvation case if mirror keep getting 
registered, but I think we can ignore that, because it only lasts as long as 
mirrors keep getting added, and then it finishes up.

> 
> What matter is to stop any further processing. Yes some fault might
> be in flight but they will serialize on various lock. 

Those faults in flight could already be at a point where they have taken
whatever locks they need, so we don't dare let the mm get destroyed while
such fault handling is in progress.


So just do not
> wait in the release callback, kill thing. I might have a bug where i
> still fill in GPU page table in nouveau, i will check nouveau code
> for that.

Again, we can't "kill" a thread of execution (this would often be an
interrupt bottom half context, btw) while it is, for example,
in the middle of migrate_vma.

I really don't believe there is a safe way to do this without draining
the existing operations before .release returns, and for that, we'll need to 
issue the .release callbacks while not holding locks.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Kill thing should also kill the channel (i don't do that in nouveau
> because i am waiting on some channel patchset) but i am not sure if
> hardware like it if we kill channel before stoping fault notification.
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
  2018-03-21 14:48       ` Jerome Glisse
@ 2018-03-21 23:16         ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 07:48 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 10:24:34PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>

<snip>

>>
>> <snip>
>>
>>> @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>>  		pfns[i] = 0;
>>>  
>>>  		if (pte_none(pte)) {
>>> -			pfns[i] = HMM_PFN_EMPTY;
>>> +			pfns[i] = 0;
>>
>> This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
>> Symbols are better than raw numbers here.
>>
> 
> The last patch do that so i don't think it is worth respinning
> just to make this intermediate state prettier.
> 

Yes, you're right, of course. And, no other problems found, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2
@ 2018-03-21 23:16         ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 07:48 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 10:24:34PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>

<snip>

>>
>> <snip>
>>
>>> @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>>  		pfns[i] = 0;
>>>  
>>>  		if (pte_none(pte)) {
>>> -			pfns[i] = HMM_PFN_EMPTY;
>>> +			pfns[i] = 0;
>>
>> This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
>> Symbols are better than raw numbers here.
>>
> 
> The last patch do that so i don't think it is worth respinning
> just to make this intermediate state prettier.
> 

Yes, you're right, of course. And, no other problems found, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-21 15:52       ` Jerome Glisse
@ 2018-03-21 23:19         ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 08:52 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
> 
> [...]
> 

<snip>

>>
>> Let's just keep it simple, and go back to the bitmap flags!
> 
> This simplify nouveau code and it is the reason why i did that patch.
> I am sure it can simplify NVidia uvm code, i can look into it if you
> want to give pointers. Idea here is that HMM can fill array with some-
> thing that match device driver internal format and avoid the conversion
> step from HMM format to driver format (saving CPU cycles and memory
> doing so). I am open to alternative that give the same end result.
> 
> [Just because code is worth 2^32 words :)
> 
> Without this patch:
>     int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
>     {
>         uint64_t *hmm_pfns, *nouveau_pfns;
> 
>         hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         hmm_vma_fault(..., hmm_pfns, ...);
> 
>         for (i = 0; i < npages; ++i) {
>             nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
>         }
>         ...
>     }
> 
> With this patch:
>     int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
>     {
>         uint64_t *nouveau_pfns;
> 
>         nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         hmm_vma_fault(..., nouveau_pfns, ...);
> 
>         ...
>     }
> 
> Benefit from this patch is quite obvious to me. Down the road with bit
> more integration between HMM and IOMMU/DMA this can turn into something
> directly ready for hardware consumptions.
> 
> Note that you could argue that i can convert nouveau to use HMM format
> but this would not work, first because it requires a lot of changes in
> nouuveau, second because HMM do not have all the flags needed by the
> drivers (nor does HMM need them). HMM being the helper here, i feel it
> is up to HMM to adapt to drivers than the other way around.]
> 

OK, if this simplifies Nouveau and potentially other drivers, then I'll 
drop my earlier objections! Thanks for explaining what's going on, in detail.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
@ 2018-03-21 23:19         ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 08:52 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
> 
> [...]
> 

<snip>

>>
>> Let's just keep it simple, and go back to the bitmap flags!
> 
> This simplify nouveau code and it is the reason why i did that patch.
> I am sure it can simplify NVidia uvm code, i can look into it if you
> want to give pointers. Idea here is that HMM can fill array with some-
> thing that match device driver internal format and avoid the conversion
> step from HMM format to driver format (saving CPU cycles and memory
> doing so). I am open to alternative that give the same end result.
> 
> [Just because code is worth 2^32 words :)
> 
> Without this patch:
>     int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
>     {
>         uint64_t *hmm_pfns, *nouveau_pfns;
> 
>         hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         hmm_vma_fault(..., hmm_pfns, ...);
> 
>         for (i = 0; i < npages; ++i) {
>             nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
>         }
>         ...
>     }
> 
> With this patch:
>     int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
>     {
>         uint64_t *nouveau_pfns;
> 
>         nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         hmm_vma_fault(..., nouveau_pfns, ...);
> 
>         ...
>     }
> 
> Benefit from this patch is quite obvious to me. Down the road with bit
> more integration between HMM and IOMMU/DMA this can turn into something
> directly ready for hardware consumptions.
> 
> Note that you could argue that i can convert nouveau to use HMM format
> but this would not work, first because it requires a lot of changes in
> nouuveau, second because HMM do not have all the flags needed by the
> drivers (nor does HMM need them). HMM being the helper here, i feel it
> is up to HMM to adapt to drivers than the other way around.]
> 

OK, if this simplifies Nouveau and potentially other drivers, then I'll 
drop my earlier objections! Thanks for explaining what's going on, in detail.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-21 18:16     ` jglisse
@ 2018-03-21 23:22       ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:22 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.
> 
> Changed since v1:
>   - close race window between a last mirror unregistering and a new
>     mirror registering, which could have lead to use after free()
>     kind of bug
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..f75aa8df6e97 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>  	if (!mm || !mirror || !mirror->ops)
>  		return -EINVAL;
>  
> +again:
>  	mirror->hmm = hmm_register(mm);
>  	if (!mirror->hmm)
>  		return -ENOMEM;
>  
>  	down_write(&mirror->hmm->mirrors_sem);
> -	list_add(&mirror->list, &mirror->hmm->mirrors);
> -	up_write(&mirror->hmm->mirrors_sem);
> +	if (mirror->hmm->mm == NULL) {
> +		/*
> +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> +		 * struct. Try again to allocate a new one.
> +		 */
> +		up_write(&mirror->hmm->mirrors_sem);
> +		mirror->hmm = NULL;

This is being set outside of locks, so now there is another race with
another hmm_mirror_register...

I'll take a moment and draft up what I have in mind here, which is a more
symmetrical locking scheme for these routines.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-21 23:22       ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-21 23:22 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.
> 
> Changed since v1:
>   - close race window between a last mirror unregistering and a new
>     mirror registering, which could have lead to use after free()
>     kind of bug
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..f75aa8df6e97 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>  	if (!mm || !mirror || !mirror->ops)
>  		return -EINVAL;
>  
> +again:
>  	mirror->hmm = hmm_register(mm);
>  	if (!mirror->hmm)
>  		return -ENOMEM;
>  
>  	down_write(&mirror->hmm->mirrors_sem);
> -	list_add(&mirror->list, &mirror->hmm->mirrors);
> -	up_write(&mirror->hmm->mirrors_sem);
> +	if (mirror->hmm->mm == NULL) {
> +		/*
> +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> +		 * struct. Try again to allocate a new one.
> +		 */
> +		up_write(&mirror->hmm->mirrors_sem);
> +		mirror->hmm = NULL;

This is being set outside of locks, so now there is another race with
another hmm_mirror_register...

I'll take a moment and draft up what I have in mind here, which is a more
symmetrical locking scheme for these routines.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-21 23:10             ` John Hubbard
  (?)
@ 2018-03-21 23:37               ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 23:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> >> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> >>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>>>> From: Ralph Campbell <rcampbell@nvidia.com>

[...]

> >>> That is just illegal, the release callback is not allowed to trigger
> >>> invalidation all it does is kill all device's threads and stop device
> >>> page fault from happening. So there is no deadlock issues. I can re-
> >>> inforce the comment some more (see [1] for example on what it should
> >>> be).
> >>
> >> That rule is fine, and it is true that the .release callback will not 
> >> directly trigger any invalidations. However, the problem is in letting 
> >> any *existing* outstanding operations finish up. We have to let 
> >> existing operations "drain", in order to meet the requirement that 
> >> everything is done when .release returns.
> >>
> >> For example, if a device driver thread is in the middle of working through
> >> its fault buffer, it will call migrate_vma(), which will in turn unmap
> >> pages. That will cause an hmm_invalidate_range() callback, which tries
> >> to take hmm->mirrors_sems, and we deadlock.
> >>
> >> There's no way to "kill" such a thread while it's in the middle of
> >> migrate_vma(), you have to let it finish up.
> >>
> >>> Also it is illegal for the sync callback to trigger any mmu_notifier
> >>> callback. I thought this was obvious. The sync callback should only
> >>> update device page table and do _nothing else_. No way to make this
> >>> re-entrant.
> >>
> >> That is obvious, yes. I am not trying to say there is any problem with
> >> that rule. It's the "drain outstanding operations during .release", 
> >> above, that is the real problem.
> > 
> > Maybe just relax the release callback wording, it should stop any
> > more processing of fault buffer but not wait for it to finish. In
> > nouveau code i kill thing but i do not wait hence i don't deadlock.
> 
> But you may crash, because that approach allows .release to finish
> up, thus removing the mm entirely, out from under (for example)
> a migrate_vma call--or any other call that refers to the mm.

No you can not crash on mm as it will not vanish before you are done
with it as mm will not be freed before you call hmm_unregister() and
you should not call that from release, nor should you call it before
everything is flush. However vma struct might vanish ... i might have
assume wrongly about the down_write() always happening in exit_mmap()
This might be a solution to force serialization.


> 
> It doesn't seem too hard to avoid the problem, though: maybe we
> can just drop the lock while doing the mirror->ops->release callback.
> There are a few ways to do this, but one example is: 
> 
>     -- take the lock,
>         -- copy the list to a local list, deleting entries as you go,
>     -- drop the lock, 
>     -- iterate through the local list copy and 
>         -- issue the mirror->ops->release callbacks.
> 
> At this point, more items could have been added to the list, so repeat
> the above until the original list is empty. 
> 
> This is subject to a limited starvation case if mirror keep getting 
> registered, but I think we can ignore that, because it only lasts as long as 
> mirrors keep getting added, and then it finishes up.

The down_write is better solution and easier just 2 line of code.

> 
> > 
> > What matter is to stop any further processing. Yes some fault might
> > be in flight but they will serialize on various lock. 
> 
> Those faults in flight could already be at a point where they have taken
> whatever locks they need, so we don't dare let the mm get destroyed while
> such fault handling is in progress.

mm can not vanish until hmm_unregister() is call, vma will vanish before.

> So just do not
> > wait in the release callback, kill thing. I might have a bug where i
> > still fill in GPU page table in nouveau, i will check nouveau code
> > for that.
> 
> Again, we can't "kill" a thread of execution (this would often be an
> interrupt bottom half context, btw) while it is, for example,
> in the middle of migrate_vma.

You should not call migrate from bottom half ! Only call this from work
queue like nouveau.

> 
> I really don't believe there is a safe way to do this without draining
> the existing operations before .release returns, and for that, we'll need to 
> issue the .release callbacks while not holding locks.

down_write on mmap_sem would force serialization. I am not sure we want
to do this change now. It can wait as it is definitly not an issue for
nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
make me nervous.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 23:37               ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 23:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> >> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> >>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>>>> From: Ralph Campbell <rcampbell@nvidia.com>

[...]

> >>> That is just illegal, the release callback is not allowed to trigger
> >>> invalidation all it does is kill all device's threads and stop device
> >>> page fault from happening. So there is no deadlock issues. I can re-
> >>> inforce the comment some more (see [1] for example on what it should
> >>> be).
> >>
> >> That rule is fine, and it is true that the .release callback will not 
> >> directly trigger any invalidations. However, the problem is in letting 
> >> any *existing* outstanding operations finish up. We have to let 
> >> existing operations "drain", in order to meet the requirement that 
> >> everything is done when .release returns.
> >>
> >> For example, if a device driver thread is in the middle of working through
> >> its fault buffer, it will call migrate_vma(), which will in turn unmap
> >> pages. That will cause an hmm_invalidate_range() callback, which tries
> >> to take hmm->mirrors_sems, and we deadlock.
> >>
> >> There's no way to "kill" such a thread while it's in the middle of
> >> migrate_vma(), you have to let it finish up.
> >>
> >>> Also it is illegal for the sync callback to trigger any mmu_notifier
> >>> callback. I thought this was obvious. The sync callback should only
> >>> update device page table and do _nothing else_. No way to make this
> >>> re-entrant.
> >>
> >> That is obvious, yes. I am not trying to say there is any problem with
> >> that rule. It's the "drain outstanding operations during .release", 
> >> above, that is the real problem.
> > 
> > Maybe just relax the release callback wording, it should stop any
> > more processing of fault buffer but not wait for it to finish. In
> > nouveau code i kill thing but i do not wait hence i don't deadlock.
> 
> But you may crash, because that approach allows .release to finish
> up, thus removing the mm entirely, out from under (for example)
> a migrate_vma call--or any other call that refers to the mm.

No you can not crash on mm as it will not vanish before you are done
with it as mm will not be freed before you call hmm_unregister() and
you should not call that from release, nor should you call it before
everything is flush. However vma struct might vanish ... i might have
assume wrongly about the down_write() always happening in exit_mmap()
This might be a solution to force serialization.


> 
> It doesn't seem too hard to avoid the problem, though: maybe we
> can just drop the lock while doing the mirror->ops->release callback.
> There are a few ways to do this, but one example is: 
> 
>     -- take the lock,
>         -- copy the list to a local list, deleting entries as you go,
>     -- drop the lock, 
>     -- iterate through the local list copy and 
>         -- issue the mirror->ops->release callbacks.
> 
> At this point, more items could have been added to the list, so repeat
> the above until the original list is empty. 
> 
> This is subject to a limited starvation case if mirror keep getting 
> registered, but I think we can ignore that, because it only lasts as long as 
> mirrors keep getting added, and then it finishes up.

The down_write is better solution and easier just 2 line of code.

> 
> > 
> > What matter is to stop any further processing. Yes some fault might
> > be in flight but they will serialize on various lock. 
> 
> Those faults in flight could already be at a point where they have taken
> whatever locks they need, so we don't dare let the mm get destroyed while
> such fault handling is in progress.

mm can not vanish until hmm_unregister() is call, vma will vanish before.

> So just do not
> > wait in the release callback, kill thing. I might have a bug where i
> > still fill in GPU page table in nouveau, i will check nouveau code
> > for that.
> 
> Again, we can't "kill" a thread of execution (this would often be an
> interrupt bottom half context, btw) while it is, for example,
> in the middle of migrate_vma.

You should not call migrate from bottom half ! Only call this from work
queue like nouveau.

> 
> I really don't believe there is a safe way to do this without draining
> the existing operations before .release returns, and for that, we'll need to 
> issue the .release callbacks while not holding locks.

down_write on mmap_sem would force serialization. I am not sure we want
to do this change now. It can wait as it is definitly not an issue for
nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
make me nervous.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-21 23:37               ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 23:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> >> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> >>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>>>> From: Ralph Campbell <rcampbell@nvidia.com>

[...]

> >>> That is just illegal, the release callback is not allowed to trigger
> >>> invalidation all it does is kill all device's threads and stop device
> >>> page fault from happening. So there is no deadlock issues. I can re-
> >>> inforce the comment some more (see [1] for example on what it should
> >>> be).
> >>
> >> That rule is fine, and it is true that the .release callback will not 
> >> directly trigger any invalidations. However, the problem is in letting 
> >> any *existing* outstanding operations finish up. We have to let 
> >> existing operations "drain", in order to meet the requirement that 
> >> everything is done when .release returns.
> >>
> >> For example, if a device driver thread is in the middle of working through
> >> its fault buffer, it will call migrate_vma(), which will in turn unmap
> >> pages. That will cause an hmm_invalidate_range() callback, which tries
> >> to take hmm->mirrors_sems, and we deadlock.
> >>
> >> There's no way to "kill" such a thread while it's in the middle of
> >> migrate_vma(), you have to let it finish up.
> >>
> >>> Also it is illegal for the sync callback to trigger any mmu_notifier
> >>> callback. I thought this was obvious. The sync callback should only
> >>> update device page table and do _nothing else_. No way to make this
> >>> re-entrant.
> >>
> >> That is obvious, yes. I am not trying to say there is any problem with
> >> that rule. It's the "drain outstanding operations during .release", 
> >> above, that is the real problem.
> > 
> > Maybe just relax the release callback wording, it should stop any
> > more processing of fault buffer but not wait for it to finish. In
> > nouveau code i kill thing but i do not wait hence i don't deadlock.
> 
> But you may crash, because that approach allows .release to finish
> up, thus removing the mm entirely, out from under (for example)
> a migrate_vma call--or any other call that refers to the mm.

No you can not crash on mm as it will not vanish before you are done
with it as mm will not be freed before you call hmm_unregister() and
you should not call that from release, nor should you call it before
everything is flush. However vma struct might vanish ... i might have
assume wrongly about the down_write() always happening in exit_mmap()
This might be a solution to force serialization.


> 
> It doesn't seem too hard to avoid the problem, though: maybe we
> can just drop the lock while doing the mirror->ops->release callback.
> There are a few ways to do this, but one example is: 
> 
>     -- take the lock,
>         -- copy the list to a local list, deleting entries as you go,
>     -- drop the lock, 
>     -- iterate through the local list copy and 
>         -- issue the mirror->ops->release callbacks.
> 
> At this point, more items could have been added to the list, so repeat
> the above until the original list is empty. 
> 
> This is subject to a limited starvation case if mirror keep getting 
> registered, but I think we can ignore that, because it only lasts as long as 
> mirrors keep getting added, and then it finishes up.

The down_write is better solution and easier just 2 line of code.

> 
> > 
> > What matter is to stop any further processing. Yes some fault might
> > be in flight but they will serialize on various lock. 
> 
> Those faults in flight could already be at a point where they have taken
> whatever locks they need, so we don't dare let the mm get destroyed while
> such fault handling is in progress.

mm can not vanish until hmm_unregister() is call, vma will vanish before.

> So just do not
> > wait in the release callback, kill thing. I might have a bug where i
> > still fill in GPU page table in nouveau, i will check nouveau code
> > for that.
> 
> Again, we can't "kill" a thread of execution (this would often be an
> interrupt bottom half context, btw) while it is, for example,
> in the middle of migrate_vma.

You should not call migrate from bottom half ! Only call this from work
queue like nouveau.

> 
> I really don't believe there is a safe way to do this without draining
> the existing operations before .release returns, and for that, we'll need to 
> issue the .release callbacks while not holding locks.

down_write on mmap_sem would force serialization. I am not sure we want
to do this change now. It can wait as it is definitly not an issue for
nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
make me nervous.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-21 23:22       ` John Hubbard
@ 2018-03-21 23:41         ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 23:41 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This code was lost in translation at one point. This properly call
> > mmu_notifier_unregister_no_release() once last user is gone. This
> > fix the zombie mm_struct as without this patch we do not drop the
> > refcount we have on it.
> > 
> > Changed since v1:
> >   - close race window between a last mirror unregistering and a new
> >     mirror registering, which could have lead to use after free()
> >     kind of bug
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
> >  1 file changed, 33 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6088fa6ed137..f75aa8df6e97 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> >  	if (!mm || !mirror || !mirror->ops)
> >  		return -EINVAL;
> >  
> > +again:
> >  	mirror->hmm = hmm_register(mm);
> >  	if (!mirror->hmm)
> >  		return -ENOMEM;
> >  
> >  	down_write(&mirror->hmm->mirrors_sem);
> > -	list_add(&mirror->list, &mirror->hmm->mirrors);
> > -	up_write(&mirror->hmm->mirrors_sem);
> > +	if (mirror->hmm->mm == NULL) {
> > +		/*
> > +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> > +		 * struct. Try again to allocate a new one.
> > +		 */
> > +		up_write(&mirror->hmm->mirrors_sem);
> > +		mirror->hmm = NULL;
> 
> This is being set outside of locks, so now there is another race with
> another hmm_mirror_register...
> 
> I'll take a moment and draft up what I have in mind here, which is a more
> symmetrical locking scheme for these routines.
> 

No this code is correct. hmm->mm is set after hmm struct is allocated
and before it is public so no one can race with that. It is clear in
hmm_mirror_unregister() under the write lock hence checking it here
under that same lock is correct.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-21 23:41         ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-21 23:41 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > This code was lost in translation at one point. This properly call
> > mmu_notifier_unregister_no_release() once last user is gone. This
> > fix the zombie mm_struct as without this patch we do not drop the
> > refcount we have on it.
> > 
> > Changed since v1:
> >   - close race window between a last mirror unregistering and a new
> >     mirror registering, which could have lead to use after free()
> >     kind of bug
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > ---
> >  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
> >  1 file changed, 33 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6088fa6ed137..f75aa8df6e97 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> >  	if (!mm || !mirror || !mirror->ops)
> >  		return -EINVAL;
> >  
> > +again:
> >  	mirror->hmm = hmm_register(mm);
> >  	if (!mirror->hmm)
> >  		return -ENOMEM;
> >  
> >  	down_write(&mirror->hmm->mirrors_sem);
> > -	list_add(&mirror->list, &mirror->hmm->mirrors);
> > -	up_write(&mirror->hmm->mirrors_sem);
> > +	if (mirror->hmm->mm == NULL) {
> > +		/*
> > +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> > +		 * struct. Try again to allocate a new one.
> > +		 */
> > +		up_write(&mirror->hmm->mirrors_sem);
> > +		mirror->hmm = NULL;
> 
> This is being set outside of locks, so now there is another race with
> another hmm_mirror_register...
> 
> I'll take a moment and draft up what I have in mind here, which is a more
> symmetrical locking scheme for these routines.
> 

No this code is correct. hmm->mm is set after hmm struct is allocated
and before it is public so no one can race with that. It is clear in
hmm_mirror_unregister() under the write lock hence checking it here
under that same lock is correct.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-21 23:37               ` Jerome Glisse
@ 2018-03-22  0:11                 ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-22  0:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/21/2018 04:37 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
>> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
>>> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
>>>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>>>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> [...]
> 
>>>>> That is just illegal, the release callback is not allowed to trigger
>>>>> invalidation all it does is kill all device's threads and stop device
>>>>> page fault from happening. So there is no deadlock issues. I can re-
>>>>> inforce the comment some more (see [1] for example on what it should
>>>>> be).
>>>>
>>>> That rule is fine, and it is true that the .release callback will not 
>>>> directly trigger any invalidations. However, the problem is in letting 
>>>> any *existing* outstanding operations finish up. We have to let 
>>>> existing operations "drain", in order to meet the requirement that 
>>>> everything is done when .release returns.
>>>>
>>>> For example, if a device driver thread is in the middle of working through
>>>> its fault buffer, it will call migrate_vma(), which will in turn unmap
>>>> pages. That will cause an hmm_invalidate_range() callback, which tries
>>>> to take hmm->mirrors_sems, and we deadlock.
>>>>
>>>> There's no way to "kill" such a thread while it's in the middle of
>>>> migrate_vma(), you have to let it finish up.
>>>>
>>>>> Also it is illegal for the sync callback to trigger any mmu_notifier
>>>>> callback. I thought this was obvious. The sync callback should only
>>>>> update device page table and do _nothing else_. No way to make this
>>>>> re-entrant.
>>>>
>>>> That is obvious, yes. I am not trying to say there is any problem with
>>>> that rule. It's the "drain outstanding operations during .release", 
>>>> above, that is the real problem.
>>>
>>> Maybe just relax the release callback wording, it should stop any
>>> more processing of fault buffer but not wait for it to finish. In
>>> nouveau code i kill thing but i do not wait hence i don't deadlock.
>>
>> But you may crash, because that approach allows .release to finish
>> up, thus removing the mm entirely, out from under (for example)
>> a migrate_vma call--or any other call that refers to the mm.
> 
> No you can not crash on mm as it will not vanish before you are done
> with it as mm will not be freed before you call hmm_unregister() and
> you should not call that from release, nor should you call it before
> everything is flush. However vma struct might vanish ... i might have
> assume wrongly about the down_write() always happening in exit_mmap()
> This might be a solution to force serialization.
> 
 
OK. My details on mm destruction were inaccurate, but we do agree now
that that the whole virtual address space is being torn down at the same 
time as we're trying to use it, so I think we're on the same page now.

>>
>> It doesn't seem too hard to avoid the problem, though: maybe we
>> can just drop the lock while doing the mirror->ops->release callback.
>> There are a few ways to do this, but one example is: 
>>
>>     -- take the lock,
>>         -- copy the list to a local list, deleting entries as you go,
>>     -- drop the lock, 
>>     -- iterate through the local list copy and 
>>         -- issue the mirror->ops->release callbacks.
>>
>> At this point, more items could have been added to the list, so repeat
>> the above until the original list is empty. 
>>
>> This is subject to a limited starvation case if mirror keep getting 
>> registered, but I think we can ignore that, because it only lasts as long as 
>> mirrors keep getting added, and then it finishes up.
> 
> The down_write is better solution and easier just 2 line of code.

OK. I'll have a better idea when I see it.

> 
>>
>>>
>>> What matter is to stop any further processing. Yes some fault might
>>> be in flight but they will serialize on various lock. 
>>
>> Those faults in flight could already be at a point where they have taken
>> whatever locks they need, so we don't dare let the mm get destroyed while
>> such fault handling is in progress.
> 
> mm can not vanish until hmm_unregister() is call, vma will vanish before.

OK, yes. And we agree that vma vanishing is a problem. 

> 
>> So just do not
>>> wait in the release callback, kill thing. I might have a bug where i
>>> still fill in GPU page table in nouveau, i will check nouveau code
>>> for that.
>>
>> Again, we can't "kill" a thread of execution (this would often be an
>> interrupt bottom half context, btw) while it is, for example,
>> in the middle of migrate_vma.
> 
> You should not call migrate from bottom half ! Only call this from work
> queue like nouveau.

By "bottom half", I mean the kthread that we have running to handle work
that was handed off from the top half ISR. So we are in process context.
And we will need to do migrate_vma() from there.

> 
>>
>> I really don't believe there is a safe way to do this without draining
>> the existing operations before .release returns, and for that, we'll need to 
>> issue the .release callbacks while not holding locks.
> 
> down_write on mmap_sem would force serialization. I am not sure we want
> to do this change now. It can wait as it is definitly not an issue for
> nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
> make me nervous.
> 

I'm not going to lose any sleep about when various fixes are made, as long as
we agree on problems and solution approaches, and fix them at some point.
I will note that our downstreamdriver will not be...well, completely usable, 
until we fix this, though.

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-22  0:11                 ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-22  0:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On 03/21/2018 04:37 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
>> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
>>> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
>>>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>>>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> [...]
> 
>>>>> That is just illegal, the release callback is not allowed to trigger
>>>>> invalidation all it does is kill all device's threads and stop device
>>>>> page fault from happening. So there is no deadlock issues. I can re-
>>>>> inforce the comment some more (see [1] for example on what it should
>>>>> be).
>>>>
>>>> That rule is fine, and it is true that the .release callback will not 
>>>> directly trigger any invalidations. However, the problem is in letting 
>>>> any *existing* outstanding operations finish up. We have to let 
>>>> existing operations "drain", in order to meet the requirement that 
>>>> everything is done when .release returns.
>>>>
>>>> For example, if a device driver thread is in the middle of working through
>>>> its fault buffer, it will call migrate_vma(), which will in turn unmap
>>>> pages. That will cause an hmm_invalidate_range() callback, which tries
>>>> to take hmm->mirrors_sems, and we deadlock.
>>>>
>>>> There's no way to "kill" such a thread while it's in the middle of
>>>> migrate_vma(), you have to let it finish up.
>>>>
>>>>> Also it is illegal for the sync callback to trigger any mmu_notifier
>>>>> callback. I thought this was obvious. The sync callback should only
>>>>> update device page table and do _nothing else_. No way to make this
>>>>> re-entrant.
>>>>
>>>> That is obvious, yes. I am not trying to say there is any problem with
>>>> that rule. It's the "drain outstanding operations during .release", 
>>>> above, that is the real problem.
>>>
>>> Maybe just relax the release callback wording, it should stop any
>>> more processing of fault buffer but not wait for it to finish. In
>>> nouveau code i kill thing but i do not wait hence i don't deadlock.
>>
>> But you may crash, because that approach allows .release to finish
>> up, thus removing the mm entirely, out from under (for example)
>> a migrate_vma call--or any other call that refers to the mm.
> 
> No you can not crash on mm as it will not vanish before you are done
> with it as mm will not be freed before you call hmm_unregister() and
> you should not call that from release, nor should you call it before
> everything is flush. However vma struct might vanish ... i might have
> assume wrongly about the down_write() always happening in exit_mmap()
> This might be a solution to force serialization.
> 
 
OK. My details on mm destruction were inaccurate, but we do agree now
that that the whole virtual address space is being torn down at the same 
time as we're trying to use it, so I think we're on the same page now.

>>
>> It doesn't seem too hard to avoid the problem, though: maybe we
>> can just drop the lock while doing the mirror->ops->release callback.
>> There are a few ways to do this, but one example is: 
>>
>>     -- take the lock,
>>         -- copy the list to a local list, deleting entries as you go,
>>     -- drop the lock, 
>>     -- iterate through the local list copy and 
>>         -- issue the mirror->ops->release callbacks.
>>
>> At this point, more items could have been added to the list, so repeat
>> the above until the original list is empty. 
>>
>> This is subject to a limited starvation case if mirror keep getting 
>> registered, but I think we can ignore that, because it only lasts as long as 
>> mirrors keep getting added, and then it finishes up.
> 
> The down_write is better solution and easier just 2 line of code.

OK. I'll have a better idea when I see it.

> 
>>
>>>
>>> What matter is to stop any further processing. Yes some fault might
>>> be in flight but they will serialize on various lock. 
>>
>> Those faults in flight could already be at a point where they have taken
>> whatever locks they need, so we don't dare let the mm get destroyed while
>> such fault handling is in progress.
> 
> mm can not vanish until hmm_unregister() is call, vma will vanish before.

OK, yes. And we agree that vma vanishing is a problem. 

> 
>> So just do not
>>> wait in the release callback, kill thing. I might have a bug where i
>>> still fill in GPU page table in nouveau, i will check nouveau code
>>> for that.
>>
>> Again, we can't "kill" a thread of execution (this would often be an
>> interrupt bottom half context, btw) while it is, for example,
>> in the middle of migrate_vma.
> 
> You should not call migrate from bottom half ! Only call this from work
> queue like nouveau.

By "bottom half", I mean the kthread that we have running to handle work
that was handed off from the top half ISR. So we are in process context.
And we will need to do migrate_vma() from there.

> 
>>
>> I really don't believe there is a safe way to do this without draining
>> the existing operations before .release returns, and for that, we'll need to 
>> issue the .release callbacks while not holding locks.
> 
> down_write on mmap_sem would force serialization. I am not sure we want
> to do this change now. It can wait as it is definitly not an issue for
> nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
> make me nervous.
> 

I'm not going to lose any sleep about when various fixes are made, as long as
we agree on problems and solution approaches, and fix them at some point.
I will note that our downstreamdriver will not be...well, completely usable, 
until we fix this, though.

thanks,
-- 
John Hubbard
NVIDIA
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3
  2018-03-20  2:00   ` jglisse
@ 2018-03-22  1:28     ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-22  1:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable
Changed since v2:
  - drop stable
  - Split list removale and call to driver release callback. This
    allow the release callback to wait on any pending fault handler
    without deadlock.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 29 ++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 36dd21fe5caf..fa7b51f65905 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..34c16297f65e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,32 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm_mirror *mirror;
+	struct hmm *hmm = mm->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
+					  list);
+	while (mirror) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release) {
+			/*
+			 * Drop mirrors_sem so callback can wait on any pending
+			 * work that might itself trigger mmu_notifier callback
+			 * and thus would deadlock with us.
+			 */
+			up_write(&hmm->mirrors_sem);
+			mirror->ops->release(mirror);
+			down_write(&hmm->mirrors_sem);
+		}
+		mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
+						  list);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +211,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +257,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3
@ 2018-03-22  1:28     ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-22  1:28 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable
Changed since v2:
  - drop stable
  - Split list removale and call to driver release callback. This
    allow the release callback to wait on any pending fault handler
    without deadlock.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 29 ++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 36dd21fe5caf..fa7b51f65905 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..34c16297f65e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,32 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm_mirror *mirror;
+	struct hmm *hmm = mm->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
+					  list);
+	while (mirror) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release) {
+			/*
+			 * Drop mirrors_sem so callback can wait on any pending
+			 * work that might itself trigger mmu_notifier callback
+			 * and thus would deadlock with us.
+			 */
+			up_write(&hmm->mirrors_sem);
+			mirror->ops->release(mirror);
+			down_write(&hmm->mirrors_sem);
+		}
+		mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
+						  list);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +211,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +257,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3
  2018-03-21 18:16     ` jglisse
@ 2018-03-22  1:30       ` jglisse
  -1 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-22  1:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Changed since v1:
  - close race window between a last mirror unregistering and a new
    mirror registering, which could have lead to use after free()
    kind of bug
Changed since v2:
  - Avoid issue if there is multiple call to hmm_mirror_register()
    (which lead to use after free).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 39 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 34c16297f65e..5b21d458db60 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -233,13 +233,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
+again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	list_add(&mirror->list, &mirror->hmm->mirrors);
-	up_write(&mirror->hmm->mirrors_sem);
+	if (mirror->hmm->mm == NULL) {
+		/*
+		 * A racing hmm_mirror_unregister() is about to destroy the hmm
+		 * struct. Try again to allocate a new one.
+		 */
+		up_write(&mirror->hmm->mirrors_sem);
+		mirror->hmm = NULL;
+		goto again;
+	} else {
+		list_add(&mirror->list, &mirror->hmm->mirrors);
+		up_write(&mirror->hmm->mirrors_sem);
+	}
 
 	return 0;
 }
@@ -254,11 +265,33 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	struct hmm *hmm = mirror->hmm;
+	bool should_unregister = false;
+	struct mm_struct *mm;
+	struct hmm *hmm;
+
+	if (mirror->hmm == NULL)
+		return;
 
+	hmm = mirror->hmm;
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	should_unregister = list_empty(&hmm->mirrors);
+	mirror->hmm = NULL;
+	mm = hmm->mm;
+	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
+
+	if (!should_unregister || mm == NULL)
+		return;
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm) {
+		mm->hmm = NULL;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3
@ 2018-03-22  1:30       ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-22  1:30 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Changed since v1:
  - close race window between a last mirror unregistering and a new
    mirror registering, which could have lead to use after free()
    kind of bug
Changed since v2:
  - Avoid issue if there is multiple call to hmm_mirror_register()
    (which lead to use after free).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 39 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 34c16297f65e..5b21d458db60 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -233,13 +233,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
+again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	list_add(&mirror->list, &mirror->hmm->mirrors);
-	up_write(&mirror->hmm->mirrors_sem);
+	if (mirror->hmm->mm == NULL) {
+		/*
+		 * A racing hmm_mirror_unregister() is about to destroy the hmm
+		 * struct. Try again to allocate a new one.
+		 */
+		up_write(&mirror->hmm->mirrors_sem);
+		mirror->hmm = NULL;
+		goto again;
+	} else {
+		list_add(&mirror->list, &mirror->hmm->mirrors);
+		up_write(&mirror->hmm->mirrors_sem);
+	}
 
 	return 0;
 }
@@ -254,11 +265,33 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	struct hmm *hmm = mirror->hmm;
+	bool should_unregister = false;
+	struct mm_struct *mm;
+	struct hmm *hmm;
+
+	if (mirror->hmm == NULL)
+		return;
 
+	hmm = mirror->hmm;
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	should_unregister = list_empty(&hmm->mirrors);
+	mirror->hmm = NULL;
+	mm = hmm->mm;
+	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
+
+	if (!should_unregister || mm == NULL)
+		return;
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm) {
+		mm->hmm = NULL;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
  2018-03-22  0:11                 ` John Hubbard
  (?)
@ 2018-03-22  1:32                   ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-22  1:32 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 05:11:10PM -0700, John Hubbard wrote:
> On 03/21/2018 04:37 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
> >> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> >>> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> >>>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> >>>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >>>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > [...]
> > 
> >>>>> That is just illegal, the release callback is not allowed to trigger
> >>>>> invalidation all it does is kill all device's threads and stop device
> >>>>> page fault from happening. So there is no deadlock issues. I can re-
> >>>>> inforce the comment some more (see [1] for example on what it should
> >>>>> be).
> >>>>
> >>>> That rule is fine, and it is true that the .release callback will not 
> >>>> directly trigger any invalidations. However, the problem is in letting 
> >>>> any *existing* outstanding operations finish up. We have to let 
> >>>> existing operations "drain", in order to meet the requirement that 
> >>>> everything is done when .release returns.
> >>>>
> >>>> For example, if a device driver thread is in the middle of working through
> >>>> its fault buffer, it will call migrate_vma(), which will in turn unmap
> >>>> pages. That will cause an hmm_invalidate_range() callback, which tries
> >>>> to take hmm->mirrors_sems, and we deadlock.
> >>>>
> >>>> There's no way to "kill" such a thread while it's in the middle of
> >>>> migrate_vma(), you have to let it finish up.
> >>>>
> >>>>> Also it is illegal for the sync callback to trigger any mmu_notifier
> >>>>> callback. I thought this was obvious. The sync callback should only
> >>>>> update device page table and do _nothing else_. No way to make this
> >>>>> re-entrant.
> >>>>
> >>>> That is obvious, yes. I am not trying to say there is any problem with
> >>>> that rule. It's the "drain outstanding operations during .release", 
> >>>> above, that is the real problem.
> >>>
> >>> Maybe just relax the release callback wording, it should stop any
> >>> more processing of fault buffer but not wait for it to finish. In
> >>> nouveau code i kill thing but i do not wait hence i don't deadlock.
> >>
> >> But you may crash, because that approach allows .release to finish
> >> up, thus removing the mm entirely, out from under (for example)
> >> a migrate_vma call--or any other call that refers to the mm.
> > 
> > No you can not crash on mm as it will not vanish before you are done
> > with it as mm will not be freed before you call hmm_unregister() and
> > you should not call that from release, nor should you call it before
> > everything is flush. However vma struct might vanish ... i might have
> > assume wrongly about the down_write() always happening in exit_mmap()
> > This might be a solution to force serialization.
> > 
>  
> OK. My details on mm destruction were inaccurate, but we do agree now
> that that the whole virtual address space is being torn down at the same 
> time as we're trying to use it, so I think we're on the same page now.
> 
> >>
> >> It doesn't seem too hard to avoid the problem, though: maybe we
> >> can just drop the lock while doing the mirror->ops->release callback.
> >> There are a few ways to do this, but one example is: 
> >>
> >>     -- take the lock,
> >>         -- copy the list to a local list, deleting entries as you go,
> >>     -- drop the lock, 
> >>     -- iterate through the local list copy and 
> >>         -- issue the mirror->ops->release callbacks.
> >>
> >> At this point, more items could have been added to the list, so repeat
> >> the above until the original list is empty. 
> >>
> >> This is subject to a limited starvation case if mirror keep getting 
> >> registered, but I think we can ignore that, because it only lasts as long as 
> >> mirrors keep getting added, and then it finishes up.
> > 
> > The down_write is better solution and easier just 2 line of code.
> 
> OK. I'll have a better idea when I see it.
> 
> > 
> >>
> >>>
> >>> What matter is to stop any further processing. Yes some fault might
> >>> be in flight but they will serialize on various lock. 
> >>
> >> Those faults in flight could already be at a point where they have taken
> >> whatever locks they need, so we don't dare let the mm get destroyed while
> >> such fault handling is in progress.
> > 
> > mm can not vanish until hmm_unregister() is call, vma will vanish before.
> 
> OK, yes. And we agree that vma vanishing is a problem. 
> 
> > 
> >> So just do not
> >>> wait in the release callback, kill thing. I might have a bug where i
> >>> still fill in GPU page table in nouveau, i will check nouveau code
> >>> for that.
> >>
> >> Again, we can't "kill" a thread of execution (this would often be an
> >> interrupt bottom half context, btw) while it is, for example,
> >> in the middle of migrate_vma.
> > 
> > You should not call migrate from bottom half ! Only call this from work
> > queue like nouveau.
> 
> By "bottom half", I mean the kthread that we have running to handle work
> that was handed off from the top half ISR. So we are in process context.
> And we will need to do migrate_vma() from there.
> 
> > 
> >>
> >> I really don't believe there is a safe way to do this without draining
> >> the existing operations before .release returns, and for that, we'll need to 
> >> issue the .release callbacks while not holding locks.
> > 
> > down_write on mmap_sem would force serialization. I am not sure we want
> > to do this change now. It can wait as it is definitly not an issue for
> > nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
> > make me nervous.
> > 
> 
> I'm not going to lose any sleep about when various fixes are made, as long as
> we agree on problems and solution approaches, and fix them at some point.
> I will note that our downstreamdriver will not be...well, completely usable, 
> until we fix this, though.
> 

So i posted updated patch for 3 and 4 that should address your concern.
Testing done with them and nouveau seems to work ok. I am hopping this
address all your concerns.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-22  1:32                   ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-22  1:32 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 05:11:10PM -0700, John Hubbard wrote:
> On 03/21/2018 04:37 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
> >> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> >>> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> >>>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> >>>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >>>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > [...]
> > 
> >>>>> That is just illegal, the release callback is not allowed to trigger
> >>>>> invalidation all it does is kill all device's threads and stop device
> >>>>> page fault from happening. So there is no deadlock issues. I can re-
> >>>>> inforce the comment some more (see [1] for example on what it should
> >>>>> be).
> >>>>
> >>>> That rule is fine, and it is true that the .release callback will not 
> >>>> directly trigger any invalidations. However, the problem is in letting 
> >>>> any *existing* outstanding operations finish up. We have to let 
> >>>> existing operations "drain", in order to meet the requirement that 
> >>>> everything is done when .release returns.
> >>>>
> >>>> For example, if a device driver thread is in the middle of working through
> >>>> its fault buffer, it will call migrate_vma(), which will in turn unmap
> >>>> pages. That will cause an hmm_invalidate_range() callback, which tries
> >>>> to take hmm->mirrors_sems, and we deadlock.
> >>>>
> >>>> There's no way to "kill" such a thread while it's in the middle of
> >>>> migrate_vma(), you have to let it finish up.
> >>>>
> >>>>> Also it is illegal for the sync callback to trigger any mmu_notifier
> >>>>> callback. I thought this was obvious. The sync callback should only
> >>>>> update device page table and do _nothing else_. No way to make this
> >>>>> re-entrant.
> >>>>
> >>>> That is obvious, yes. I am not trying to say there is any problem with
> >>>> that rule. It's the "drain outstanding operations during .release", 
> >>>> above, that is the real problem.
> >>>
> >>> Maybe just relax the release callback wording, it should stop any
> >>> more processing of fault buffer but not wait for it to finish. In
> >>> nouveau code i kill thing but i do not wait hence i don't deadlock.
> >>
> >> But you may crash, because that approach allows .release to finish
> >> up, thus removing the mm entirely, out from under (for example)
> >> a migrate_vma call--or any other call that refers to the mm.
> > 
> > No you can not crash on mm as it will not vanish before you are done
> > with it as mm will not be freed before you call hmm_unregister() and
> > you should not call that from release, nor should you call it before
> > everything is flush. However vma struct might vanish ... i might have
> > assume wrongly about the down_write() always happening in exit_mmap()
> > This might be a solution to force serialization.
> > 
>  
> OK. My details on mm destruction were inaccurate, but we do agree now
> that that the whole virtual address space is being torn down at the same 
> time as we're trying to use it, so I think we're on the same page now.
> 
> >>
> >> It doesn't seem too hard to avoid the problem, though: maybe we
> >> can just drop the lock while doing the mirror->ops->release callback.
> >> There are a few ways to do this, but one example is: 
> >>
> >>     -- take the lock,
> >>         -- copy the list to a local list, deleting entries as you go,
> >>     -- drop the lock, 
> >>     -- iterate through the local list copy and 
> >>         -- issue the mirror->ops->release callbacks.
> >>
> >> At this point, more items could have been added to the list, so repeat
> >> the above until the original list is empty. 
> >>
> >> This is subject to a limited starvation case if mirror keep getting 
> >> registered, but I think we can ignore that, because it only lasts as long as 
> >> mirrors keep getting added, and then it finishes up.
> > 
> > The down_write is better solution and easier just 2 line of code.
> 
> OK. I'll have a better idea when I see it.
> 
> > 
> >>
> >>>
> >>> What matter is to stop any further processing. Yes some fault might
> >>> be in flight but they will serialize on various lock. 
> >>
> >> Those faults in flight could already be at a point where they have taken
> >> whatever locks they need, so we don't dare let the mm get destroyed while
> >> such fault handling is in progress.
> > 
> > mm can not vanish until hmm_unregister() is call, vma will vanish before.
> 
> OK, yes. And we agree that vma vanishing is a problem. 
> 
> > 
> >> So just do not
> >>> wait in the release callback, kill thing. I might have a bug where i
> >>> still fill in GPU page table in nouveau, i will check nouveau code
> >>> for that.
> >>
> >> Again, we can't "kill" a thread of execution (this would often be an
> >> interrupt bottom half context, btw) while it is, for example,
> >> in the middle of migrate_vma.
> > 
> > You should not call migrate from bottom half ! Only call this from work
> > queue like nouveau.
> 
> By "bottom half", I mean the kthread that we have running to handle work
> that was handed off from the top half ISR. So we are in process context.
> And we will need to do migrate_vma() from there.
> 
> > 
> >>
> >> I really don't believe there is a safe way to do this without draining
> >> the existing operations before .release returns, and for that, we'll need to 
> >> issue the .release callbacks while not holding locks.
> > 
> > down_write on mmap_sem would force serialization. I am not sure we want
> > to do this change now. It can wait as it is definitly not an issue for
> > nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
> > make me nervous.
> > 
> 
> I'm not going to lose any sleep about when various fixes are made, as long as
> we agree on problems and solution approaches, and fix them at some point.
> I will note that our downstreamdriver will not be...well, completely usable, 
> until we fix this, though.
> 

So i posted updated patch for 3 and 4 that should address your concern.
Testing done with them and nouveau seems to work ok. I am hopping this
address all your concerns.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2
@ 2018-03-22  1:32                   ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-22  1:32 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell, stable,
	Evgeny Baskakov, Mark Hairgrove

On Wed, Mar 21, 2018 at 05:11:10PM -0700, John Hubbard wrote:
> On 03/21/2018 04:37 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
> >> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> >>> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
> >>>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> >>>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
> >>>>>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> >>>>>>> From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > [...]
> > 
> >>>>> That is just illegal, the release callback is not allowed to trigger
> >>>>> invalidation all it does is kill all device's threads and stop device
> >>>>> page fault from happening. So there is no deadlock issues. I can re-
> >>>>> inforce the comment some more (see [1] for example on what it should
> >>>>> be).
> >>>>
> >>>> That rule is fine, and it is true that the .release callback will not 
> >>>> directly trigger any invalidations. However, the problem is in letting 
> >>>> any *existing* outstanding operations finish up. We have to let 
> >>>> existing operations "drain", in order to meet the requirement that 
> >>>> everything is done when .release returns.
> >>>>
> >>>> For example, if a device driver thread is in the middle of working through
> >>>> its fault buffer, it will call migrate_vma(), which will in turn unmap
> >>>> pages. That will cause an hmm_invalidate_range() callback, which tries
> >>>> to take hmm->mirrors_sems, and we deadlock.
> >>>>
> >>>> There's no way to "kill" such a thread while it's in the middle of
> >>>> migrate_vma(), you have to let it finish up.
> >>>>
> >>>>> Also it is illegal for the sync callback to trigger any mmu_notifier
> >>>>> callback. I thought this was obvious. The sync callback should only
> >>>>> update device page table and do _nothing else_. No way to make this
> >>>>> re-entrant.
> >>>>
> >>>> That is obvious, yes. I am not trying to say there is any problem with
> >>>> that rule. It's the "drain outstanding operations during .release", 
> >>>> above, that is the real problem.
> >>>
> >>> Maybe just relax the release callback wording, it should stop any
> >>> more processing of fault buffer but not wait for it to finish. In
> >>> nouveau code i kill thing but i do not wait hence i don't deadlock.
> >>
> >> But you may crash, because that approach allows .release to finish
> >> up, thus removing the mm entirely, out from under (for example)
> >> a migrate_vma call--or any other call that refers to the mm.
> > 
> > No you can not crash on mm as it will not vanish before you are done
> > with it as mm will not be freed before you call hmm_unregister() and
> > you should not call that from release, nor should you call it before
> > everything is flush. However vma struct might vanish ... i might have
> > assume wrongly about the down_write() always happening in exit_mmap()
> > This might be a solution to force serialization.
> > 
>  
> OK. My details on mm destruction were inaccurate, but we do agree now
> that that the whole virtual address space is being torn down at the same 
> time as we're trying to use it, so I think we're on the same page now.
> 
> >>
> >> It doesn't seem too hard to avoid the problem, though: maybe we
> >> can just drop the lock while doing the mirror->ops->release callback.
> >> There are a few ways to do this, but one example is: 
> >>
> >>     -- take the lock,
> >>         -- copy the list to a local list, deleting entries as you go,
> >>     -- drop the lock, 
> >>     -- iterate through the local list copy and 
> >>         -- issue the mirror->ops->release callbacks.
> >>
> >> At this point, more items could have been added to the list, so repeat
> >> the above until the original list is empty. 
> >>
> >> This is subject to a limited starvation case if mirror keep getting 
> >> registered, but I think we can ignore that, because it only lasts as long as 
> >> mirrors keep getting added, and then it finishes up.
> > 
> > The down_write is better solution and easier just 2 line of code.
> 
> OK. I'll have a better idea when I see it.
> 
> > 
> >>
> >>>
> >>> What matter is to stop any further processing. Yes some fault might
> >>> be in flight but they will serialize on various lock. 
> >>
> >> Those faults in flight could already be at a point where they have taken
> >> whatever locks they need, so we don't dare let the mm get destroyed while
> >> such fault handling is in progress.
> > 
> > mm can not vanish until hmm_unregister() is call, vma will vanish before.
> 
> OK, yes. And we agree that vma vanishing is a problem. 
> 
> > 
> >> So just do not
> >>> wait in the release callback, kill thing. I might have a bug where i
> >>> still fill in GPU page table in nouveau, i will check nouveau code
> >>> for that.
> >>
> >> Again, we can't "kill" a thread of execution (this would often be an
> >> interrupt bottom half context, btw) while it is, for example,
> >> in the middle of migrate_vma.
> > 
> > You should not call migrate from bottom half ! Only call this from work
> > queue like nouveau.
> 
> By "bottom half", I mean the kthread that we have running to handle work
> that was handed off from the top half ISR. So we are in process context.
> And we will need to do migrate_vma() from there.
> 
> > 
> >>
> >> I really don't believe there is a safe way to do this without draining
> >> the existing operations before .release returns, and for that, we'll need to 
> >> issue the .release callbacks while not holding locks.
> > 
> > down_write on mmap_sem would force serialization. I am not sure we want
> > to do this change now. It can wait as it is definitly not an issue for
> > nouveau yet. Taking mmap_sem in write (see oom in exit_mmap()) in release
> > make me nervous.
> > 
> 
> I'm not going to lose any sleep about when various fixes are made, as long as
> we agree on problems and solution approaches, and fix them at some point.
> I will note that our downstreamdriver will not be...well, completely usable, 
> until we fix this, though.
> 

So i posted updated patch for 3 and 4 that should address your concern.
Testing done with them and nouveau seems to work ok. I am hopping this
address all your concerns.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3
  2018-03-22  1:28     ` jglisse
@ 2018-03-22  6:58       ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-22  6:58 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, Evgeny Baskakov,
	Mark Hairgrove

On 03/21/2018 06:28 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.
> 
> Changed since v1:
>   - dropped VM_BUG_ON()
>   - cc stable
> Changed since v2:
>   - drop stable
>   - Split list removale and call to driver release callback. This
>     allow the release callback to wait on any pending fault handler
>     without deadlock.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 10 ++++++++++
>  mm/hmm.c            | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 36dd21fe5caf..fa7b51f65905 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,6 +218,16 @@ enum hmm_update_type {
>   * @update: callback to update range on a device
>   */
>  struct hmm_mirror_ops {
> +	/* release() - release hmm_mirror
> +	 *
> +	 * @mirror: pointer to struct hmm_mirror
> +	 *
> +	 * This is called when the mm_struct is being released.
> +	 * The callback should make sure no references to the mirror occur
> +	 * after the callback returns.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +
>  	/* sync_cpu_device_pagetables() - synchronize page tables
>  	 *
>  	 * @mirror: pointer to struct hmm_mirror
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..34c16297f65e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,32 @@ static void hmm_invalidate_range(struct hmm *hmm,
>  	up_read(&hmm->mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm_mirror *mirror;
> +	struct hmm *hmm = mm->hmm;
> +
> +	down_write(&hmm->mirrors_sem);
> +	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> +					  list);
> +	while (mirror) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release) {
> +			/*
> +			 * Drop mirrors_sem so callback can wait on any pending
> +			 * work that might itself trigger mmu_notifier callback
> +			 * and thus would deadlock with us.
> +			 */
> +			up_write(&hmm->mirrors_sem);
> +			mirror->ops->release(mirror);
> +			down_write(&hmm->mirrors_sem);
> +		}
> +		mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> +						  list);
> +	}
> +	up_write(&hmm->mirrors_sem);
> +}
> +

Hi Jerome,

This looks good (and the list handling is way better than my horrible 
copy-the-list idea)!

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

>  static void hmm_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
>  				       unsigned long start,
> @@ -185,6 +211,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> +	.release		= hmm_release,
>  	.invalidate_range_start	= hmm_invalidate_range_start,
>  	.invalidate_range_end	= hmm_invalidate_range_end,
>  };
> @@ -230,7 +257,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  	struct hmm *hmm = mirror->hmm;
>  
>  	down_write(&hmm->mirrors_sem);
> -	list_del(&mirror->list);
> +	list_del_init(&mirror->list);
>  	up_write(&hmm->mirrors_sem);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3
@ 2018-03-22  6:58       ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-22  6:58 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, Evgeny Baskakov,
	Mark Hairgrove

On 03/21/2018 06:28 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.
> 
> Changed since v1:
>   - dropped VM_BUG_ON()
>   - cc stable
> Changed since v2:
>   - drop stable
>   - Split list removale and call to driver release callback. This
>     allow the release callback to wait on any pending fault handler
>     without deadlock.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 10 ++++++++++
>  mm/hmm.c            | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 36dd21fe5caf..fa7b51f65905 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,6 +218,16 @@ enum hmm_update_type {
>   * @update: callback to update range on a device
>   */
>  struct hmm_mirror_ops {
> +	/* release() - release hmm_mirror
> +	 *
> +	 * @mirror: pointer to struct hmm_mirror
> +	 *
> +	 * This is called when the mm_struct is being released.
> +	 * The callback should make sure no references to the mirror occur
> +	 * after the callback returns.
> +	 */
> +	void (*release)(struct hmm_mirror *mirror);
> +
>  	/* sync_cpu_device_pagetables() - synchronize page tables
>  	 *
>  	 * @mirror: pointer to struct hmm_mirror
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..34c16297f65e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,32 @@ static void hmm_invalidate_range(struct hmm *hmm,
>  	up_read(&hmm->mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm_mirror *mirror;
> +	struct hmm *hmm = mm->hmm;
> +
> +	down_write(&hmm->mirrors_sem);
> +	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> +					  list);
> +	while (mirror) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release) {
> +			/*
> +			 * Drop mirrors_sem so callback can wait on any pending
> +			 * work that might itself trigger mmu_notifier callback
> +			 * and thus would deadlock with us.
> +			 */
> +			up_write(&hmm->mirrors_sem);
> +			mirror->ops->release(mirror);
> +			down_write(&hmm->mirrors_sem);
> +		}
> +		mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> +						  list);
> +	}
> +	up_write(&hmm->mirrors_sem);
> +}
> +

Hi Jerome,

This looks good (and the list handling is way better than my horrible 
copy-the-list idea)!

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

>  static void hmm_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
>  				       unsigned long start,
> @@ -185,6 +211,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> +	.release		= hmm_release,
>  	.invalidate_range_start	= hmm_invalidate_range_start,
>  	.invalidate_range_end	= hmm_invalidate_range_end,
>  };
> @@ -230,7 +257,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  	struct hmm *hmm = mirror->hmm;
>  
>  	down_write(&hmm->mirrors_sem);
> -	list_del(&mirror->list);
> +	list_del_init(&mirror->list);
>  	up_write(&hmm->mirrors_sem);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3
  2018-03-22  1:30       ` jglisse
  (?)
@ 2018-03-22 22:36       ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2018-03-22 22:36 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove, John Hubbard

On Wed, 21 Mar 2018 21:30:25 -0400 jglisse@redhat.com wrote:

> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.

OK, I'm officially all messed up and confused ;) I'll await the next
full resend, please.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-21 23:41         ` Jerome Glisse
@ 2018-03-22 22:47           ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-22 22:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> This code was lost in translation at one point. This properly call
>>> mmu_notifier_unregister_no_release() once last user is gone. This
>>> fix the zombie mm_struct as without this patch we do not drop the
>>> refcount we have on it.
>>>
>>> Changed since v1:
>>>   - close race window between a last mirror unregistering and a new
>>>     mirror registering, which could have lead to use after free()
>>>     kind of bug
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> ---
>>>  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
>>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index 6088fa6ed137..f75aa8df6e97 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>>>  	if (!mm || !mirror || !mirror->ops)
>>>  		return -EINVAL;
>>>  
>>> +again:
>>>  	mirror->hmm = hmm_register(mm);
>>>  	if (!mirror->hmm)
>>>  		return -ENOMEM;
>>>  
>>>  	down_write(&mirror->hmm->mirrors_sem);
>>> -	list_add(&mirror->list, &mirror->hmm->mirrors);
>>> -	up_write(&mirror->hmm->mirrors_sem);
>>> +	if (mirror->hmm->mm == NULL) {
>>> +		/*
>>> +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
>>> +		 * struct. Try again to allocate a new one.
>>> +		 */
>>> +		up_write(&mirror->hmm->mirrors_sem);
>>> +		mirror->hmm = NULL;
>>
>> This is being set outside of locks, so now there is another race with
>> another hmm_mirror_register...
>>
>> I'll take a moment and draft up what I have in mind here, which is a more
>> symmetrical locking scheme for these routines.
>>
> 
> No this code is correct. hmm->mm is set after hmm struct is allocated
> and before it is public so no one can race with that. It is clear in
> hmm_mirror_unregister() under the write lock hence checking it here
> under that same lock is correct.

Are you implying that code that calls hmm_mirror_register() should do 
it's own locking, to prevent simultaneous calls to that function? Because
as things are right now, multiple threads can arrive at this point. The
fact that mirror->hmm is not "public" is irrelevant; what matters is that
>1 thread can change it simultaneously.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-22 22:47           ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-22 22:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> This code was lost in translation at one point. This properly call
>>> mmu_notifier_unregister_no_release() once last user is gone. This
>>> fix the zombie mm_struct as without this patch we do not drop the
>>> refcount we have on it.
>>>
>>> Changed since v1:
>>>   - close race window between a last mirror unregistering and a new
>>>     mirror registering, which could have lead to use after free()
>>>     kind of bug
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> ---
>>>  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
>>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index 6088fa6ed137..f75aa8df6e97 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>>>  	if (!mm || !mirror || !mirror->ops)
>>>  		return -EINVAL;
>>>  
>>> +again:
>>>  	mirror->hmm = hmm_register(mm);
>>>  	if (!mirror->hmm)
>>>  		return -ENOMEM;
>>>  
>>>  	down_write(&mirror->hmm->mirrors_sem);
>>> -	list_add(&mirror->list, &mirror->hmm->mirrors);
>>> -	up_write(&mirror->hmm->mirrors_sem);
>>> +	if (mirror->hmm->mm == NULL) {
>>> +		/*
>>> +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
>>> +		 * struct. Try again to allocate a new one.
>>> +		 */
>>> +		up_write(&mirror->hmm->mirrors_sem);
>>> +		mirror->hmm = NULL;
>>
>> This is being set outside of locks, so now there is another race with
>> another hmm_mirror_register...
>>
>> I'll take a moment and draft up what I have in mind here, which is a more
>> symmetrical locking scheme for these routines.
>>
> 
> No this code is correct. hmm->mm is set after hmm struct is allocated
> and before it is public so no one can race with that. It is clear in
> hmm_mirror_unregister() under the write lock hence checking it here
> under that same lock is correct.

Are you implying that code that calls hmm_mirror_register() should do 
it's own locking, to prevent simultaneous calls to that function? Because
as things are right now, multiple threads can arrive at this point. The
fact that mirror->hmm is not "public" is irrelevant; what matters is that
>1 thread can change it simultaneously.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-22 22:47           ` John Hubbard
@ 2018-03-22 23:37             ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-22 23:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
> >> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> >>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>
> >>> This code was lost in translation at one point. This properly call
> >>> mmu_notifier_unregister_no_release() once last user is gone. This
> >>> fix the zombie mm_struct as without this patch we do not drop the
> >>> refcount we have on it.
> >>>
> >>> Changed since v1:
> >>>   - close race window between a last mirror unregistering and a new
> >>>     mirror registering, which could have lead to use after free()
> >>>     kind of bug
> >>>
> >>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> >>> Cc: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>> ---
> >>>  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
> >>>  1 file changed, 33 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index 6088fa6ed137..f75aa8df6e97 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> >>>  	if (!mm || !mirror || !mirror->ops)
> >>>  		return -EINVAL;
> >>>  
> >>> +again:
> >>>  	mirror->hmm = hmm_register(mm);
> >>>  	if (!mirror->hmm)
> >>>  		return -ENOMEM;
> >>>  
> >>>  	down_write(&mirror->hmm->mirrors_sem);
> >>> -	list_add(&mirror->list, &mirror->hmm->mirrors);
> >>> -	up_write(&mirror->hmm->mirrors_sem);
> >>> +	if (mirror->hmm->mm == NULL) {
> >>> +		/*
> >>> +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> >>> +		 * struct. Try again to allocate a new one.
> >>> +		 */
> >>> +		up_write(&mirror->hmm->mirrors_sem);
> >>> +		mirror->hmm = NULL;
> >>
> >> This is being set outside of locks, so now there is another race with
> >> another hmm_mirror_register...
> >>
> >> I'll take a moment and draft up what I have in mind here, which is a more
> >> symmetrical locking scheme for these routines.
> >>
> > 
> > No this code is correct. hmm->mm is set after hmm struct is allocated
> > and before it is public so no one can race with that. It is clear in
> > hmm_mirror_unregister() under the write lock hence checking it here
> > under that same lock is correct.
> 
> Are you implying that code that calls hmm_mirror_register() should do 
> it's own locking, to prevent simultaneous calls to that function? Because
> as things are right now, multiple threads can arrive at this point. The
> fact that mirror->hmm is not "public" is irrelevant; what matters is that
> >1 thread can change it simultaneously.

The content of struct hmm_mirror should not be modified by code outside
HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
is a private structure to HMM and the driver should not touch it, ie it
should be considered as read only/const from driver code point of view.

It is also expected (which was obvious to me) that driver only call once
and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
for any given hmm_mirror struct. Note that driver can register multiple
_different_ mirror struct to same mm or differents mm.

There is no need of locking on the driver side whatsoever as long as the
above rules are respected. I am puzzle if they were not obvious :)

Note that the above rule means that for any given struct hmm_mirror their
can only be one and only one call to hmm_mirror_register() happening, no
concurrent call. If you are doing the latter then something is seriously
wrong in your design.

So to be clear on what variable are you claiming race ?
  mirror->hmm ?
  mirror->hmm->mm which is really hmm->mm (mirror part does not matter) ?

I will hold resending v4 until tomorrow morning (eastern time) so that
you can convince yourself that this code is right or prove me wrong.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-22 23:37             ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-22 23:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> > On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
> >> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> >>> From: Jerome Glisse <jglisse@redhat.com>
> >>>
> >>> This code was lost in translation at one point. This properly call
> >>> mmu_notifier_unregister_no_release() once last user is gone. This
> >>> fix the zombie mm_struct as without this patch we do not drop the
> >>> refcount we have on it.
> >>>
> >>> Changed since v1:
> >>>   - close race window between a last mirror unregistering and a new
> >>>     mirror registering, which could have lead to use after free()
> >>>     kind of bug
> >>>
> >>> Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> >>> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> >>> Cc: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>> ---
> >>>  mm/hmm.c | 35 +++++++++++++++++++++++++++++++++--
> >>>  1 file changed, 33 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index 6088fa6ed137..f75aa8df6e97 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> >>>  	if (!mm || !mirror || !mirror->ops)
> >>>  		return -EINVAL;
> >>>  
> >>> +again:
> >>>  	mirror->hmm = hmm_register(mm);
> >>>  	if (!mirror->hmm)
> >>>  		return -ENOMEM;
> >>>  
> >>>  	down_write(&mirror->hmm->mirrors_sem);
> >>> -	list_add(&mirror->list, &mirror->hmm->mirrors);
> >>> -	up_write(&mirror->hmm->mirrors_sem);
> >>> +	if (mirror->hmm->mm == NULL) {
> >>> +		/*
> >>> +		 * A racing hmm_mirror_unregister() is about to destroy the hmm
> >>> +		 * struct. Try again to allocate a new one.
> >>> +		 */
> >>> +		up_write(&mirror->hmm->mirrors_sem);
> >>> +		mirror->hmm = NULL;
> >>
> >> This is being set outside of locks, so now there is another race with
> >> another hmm_mirror_register...
> >>
> >> I'll take a moment and draft up what I have in mind here, which is a more
> >> symmetrical locking scheme for these routines.
> >>
> > 
> > No this code is correct. hmm->mm is set after hmm struct is allocated
> > and before it is public so no one can race with that. It is clear in
> > hmm_mirror_unregister() under the write lock hence checking it here
> > under that same lock is correct.
> 
> Are you implying that code that calls hmm_mirror_register() should do 
> it's own locking, to prevent simultaneous calls to that function? Because
> as things are right now, multiple threads can arrive at this point. The
> fact that mirror->hmm is not "public" is irrelevant; what matters is that
> >1 thread can change it simultaneously.

The content of struct hmm_mirror should not be modified by code outside
HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
is a private structure to HMM and the driver should not touch it, ie it
should be considered as read only/const from driver code point of view.

It is also expected (which was obvious to me) that driver only call once
and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
for any given hmm_mirror struct. Note that driver can register multiple
_different_ mirror struct to same mm or differents mm.

There is no need of locking on the driver side whatsoever as long as the
above rules are respected. I am puzzle if they were not obvious :)

Note that the above rule means that for any given struct hmm_mirror their
can only be one and only one call to hmm_mirror_register() happening, no
concurrent call. If you are doing the latter then something is seriously
wrong in your design.

So to be clear on what variable are you claiming race ?
  mirror->hmm ?
  mirror->hmm->mm which is really hmm->mm (mirror part does not matter) ?

I will hold resending v4 until tomorrow morning (eastern time) so that
you can convince yourself that this code is right or prove me wrong.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-22 23:37             ` Jerome Glisse
@ 2018-03-23  0:13               ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-23  0:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/22/2018 04:37 PM, Jerome Glisse wrote:
> On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
>> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
>>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>

<snip>

>>>
>>> No this code is correct. hmm->mm is set after hmm struct is allocated
>>> and before it is public so no one can race with that. It is clear in
>>> hmm_mirror_unregister() under the write lock hence checking it here
>>> under that same lock is correct.
>>
>> Are you implying that code that calls hmm_mirror_register() should do 
>> it's own locking, to prevent simultaneous calls to that function? Because
>> as things are right now, multiple threads can arrive at this point. The
>> fact that mirror->hmm is not "public" is irrelevant; what matters is that
>>> 1 thread can change it simultaneously.
> 
> The content of struct hmm_mirror should not be modified by code outside
> HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
> is a private structure to HMM and the driver should not touch it, ie it
> should be considered as read only/const from driver code point of view.

Yes, that point is clear and obvious.

> 
> It is also expected (which was obvious to me) that driver only call once
> and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
> for any given hmm_mirror struct. Note that driver can register multiple
> _different_ mirror struct to same mm or differents mm.
> 
> There is no need of locking on the driver side whatsoever as long as the
> above rules are respected. I am puzzle if they were not obvious :)

Those rules were not obvious. It's unusual to claim that register and unregister
can run concurrently, but regiser and register cannot. Let's please document
the rules a bit in the comments.

> 
> Note that the above rule means that for any given struct hmm_mirror their
> can only be one and only one call to hmm_mirror_register() happening, no
> concurrent call. If you are doing the latter then something is seriously
> wrong in your design.
> 
> So to be clear on what variable are you claiming race ?
>   mirror->hmm ?
>   mirror->hmm->mm which is really hmm->mm (mirror part does not matter) ?
> 
> I will hold resending v4 until tomorrow morning (eastern time) so that
> you can convince yourself that this code is right or prove me wrong.

No need to wait. The documentation request above is a minor point, and
we're OK with you resending v4 whenever you're ready.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-23  0:13               ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-23  0:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/22/2018 04:37 PM, Jerome Glisse wrote:
> On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
>> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
>>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>

<snip>

>>>
>>> No this code is correct. hmm->mm is set after hmm struct is allocated
>>> and before it is public so no one can race with that. It is clear in
>>> hmm_mirror_unregister() under the write lock hence checking it here
>>> under that same lock is correct.
>>
>> Are you implying that code that calls hmm_mirror_register() should do 
>> it's own locking, to prevent simultaneous calls to that function? Because
>> as things are right now, multiple threads can arrive at this point. The
>> fact that mirror->hmm is not "public" is irrelevant; what matters is that
>>> 1 thread can change it simultaneously.
> 
> The content of struct hmm_mirror should not be modified by code outside
> HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
> is a private structure to HMM and the driver should not touch it, ie it
> should be considered as read only/const from driver code point of view.

Yes, that point is clear and obvious.

> 
> It is also expected (which was obvious to me) that driver only call once
> and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
> for any given hmm_mirror struct. Note that driver can register multiple
> _different_ mirror struct to same mm or differents mm.
> 
> There is no need of locking on the driver side whatsoever as long as the
> above rules are respected. I am puzzle if they were not obvious :)

Those rules were not obvious. It's unusual to claim that register and unregister
can run concurrently, but regiser and register cannot. Let's please document
the rules a bit in the comments.

> 
> Note that the above rule means that for any given struct hmm_mirror their
> can only be one and only one call to hmm_mirror_register() happening, no
> concurrent call. If you are doing the latter then something is seriously
> wrong in your design.
> 
> So to be clear on what variable are you claiming race ?
>   mirror->hmm ?
>   mirror->hmm->mm which is really hmm->mm (mirror part does not matter) ?
> 
> I will hold resending v4 until tomorrow morning (eastern time) so that
> you can convince yourself that this code is right or prove me wrong.

No need to wait. The documentation request above is a minor point, and
we're OK with you resending v4 whenever you're ready.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-23  0:13               ` John Hubbard
@ 2018-03-23  0:50                 ` Jerome Glisse
  -1 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-23  0:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Thu, Mar 22, 2018 at 05:13:14PM -0700, John Hubbard wrote:
> On 03/22/2018 04:37 PM, Jerome Glisse wrote:
> > On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
> >> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> >>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
> >>>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> >>>>> From: Jérôme Glisse <jglisse@redhat.com>
> 
> <snip>
> 
> >>>
> >>> No this code is correct. hmm->mm is set after hmm struct is allocated
> >>> and before it is public so no one can race with that. It is clear in
> >>> hmm_mirror_unregister() under the write lock hence checking it here
> >>> under that same lock is correct.
> >>
> >> Are you implying that code that calls hmm_mirror_register() should do 
> >> it's own locking, to prevent simultaneous calls to that function? Because
> >> as things are right now, multiple threads can arrive at this point. The
> >> fact that mirror->hmm is not "public" is irrelevant; what matters is that
> >>> 1 thread can change it simultaneously.
> > 
> > The content of struct hmm_mirror should not be modified by code outside
> > HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
> > is a private structure to HMM and the driver should not touch it, ie it
> > should be considered as read only/const from driver code point of view.
> 
> Yes, that point is clear and obvious.
> 
> > 
> > It is also expected (which was obvious to me) that driver only call once
> > and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
> > for any given hmm_mirror struct. Note that driver can register multiple
> > _different_ mirror struct to same mm or differents mm.
> > 
> > There is no need of locking on the driver side whatsoever as long as the
> > above rules are respected. I am puzzle if they were not obvious :)
> 
> Those rules were not obvious. It's unusual to claim that register and unregister
> can run concurrently, but regiser and register cannot. Let's please document
> the rules a bit in the comments.

I am really surprise this was not obvious. All existing _register API
in the kernel follow this. You register something once only and doing
it twice for same structure (ie unique struct hmm_mirror *mirror pointer
value) leads to serious bugs (doing so concurently or not).

For instance if you call mmu_notifier_register() twice (concurrently
or not) with same pointer value for struct mmu_notifier *mn then bad
thing will happen. Same for driver_register() but this one actualy
have sanity check and complain loudly if that happens. I doubt there
is any single *_register/unregister() in the kernel that does not
follow this.

Note that doing register/unregister concurrently for the same unique
hmm_mirror struct is also illegal. However concurrent register and
unregister of different hmm_mirror struct is legal and this is the
reasons for races we were discussing.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-23  0:50                 ` Jerome Glisse
  0 siblings, 0 replies; 99+ messages in thread
From: Jerome Glisse @ 2018-03-23  0:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Thu, Mar 22, 2018 at 05:13:14PM -0700, John Hubbard wrote:
> On 03/22/2018 04:37 PM, Jerome Glisse wrote:
> > On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
> >> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> >>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
> >>>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
> >>>>> From: Jerome Glisse <jglisse@redhat.com>
> 
> <snip>
> 
> >>>
> >>> No this code is correct. hmm->mm is set after hmm struct is allocated
> >>> and before it is public so no one can race with that. It is clear in
> >>> hmm_mirror_unregister() under the write lock hence checking it here
> >>> under that same lock is correct.
> >>
> >> Are you implying that code that calls hmm_mirror_register() should do 
> >> it's own locking, to prevent simultaneous calls to that function? Because
> >> as things are right now, multiple threads can arrive at this point. The
> >> fact that mirror->hmm is not "public" is irrelevant; what matters is that
> >>> 1 thread can change it simultaneously.
> > 
> > The content of struct hmm_mirror should not be modified by code outside
> > HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
> > is a private structure to HMM and the driver should not touch it, ie it
> > should be considered as read only/const from driver code point of view.
> 
> Yes, that point is clear and obvious.
> 
> > 
> > It is also expected (which was obvious to me) that driver only call once
> > and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
> > for any given hmm_mirror struct. Note that driver can register multiple
> > _different_ mirror struct to same mm or differents mm.
> > 
> > There is no need of locking on the driver side whatsoever as long as the
> > above rules are respected. I am puzzle if they were not obvious :)
> 
> Those rules were not obvious. It's unusual to claim that register and unregister
> can run concurrently, but regiser and register cannot. Let's please document
> the rules a bit in the comments.

I am really surprise this was not obvious. All existing _register API
in the kernel follow this. You register something once only and doing
it twice for same structure (ie unique struct hmm_mirror *mirror pointer
value) leads to serious bugs (doing so concurently or not).

For instance if you call mmu_notifier_register() twice (concurrently
or not) with same pointer value for struct mmu_notifier *mn then bad
thing will happen. Same for driver_register() but this one actualy
have sanity check and complain loudly if that happens. I doubt there
is any single *_register/unregister() in the kernel that does not
follow this.

Note that doing register/unregister concurrently for the same unique
hmm_mirror struct is also illegal. However concurrent register and
unregister of different hmm_mirror struct is legal and this is the
reasons for races we were discussing.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
  2018-03-23  0:50                 ` Jerome Glisse
@ 2018-03-23  0:56                   ` John Hubbard
  -1 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-23  0:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/22/2018 05:50 PM, Jerome Glisse wrote:
> On Thu, Mar 22, 2018 at 05:13:14PM -0700, John Hubbard wrote:
>> On 03/22/2018 04:37 PM, Jerome Glisse wrote:
>>> On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
>>>>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>>>>>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>
>> <snip>
>>
>>>>>
>>>>> No this code is correct. hmm->mm is set after hmm struct is allocated
>>>>> and before it is public so no one can race with that. It is clear in
>>>>> hmm_mirror_unregister() under the write lock hence checking it here
>>>>> under that same lock is correct.
>>>>
>>>> Are you implying that code that calls hmm_mirror_register() should do 
>>>> it's own locking, to prevent simultaneous calls to that function? Because
>>>> as things are right now, multiple threads can arrive at this point. The
>>>> fact that mirror->hmm is not "public" is irrelevant; what matters is that
>>>>> 1 thread can change it simultaneously.
>>>
>>> The content of struct hmm_mirror should not be modified by code outside
>>> HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
>>> is a private structure to HMM and the driver should not touch it, ie it
>>> should be considered as read only/const from driver code point of view.
>>
>> Yes, that point is clear and obvious.
>>
>>>
>>> It is also expected (which was obvious to me) that driver only call once
>>> and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
>>> for any given hmm_mirror struct. Note that driver can register multiple
>>> _different_ mirror struct to same mm or differents mm.
>>>
>>> There is no need of locking on the driver side whatsoever as long as the
>>> above rules are respected. I am puzzle if they were not obvious :)
>>
>> Those rules were not obvious. It's unusual to claim that register and unregister
>> can run concurrently, but regiser and register cannot. Let's please document
>> the rules a bit in the comments.
> 
> I am really surprise this was not obvious. All existing _register API
> in the kernel follow this. You register something once only and doing
> it twice for same structure (ie unique struct hmm_mirror *mirror pointer
> value) leads to serious bugs (doing so concurently or not).
> 
> For instance if you call mmu_notifier_register() twice (concurrently
> or not) with same pointer value for struct mmu_notifier *mn then bad
> thing will happen. Same for driver_register() but this one actualy
> have sanity check and complain loudly if that happens. I doubt there
> is any single *_register/unregister() in the kernel that does not
> follow this.

OK, then I guess no need to document it. In any case we know what to
expect here, so no problem there.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Note that doing register/unregister concurrently for the same unique
> hmm_mirror struct is also illegal. However concurrent register and
> unregister of different hmm_mirror struct is legal and this is the
> reasons for races we were discussing.
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2
@ 2018-03-23  0:56                   ` John Hubbard
  0 siblings, 0 replies; 99+ messages in thread
From: John Hubbard @ 2018-03-23  0:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/22/2018 05:50 PM, Jerome Glisse wrote:
> On Thu, Mar 22, 2018 at 05:13:14PM -0700, John Hubbard wrote:
>> On 03/22/2018 04:37 PM, Jerome Glisse wrote:
>>> On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
>>>>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>>>>>> On 03/21/2018 11:16 AM, jglisse@redhat.com wrote:
>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>
>> <snip>
>>
>>>>>
>>>>> No this code is correct. hmm->mm is set after hmm struct is allocated
>>>>> and before it is public so no one can race with that. It is clear in
>>>>> hmm_mirror_unregister() under the write lock hence checking it here
>>>>> under that same lock is correct.
>>>>
>>>> Are you implying that code that calls hmm_mirror_register() should do 
>>>> it's own locking, to prevent simultaneous calls to that function? Because
>>>> as things are right now, multiple threads can arrive at this point. The
>>>> fact that mirror->hmm is not "public" is irrelevant; what matters is that
>>>>> 1 thread can change it simultaneously.
>>>
>>> The content of struct hmm_mirror should not be modified by code outside
>>> HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
>>> is a private structure to HMM and the driver should not touch it, ie it
>>> should be considered as read only/const from driver code point of view.
>>
>> Yes, that point is clear and obvious.
>>
>>>
>>> It is also expected (which was obvious to me) that driver only call once
>>> and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
>>> for any given hmm_mirror struct. Note that driver can register multiple
>>> _different_ mirror struct to same mm or differents mm.
>>>
>>> There is no need of locking on the driver side whatsoever as long as the
>>> above rules are respected. I am puzzle if they were not obvious :)
>>
>> Those rules were not obvious. It's unusual to claim that register and unregister
>> can run concurrently, but regiser and register cannot. Let's please document
>> the rules a bit in the comments.
> 
> I am really surprise this was not obvious. All existing _register API
> in the kernel follow this. You register something once only and doing
> it twice for same structure (ie unique struct hmm_mirror *mirror pointer
> value) leads to serious bugs (doing so concurently or not).
> 
> For instance if you call mmu_notifier_register() twice (concurrently
> or not) with same pointer value for struct mmu_notifier *mn then bad
> thing will happen. Same for driver_register() but this one actualy
> have sanity check and complain loudly if that happens. I doubt there
> is any single *_register/unregister() in the kernel that does not
> follow this.

OK, then I guess no need to document it. In any case we know what to
expect here, so no problem there.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Note that doing register/unregister concurrently for the same unique
> hmm_mirror struct is also illegal. However concurrent register and
> unregister of different hmm_mirror struct is legal and this is the
> reasons for races we were discussing.
> 
> Cheers,
> Jérôme
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
@ 2018-03-23  0:55   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Changed since v1:
  - Fail early on vma without read permission and return an error
  - Improve comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 16 +++++++---------
 mm/hmm.c            | 44 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d0d6760cdada..dd907f614dfe 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,8 +83,7 @@ struct hmm;
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
- * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
+ * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index fc5057d7aa05..5da0f852a7aa 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -417,11 +417,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -433,6 +431,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -497,7 +496,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -517,7 +515,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -534,7 +532,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          vma permission, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
@@ -573,6 +572,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -686,6 +696,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *     goto retry;
  *   case 0:
  *     break;
+ *   case -ENOMEM:
+ *   case -EINVAL:
+ *   case -EPERM:
  *   default:
  *     // Handle error !
  *     up_read(&mm->mmap_sem)
@@ -727,11 +740,16 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
@@ -739,6 +757,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
+	/* Initialize range to track CPU page table update */
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
@ 2018-03-23  0:55   ` jglisse
  0 siblings, 0 replies; 99+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Changed since v1:
  - Fail early on vma without read permission and return an error
  - Improve comments

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 16 +++++++---------
 mm/hmm.c            | 44 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d0d6760cdada..dd907f614dfe 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,8 +83,7 @@ struct hmm;
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
- * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
+ * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index fc5057d7aa05..5da0f852a7aa 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -417,11 +417,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -433,6 +431,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -497,7 +496,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -517,7 +515,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -534,7 +532,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          vma permission, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
@@ -573,6 +572,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -686,6 +696,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *     goto retry;
  *   case 0:
  *     break;
+ *   case -ENOMEM:
+ *   case -EINVAL:
+ *   case -EPERM:
  *   default:
  *     // Handle error !
  *     up_read(&mm->mmap_sem)
@@ -727,11 +740,16 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
@@ -739,6 +757,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
+	/* Initialize range to track CPU page table update */
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2018-03-23  0:56 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20  2:00 [PATCH 00/15] hmm: fixes and documentations v3 jglisse
2018-03-20  2:00 ` jglisse
2018-03-20  2:00 ` [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-21  4:14   ` John Hubbard
2018-03-21  4:14     ` John Hubbard
2018-03-21 18:03     ` Jerome Glisse
2018-03-21 18:03       ` Jerome Glisse
2018-03-21 18:03       ` Jerome Glisse
2018-03-21 22:16       ` John Hubbard
2018-03-21 22:16         ` John Hubbard
2018-03-21 22:46         ` Jerome Glisse
2018-03-21 22:46           ` Jerome Glisse
2018-03-21 22:46           ` Jerome Glisse
2018-03-21 23:10           ` John Hubbard
2018-03-21 23:10             ` John Hubbard
2018-03-21 23:37             ` Jerome Glisse
2018-03-21 23:37               ` Jerome Glisse
2018-03-21 23:37               ` Jerome Glisse
2018-03-22  0:11               ` John Hubbard
2018-03-22  0:11                 ` John Hubbard
2018-03-22  1:32                 ` Jerome Glisse
2018-03-22  1:32                   ` Jerome Glisse
2018-03-22  1:32                   ` Jerome Glisse
2018-03-22  1:28   ` [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3 jglisse
2018-03-22  1:28     ` jglisse
2018-03-22  6:58     ` John Hubbard
2018-03-22  6:58       ` John Hubbard
2018-03-20  2:00 ` [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit jglisse
2018-03-20  2:00   ` jglisse
2018-03-21  4:24   ` John Hubbard
2018-03-21  4:24     ` John Hubbard
2018-03-21 18:12     ` Jerome Glisse
2018-03-21 18:12       ` Jerome Glisse
2018-03-21 18:16   ` [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2 jglisse
2018-03-21 18:16     ` jglisse
2018-03-21 23:22     ` John Hubbard
2018-03-21 23:22       ` John Hubbard
2018-03-21 23:41       ` Jerome Glisse
2018-03-21 23:41         ` Jerome Glisse
2018-03-22 22:47         ` John Hubbard
2018-03-22 22:47           ` John Hubbard
2018-03-22 23:37           ` Jerome Glisse
2018-03-22 23:37             ` Jerome Glisse
2018-03-23  0:13             ` John Hubbard
2018-03-23  0:13               ` John Hubbard
2018-03-23  0:50               ` Jerome Glisse
2018-03-23  0:50                 ` Jerome Glisse
2018-03-23  0:56                 ` John Hubbard
2018-03-23  0:56                   ` John Hubbard
2018-03-22  1:30     ` [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3 jglisse
2018-03-22  1:30       ` jglisse
2018-03-22 22:36       ` Andrew Morton
2018-03-20  2:00 ` [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL) jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-21  5:24   ` John Hubbard
2018-03-21  5:24     ` John Hubbard
2018-03-21 14:48     ` Jerome Glisse
2018-03-21 14:48       ` Jerome Glisse
2018-03-21 23:16       ` John Hubbard
2018-03-21 23:16         ` John Hubbard
2018-03-20  2:00 ` [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() jglisse
2018-03-20  2:00   ` jglisse
2018-03-21  5:07   ` John Hubbard
2018-03-21  5:07     ` John Hubbard
2018-03-21 15:08     ` Jerome Glisse
2018-03-21 15:08       ` Jerome Glisse
2018-03-21 22:36       ` John Hubbard
2018-03-21 22:36         ` John Hubbard
2018-03-20  2:00 ` [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis jglisse
2018-03-20  2:00   ` jglisse
2018-03-20  2:00 ` [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2 jglisse
2018-03-20  2:00   ` jglisse
2018-03-21  4:39   ` John Hubbard
2018-03-21  4:39     ` John Hubbard
2018-03-21 15:52     ` Jerome Glisse
2018-03-21 15:52       ` Jerome Glisse
2018-03-21 23:19       ` John Hubbard
2018-03-21 23:19         ` John Hubbard
2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
2018-03-23  0:55 ` [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2 jglisse
2018-03-23  0:55   ` jglisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.