All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v11 05/15] HMM: introduce heterogeneous memory management v5.
  2015-10-21 21:00   ` Jérôme Glisse
@ 2015-10-21 20:18     ` Randy Dunlap
  -1 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2015-10-21 20:18 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jatin Kumar

On 10/21/15 14:00, Jérôme Glisse wrote:

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0d9fdcd..10ed2de 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -680,3 +680,15 @@ config ZONE_DEVICE
>  
>  config FRAME_VECTOR
>  	bool
> +
> +config HMM
> +	bool "Enable heterogeneous memory management (HMM)"
> +	depends on MMU
> +	select MMU_NOTIFIER
> +	default n
> +	help
> +	  Heterogeneous memory management provide infrastructure for a device

	                                  provides

> +	  to mirror a process address space into an hardware mmu or into any

	                                    into a hardware MMU

> +	  things supporting pagefault like event.
> +
> +	  If unsure, say N to disable hmm.

	                              HMM.


-- 
~Randy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 05/15] HMM: introduce heterogeneous memory management v5.
@ 2015-10-21 20:18     ` Randy Dunlap
  0 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2015-10-21 20:18 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jatin Kumar

On 10/21/15 14:00, Jerome Glisse wrote:

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0d9fdcd..10ed2de 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -680,3 +680,15 @@ config ZONE_DEVICE
>  
>  config FRAME_VECTOR
>  	bool
> +
> +config HMM
> +	bool "Enable heterogeneous memory management (HMM)"
> +	depends on MMU
> +	select MMU_NOTIFIER
> +	default n
> +	help
> +	  Heterogeneous memory management provide infrastructure for a device

	                                  provides

> +	  to mirror a process address space into an hardware mmu or into any

	                                    into a hardware MMU

> +	  things supporting pagefault like event.
> +
> +	  If unsure, say N to disable hmm.

	                              HMM.


-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v11 00/15] HMM (Heterogeneous Memory Management)
@ 2015-10-21 20:59 ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Linda Wang, Kevin E Martin, Jeff Law,
	Or Gerlitz, Sagi Grimberg, Aneesh Kumar K.V

Minor fixes since last post (1), apply on top of 4.3rc6.
Please consider applying. Tree with the patchset:

git://people.freedesktop.org/~glisse/linux hmm-v11 branch

HMM (HMM (Heterogeneous Memory Management) is an helper layer
for device driver, its main features are :
   - Shadow CPU page table of a process into a device specific
     format page table and keep both page table synchronize.
   - Handle DMA mapping of system ram page on behalf of device
     (for shadowed page table entry).
   - Migrate private anonymous memory to private device memory
     and handle CPU page fault (which triggers a migration back
     to system memory so CPU can access it).

Benefits of HMM :
   - Avoid current model where device driver have to pin page
     which blocks several kernel features (KSM, migration, ...).
   - No impact on existing workload that do not use HMM (it only
     adds couple more if() to common code path).
   - Intended as common infrastructure for several different
     hardware, as of today Mellanox and NVidia.
   - Allow userspace API to move away from explicit copy code
     path where application programmer has to manage manually
     memcpy to and from device memory.
   - Transparent to userspace, for instance allowing library to
     use GPU without involving application linked against it.

I expect other hardware company to express interest in HMM and
eventualy start using it with their new hardware. I give a more
in depth motivation after the change log.


Change log :

v11:
  - Fix PROT_NONE case.
  - Fix missing page table walk callback.
  - Add support for hugetlbfs.

v10:
  - Minor fixes here and there.

v9:
  - Added new device driver helpers.
  - Added documentions.
  - Improved page table code claritity (minor architectural changes
    and better names).

v8:
  - Removed currently unuse fence code.
  - Added DMA mapping on behalf of device.

v7:
  - Redone and simplified page table code to match Linus suggestion
    http://article.gmane.org/gmane.linux.kernel.mm/125257

... Lost in translation ...


Why doing this ?

Mirroring a process address space is mandatory with OpenCL 2.0 and
with other GPU compute APIs. OpenCL 2.0 allows different level of
implementation and currently only the lowest 2 are supported on
Linux. To implement the highest level, where CPU and GPU access
can happen concurently and are cache coherent, HMM is needed, or
something providing same functionality, for instance through
platform hardware.

Hardware solution such as PCIE ATS/PASID is limited to mirroring
system memory and does not provide way to migrate memory to device
memory (which offer significantly more bandwidth, up to 10 times
faster than regular system memory with discrete GPU, also have
lower latency than PCIE transaction).

Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
and for Intel a special level of cache (backed by a large pool of
fast memory).

For foreseeable future, discrete GPUs will remain releveant as they
can have a large quantity of faster memory than integrated GPU.

Thus we believe HMM will allow us to leverage discrete GPUs memory
in a transparent fashion to the application, with minimum disruption
to the linux kernel mm code. Also HMM can work along hardware
solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
while HMM handles the migrated memory case).


Design :

The patch 1, 2, 3 and 4 augment the mmu notifier API with new
informations to more efficiently mirror CPU page table updates.

The first side of HMM, process address space mirroring, is
implemented in patch 5 through 14. This use a secondary page
table, in which HMM mirror memory actively use by the device.
HMM does not take a reference on any of the page, it use the
mmu notifier API to track changes to the CPU page table and to
update the mirror page table. All this while providing a simple
API to device driver.

To implement this we use a "generic" page table and not a radix
tree because we need to store more flags than radix allows and
we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
on some platform).


(1) Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/

Cheers,
Jérôme

To: "Andrew Morton" <akpm@linux-foundation.org>,
To: <linux-kernel@vger.kernel.org>,
To: linux-mm <linux-mm@kvack.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Christophe Harle" <charle@nvidia.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Leonid Shamis" <Leonid.Shamis@amd.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v11 00/15] HMM (Heterogeneous Memory Management)
@ 2015-10-21 20:59 ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Linda Wang, Kevin E Martin, Jeff Law,
	Or Gerlitz, Sagi Grimberg, Aneesh Kumar K.V

Minor fixes since last post (1), apply on top of 4.3rc6.
Please consider applying. Tree with the patchset:

git://people.freedesktop.org/~glisse/linux hmm-v11 branch

HMM (HMM (Heterogeneous Memory Management) is an helper layer
for device driver, its main features are :
   - Shadow CPU page table of a process into a device specific
     format page table and keep both page table synchronize.
   - Handle DMA mapping of system ram page on behalf of device
     (for shadowed page table entry).
   - Migrate private anonymous memory to private device memory
     and handle CPU page fault (which triggers a migration back
     to system memory so CPU can access it).

Benefits of HMM :
   - Avoid current model where device driver have to pin page
     which blocks several kernel features (KSM, migration, ...).
   - No impact on existing workload that do not use HMM (it only
     adds couple more if() to common code path).
   - Intended as common infrastructure for several different
     hardware, as of today Mellanox and NVidia.
   - Allow userspace API to move away from explicit copy code
     path where application programmer has to manage manually
     memcpy to and from device memory.
   - Transparent to userspace, for instance allowing library to
     use GPU without involving application linked against it.

I expect other hardware company to express interest in HMM and
eventualy start using it with their new hardware. I give a more
in depth motivation after the change log.


Change log :

v11:
  - Fix PROT_NONE case.
  - Fix missing page table walk callback.
  - Add support for hugetlbfs.

v10:
  - Minor fixes here and there.

v9:
  - Added new device driver helpers.
  - Added documentions.
  - Improved page table code claritity (minor architectural changes
    and better names).

v8:
  - Removed currently unuse fence code.
  - Added DMA mapping on behalf of device.

v7:
  - Redone and simplified page table code to match Linus suggestion
    http://article.gmane.org/gmane.linux.kernel.mm/125257

... Lost in translation ...


Why doing this ?

Mirroring a process address space is mandatory with OpenCL 2.0 and
with other GPU compute APIs. OpenCL 2.0 allows different level of
implementation and currently only the lowest 2 are supported on
Linux. To implement the highest level, where CPU and GPU access
can happen concurently and are cache coherent, HMM is needed, or
something providing same functionality, for instance through
platform hardware.

Hardware solution such as PCIE ATS/PASID is limited to mirroring
system memory and does not provide way to migrate memory to device
memory (which offer significantly more bandwidth, up to 10 times
faster than regular system memory with discrete GPU, also have
lower latency than PCIE transaction).

Current CPU with GPU on same die (AMD or Intel) use the ATS/PASID
and for Intel a special level of cache (backed by a large pool of
fast memory).

For foreseeable future, discrete GPUs will remain releveant as they
can have a large quantity of faster memory than integrated GPU.

Thus we believe HMM will allow us to leverage discrete GPUs memory
in a transparent fashion to the application, with minimum disruption
to the linux kernel mm code. Also HMM can work along hardware
solution such as PCIE ATS/PASID (leaving regular case to ATS/PASID
while HMM handles the migrated memory case).


Design :

The patch 1, 2, 3 and 4 augment the mmu notifier API with new
informations to more efficiently mirror CPU page table updates.

The first side of HMM, process address space mirroring, is
implemented in patch 5 through 14. This use a secondary page
table, in which HMM mirror memory actively use by the device.
HMM does not take a reference on any of the page, it use the
mmu notifier API to track changes to the CPU page table and to
update the mirror page table. All this while providing a simple
API to device driver.

To implement this we use a "generic" page table and not a radix
tree because we need to store more flags than radix allows and
we need to store dma address (sizeof(dma_addr_t) > sizeof(long)
on some platform).


(1) Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/

Cheers,
JA(C)rA'me

To: "Andrew Morton" <akpm@linux-foundation.org>,
To: <linux-kernel@vger.kernel.org>,
To: linux-mm <linux-mm@kvack.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Christophe Harle" <charle@nvidia.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Leonid Shamis" <Leonid.Shamis@amd.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v11 01/15] mmu_notifier: add event information to address invalidation v8
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 20:59   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate huge page split from mlock/munlock and softdirty.

Changed since v4:
  - Rebase (no other changes).

Changed since v5:
  - Typo fix.
  - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
    fact that the address range is still valid just the page backing it
    are no longer.

Changed since v6:
  - try_to_unmap_one() only invalidate when doing migration.
  - Differentiate fork from other case.

Changed since v7:
  - Renamed MMU_HUGE_PAGE_SPLIT to MMU_HUGE_PAGE_SPLIT.
  - Renamed MMU_ISDIRTY to MMU_CLEAR_SOFT_DIRTY.
  - Renamed MMU_WRITE_PROTECT to MMU_KSM_WRITE_PROTECT.
  - English syntax fixes.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |   3 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
 drivers/infiniband/core/umem_odp.c      |   9 ++-
 drivers/iommu/amd_iommu_v2.c            |   3 +-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 132 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/huge_memory.c                        |  33 +++++---
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   6 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |   4 +-
 virt/kvm/kvm_main.c                     |  12 ++-
 20 files changed, 254 insertions(+), 99 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index b1969f2..7ca805c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -121,7 +121,8 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
 static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 8fd431b..adc5480 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -132,7 +132,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index eef006c..3a9615b 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -121,7 +121,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 40becdb..6ed69fa 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 1131664..52f7d64 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -392,7 +392,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 2ea0b3b..60491fc 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,7 +467,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -484,9 +486,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e2d46ad..a3b15d4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -929,11 +929,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0, -1,
+							MMU_CLEAR_SOFT_DIRTY);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0, -1,
+							MMU_CLEAR_SOFT_DIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a1a210d..e92c52e 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,67 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_FORK a process is forking. This will lead to vmas getting
+ *     write-protected, in order to set up COW
+ *
+ *   - MMU_HUGE_PAGE_SPLIT the pages don't move, nor does their content change,
+ *     but the page table structure is updated (levels added or removed).
+ *
+ *   - MMU_CLEAR_SOFT_DIRTY need to write protect so write properly update the
+ *     soft dirty bit of page table entry.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_KSM_WRITE_PROTECT: memory is being write protected for KSM.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_FORK = 0,
+	MMU_HUGE_PAGE_SPLIT,
+	MMU_CLEAR_SOFT_DIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_KSM_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -92,7 +153,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -103,7 +165,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -150,10 +213,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 
 	/*
 	 * invalidate_range() is either called between
@@ -219,13 +286,20 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -262,31 +336,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -403,13 +484,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -437,22 +518,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4e5e979..eafa177 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -168,7 +168,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -186,7 +187,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -200,7 +203,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8d..2e1e746 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1093,7 +1093,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1127,7 +1128,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1137,7 +1139,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1229,7 +1232,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1261,7 +1265,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1680,7 +1685,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HUGE_PAGE_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1696,7 +1702,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HUGE_PAGE_SPLIT);
 
 	return ret;
 }
@@ -2566,7 +2573,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2576,7 +2584,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2975,7 +2984,8 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd)))
 		goto unlock;
@@ -2992,7 +3002,8 @@ again:
 	}
  unlock:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	if (!page)
 		return;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9cc7734..62c3ad8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2977,7 +2977,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -3031,7 +3032,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -3057,7 +3059,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -3131,7 +3134,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3318,8 +3322,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3340,7 +3344,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3822,7 +3827,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3872,7 +3877,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101e..eb1b2b5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_KSM_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_KSM_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index deb679c..8281b4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_FORK);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1066,7 +1066,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
+						  mmun_end, MMU_FORK);
 	return ret;
 }
 
@@ -1336,10 +1337,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1361,10 +1364,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1387,9 +1390,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2088,7 +2091,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2121,7 +2125,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2160,7 +2164,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 842ecd7..d49a3af 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1774,12 +1774,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1833,7 +1835,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5fbdd36..b806bdb 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -159,8 +159,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -168,13 +170,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -182,13 +185,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -196,14 +202,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -221,7 +230,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ef5be8e..f63b022 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,7 +155,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,7 +184,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end,
+						  MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 5a71cce..eea73a3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..8ff1e3b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1000,7 +1000,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1420,7 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8db1d93..9f6acd0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -269,7 +269,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -311,7 +312,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -327,7 +329,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -353,7 +356,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 01/15] mmu_notifier: add event information to address invalidation v8
@ 2015-10-21 20:59   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate huge page split from mlock/munlock and softdirty.

Changed since v4:
  - Rebase (no other changes).

Changed since v5:
  - Typo fix.
  - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
    fact that the address range is still valid just the page backing it
    are no longer.

Changed since v6:
  - try_to_unmap_one() only invalidate when doing migration.
  - Differentiate fork from other case.

Changed since v7:
  - Renamed MMU_HUGE_PAGE_SPLIT to MMU_HUGE_PAGE_SPLIT.
  - Renamed MMU_ISDIRTY to MMU_CLEAR_SOFT_DIRTY.
  - Renamed MMU_WRITE_PROTECT to MMU_KSM_WRITE_PROTECT.
  - English syntax fixes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |   3 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
 drivers/infiniband/core/umem_odp.c      |   9 ++-
 drivers/iommu/amd_iommu_v2.c            |   3 +-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 132 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/huge_memory.c                        |  33 +++++---
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   6 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |   4 +-
 virt/kvm/kvm_main.c                     |  12 ++-
 20 files changed, 254 insertions(+), 99 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index b1969f2..7ca805c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -121,7 +121,8 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
 static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 8fd431b..adc5480 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -132,7 +132,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index eef006c..3a9615b 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -121,7 +121,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 40becdb..6ed69fa 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 1131664..52f7d64 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -392,7 +392,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 2ea0b3b..60491fc 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,7 +467,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -484,9 +486,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e2d46ad..a3b15d4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -929,11 +929,13 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0, -1,
+							MMU_CLEAR_SOFT_DIRTY);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0, -1,
+							MMU_CLEAR_SOFT_DIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a1a210d..e92c52e 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,67 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_FORK a process is forking. This will lead to vmas getting
+ *     write-protected, in order to set up COW
+ *
+ *   - MMU_HUGE_PAGE_SPLIT the pages don't move, nor does their content change,
+ *     but the page table structure is updated (levels added or removed).
+ *
+ *   - MMU_CLEAR_SOFT_DIRTY need to write protect so write properly update the
+ *     soft dirty bit of page table entry.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_KSM_WRITE_PROTECT: memory is being write protected for KSM.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_FORK = 0,
+	MMU_HUGE_PAGE_SPLIT,
+	MMU_CLEAR_SOFT_DIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_KSM_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -92,7 +153,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -103,7 +165,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -150,10 +213,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 
 	/*
 	 * invalidate_range() is either called between
@@ -219,13 +286,20 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -262,31 +336,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -403,13 +484,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -437,22 +518,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4e5e979..eafa177 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -168,7 +168,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -186,7 +187,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -200,7 +203,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8d..2e1e746 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1093,7 +1093,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1127,7 +1128,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1137,7 +1139,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1229,7 +1232,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1261,7 +1265,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1680,7 +1685,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HUGE_PAGE_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1696,7 +1702,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HUGE_PAGE_SPLIT);
 
 	return ret;
 }
@@ -2566,7 +2573,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2576,7 +2584,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2975,7 +2984,8 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd)))
 		goto unlock;
@@ -2992,7 +3002,8 @@ again:
 	}
  unlock:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	if (!page)
 		return;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9cc7734..62c3ad8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2977,7 +2977,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -3031,7 +3032,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -3057,7 +3059,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -3131,7 +3134,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3318,8 +3322,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3340,7 +3344,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3822,7 +3827,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3872,7 +3877,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101e..eb1b2b5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_KSM_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_KSM_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index deb679c..8281b4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_FORK);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1066,7 +1066,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
+						  mmun_end, MMU_FORK);
 	return ret;
 }
 
@@ -1336,10 +1337,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1361,10 +1364,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1387,9 +1390,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2088,7 +2091,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2121,7 +2125,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2160,7 +2164,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 842ecd7..d49a3af 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1774,12 +1774,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1833,7 +1835,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5fbdd36..b806bdb 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -159,8 +159,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -168,13 +170,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -182,13 +185,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -196,14 +202,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -221,7 +230,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ef5be8e..f63b022 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -155,7 +155,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,7 +184,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end,
+						  MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 5a71cce..eea73a3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -176,7 +176,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..8ff1e3b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1000,7 +1000,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1420,7 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8db1d93..9f6acd0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -269,7 +269,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -311,7 +312,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -327,7 +329,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -353,7 +356,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 02/15] mmu_notifier: keep track of active invalidation ranges v5
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 20:59   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

The invalidate_range_start() and invalidate_range_end() can be
considered as forming an "atomic" section for the cpu page table
update point of view. Between this two function the cpu page
table content is unreliable for the address range being
invalidated.

This patch use a structure define at all place doing range
invalidation. This structure is added to a list for the duration
of the update ie added with invalid_range_start() and removed
with invalidate_range_end().

Helpers allow querying if a range is valid and wait for it if
necessary.

For proper synchronization, user must block any new range
invalidation from inside their invalidate_range_start() callback.
Otherwise there is no guarantee that a new range invalidation will
not be added after the call to the helper function to query for
existing range.

Changed since v1:
  - Fix a possible deadlock in mmu_notifier_range_wait_active()

Changed since v2:
  - Add the range to invalid range list before calling ->range_start().
  - Del the range from invalid range list after calling ->range_end().
  - Remove useless list initialization.

Changed since v3:
  - Improved commit message.
  - Added comment to explain how helpers function are suppose to be use.
  - English syntax fixes.

Changed since v4:
  - Syntax fixes.
  - Rename from range_*_valid to range_*active|inactive.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |  13 ++--
 drivers/gpu/drm/i915/i915_gem_userptr.c |  10 +--
 drivers/gpu/drm/radeon/radeon_mn.c      |  16 ++--
 drivers/infiniband/core/umem_odp.c      |  20 ++---
 drivers/misc/sgi-gru/grutlbpurge.c      |  15 ++--
 drivers/xen/gntdev.c                    |  15 ++--
 fs/proc/task_mmu.c                      |  11 ++-
 include/linux/mmu_notifier.h            |  55 +++++++-------
 kernel/events/uprobes.c                 |  13 ++--
 mm/huge_memory.c                        |  72 ++++++++----------
 mm/hugetlb.c                            |  55 +++++++-------
 mm/ksm.c                                |  28 +++----
 mm/memory.c                             |  72 ++++++++++--------
 mm/migrate.c                            |  36 ++++-----
 mm/mmu_notifier.c                       | 128 +++++++++++++++++++++++++++++---
 mm/mprotect.c                           |  18 +++--
 mm/mremap.c                             |  14 ++--
 virt/kvm/kvm_main.c                     |  14 ++--
 18 files changed, 350 insertions(+), 255 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 7ca805c..7c9eb1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -119,27 +119,24 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
  * unmap them by move them into system domain again.
  */
 static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 		long r;
 
 		node = container_of(it, struct amdgpu_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index adc5480..40ae9c1 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -130,17 +130,17 @@ restart:
 }
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
-						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
+	unsigned long start = range->start;
 	unsigned long next = start;
+	/* interval ranges are inclusive, but invalidate range is exclusive */
+	unsigned long end = range->end - 1;
 	unsigned long serial = 0;
 
-	end--; /* interval ranges are inclusive, but invalidate range is exclusive */
 	while (next < end) {
 		struct drm_i915_gem_object *obj = NULL;
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index 3a9615b..5276f01 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -112,34 +112,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  *
  * @mn: our notifier
  * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct radeon_mn_node *node;
 		struct radeon_bo *bo;
 		long r;
 
 		node = container_of(it, struct radeon_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ed69fa..58d9a00 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -191,10 +191,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 }
 
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	ib_ucontext_notifier_start_account(context);
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -217,10 +215,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 }
 
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_end_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end,
-				       enum mmu_event event)
+				       const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	STAT(mmu_invalidate_range);
 	atomic_inc(&gms->ms_range_active);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
+		range->start, range->end, atomic_read(&gms->ms_range_active));
+	gru_flush_tlb_range(gms, range->start, range->end - range->start);
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event)
+				     struct mm_struct *mm,
+				     const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	(void)atomic_dec_and_test(&gms->ms_range_active);
 
 	wake_up_all(&gms->ms_wait_queue);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+		range->start, range->end);
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 60491fc..71c526c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,19 +467,17 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end,
-				enum mmu_event event)
+				const struct mmu_notifier_range *range)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
 	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	mutex_unlock(&priv->lock);
 }
@@ -489,7 +487,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	struct mmu_notifier_range range;
+
+	range.start = address;
+	range.end = address + PAGE_SIZE;
+	range.event = event;
+	mn_invl_range_start(mn, mm, &range);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a3b15d4..65ef71f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -903,6 +903,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.mm = mm,
 			.private = &cp,
 		};
+		struct mmu_notifier_range range = {
+			.start = 0,
+			.end = ~0UL,
+			.event = MMU_CLEAR_SOFT_DIRTY,
+		};
 
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
 			/*
@@ -929,13 +934,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1,
-							MMU_CLEAR_SOFT_DIRTY);
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1,
-							MMU_CLEAR_SOFT_DIRTY);
+			mmu_notifier_invalidate_range_end(mm, &range);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index e92c52e..4ac1930 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -70,6 +70,13 @@ enum mmu_event {
 	MMU_KSM_WRITE_PROTECT,
 };
 
+struct mmu_notifier_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+	enum mmu_event event;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -83,6 +90,12 @@ struct mmu_notifier_mm {
 	struct hlist_head list;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	/* List of all active range invalidations. */
+	struct list_head ranges;
+	/* Number of active range invalidations. */
+	int nranges;
+	/* For threads waiting on range invalidations. */
+	wait_queue_head_t wait_queue;
 };
 
 struct mmu_notifier_ops {
@@ -213,14 +226,10 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start,
-				       unsigned long end,
-				       enum mmu_event event);
+				       const struct mmu_notifier_range *range);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event);
+				     const struct mmu_notifier_range *range);
 
 	/*
 	 * invalidate_range() is either called between
@@ -293,15 +302,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event);
+					struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						enum mmu_event event);
+					struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_inactive(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
+extern void mmu_notifier_range_wait_active(struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -353,21 +364,17 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -531,16 +538,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index eafa177..60d3d3c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -156,9 +156,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
 	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -168,8 +166,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -203,8 +203,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e1e746..e73c84c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1052,8 +1052,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1091,10 +1090,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1128,8 +1127,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1139,8 +1137,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1159,9 +1156,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 	gfp_t huge_gfp;			/* for allocation and charge */
+	struct mmu_notifier_range range;
 
 	ptl = pmd_lockptr(mm, pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1230,10 +1226,10 @@ alloc:
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	spin_lock(ptl);
 	if (page)
@@ -1265,8 +1261,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return ret;
 out_unlock:
@@ -1681,12 +1676,12 @@ static int __split_huge_page_splitting(struct page *page,
 	spinlock_t *ptl;
 	pmd_t *pmd;
 	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_HUGE_PAGE_SPLIT);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_HUGE_PAGE_SPLIT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1702,8 +1697,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_HUGE_PAGE_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return ret;
 }
@@ -2525,8 +2519,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	gfp_t gfp;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2571,10 +2564,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
-	mmun_start = address;
-	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2584,8 +2577,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2976,16 +2968,15 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page = NULL;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd)))
 		goto unlock;
@@ -3002,8 +2993,7 @@ again:
 	}
  unlock:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (!page)
 		return;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 62c3ad8..dae64fd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2968,17 +2968,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	int ret = 0;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	mmun_start = vma->vm_start;
-	mmun_end = vma->vm_end;
+	range.start = vma->vm_start;
+	range.end = vma->vm_end;
+	range.event = MMU_MIGRATE;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src, &range);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -3018,8 +3017,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
+				mmu_notifier_invalidate_range(src, range.start,
+								   range.end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -3032,8 +3031,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src, &range);
 
 	return ret;
 }
@@ -3051,16 +3049,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -3134,8 +3133,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3240,8 +3238,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	old_page = pte_page(pte);
 
@@ -3320,10 +3317,11 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 	set_page_huge_active(new_page);
 
-	mmun_start = address & huge_page_mask(h);
-	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = address & huge_page_mask(h);
+	range.end = range.start + huge_page_size(h);
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
+
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3335,7 +3333,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3344,8 +3342,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3823,11 +3820,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	struct mmu_notifier_range range;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MPROT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3877,7 +3878,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index eb1b2b5..e384a97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
+	struct mmu_notifier_range range;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
 	pte_t *ptep;
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_KSM_WRITE_PROTECT);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_KSM_WRITE_PROTECT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_KSM_WRITE_PROTECT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 8281b4b..77bbbf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1010,8 +1010,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
 
@@ -1045,11 +1044,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * is_cow_mapping() returns true.
 	 */
 	is_cow = is_cow_mapping(vma->vm_flags);
-	mmun_start = addr;
-	mmun_end   = end;
+	range.start = addr;
+	range.end = end;
+	range.event = MMU_FORK;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_start(src_mm, &range);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1066,8 +1065,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
-						  mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_end(src_mm, &range);
 	return ret;
 }
 
@@ -1336,13 +1334,16 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {
+		.start = start_addr,
+		.end = end_addr,
+		.event = MMU_MUNMAP,
+	};
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 }
 
 /**
@@ -1359,16 +1360,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = start + size;
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = start + size,
+		.event = MMU_MIGRATE,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, start, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+		unmap_single_vma(&tlb, vma, start, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, start, range.end);
 }
 
 /**
@@ -1385,15 +1390,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = address + size;
+	struct mmu_notifier_range range = {
+		.start = address,
+		.end = address + size,
+		.event = MMU_MUNMAP,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, address, end);
+	tlb_gather_mmu(&tlb, mm, address, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
-	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, address, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	unmap_single_vma(&tlb, vma, address, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, address, range.end);
 }
 
 /**
@@ -2001,6 +2010,7 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 	__releases(ptl)
 {
 	pte_t entry;
+
 	/*
 	 * Clear the pages cpupid information as the existing
 	 * information potentially belongs to a now completely
@@ -2068,9 +2078,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl = NULL;
 	pte_t entry;
 	int page_copied = 0;
-	const unsigned long mmun_start = address & PAGE_MASK;	/* For mmu_notifiers */
-	const unsigned long mmun_end = mmun_start + PAGE_SIZE;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
+	struct mmu_notifier_range range;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2091,8 +2100,10 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2164,8 +2175,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index d49a3af..144a19c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1736,10 +1736,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 	pmd_t orig_entry;
 
+	range.start = address & HPAGE_PMD_MASK;
+	range.end = range.start + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
@@ -1761,7 +1764,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	}
 
 	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
+		flush_tlb_range(vma, range.start, range.end);
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
@@ -1774,14 +1777,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1814,17 +1815,17 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
-	set_pmd_at(mm, mmun_start, pmd, entry);
-	flush_tlb_range(vma, mmun_start, mmun_end);
+	flush_cache_range(vma, range.start, range.end);
+	page_add_anon_rmap(new_page, vma, range.start);
+	pmdp_huge_clear_flush_notify(vma, range.start, pmd);
+	set_pmd_at(mm, range.start, pmd, entry);
+	flush_tlb_range(vma, range.start, range.end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		set_pmd_at(mm, range.start, pmd, orig_entry);
+		flush_tlb_range(vma, range.start, range.end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1835,8 +1836,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -1861,7 +1861,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, range.start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b806bdb..c43c851 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -191,28 +191,28 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end,
-					   enum mmu_event event)
+					   struct mmu_notifier_range *range)
 
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+	mm->mmu_notifier_mm->nranges++;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
-							end, event);
+			mn->ops->invalidate_range_start(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 unsigned long start,
-					 unsigned long end,
-					 enum mmu_event event)
+					 struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -228,12 +228,23 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * (besides the pointer check).
 		 */
 		if (mn->ops->invalidate_range)
-			mn->ops->invalidate_range(mn, mm, start, end);
+			mn->ops->invalidate_range(mn, mm,
+						  range->start, range->end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
-						      end, event);
+			mn->ops->invalidate_range_end(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_del_init(&range->list);
+	mm->mmu_notifier_mm->nranges--;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wakeup after callback so they can do their job before any of the
+	 * waiters resume.
+	 */
+	wake_up(&mm->mmu_notifier_mm->wait_queue);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
@@ -252,6 +263,98 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
+/* mmu_notifier_range_inactive_locked() - test if range overlaps with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: false if overlaps with an active invalidation, true otherwise.
+ *
+ * This function tests whether any active invalidation range conflicts with a
+ * given range ([start, end[), active invalidations are added to a list inside
+ * __mmu_notifier_invalidate_range_start() and removed from that list inside
+ * __mmu_notifier_invalidate_range_end().
+ */
+static bool mmu_notifier_range_inactive_locked(struct mm_struct *mm,
+					       unsigned long start,
+					       unsigned long end)
+{
+	struct mmu_notifier_range *range;
+
+	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+		if (range->end > start && range->start < end)
+			return false;
+	}
+	return true;
+}
+
+/* mmu_notifier_range_inactive() - test if range overlaps with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * Same as mmu_notifier_range_inactive_locked() but take the mmu_notifier lock.
+ */
+bool mmu_notifier_range_inactive(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	bool valid;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	valid = mmu_notifier_range_inactive_locked(mm, start, end);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_inactive);
+
+/* mmu_notifier_range_wait_active() - wait for a range to have no conflict with
+ * active invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function wait for any active range invalidation that conflict with the
+ * given range, to end.
+ *
+ * Note by the time this function return a new range invalidation that conflict
+ * might have started. So you need to atomically block new range and query
+ * again if range is still valid with mmu_notifier_range_inactive(). So call
+ * sequence should be :
+ *
+ * again:
+ * mmu_notifier_range_wait_active()
+ * // Stop new invalidation using common lock with your range_start callback.
+ * lock_block_new_invalidation()
+ * if (!mmu_notifier_range_inactive()) {
+ *     unlock_block_new_invalidation();
+ *     goto again;
+ * }
+ * // Here you can safely access CPU page table for the range, knowing that you
+ * // will see valid entry and no one can change them.
+ * unlock_block_new_invalidation()
+ */
+void mmu_notifier_range_wait_active(struct mm_struct *mm,
+				    unsigned long start,
+				    unsigned long end)
+{
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (!mmu_notifier_range_inactive_locked(mm, start, end)) {
+		int nranges = mm->mmu_notifier_mm->nranges;
+
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		wait_event(mm->mmu_notifier_mm->wait_queue,
+			   nranges != mm->mmu_notifier_mm->nranges);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_active);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
@@ -281,6 +384,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+		mmu_notifier_mm->nranges = 0;
+		init_waitqueue_head(&mmu_notifier_mm->wait_queue);
 
 		mm->mmu_notifier_mm = mmu_notifier_mm;
 		mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f63b022..d1b6f87 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -142,7 +142,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
+	struct mmu_notifier_range range = {
+		.start = 0,
+	};
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -153,10 +155,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
-							    end, MMU_MPROT);
+		if (!range.start) {
+			range.start = addr;
+			range.end = end;
+			range.event = MMU_MPROT;
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,9 +186,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end,
-						  MMU_MPROT);
+	if (range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index eea73a3..eb1a43f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -166,18 +166,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks)
 {
 	unsigned long extent, next, old_end;
+	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmun_start = old_addr;
-	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = old_addr;
+	range.end = old_end;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, &range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -229,8 +228,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, &range);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9f6acd0..fa2418f3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -327,10 +327,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 }
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +341,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_notifier_count++;
-	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
 	need_tlb_flush |= kvm->tlbs_dirty;
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
@@ -354,10 +352,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 02/15] mmu_notifier: keep track of active invalidation ranges v5
@ 2015-10-21 20:59   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

The invalidate_range_start() and invalidate_range_end() can be
considered as forming an "atomic" section for the cpu page table
update point of view. Between this two function the cpu page
table content is unreliable for the address range being
invalidated.

This patch use a structure define at all place doing range
invalidation. This structure is added to a list for the duration
of the update ie added with invalid_range_start() and removed
with invalidate_range_end().

Helpers allow querying if a range is valid and wait for it if
necessary.

For proper synchronization, user must block any new range
invalidation from inside their invalidate_range_start() callback.
Otherwise there is no guarantee that a new range invalidation will
not be added after the call to the helper function to query for
existing range.

Changed since v1:
  - Fix a possible deadlock in mmu_notifier_range_wait_active()

Changed since v2:
  - Add the range to invalid range list before calling ->range_start().
  - Del the range from invalid range list after calling ->range_end().
  - Remove useless list initialization.

Changed since v3:
  - Improved commit message.
  - Added comment to explain how helpers function are suppose to be use.
  - English syntax fixes.

Changed since v4:
  - Syntax fixes.
  - Rename from range_*_valid to range_*active|inactive.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |  13 ++--
 drivers/gpu/drm/i915/i915_gem_userptr.c |  10 +--
 drivers/gpu/drm/radeon/radeon_mn.c      |  16 ++--
 drivers/infiniband/core/umem_odp.c      |  20 ++---
 drivers/misc/sgi-gru/grutlbpurge.c      |  15 ++--
 drivers/xen/gntdev.c                    |  15 ++--
 fs/proc/task_mmu.c                      |  11 ++-
 include/linux/mmu_notifier.h            |  55 +++++++-------
 kernel/events/uprobes.c                 |  13 ++--
 mm/huge_memory.c                        |  72 ++++++++----------
 mm/hugetlb.c                            |  55 +++++++-------
 mm/ksm.c                                |  28 +++----
 mm/memory.c                             |  72 ++++++++++--------
 mm/migrate.c                            |  36 ++++-----
 mm/mmu_notifier.c                       | 128 +++++++++++++++++++++++++++++---
 mm/mprotect.c                           |  18 +++--
 mm/mremap.c                             |  14 ++--
 virt/kvm/kvm_main.c                     |  14 ++--
 18 files changed, 350 insertions(+), 255 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 7ca805c..7c9eb1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -119,27 +119,24 @@ static void amdgpu_mn_release(struct mmu_notifier *mn,
  * unmap them by move them into system domain again.
  */
 static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 		long r;
 
 		node = container_of(it, struct amdgpu_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index adc5480..40ae9c1 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -130,17 +130,17 @@ restart:
 }
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
-						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
+	unsigned long start = range->start;
 	unsigned long next = start;
+	/* interval ranges are inclusive, but invalidate range is exclusive */
+	unsigned long end = range->end - 1;
 	unsigned long serial = 0;
 
-	end--; /* interval ranges are inclusive, but invalidate range is exclusive */
 	while (next < end) {
 		struct drm_i915_gem_object *obj = NULL;
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index 3a9615b..5276f01 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -112,34 +112,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  *
  * @mn: our notifier
  * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct radeon_mn_node *node;
 		struct radeon_bo *bo;
 		long r;
 
 		node = container_of(it, struct radeon_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		list_for_each_entry(bo, &node->bos, mn_list) {
 
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6ed69fa..58d9a00 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -191,10 +191,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 }
 
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	ib_ucontext_notifier_start_account(context);
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -217,10 +215,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 }
 
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_end_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end,
-				       enum mmu_event event)
+				       const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	STAT(mmu_invalidate_range);
 	atomic_inc(&gms->ms_range_active);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
+		range->start, range->end, atomic_read(&gms->ms_range_active));
+	gru_flush_tlb_range(gms, range->start, range->end - range->start);
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event)
+				     struct mm_struct *mm,
+				     const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	(void)atomic_dec_and_test(&gms->ms_range_active);
 
 	wake_up_all(&gms->ms_wait_queue);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+		range->start, range->end);
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 60491fc..71c526c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -467,19 +467,17 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end,
-				enum mmu_event event)
+				const struct mmu_notifier_range *range)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
 	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	mutex_unlock(&priv->lock);
 }
@@ -489,7 +487,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	struct mmu_notifier_range range;
+
+	range.start = address;
+	range.end = address + PAGE_SIZE;
+	range.event = event;
+	mn_invl_range_start(mn, mm, &range);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a3b15d4..65ef71f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -903,6 +903,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.mm = mm,
 			.private = &cp,
 		};
+		struct mmu_notifier_range range = {
+			.start = 0,
+			.end = ~0UL,
+			.event = MMU_CLEAR_SOFT_DIRTY,
+		};
 
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
 			/*
@@ -929,13 +934,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1,
-							MMU_CLEAR_SOFT_DIRTY);
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 		walk_page_range(0, ~0UL, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1,
-							MMU_CLEAR_SOFT_DIRTY);
+			mmu_notifier_invalidate_range_end(mm, &range);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 out_mm:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index e92c52e..4ac1930 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -70,6 +70,13 @@ enum mmu_event {
 	MMU_KSM_WRITE_PROTECT,
 };
 
+struct mmu_notifier_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+	enum mmu_event event;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -83,6 +90,12 @@ struct mmu_notifier_mm {
 	struct hlist_head list;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	/* List of all active range invalidations. */
+	struct list_head ranges;
+	/* Number of active range invalidations. */
+	int nranges;
+	/* For threads waiting on range invalidations. */
+	wait_queue_head_t wait_queue;
 };
 
 struct mmu_notifier_ops {
@@ -213,14 +226,10 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start,
-				       unsigned long end,
-				       enum mmu_event event);
+				       const struct mmu_notifier_range *range);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event);
+				     const struct mmu_notifier_range *range);
 
 	/*
 	 * invalidate_range() is either called between
@@ -293,15 +302,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event);
+					struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						enum mmu_event event);
+					struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_inactive(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
+extern void mmu_notifier_range_wait_active(struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -353,21 +364,17 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -531,16 +538,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+					struct mmu_notifier_range *range)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index eafa177..60d3d3c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -156,9 +156,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
 	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -168,8 +166,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -203,8 +203,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e1e746..e73c84c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1052,8 +1052,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1091,10 +1090,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1128,8 +1127,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1139,8 +1137,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1159,9 +1156,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 	gfp_t huge_gfp;			/* for allocation and charge */
+	struct mmu_notifier_range range;
 
 	ptl = pmd_lockptr(mm, pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1230,10 +1226,10 @@ alloc:
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	spin_lock(ptl);
 	if (page)
@@ -1265,8 +1261,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return ret;
 out_unlock:
@@ -1681,12 +1676,12 @@ static int __split_huge_page_splitting(struct page *page,
 	spinlock_t *ptl;
 	pmd_t *pmd;
 	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_HUGE_PAGE_SPLIT);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_HUGE_PAGE_SPLIT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1702,8 +1697,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_HUGE_PAGE_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return ret;
 }
@@ -2525,8 +2519,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	gfp_t gfp;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2571,10 +2564,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
-	mmun_start = address;
-	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2584,8 +2577,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2976,16 +2968,15 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page = NULL;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd)))
 		goto unlock;
@@ -3002,8 +2993,7 @@ again:
 	}
  unlock:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (!page)
 		return;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 62c3ad8..dae64fd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2968,17 +2968,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	int ret = 0;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	mmun_start = vma->vm_start;
-	mmun_end = vma->vm_end;
+	range.start = vma->vm_start;
+	range.end = vma->vm_end;
+	range.event = MMU_MIGRATE;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src, &range);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -3018,8 +3017,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
+				mmu_notifier_invalidate_range(src, range.start,
+								   range.end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -3032,8 +3031,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src, &range);
 
 	return ret;
 }
@@ -3051,16 +3049,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -3134,8 +3133,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -3240,8 +3238,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	old_page = pte_page(pte);
 
@@ -3320,10 +3317,11 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 	set_page_huge_active(new_page);
 
-	mmun_start = address & huge_page_mask(h);
-	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = address & huge_page_mask(h);
+	range.end = range.start + huge_page_size(h);
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
+
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -3335,7 +3333,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -3344,8 +3342,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3823,11 +3820,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	struct mmu_notifier_range range;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MPROT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3877,7 +3878,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index eb1b2b5..e384a97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
+	struct mmu_notifier_range range;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
 	pte_t *ptep;
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_KSM_WRITE_PROTECT);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_KSM_WRITE_PROTECT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_KSM_WRITE_PROTECT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 8281b4b..77bbbf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1010,8 +1010,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
 
@@ -1045,11 +1044,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * is_cow_mapping() returns true.
 	 */
 	is_cow = is_cow_mapping(vma->vm_flags);
-	mmun_start = addr;
-	mmun_end   = end;
+	range.start = addr;
+	range.end = end;
+	range.event = MMU_FORK;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_start(src_mm, &range);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1066,8 +1065,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
-						  mmun_end, MMU_FORK);
+		mmu_notifier_invalidate_range_end(src_mm, &range);
 	return ret;
 }
 
@@ -1336,13 +1334,16 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {
+		.start = start_addr,
+		.end = end_addr,
+		.event = MMU_MUNMAP,
+	};
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 }
 
 /**
@@ -1359,16 +1360,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = start + size;
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = start + size,
+		.event = MMU_MIGRATE,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, start, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+		unmap_single_vma(&tlb, vma, start, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, start, range.end);
 }
 
 /**
@@ -1385,15 +1390,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = address + size;
+	struct mmu_notifier_range range = {
+		.start = address,
+		.end = address + size,
+		.event = MMU_MUNMAP,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, address, end);
+	tlb_gather_mmu(&tlb, mm, address, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
-	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, address, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	unmap_single_vma(&tlb, vma, address, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, address, range.end);
 }
 
 /**
@@ -2001,6 +2010,7 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 	__releases(ptl)
 {
 	pte_t entry;
+
 	/*
 	 * Clear the pages cpupid information as the existing
 	 * information potentially belongs to a now completely
@@ -2068,9 +2078,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl = NULL;
 	pte_t entry;
 	int page_copied = 0;
-	const unsigned long mmun_start = address & PAGE_MASK;	/* For mmu_notifiers */
-	const unsigned long mmun_end = mmun_start + PAGE_SIZE;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
+	struct mmu_notifier_range range;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -2091,8 +2100,10 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2164,8 +2175,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(new_page);
 
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index d49a3af..144a19c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1736,10 +1736,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 	pmd_t orig_entry;
 
+	range.start = address & HPAGE_PMD_MASK;
+	range.end = range.start + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
@@ -1761,7 +1764,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	}
 
 	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
+		flush_tlb_range(vma, range.start, range.end);
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
@@ -1774,14 +1777,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1814,17 +1815,17 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
-	set_pmd_at(mm, mmun_start, pmd, entry);
-	flush_tlb_range(vma, mmun_start, mmun_end);
+	flush_cache_range(vma, range.start, range.end);
+	page_add_anon_rmap(new_page, vma, range.start);
+	pmdp_huge_clear_flush_notify(vma, range.start, pmd);
+	set_pmd_at(mm, range.start, pmd, entry);
+	flush_tlb_range(vma, range.start, range.end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		set_pmd_at(mm, range.start, pmd, orig_entry);
+		flush_tlb_range(vma, range.start, range.end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1835,8 +1836,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -1861,7 +1861,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, range.start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b806bdb..c43c851 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -191,28 +191,28 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end,
-					   enum mmu_event event)
+					   struct mmu_notifier_range *range)
 
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+	mm->mmu_notifier_mm->nranges++;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
-							end, event);
+			mn->ops->invalidate_range_start(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 unsigned long start,
-					 unsigned long end,
-					 enum mmu_event event)
+					 struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -228,12 +228,23 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * (besides the pointer check).
 		 */
 		if (mn->ops->invalidate_range)
-			mn->ops->invalidate_range(mn, mm, start, end);
+			mn->ops->invalidate_range(mn, mm,
+						  range->start, range->end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
-						      end, event);
+			mn->ops->invalidate_range_end(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_del_init(&range->list);
+	mm->mmu_notifier_mm->nranges--;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wakeup after callback so they can do their job before any of the
+	 * waiters resume.
+	 */
+	wake_up(&mm->mmu_notifier_mm->wait_queue);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
@@ -252,6 +263,98 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
+/* mmu_notifier_range_inactive_locked() - test if range overlaps with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ * Returns: false if overlaps with an active invalidation, true otherwise.
+ *
+ * This function tests whether any active invalidation range conflicts with a
+ * given range ([start, end[), active invalidations are added to a list inside
+ * __mmu_notifier_invalidate_range_start() and removed from that list inside
+ * __mmu_notifier_invalidate_range_end().
+ */
+static bool mmu_notifier_range_inactive_locked(struct mm_struct *mm,
+					       unsigned long start,
+					       unsigned long end)
+{
+	struct mmu_notifier_range *range;
+
+	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+		if (range->end > start && range->start < end)
+			return false;
+	}
+	return true;
+}
+
+/* mmu_notifier_range_inactive() - test if range overlaps with active
+ * invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * Same as mmu_notifier_range_inactive_locked() but take the mmu_notifier lock.
+ */
+bool mmu_notifier_range_inactive(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	bool valid;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	valid = mmu_notifier_range_inactive_locked(mm, start, end);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_inactive);
+
+/* mmu_notifier_range_wait_active() - wait for a range to have no conflict with
+ * active invalidation.
+ *
+ * @mm: The mm struct.
+ * @start: Start address of the range (inclusive).
+ * @end: End address of the range (exclusive).
+ *
+ * This function wait for any active range invalidation that conflict with the
+ * given range, to end.
+ *
+ * Note by the time this function return a new range invalidation that conflict
+ * might have started. So you need to atomically block new range and query
+ * again if range is still valid with mmu_notifier_range_inactive(). So call
+ * sequence should be :
+ *
+ * again:
+ * mmu_notifier_range_wait_active()
+ * // Stop new invalidation using common lock with your range_start callback.
+ * lock_block_new_invalidation()
+ * if (!mmu_notifier_range_inactive()) {
+ *     unlock_block_new_invalidation();
+ *     goto again;
+ * }
+ * // Here you can safely access CPU page table for the range, knowing that you
+ * // will see valid entry and no one can change them.
+ * unlock_block_new_invalidation()
+ */
+void mmu_notifier_range_wait_active(struct mm_struct *mm,
+				    unsigned long start,
+				    unsigned long end)
+{
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (!mmu_notifier_range_inactive_locked(mm, start, end)) {
+		int nranges = mm->mmu_notifier_mm->nranges;
+
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		wait_event(mm->mmu_notifier_mm->wait_queue,
+			   nranges != mm->mmu_notifier_mm->nranges);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_active);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
@@ -281,6 +384,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+		mmu_notifier_mm->nranges = 0;
+		init_waitqueue_head(&mmu_notifier_mm->wait_queue);
 
 		mm->mmu_notifier_mm = mmu_notifier_mm;
 		mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f63b022..d1b6f87 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -142,7 +142,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
+	struct mmu_notifier_range range = {
+		.start = 0,
+	};
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -153,10 +155,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
-							    end, MMU_MPROT);
+		if (!range.start) {
+			range.start = addr;
+			range.end = end;
+			range.event = MMU_MPROT;
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -183,9 +186,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end,
-						  MMU_MPROT);
+	if (range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index eea73a3..eb1a43f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -166,18 +166,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks)
 {
 	unsigned long extent, next, old_end;
+	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmun_start = old_addr;
-	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = old_addr;
+	range.end = old_end;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, &range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -229,8 +228,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, &range);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9f6acd0..fa2418f3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -327,10 +327,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 }
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
-						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +341,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_notifier_count++;
-	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
 	need_tlb_flush |= kvm->tlbs_dirty;
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
@@ -354,10 +352,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 20:59   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Listener of mm event might not have easy way to get the struct page
behind an address invalidated with mmu_notifier_invalidate_page()
function as this happens after the cpu page table have been clear/
updated. This happens for instance if the listener is storing a dma
mapping inside its secondary page table. To avoid complex reverse
dma mapping lookup just pass along a pointer to the page being
invalidated.

Changed since v1:
  - English syntax fixes.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 1 +
 drivers/iommu/amd_iommu_v2.c       | 1 +
 drivers/misc/sgi-gru/grutlbpurge.c | 1 +
 drivers/xen/gntdev.c               | 1 +
 include/linux/mmu_notifier.h       | 6 +++++-
 mm/mmu_notifier.c                  | 3 ++-
 mm/rmap.c                          | 4 ++--
 virt/kvm/kvm_main.c                | 1 +
 8 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 58d9a00..0541761 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,6 +166,7 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 52f7d64..69f0f7c 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -393,6 +393,7 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
 			       unsigned long address,
+			       struct page *page,
 			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 44b41b7..c7659b76 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -250,6 +250,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 71c526c..2782c7c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -485,6 +485,7 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
 			 unsigned long address,
+			 struct page *page,
 			 enum mmu_event event)
 {
 	struct mmu_notifier_range range;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 4ac1930..d9b3cf1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -179,6 +179,7 @@ struct mmu_notifier_ops {
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event);
 
 	/*
@@ -300,6 +301,7 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
+					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 					struct mmu_notifier_range *range);
@@ -357,10 +359,11 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, event);
+		__mmu_notifier_invalidate_page(mm, address, page, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -533,6 +536,7 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index c43c851..316e4a9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -177,6 +177,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 				    unsigned long address,
+				    struct page *page,
 				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
@@ -185,7 +186,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, event);
+			mn->ops->invalidate_page(mn, mm, address, page, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 8ff1e3b..c26b76a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1000,7 +1000,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1420,7 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa2418f3..8164ce5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -270,6 +270,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2
@ 2015-10-21 20:59   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Listener of mm event might not have easy way to get the struct page
behind an address invalidated with mmu_notifier_invalidate_page()
function as this happens after the cpu page table have been clear/
updated. This happens for instance if the listener is storing a dma
mapping inside its secondary page table. To avoid complex reverse
dma mapping lookup just pass along a pointer to the page being
invalidated.

Changed since v1:
  - English syntax fixes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 1 +
 drivers/iommu/amd_iommu_v2.c       | 1 +
 drivers/misc/sgi-gru/grutlbpurge.c | 1 +
 drivers/xen/gntdev.c               | 1 +
 include/linux/mmu_notifier.h       | 6 +++++-
 mm/mmu_notifier.c                  | 3 ++-
 mm/rmap.c                          | 4 ++--
 virt/kvm/kvm_main.c                | 1 +
 8 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 58d9a00..0541761 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,6 +166,7 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 52f7d64..69f0f7c 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -393,6 +393,7 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
 			       unsigned long address,
+			       struct page *page,
 			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 44b41b7..c7659b76 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -250,6 +250,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 71c526c..2782c7c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -485,6 +485,7 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
 			 unsigned long address,
+			 struct page *page,
 			 enum mmu_event event)
 {
 	struct mmu_notifier_range range;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 4ac1930..d9b3cf1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -179,6 +179,7 @@ struct mmu_notifier_ops {
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long address,
+				struct page *page,
 				enum mmu_event event);
 
 	/*
@@ -300,6 +301,7 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
+					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 					struct mmu_notifier_range *range);
@@ -357,10 +359,11 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, event);
+		__mmu_notifier_invalidate_page(mm, address, page, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -533,6 +536,7 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 						unsigned long address,
+						struct page *page,
 						enum mmu_event event)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index c43c851..316e4a9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -177,6 +177,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 				    unsigned long address,
+				    struct page *page,
 				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
@@ -185,7 +186,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, event);
+			mn->ops->invalidate_page(mn, mm, address, page, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 8ff1e3b..c26b76a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1000,7 +1000,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1420,7 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+		mmu_notifier_invalidate_page(mm, address, page, MMU_MIGRATE);
 out:
 	return ret;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa2418f3..8164ce5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -270,6 +270,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long address,
+					     struct page *page,
 					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 20:59   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

This patch allow to invalidate a range while excluding call to a specific
mmu_notifier which allow for a subsystem to invalidate a range for everyone
but itself.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/mmu_notifier.h | 66 ++++++++++++++++++++++++++++++++++++++++----
 mm/mmu_notifier.c            | 16 +++++++++--
 2 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d9b3cf1..42cb4ef 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -304,11 +304,15 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					struct mmu_notifier_range *range);
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					struct mmu_notifier_range *range);
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+					    unsigned long start,
+					    unsigned long end,
+					    const struct mmu_notifier *exclude);
 extern bool mmu_notifier_range_inactive(struct mm_struct *mm,
 					unsigned long start,
 					unsigned long end);
@@ -370,21 +374,49 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, range);
+		__mmu_notifier_invalidate_range_start(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, range);
+		__mmu_notifier_invalidate_range_end(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_invalidate_range(mm, start, end, NULL);
+}
+
+static inline void mmu_notifier_invalidate_range_start_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end, exclude);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -556,6 +588,28 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_start_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end,
+					const struct mmu_notifier *exclude)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 316e4a9..651246f 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -192,7 +192,8 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   struct mmu_notifier_range *range)
+					   struct mmu_notifier_range *range,
+					   const struct mmu_notifier *exclude)
 
 {
 	struct mmu_notifier *mn;
@@ -205,6 +206,8 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range_start)
 			mn->ops->invalidate_range_start(mn, mm, range);
 	}
@@ -213,13 +216,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 struct mmu_notifier_range *range)
+					 struct mmu_notifier_range *range,
+					 const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		/*
 		 * Call invalidate_range here too to avoid the need for the
 		 * subsystem of having to register an invalidate_range_end
@@ -250,13 +256,17 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
 void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+				     unsigned long start,
+				     unsigned long end,
+				     const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 	}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier
@ 2015-10-21 20:59   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 20:59 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

This patch allow to invalidate a range while excluding call to a specific
mmu_notifier which allow for a subsystem to invalidate a range for everyone
but itself.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/mmu_notifier.h | 66 ++++++++++++++++++++++++++++++++++++++++----
 mm/mmu_notifier.c            | 16 +++++++++--
 2 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d9b3cf1..42cb4ef 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -304,11 +304,15 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  struct page *page,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					struct mmu_notifier_range *range);
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					struct mmu_notifier_range *range);
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+					    unsigned long start,
+					    unsigned long end,
+					    const struct mmu_notifier *exclude);
 extern bool mmu_notifier_range_inactive(struct mm_struct *mm,
 					unsigned long start,
 					unsigned long end);
@@ -370,21 +374,49 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, range);
+		__mmu_notifier_invalidate_range_start(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 					struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, range);
+		__mmu_notifier_invalidate_range_end(mm, range, NULL);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range(mm, start, end);
+		__mmu_notifier_invalidate_range(mm, start, end, NULL);
+}
+
+static inline void mmu_notifier_invalidate_range_start_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, range, exclude);
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end,
+					const struct mmu_notifier *exclude)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range(mm, start, end, exclude);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -556,6 +588,28 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_invalidate_range_start_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end_excluding(
+					struct mm_struct *mm,
+					struct mmu_notifier_range *range,
+					const struct mmu_notifier *exclude)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_excluding(
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end,
+					const struct mmu_notifier *exclude)
+{
+}
+
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
 {
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 316e4a9..651246f 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -192,7 +192,8 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   struct mmu_notifier_range *range)
+					   struct mmu_notifier_range *range,
+					   const struct mmu_notifier *exclude)
 
 {
 	struct mmu_notifier *mn;
@@ -205,6 +206,8 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range_start)
 			mn->ops->invalidate_range_start(mn, mm, range);
 	}
@@ -213,13 +216,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 struct mmu_notifier_range *range)
+					 struct mmu_notifier_range *range,
+					 const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		/*
 		 * Call invalidate_range here too to avoid the need for the
 		 * subsystem of having to register an invalidate_range_end
@@ -250,13 +256,17 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
 void __mmu_notifier_invalidate_range(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+				     unsigned long start,
+				     unsigned long end,
+				     const struct mmu_notifier *exclude)
 {
 	struct mmu_notifier *mn;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn == exclude)
+			continue;
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 	}
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 05/15] HMM: introduce heterogeneous memory management v5.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
  - Updated comment of hmm_device_register().

Changed since v2:
  - Expose struct hmm for easy access to mm struct.
  - Simplify hmm_mirror_register() arguments.
  - Removed the device name.
  - Refcount the mirror struct internaly to HMM allowing to get
    rid of the srcu and making the device driver callback error
    handling simpler.
  - Safe to call several time hmm_mirror_unregister().
  - Rework the mmu_notifier unregistration and release callback.

Changed since v3:
  - Rework hmm_mirror lifetime rules.
  - Synchronize with mmu_notifier srcu before droping mirror last
    reference in hmm_mirror_unregister()
  - Use spinlock for device's mirror list.
  - Export mirror ref/unref functions.
  - English syntax fixes.

Changed since v4:
  - Properly reference existing hmm struct if any.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 173 +++++++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  12 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 381 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 601 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fb7d2e4..85a8dd0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4950,6 +4950,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..b559c0b
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,173 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is called, device driver must kill all device thread using
+	 * this mirror. It is call either from :
+	 *   - mm dying (all process using this mm exiting).
+	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
+	 *   - outcome of some device error reported by any of the device
+	 *     callback against that mirror.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
+	/* free() - mirror can be freed.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is called, device driver can free the underlying memory
+	 * associated with that mirror. Note this is call from atomic context
+	 * so device driver callback can not sleep.
+	 */
+	void (*free)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct hlist_head	mirrors;
+	unsigned long		vm_end;
+	struct kref		kref;
+	struct rw_semaphore	rwsem;
+	struct mmu_notifier	mmu_notifier;
+	struct rcu_head		rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @lock: Lock protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	struct device			*dev;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	spinlock_t			lock;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct kref		kref;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..6f967a1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2338,5 +2338,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3d6baa7..993ac90 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -453,6 +457,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem held in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 2845623..631c398 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -603,6 +604,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 0d9fdcd..10ed2de 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -680,3 +680,15 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
diff --git a/mm/Makefile b/mm/Makefile
index 2ed4319..f291178 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,3 +81,4 @@ obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..8d861c4
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,381 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->rwsem);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	down_write(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			up_write(&hmm->rwsem);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	hmm_mirror_ref(mirror);
+	up_write(&hmm->rwsem);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(rcu, struct hmm, rcu);
+	kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		/*
+		 * Here we are holding the mirror reference from the mirror
+		 * list. As list removal is synchronized through rwsem, no
+		 * other thread can assume it holds that reference.
+		 */
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+
+		mirror->device->ops->release(mirror);
+		hmm_mirror_unref(&mirror);
+
+		down_write(&hmm->rwsem);
+	}
+	up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !kref_get_unless_zero(&mirror->kref))
+		return NULL;
+	return mirror;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_mirror *mirror;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+
+	hmm_unref(mirror->hmm);
+
+	spin_lock(&device->lock);
+	list_del_init(&mirror->dlist);
+	device->ops->free(mirror);
+	spin_unlock(&device->lock);
+}
+
+void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+	struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+	if (tmp) {
+		*mirror = NULL;
+		kref_put(&tmp->kref, hmm_mirror_destroy);
+	}
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!mirror->device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	kref_init(&mirror->kref);
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	spin_lock(&mirror->device->lock);
+	list_add(&mirror->dlist, &mirror->device->mirrors);
+	spin_unlock(&mirror->device->lock);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = hmm_ref(mm->hmm);
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	spin_lock(&mirror->device->lock);
+	list_del_init(&mirror->dlist);
+	spin_unlock(&mirror->device->lock);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm *hmm = hmm_ref(mirror->hmm);
+
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+		device->ops->release(mirror);
+		hmm_mirror_unref(&mirror);
+	} else
+		up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ *
+ * THIS CAN NOT BE CALL FROM device->release() CALLBACK OR IT WILL DEADLOCK.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	if (mirror == NULL)
+		return;
+
+	hmm_mirror_kill(mirror);
+	mmu_notifier_synchronize();
+	hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	spin_lock_init(&device->lock);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	spin_lock(&device->lock);
+	if (!list_empty(&device->mirrors)) {
+		spin_unlock(&device->lock);
+		return -EBUSY;
+	}
+	spin_unlock(&device->lock);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 05/15] HMM: introduce heterogeneous memory management v5.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch only introduce core HMM functions for registering a new
mirror and stopping a mirror as well as HMM device registering and
unregistering.

The lifecycle of HMM object is handled differently then the one of
mmu_notifier because unlike mmu_notifier there can be concurrent
call from both mm code to HMM code and/or from device driver code
to HMM code. Moreover lifetime of HMM can be uncorrelated from the
lifetime of the process that is being mirror (GPU might take longer
time to cleanup).

Changed since v1:
  - Updated comment of hmm_device_register().

Changed since v2:
  - Expose struct hmm for easy access to mm struct.
  - Simplify hmm_mirror_register() arguments.
  - Removed the device name.
  - Refcount the mirror struct internaly to HMM allowing to get
    rid of the srcu and making the device driver callback error
    handling simpler.
  - Safe to call several time hmm_mirror_unregister().
  - Rework the mmu_notifier unregistration and release callback.

Changed since v3:
  - Rework hmm_mirror lifetime rules.
  - Synchronize with mmu_notifier srcu before droping mirror last
    reference in hmm_mirror_unregister()
  - Use spinlock for device's mirror list.
  - Export mirror ref/unref functions.
  - English syntax fixes.

Changed since v4:
  - Properly reference existing hmm struct if any.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 173 +++++++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  12 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 381 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 601 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fb7d2e4..85a8dd0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4950,6 +4950,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..b559c0b
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,173 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is called, device driver must kill all device thread using
+	 * this mirror. It is call either from :
+	 *   - mm dying (all process using this mm exiting).
+	 *   - hmm_mirror_unregister() (if no other thread holds a reference)
+	 *   - outcome of some device error reported by any of the device
+	 *     callback against that mirror.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
+	/* free() - mirror can be freed.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * When this is called, device driver can free the underlying memory
+	 * associated with that mirror. Note this is call from atomic context
+	 * so device driver callback can not sleep.
+	 */
+	void (*free)(struct hmm_mirror *mirror);
+};
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mm: The mm struct this hmm is associated with.
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @kref: Reference counter.
+ * @rwsem: Serialize the mirror list modifications.
+ * @mmu_notifier: The mmu_notifier of this mm.
+ * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ *
+ * Device driver must not access this structure other than for getting the
+ * mm pointer.
+ */
+struct hmm {
+	struct mm_struct	*mm;
+	struct hlist_head	mirrors;
+	unsigned long		vm_end;
+	struct kref		kref;
+	struct rw_semaphore	rwsem;
+	struct mmu_notifier	mmu_notifier;
+	struct rcu_head		rcu;
+};
+
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @dev: Linux device structure pointer.
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @lock: Lock protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	struct device			*dev;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	spinlock_t			lock;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @kref: Reference counter (private to HMM do not use).
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct kref		kref;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+void hmm_mirror_unref(struct hmm_mirror **mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..6f967a1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2338,5 +2338,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3d6baa7..993ac90 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -453,6 +457,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem held in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 2845623..631c398 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -603,6 +604,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 0d9fdcd..10ed2de 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -680,3 +680,15 @@ config ZONE_DEVICE
 
 config FRAME_VECTOR
 	bool
+
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
diff --git a/mm/Makefile b/mm/Makefile
index 2ed4319..f291178 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,3 +81,4 @@ obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..8d861c4
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,381 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->rwsem);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	down_write(&hmm->rwsem);
+	hlist_for_each_entry(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			up_write(&hmm->rwsem);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	hmm_mirror_ref(mirror);
+	up_write(&hmm->rwsem);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy_delayed(struct rcu_head *rcu)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(rcu, struct hmm, rcu);
+	kfree(hmm);
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
+
+	mmu_notifier_call_srcu(&hmm->rcu, &hmm_destroy_delayed);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	hmm = hmm_ref(container_of(mn, struct hmm, mmu_notifier));
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		/*
+		 * Here we are holding the mirror reference from the mirror
+		 * list. As list removal is synchronized through rwsem, no
+		 * other thread can assume it holds that reference.
+		 */
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+
+		mirror->device->ops->release(mirror);
+		hmm_mirror_unref(&mirror);
+
+		down_write(&hmm->rwsem);
+	}
+	up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (!mirror || !kref_get_unless_zero(&mirror->kref))
+		return NULL;
+	return mirror;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_mirror *mirror;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+
+	hmm_unref(mirror->hmm);
+
+	spin_lock(&device->lock);
+	list_del_init(&mirror->dlist);
+	device->ops->free(mirror);
+	spin_unlock(&device->lock);
+}
+
+void hmm_mirror_unref(struct hmm_mirror **mirror)
+{
+	struct hmm_mirror *tmp = mirror ? *mirror : NULL;
+
+	if (tmp) {
+		*mirror = NULL;
+		kref_put(&tmp->kref, hmm_mirror_destroy);
+	}
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!mirror->device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	kref_init(&mirror->kref);
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	spin_lock(&mirror->device->lock);
+	list_add(&mirror->dlist, &mirror->device->mirrors);
+	spin_unlock(&mirror->device->lock);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = hmm_ref(mm->hmm);
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	spin_lock(&mirror->device->lock);
+	list_del_init(&mirror->dlist);
+	spin_unlock(&mirror->device->lock);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_kill(struct hmm_mirror *mirror)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm *hmm = hmm_ref(mirror->hmm);
+
+	if (!hmm)
+		return;
+
+	down_write(&hmm->rwsem);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init(&mirror->mlist);
+		up_write(&hmm->rwsem);
+		device->ops->release(mirror);
+		hmm_mirror_unref(&mirror);
+	} else
+		up_write(&hmm->rwsem);
+
+	hmm_unref(hmm);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->release() callback if it did not aleady
+ * happen.
+ *
+ * Note that caller must hold a reference on the mirror.
+ *
+ * THIS CAN NOT BE CALL FROM device->release() CALLBACK OR IT WILL DEADLOCK.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	if (mirror == NULL)
+		return;
+
+	hmm_mirror_kill(mirror);
+	mmu_notifier_synchronize();
+	hmm_mirror_unref(&mirror);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver must
+ * only register once.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	spin_lock_init(&device->lock);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	spin_lock(&device->lock);
+	if (!list_empty(&device->mirrors)) {
+		spin_unlock(&device->lock);
+		return -EBUSY;
+	}
+	spin_unlock(&device->lock);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 06/15] HMM: add HMM page table v4.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

Heterogeneous memory management main purpose is to mirror a process
address. To do so it must maintain a secondary page table that is
use by the device driver to program the device or build a device
specific page table.

Radix tree can't be use to create this secondary page table because
HMM needs more flags than RADIX_TREE_MAX_TAGS (while this can be
increase we believe HMM will require so much flags that cost will
becomes prohibitive to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to
store dma address and on some platform sizeof(dma_addr_t) is bigger
than sizeof(long). Thus radix tree is unsuitable to fulfill HMM
requirement hence why we introduce this code which allows to create
page table that can grow and shrink dynamicly.

The design is very close to CPU page table as it reuse some of the
feature such as spinlock embedded in struct page.

Changed since v1:
  - Use PAGE_SHIFT as shift value to reserve low bit for private
    device specific flags. This is to allow device driver to use
    and some of the lower bits for their own device specific purpose.
  - Add set of helper for atomically clear, setting and testing bit
    on dma_addr_t pointer. Atomicity being useful only for dirty bit.
  - Differentiate btw DMA mapped entry and non mapped entry (pfn).
  - Split page directory entry and page table entry helpers.

Changed since v2:
  - Rename hmm_pt_iter_update() -> hmm_pt_iter_lookup().
  - Rename hmm_pt_iter_fault() -> hmm_pt_iter_populate().
  - Add hmm_pt_iter_walk()
  - Remove hmm_pt_iter_next() (useless now).
  - Code simplification and improved comments.
  - Fix hmm_pt_fini_directory().

Changed since v3:
  - Fix hmm_pt_iter_directory_unref_safe().

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 342 ++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 603 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 948 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 85a8dd0..0e3f980 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4956,6 +4956,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..4a8beb1
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,342 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ *
+ * Device driver should only rely on the helpers and should not traverse the
+ * page table themself.
+ */
+#define HMM_PT_MAX_LEVEL	6
+
+#define HMM_PDE_VALID_BIT	0
+#define HMM_PDE_VALID		(1 << HMM_PDE_VALID_BIT)
+#define HMM_PDE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
+}
+
+static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
+{
+	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
+}
+
+
+/*
+ * The HMM_PTE_VALID_DMA_BIT is set for valid DMA mapped entry, while for pfn
+ * entry the HMM_PTE_VALID_PFN_BIT is set. If the hmm_device is associated with
+ * a valid struct device than device driver will be supplied with DMA mapped
+ * entry otherwise it will be supplied with pfn entry.
+ *
+ * In the first case the device driver must ignore any pfn entry as they might
+ * show as transient state while HMM is mapping the page.
+ */
+#define HMM_PTE_VALID_DMA_BIT	0
+#define HMM_PTE_VALID_PFN_BIT	1
+#define HMM_PTE_WRITE_BIT	2
+#define HMM_PTE_DIRTY_BIT	3
+/*
+ * Reserve some bits for device driver private flags. Note that thus can only
+ * be manipulated using the hmm_pte_*_bit() sets of helpers.
+ *
+ * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
+ * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
+ */
+#define HMM_PTE_HW_SHIFT	4
+
+#define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+#define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+
+#ifdef __BIG_ENDIAN
+/*
+ * The dma_addr_t casting we do on little endian do not work on big endian. It
+ * would require some macro trickery to adjust the bit value depending on the
+ * number of bit unsigned long have in comparison to dma_addr_t. This is just
+ * low on the todo list for now.
+ */
+#error "HMM not supported on BIG_ENDIAN architecture.\n"
+#else /* __BIG_ENDIAN */
+static inline void hmm_pte_clear_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline void hmm_pte_set_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	set_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	return !!test_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_clear_bit(dma_addr_t *ptep,
+					      unsigned char bit)
+{
+	return !!test_and_clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
+					    unsigned char bit)
+{
+	return !!test_and_set_bit(bit, (unsigned long *)ptep);
+}
+#endif /* __BIG_ENDIAN */
+
+
+#define HMM_PTE_CLEAR_BIT(name, bit)\
+	static inline void hmm_pte_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_SET_BIT(name, bit)\
+	static inline void hmm_pte_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_BIT(name, bit)\
+	static inline bool hmm_pte_test_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_SET_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_BIT_HELPER(name, bit)\
+	HMM_PTE_CLEAR_BIT(name, bit)\
+	HMM_PTE_SET_BIT(name, bit)\
+	HMM_PTE_TEST_BIT(name, bit)\
+	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	HMM_PTE_TEST_AND_SET_BIT(name, bit)
+
+HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
+HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
+HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
+}
+
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
+					struct page *ptd)
+{
+	if (!atomic_inc_not_zero(&ptd->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+static inline void hmm_pt_directory_unref(struct hmm_pt *pt,
+					  struct page *ptd)
+{
+	if (atomic_dec_and_test(&ptd->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct hmm_pt		*pt;
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter);
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+			     unsigned long *addr,
+			     unsigned long *next);
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+			       unsigned long addr,
+			       unsigned long *next);
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+				 unsigned long addr,
+				 unsigned long *next);
+
+/* hmm_pt_protect_directory_ref() - reference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will reference the current entry directory. Call this when
+ * you add a new valid entry to the entry directory.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
+{
+	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+	hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+/* hmm_pt_protect_directory_unref() - unreference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will unreference the current entry directory. Call this when
+ * you remove a valid entry from the entry directory.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter)
+{
+	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+	hmm_pt_directory_unref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index f291178..b60ab0e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,4 +81,4 @@ obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..ed766a0
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,603 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	/* Align end address with end of page for current arch. */
+	pt->last |= (PAGE_SIZE - 1);
+	spin_lock_init(&pt->lock);
+	/*
+	 * Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/*
+	 * Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kcalloc(npgd, sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PDE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pde_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&lptd->_mapcount, -1);
+		__free_page(lptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PDE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pde_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+/* hmm_pt_level_start() - Start (inclusive) address of directory at given level
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory start address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ *   (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ *   llevel = 3 (which is the page table entry level)
+ *   shift[0] = 39  mask[0] = ~((1 << 39) - 1)
+ *   shift[1] = 30  mask[1] = ~((1 << 30) - 1)
+ *   shift[2] = 21  mask[2] = ~((1 << 21) - 1)
+ *   shift[3] = 12  mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ *   start = hmm_pt_level_start(pt, addr, 3)
+ *         = addr & pt->mask[3 - 1]
+ *         = addr & ~((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return level ? addr & pt->mask[level - 1] : 0;
+}
+
+/* hmm_pt_level_end() - End address (inclusive) of directory at given level.
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory end address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ *   (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ *   llevel = 3 (which is the page table entry level)
+ *   shift[0] = 39  mask[0] = ~((1 << 39) - 1)
+ *   shift[1] = 30  mask[1] = ~((1 << 30) - 1)
+ *   shift[2] = 21  mask[2] = ~((1 << 21) - 1)
+ *   shift[3] = 12  mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ *   start = hmm_pt_level_end(pt, addr, 3)
+ *         = addr | ~pt->mask[3 - 1]
+ *         = addr | ((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return level ? (addr | (~pt->mask[level - 1])) : pt->last;
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   unsigned long addr)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr > hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+/* hmm_pt_iter_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	iter->pt = pt;
+	memset(iter->ptd, 0, sizeof(iter->ptd));
+	memset(iter->ptdp, 0, sizeof(iter->ptdp));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : iter->pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(iter->pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(iter->pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(iter->pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is now safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	if (upper_ptd)
+		hmm_pt_directory_unref(iter->pt, upper_ptd);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+static int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+					 struct page *ptd,
+					 unsigned long addr,
+					 unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+/* hmm_pt_iter_walk() - Walk page table for a valid entry directory.
+ *
+ * @iter: Iterator states.
+ * @addr: Start address of the range, return address of the entry directory.
+ * @next: End address of the range, return address of next directory.
+ * Returns Entry directory pointer and associated address if a valid entry
+ * directory exist in the range, or NULL and empty (*addr=*next) range
+ * otherwise.
+ *
+ * This function will return the first valid entry directory over a range of
+ * address. It update the addr parameter with the entry address and the next
+ * parameter with the address of the end of that directory. So device driver
+ * can do :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next = end;
+ *
+ *   for (ptep=hmm_pt_iter_walk(iter, &addr, &next); ptep; addr + PAGE_SIZE) {
+ *     // Use ptep
+ *     ptep++;
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+			     unsigned long *addr,
+			     unsigned long *next)
+{
+	struct hmm_pt *pt = iter->pt;
+	int i;
+
+	*addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    *addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    *addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+		*next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel)+1);
+		return hmm_pt_iter_ptdp(iter, *addr);
+	}
+
+again:
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (*addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    *addr <= hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, *addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			*addr = hmm_pt_level_end(pt, iter->cur, i) + 1;
+			if (*addr > *next) {
+				*addr = *next;
+				return NULL;
+			}
+			goto again;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd,
+						  *addr, i + 1) != 1) {
+			if (*addr > *next) {
+				*addr = *next;
+				return NULL;
+			}
+			goto again;
+		}
+	}
+
+	*next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, *addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_walk);
+
+/* hmm_pt_iter_lookup() - Lookup entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer for a given address as
+ * well as the end address of that directory (address of the next directory).
+ * Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next;
+ *
+ *   for (ptep=hmm_pt_iter_lookup(iter, addr, &next); ptep; addr+=PAGE_SIZE) {
+ *     // Use ptep
+ *     ptep++;
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+			       unsigned long addr,
+			       unsigned long *next)
+{
+	struct hmm_pt *pt = iter->pt;
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+		*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+		return hmm_pt_iter_ptdp(iter, addr);
+	}
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr <= hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			*next = min(*next,
+				    hmm_pt_level_end(pt, iter->cur, i) + 1);
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1) {
+			*next = min(*next,
+				    hmm_pt_level_end(pt, iter->cur, i) + 1);
+			return NULL;
+		}
+	}
+
+	*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_lookup);
+
+/* hmm_pt_iter_populate() - Allocate entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer (and allocate a new
+ * one if none exist) for a given address as well as the end address of that
+ * directory (address of the next directory). Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next;
+ *
+ *   ptep = hmm_pt_iter_populate(iter,addr,&next);
+ *   if (!ptep) {
+ *     // error handling.
+ *   }
+ *   for (; addr < next; addr += PAGE_SIZE, ptep++) {
+ *     // Use ptep
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+				 unsigned long addr,
+				 unsigned long *next)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_lookup(iter, addr, next);
+	struct hmm_pt *pt = iter->pt;
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1, iter->cur = addr; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PDE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pde_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pde_from_pfn(page_to_pfn(new));
+		/* The pgd level is not refcounted. */
+		if (i > 1)
+			hmm_pt_directory_ref(pt, iter->ptd[i - 2]);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_populate);
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = iter->pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 06/15] HMM: add HMM page table v4.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

Heterogeneous memory management main purpose is to mirror a process
address. To do so it must maintain a secondary page table that is
use by the device driver to program the device or build a device
specific page table.

Radix tree can't be use to create this secondary page table because
HMM needs more flags than RADIX_TREE_MAX_TAGS (while this can be
increase we believe HMM will require so much flags that cost will
becomes prohibitive to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to
store dma address and on some platform sizeof(dma_addr_t) is bigger
than sizeof(long). Thus radix tree is unsuitable to fulfill HMM
requirement hence why we introduce this code which allows to create
page table that can grow and shrink dynamicly.

The design is very close to CPU page table as it reuse some of the
feature such as spinlock embedded in struct page.

Changed since v1:
  - Use PAGE_SHIFT as shift value to reserve low bit for private
    device specific flags. This is to allow device driver to use
    and some of the lower bits for their own device specific purpose.
  - Add set of helper for atomically clear, setting and testing bit
    on dma_addr_t pointer. Atomicity being useful only for dirty bit.
  - Differentiate btw DMA mapped entry and non mapped entry (pfn).
  - Split page directory entry and page table entry helpers.

Changed since v2:
  - Rename hmm_pt_iter_update() -> hmm_pt_iter_lookup().
  - Rename hmm_pt_iter_fault() -> hmm_pt_iter_populate().
  - Add hmm_pt_iter_walk()
  - Remove hmm_pt_iter_next() (useless now).
  - Code simplification and improved comments.
  - Fix hmm_pt_fini_directory().

Changed since v3:
  - Fix hmm_pt_iter_directory_unref_safe().

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 342 ++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 603 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 948 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 85a8dd0..0e3f980 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4956,6 +4956,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..4a8beb1
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,342 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ *
+ * Device driver should only rely on the helpers and should not traverse the
+ * page table themself.
+ */
+#define HMM_PT_MAX_LEVEL	6
+
+#define HMM_PDE_VALID_BIT	0
+#define HMM_PDE_VALID		(1 << HMM_PDE_VALID_BIT)
+#define HMM_PDE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
+}
+
+static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
+{
+	return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
+}
+
+
+/*
+ * The HMM_PTE_VALID_DMA_BIT is set for valid DMA mapped entry, while for pfn
+ * entry the HMM_PTE_VALID_PFN_BIT is set. If the hmm_device is associated with
+ * a valid struct device than device driver will be supplied with DMA mapped
+ * entry otherwise it will be supplied with pfn entry.
+ *
+ * In the first case the device driver must ignore any pfn entry as they might
+ * show as transient state while HMM is mapping the page.
+ */
+#define HMM_PTE_VALID_DMA_BIT	0
+#define HMM_PTE_VALID_PFN_BIT	1
+#define HMM_PTE_WRITE_BIT	2
+#define HMM_PTE_DIRTY_BIT	3
+/*
+ * Reserve some bits for device driver private flags. Note that thus can only
+ * be manipulated using the hmm_pte_*_bit() sets of helpers.
+ *
+ * WARNING ONLY SET/CLEAR THOSE FLAG ON PTE ENTRY THAT HAVE THE VALID BIT SET
+ * AS OTHERWISE ANY BIT SET BY THE DRIVER WILL BE OVERWRITTEN BY HMM.
+ */
+#define HMM_PTE_HW_SHIFT	4
+
+#define HMM_PTE_PFN_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+#define HMM_PTE_DMA_MASK	(~((dma_addr_t)((1 << PAGE_SHIFT) - 1)))
+
+
+#ifdef __BIG_ENDIAN
+/*
+ * The dma_addr_t casting we do on little endian do not work on big endian. It
+ * would require some macro trickery to adjust the bit value depending on the
+ * number of bit unsigned long have in comparison to dma_addr_t. This is just
+ * low on the todo list for now.
+ */
+#error "HMM not supported on BIG_ENDIAN architecture.\n"
+#else /* __BIG_ENDIAN */
+static inline void hmm_pte_clear_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline void hmm_pte_set_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	set_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_bit(dma_addr_t *ptep, unsigned char bit)
+{
+	return !!test_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_clear_bit(dma_addr_t *ptep,
+					      unsigned char bit)
+{
+	return !!test_and_clear_bit(bit, (unsigned long *)ptep);
+}
+
+static inline bool hmm_pte_test_and_set_bit(dma_addr_t *ptep,
+					    unsigned char bit)
+{
+	return !!test_and_set_bit(bit, (unsigned long *)ptep);
+}
+#endif /* __BIG_ENDIAN */
+
+
+#define HMM_PTE_CLEAR_BIT(name, bit)\
+	static inline void hmm_pte_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_SET_BIT(name, bit)\
+	static inline void hmm_pte_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_BIT(name, bit)\
+	static inline bool hmm_pte_test_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_clear_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_clear_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_TEST_AND_SET_BIT(name, bit)\
+	static inline bool hmm_pte_test_and_set_##name(dma_addr_t *ptep)\
+	{\
+		return hmm_pte_test_and_set_bit(ptep, bit);\
+	}
+
+#define HMM_PTE_BIT_HELPER(name, bit)\
+	HMM_PTE_CLEAR_BIT(name, bit)\
+	HMM_PTE_SET_BIT(name, bit)\
+	HMM_PTE_TEST_BIT(name, bit)\
+	HMM_PTE_TEST_AND_CLEAR_BIT(name, bit)\
+	HMM_PTE_TEST_AND_SET_BIT(name, bit)
+
+HMM_PTE_BIT_HELPER(valid_dma, HMM_PTE_VALID_DMA_BIT)
+HMM_PTE_BIT_HELPER(valid_pfn, HMM_PTE_VALID_PFN_BIT)
+HMM_PTE_BIT_HELPER(dirty, HMM_PTE_DIRTY_BIT)
+HMM_PTE_BIT_HELPER(write, HMM_PTE_WRITE_BIT)
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
+}
+
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline void hmm_pt_directory_ref(struct hmm_pt *pt,
+					struct page *ptd)
+{
+	if (!atomic_inc_not_zero(&ptd->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+static inline void hmm_pt_directory_unref(struct hmm_pt *pt,
+					  struct page *ptd)
+{
+	if (atomic_dec_and_test(&ptd->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct hmm_pt		*pt;
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter);
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+			     unsigned long *addr,
+			     unsigned long *next);
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+			       unsigned long addr,
+			       unsigned long *next);
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+				 unsigned long addr,
+				 unsigned long *next);
+
+/* hmm_pt_protect_directory_ref() - reference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will reference the current entry directory. Call this when
+ * you add a new valid entry to the entry directory.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter)
+{
+	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+	hmm_pt_directory_ref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+/* hmm_pt_protect_directory_unref() - unreference current entry directory.
+ *
+ * @iter: Iterator states that currently protect the entry directory.
+ *
+ * This function will unreference the current entry directory. Call this when
+ * you remove a valid entry from the entry directory.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter)
+{
+	BUG_ON(!iter->ptd[iter->pt->llevel - 1]);
+	hmm_pt_directory_unref(iter->pt, iter->ptd[iter->pt->llevel - 1]);
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index f291178..b60ab0e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,4 +81,4 @@ obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..ed766a0
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,603 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	/* Align end address with end of page for current arch. */
+	pt->last |= (PAGE_SIZE - 1);
+	spin_lock_init(&pt->lock);
+	/*
+	 * Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/*
+	 * Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kcalloc(npgd, sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PDE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pde_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&lptd->_mapcount, -1);
+		__free_page(lptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PDE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pde_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+/* hmm_pt_level_start() - Start (inclusive) address of directory at given level
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory start address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ *   (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ *   llevel = 3 (which is the page table entry level)
+ *   shift[0] = 39  mask[0] = ~((1 << 39) - 1)
+ *   shift[1] = 30  mask[1] = ~((1 << 30) - 1)
+ *   shift[2] = 21  mask[2] = ~((1 << 21) - 1)
+ *   shift[3] = 12  mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ *   start = hmm_pt_level_start(pt, addr, 3)
+ *         = addr & pt->mask[3 - 1]
+ *         = addr & ~((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return level ? addr & pt->mask[level - 1] : 0;
+}
+
+/* hmm_pt_level_end() - End address (inclusive) of directory at given level.
+ *
+ * @pt: HMM page table.
+ * @addr: Address for which to get the directory end address.
+ * @level: Directory level.
+ *
+ * This return the start address of directory at given level for a given
+ * address. So using usual x86-64 example with :
+ *   (hmm_pt.last == (1 << 48) - 1, PAGE_SHIFT == 12, sizeof(dma_addr_t) == 8)
+ * We have :
+ *   llevel = 3 (which is the page table entry level)
+ *   shift[0] = 39  mask[0] = ~((1 << 39) - 1)
+ *   shift[1] = 30  mask[1] = ~((1 << 30) - 1)
+ *   shift[2] = 21  mask[2] = ~((1 << 21) - 1)
+ *   shift[3] = 12  mask[3] = ~((1 << 12) - 1)
+ * Which gives :
+ *   start = hmm_pt_level_end(pt, addr, 3)
+ *         = addr | ~pt->mask[3 - 1]
+ *         = addr | ((1 << 21) - 1)
+ */
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return level ? (addr | (~pt->mask[level - 1])) : pt->last;
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   unsigned long addr)
+{
+	struct hmm_pt *pt = iter->pt;
+
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr > hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+/* hmm_pt_iter_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	iter->pt = pt;
+	memset(iter->ptd, 0, sizeof(iter->ptd));
+	memset(iter->ptdp, 0, sizeof(iter->ptdp));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : iter->pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(iter->pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(iter->pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(iter->pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is now safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	if (upper_ptd)
+		hmm_pt_directory_unref(iter->pt, upper_ptd);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+static int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+					 struct page *ptd,
+					 unsigned long addr,
+					 unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+/* hmm_pt_iter_walk() - Walk page table for a valid entry directory.
+ *
+ * @iter: Iterator states.
+ * @addr: Start address of the range, return address of the entry directory.
+ * @next: End address of the range, return address of next directory.
+ * Returns Entry directory pointer and associated address if a valid entry
+ * directory exist in the range, or NULL and empty (*addr=*next) range
+ * otherwise.
+ *
+ * This function will return the first valid entry directory over a range of
+ * address. It update the addr parameter with the entry address and the next
+ * parameter with the address of the end of that directory. So device driver
+ * can do :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next = end;
+ *
+ *   for (ptep=hmm_pt_iter_walk(iter, &addr, &next); ptep; addr + PAGE_SIZE) {
+ *     // Use ptep
+ *     ptep++;
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_walk(struct hmm_pt_iter *iter,
+			     unsigned long *addr,
+			     unsigned long *next)
+{
+	struct hmm_pt *pt = iter->pt;
+	int i;
+
+	*addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    *addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    *addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+		*next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel)+1);
+		return hmm_pt_iter_ptdp(iter, *addr);
+	}
+
+again:
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (*addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    *addr <= hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, *addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			*addr = hmm_pt_level_end(pt, iter->cur, i) + 1;
+			if (*addr > *next) {
+				*addr = *next;
+				return NULL;
+			}
+			goto again;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd,
+						  *addr, i + 1) != 1) {
+			if (*addr > *next) {
+				*addr = *next;
+				return NULL;
+			}
+			goto again;
+		}
+	}
+
+	*next = min(*next, hmm_pt_level_end(pt, *addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, *addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_walk);
+
+/* hmm_pt_iter_lookup() - Lookup entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer for a given address as
+ * well as the end address of that directory (address of the next directory).
+ * Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next;
+ *
+ *   for (ptep=hmm_pt_iter_lookup(iter, addr, &next); ptep; addr+=PAGE_SIZE) {
+ *     // Use ptep
+ *     ptep++;
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_lookup(struct hmm_pt_iter *iter,
+			       unsigned long addr,
+			       unsigned long *next)
+{
+	struct hmm_pt *pt = iter->pt;
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr <= hmm_pt_level_end(pt, iter->cur, pt->llevel)) {
+		*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+		return hmm_pt_iter_ptdp(iter, addr);
+	}
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr <= hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PDE_VALID)) {
+			rcu_read_unlock();
+			*next = min(*next,
+				    hmm_pt_level_end(pt, iter->cur, i) + 1);
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pde_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1) {
+			*next = min(*next,
+				    hmm_pt_level_end(pt, iter->cur, i) + 1);
+			return NULL;
+		}
+	}
+
+	*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_lookup);
+
+/* hmm_pt_iter_populate() - Allocate entry directory for an address.
+ *
+ * @iter: Iterator states.
+ * @addr: Address of the entry directory to lookup.
+ * @next: End address up to which the entry directory is valid.
+ * Returns Entry directory pointer and its end address.
+ *
+ * This function will return the entry directory pointer (and allocate a new
+ * one if none exist) for a given address as well as the end address of that
+ * directory (address of the next directory). Use patern is :
+ *
+ * for (addr = start; addr < end;) {
+ *   unsigned long next;
+ *
+ *   ptep = hmm_pt_iter_populate(iter,addr,&next);
+ *   if (!ptep) {
+ *     // error handling.
+ *   }
+ *   for (; addr < next; addr += PAGE_SIZE, ptep++) {
+ *     // Use ptep
+ *   }
+ * }
+ */
+dma_addr_t *hmm_pt_iter_populate(struct hmm_pt_iter *iter,
+				 unsigned long addr,
+				 unsigned long *next)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_lookup(iter, addr, next);
+	struct hmm_pt *pt = iter->pt;
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1, iter->cur = addr; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PDE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pde_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pde_from_pfn(page_to_pfn(new));
+		/* The pgd level is not refcounted. */
+		if (i > 1)
+			hmm_pt_directory_ref(pt, iter->ptd[i - 2]);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	*next = min(*next, hmm_pt_level_end(pt, addr, pt->llevel) + 1);
+	return hmm_pt_iter_ptdp(iter, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_populate);
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = iter->pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 07/15] HMM: add per mirror page table v4.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add the per mirror page table. It also propagate CPU page
table update to this per mirror page table using mmu_notifier callback.
All update are contextualized with an HMM event structure that convey
all information needed by device driver to take proper actions (update
its own mmu to reflect changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once
the device driver is done with its update. Most importantly HMM will
properly propagate HMM page table dirty bit to underlying page.

Changed since v1:
  - Removed unused fence code to defer it to latter patches.

Changed since v2:
  - Use new bit flag helper for mirror page table manipulation.
  - Differentiate fork event with HMM_FORK from other events.

Changed since v3:
  - Get rid of HMM_ISDIRTY and rely on write protect instead.
  - Adapt to HMM page table changes

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  83 ++++++++++++++++++++
 mm/hmm.c            | 218 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 301 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b559c0b..5488fa9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,6 +46,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/workqueue.h>
 #include <linux/mman.h>
+#include <linux/hmm_pt.h>
 
 
 struct hmm_device;
@@ -53,6 +54,38 @@ struct hmm_mirror;
 struct hmm;
 
 
+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_FORK,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_DEVICE_RFAULT,
+	HMM_DEVICE_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	dma_addr_t		pte_mask;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+
 /* hmm_device - Each device must register one and only one hmm_device.
  *
  * The hmm_device is the link btw HMM and each device driver.
@@ -83,6 +116,54 @@ struct hmm_device_ops {
 	 * so device driver callback can not sleep.
 	 */
 	void (*free)(struct hmm_mirror *mirror);
+
+	/* update() - update device mmu following an event.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Thought most device driver only need to use pte_mask as it reflects
+	 * change that will happen to the HMM page table ie :
+	 *   new_pte = old_pte & event->pte_mask;
+	 *
+	 * Device driver must not update the HMM mirror page table (except the
+	 * dirty bit see below). Core HMM will update HMM page table after the
+	 * update is done.
+	 *
+	 * Note that device must be cache coherent with system memory (snooping
+	 * in case of PCIE devices) so there should be no need for device to
+	 * flush anything.
+	 *
+	 * When write protection is turned on device driver must make sure the
+	 * hardware will no longer be able to write to the page otherwise file
+	 * system corruption may occur.
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_set_bit() on
+	 * each page entry for memory that was written by the device. If device
+	 * can not properly account for write access then the dirty bit must be
+	 * set unconditionally so that proper write back of file backed page
+	 * can happen.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*update)(struct hmm_mirror *mirror,
+		      struct hmm_event *event);
 };
 
 
@@ -149,6 +230,7 @@ int hmm_device_unregister(struct hmm_device *device);
  * @kref: Reference counter (private to HMM do not use).
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
  *
  * Each device that want to mirror an address space must register one of this
  * struct for each of the address space it wants to mirror. Same device can
@@ -161,6 +243,7 @@ struct hmm_mirror {
 	struct kref		kref;
 	struct list_head	dlist;
 	struct hlist_node	mlist;
+	struct hmm_pt		pt;
 };
 
 int hmm_mirror_register(struct hmm_mirror *mirror);
diff --git a/mm/hmm.c b/mm/hmm.c
index 8d861c4..ef94e2a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -45,6 +45,50 @@
 #include "internal.h"
 
 static struct mmu_notifier_ops hmm_notifier_ops;
+static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+				 struct hmm *hmm,
+				 unsigned long start,
+				 unsigned long end,
+				 enum hmm_etype etype)
+{
+	event->start = start & PAGE_MASK;
+	event->end = min(end, hmm->vm_end);
+	if (event->start >= event->end)
+		return -EINVAL;
+	event->etype = etype;
+	event->pte_mask = (dma_addr_t)-1ULL;
+	switch (etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		break;
+	case HMM_FORK:
+	case HMM_WRITE_PROTECT:
+		event->pte_mask ^= (1 << HMM_PTE_WRITE_BIT);
+		break;
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		event->pte_mask = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
 
 
 /* hmm - core HMM functions.
@@ -123,6 +167,27 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+
+	/* Is this hmm already fully stop ? */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
+		if (hmm_mirror_update(mirror, event)) {
+			mirror = hmm_mirror_ref(mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	up_read(&hmm->rwsem);
+}
+
 
 /* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
  *
@@ -139,6 +204,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	down_write(&hmm->rwsem);
 	while (hmm->mirrors.first) {
 		struct hmm_mirror *mirror;
+		struct hmm_event event;
 
 		/*
 		 * Here we are holding the mirror reference from the mirror
@@ -151,6 +217,10 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 		hlist_del_init(&mirror->mlist);
 		up_write(&hmm->rwsem);
 
+		/* Make sure everything is unmapped. */
+		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+		hmm_mirror_update(mirror, &event);
+
 		mirror->device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
 
@@ -161,8 +231,89 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_unref(hmm);
 }
 
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
+{
+	struct hmm_event event;
+	unsigned long start = range->start, end = range->end;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	if (start >= hmm->vm_end)
+		return;
+
+	switch (range->event) {
+	case MMU_FORK:
+		event.etype = HMM_FORK;
+		break;
+	case MMU_MUNLOCK:
+		/* Still same physical ram backing same address. */
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_CLEAR_SOFT_DIRTY:
+	case MMU_WRITE_BACK:
+	case MMU_KSM_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_HUGE_PAGE_SPLIT:
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	case MMU_MIGRATE:
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm_event_init(&event, hmm, start, end, event.etype);
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 struct page *page,
+					 enum mmu_event mmu_event)
+{
+	struct mmu_notifier_range range;
+
+	range.start = addr & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = mmu_event;
+	hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
 static struct mmu_notifier_ops hmm_notifier_ops = {
 	.release		= hmm_notifier_release,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
 };
 
 
@@ -192,6 +343,7 @@ static void hmm_mirror_destroy(struct kref *kref)
 	mirror = container_of(kref, struct hmm_mirror, kref);
 	device = mirror->device;
 
+	hmm_pt_fini(&mirror->pt);
 	hmm_unref(mirror->hmm);
 
 	spin_lock(&device->lock);
@@ -211,6 +363,59 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unref);
 
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	int ret = 0;
+
+	ret = device->ops->update(mirror, event);
+	hmm_mirror_update_pt(mirror, event);
+	return ret;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event)
+{
+	unsigned long addr;
+	struct hmm_pt_iter iter;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = event->start; addr != event->end;) {
+		unsigned long next = event->end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+		if (!hmm_pte) {
+			addr = next;
+			continue;
+		}
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(&iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
+			    hmm_pte_test_write(hmm_pte)) {
+				struct page *page;
+
+				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+				set_page_dirty(page);
+			}
+			*hmm_pte &= event->pte_mask;
+			if (hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			hmm_pt_iter_directory_unref(&iter);
+		} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
+		hmm_pt_iter_directory_unlock(&iter);
+	}
+	hmm_pt_iter_fini(&iter);
+}
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
@@ -242,6 +447,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 	 * necessary to make the error path easier for driver and for hmm.
 	 */
 	kref_init(&mirror->kref);
+	mirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&mirror->pt)) {
+		kfree(mirror);
+		return -ENOMEM;
+	}
 	INIT_HLIST_NODE(&mirror->mlist);
 	INIT_LIST_HEAD(&mirror->dlist);
 	spin_lock(&mirror->device->lock);
@@ -278,6 +488,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 		hmm_unref(hmm);
 		goto error;
 	}
+	BUG_ON(mirror->pt.last >= hmm->vm_end);
 	return 0;
 
 error:
@@ -298,8 +509,15 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 	down_write(&hmm->rwsem);
 	if (!hlist_unhashed(&mirror->mlist)) {
+		struct hmm_event event;
+
 		hlist_del_init(&mirror->mlist);
 		up_write(&hmm->rwsem);
+
+		/* Make sure everything is unmapped. */
+		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+		hmm_mirror_update(mirror, &event);
+
 		device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
 	} else
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 07/15] HMM: add per mirror page table v4.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add the per mirror page table. It also propagate CPU page
table update to this per mirror page table using mmu_notifier callback.
All update are contextualized with an HMM event structure that convey
all information needed by device driver to take proper actions (update
its own mmu to reflect changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once
the device driver is done with its update. Most importantly HMM will
properly propagate HMM page table dirty bit to underlying page.

Changed since v1:
  - Removed unused fence code to defer it to latter patches.

Changed since v2:
  - Use new bit flag helper for mirror page table manipulation.
  - Differentiate fork event with HMM_FORK from other events.

Changed since v3:
  - Get rid of HMM_ISDIRTY and rely on write protect instead.
  - Adapt to HMM page table changes

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  83 ++++++++++++++++++++
 mm/hmm.c            | 218 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 301 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b559c0b..5488fa9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,6 +46,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/workqueue.h>
 #include <linux/mman.h>
+#include <linux/hmm_pt.h>
 
 
 struct hmm_device;
@@ -53,6 +54,38 @@ struct hmm_mirror;
 struct hmm;
 
 
+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_FORK,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_DEVICE_RFAULT,
+	HMM_DEVICE_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	dma_addr_t		pte_mask;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+
 /* hmm_device - Each device must register one and only one hmm_device.
  *
  * The hmm_device is the link btw HMM and each device driver.
@@ -83,6 +116,54 @@ struct hmm_device_ops {
 	 * so device driver callback can not sleep.
 	 */
 	void (*free)(struct hmm_mirror *mirror);
+
+	/* update() - update device mmu following an event.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * Returns: 0 on success or error code {-EIO, -ENOMEM}.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Thought most device driver only need to use pte_mask as it reflects
+	 * change that will happen to the HMM page table ie :
+	 *   new_pte = old_pte & event->pte_mask;
+	 *
+	 * Device driver must not update the HMM mirror page table (except the
+	 * dirty bit see below). Core HMM will update HMM page table after the
+	 * update is done.
+	 *
+	 * Note that device must be cache coherent with system memory (snooping
+	 * in case of PCIE devices) so there should be no need for device to
+	 * flush anything.
+	 *
+	 * When write protection is turned on device driver must make sure the
+	 * hardware will no longer be able to write to the page otherwise file
+	 * system corruption may occur.
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_set_bit() on
+	 * each page entry for memory that was written by the device. If device
+	 * can not properly account for write access then the dirty bit must be
+	 * set unconditionally so that proper write back of file backed page
+	 * can happen.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return 0 on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*update)(struct hmm_mirror *mirror,
+		      struct hmm_event *event);
 };
 
 
@@ -149,6 +230,7 @@ int hmm_device_unregister(struct hmm_device *device);
  * @kref: Reference counter (private to HMM do not use).
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
  *
  * Each device that want to mirror an address space must register one of this
  * struct for each of the address space it wants to mirror. Same device can
@@ -161,6 +243,7 @@ struct hmm_mirror {
 	struct kref		kref;
 	struct list_head	dlist;
 	struct hlist_node	mlist;
+	struct hmm_pt		pt;
 };
 
 int hmm_mirror_register(struct hmm_mirror *mirror);
diff --git a/mm/hmm.c b/mm/hmm.c
index 8d861c4..ef94e2a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -45,6 +45,50 @@
 #include "internal.h"
 
 static struct mmu_notifier_ops hmm_notifier_ops;
+static void hmm_mirror_kill(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+				 struct hmm *hmm,
+				 unsigned long start,
+				 unsigned long end,
+				 enum hmm_etype etype)
+{
+	event->start = start & PAGE_MASK;
+	event->end = min(end, hmm->vm_end);
+	if (event->start >= event->end)
+		return -EINVAL;
+	event->etype = etype;
+	event->pte_mask = (dma_addr_t)-1ULL;
+	switch (etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		break;
+	case HMM_FORK:
+	case HMM_WRITE_PROTECT:
+		event->pte_mask ^= (1 << HMM_PTE_WRITE_BIT);
+		break;
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		event->pte_mask = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
 
 
 /* hmm - core HMM functions.
@@ -123,6 +167,27 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+
+	/* Is this hmm already fully stop ? */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+again:
+	down_read(&hmm->rwsem);
+	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
+		if (hmm_mirror_update(mirror, event)) {
+			mirror = hmm_mirror_ref(mirror);
+			up_read(&hmm->rwsem);
+			hmm_mirror_kill(mirror);
+			hmm_mirror_unref(&mirror);
+			goto again;
+		}
+	up_read(&hmm->rwsem);
+}
+
 
 /* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
  *
@@ -139,6 +204,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	down_write(&hmm->rwsem);
 	while (hmm->mirrors.first) {
 		struct hmm_mirror *mirror;
+		struct hmm_event event;
 
 		/*
 		 * Here we are holding the mirror reference from the mirror
@@ -151,6 +217,10 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 		hlist_del_init(&mirror->mlist);
 		up_write(&hmm->rwsem);
 
+		/* Make sure everything is unmapped. */
+		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+		hmm_mirror_update(mirror, &event);
+
 		mirror->device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
 
@@ -161,8 +231,89 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_unref(hmm);
 }
 
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
+{
+	struct hmm_event event;
+	unsigned long start = range->start, end = range->end;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	if (start >= hmm->vm_end)
+		return;
+
+	switch (range->event) {
+	case MMU_FORK:
+		event.etype = HMM_FORK;
+		break;
+	case MMU_MUNLOCK:
+		/* Still same physical ram backing same address. */
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_CLEAR_SOFT_DIRTY:
+	case MMU_WRITE_BACK:
+	case MMU_KSM_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_HUGE_PAGE_SPLIT:
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	case MMU_MIGRATE:
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm_event_init(&event, hmm, start, end, event.etype);
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 struct page *page,
+					 enum mmu_event mmu_event)
+{
+	struct mmu_notifier_range range;
+
+	range.start = addr & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = mmu_event;
+	hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
 static struct mmu_notifier_ops hmm_notifier_ops = {
 	.release		= hmm_notifier_release,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
 };
 
 
@@ -192,6 +343,7 @@ static void hmm_mirror_destroy(struct kref *kref)
 	mirror = container_of(kref, struct hmm_mirror, kref);
 	device = mirror->device;
 
+	hmm_pt_fini(&mirror->pt);
 	hmm_unref(mirror->hmm);
 
 	spin_lock(&device->lock);
@@ -211,6 +363,59 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unref);
 
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	int ret = 0;
+
+	ret = device->ops->update(mirror, event);
+	hmm_mirror_update_pt(mirror, event);
+	return ret;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event)
+{
+	unsigned long addr;
+	struct hmm_pt_iter iter;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = event->start; addr != event->end;) {
+		unsigned long next = event->end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_lookup(&iter, addr, &next);
+		if (!hmm_pte) {
+			addr = next;
+			continue;
+		}
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(&iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
+			    hmm_pte_test_write(hmm_pte)) {
+				struct page *page;
+
+				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+				set_page_dirty(page);
+			}
+			*hmm_pte &= event->pte_mask;
+			if (hmm_pte_test_valid_pfn(hmm_pte))
+				continue;
+			hmm_pt_iter_directory_unref(&iter);
+		} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
+		hmm_pt_iter_directory_unlock(&iter);
+	}
+	hmm_pt_iter_fini(&iter);
+}
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
@@ -242,6 +447,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 	 * necessary to make the error path easier for driver and for hmm.
 	 */
 	kref_init(&mirror->kref);
+	mirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&mirror->pt)) {
+		kfree(mirror);
+		return -ENOMEM;
+	}
 	INIT_HLIST_NODE(&mirror->mlist);
 	INIT_LIST_HEAD(&mirror->dlist);
 	spin_lock(&mirror->device->lock);
@@ -278,6 +488,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror)
 		hmm_unref(hmm);
 		goto error;
 	}
+	BUG_ON(mirror->pt.last >= hmm->vm_end);
 	return 0;
 
 error:
@@ -298,8 +509,15 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 	down_write(&hmm->rwsem);
 	if (!hlist_unhashed(&mirror->mlist)) {
+		struct hmm_event event;
+
 		hlist_del_init(&mirror->mlist);
 		up_write(&hmm->rwsem);
+
+		/* Make sure everything is unmapped. */
+		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+		hmm_mirror_update(mirror, &event);
+
 		device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
 	} else
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 08/15] HMM: add device page fault support v6.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add helper for device page fault. Thus helpers will fill
the mirror page table using the CPU page table and synchronizing
with any update to CPU page table.

Changed since v1:
  - Add comment about directory lock.

Changed since v2:
  - Check for mirror->hmm in hmm_mirror_fault()

Changed since v3:
  - Adapt to HMM page table changes.

Changed since v4:
  - Fix PROT_NONE, ie do not populate from protnone pte.
  - Fix huge pmd handling (start address may != pmd start address)
  - Fix missing entry case.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  15 ++
 mm/hmm.c            | 391 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 405 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5488fa9..d819ec9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -85,6 +85,12 @@ struct hmm_event {
 	bool			backoff;
 };
 
+static inline bool hmm_event_overlap(const struct hmm_event *a,
+				     const struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 
 /* hmm_device - Each device must register one and only one hmm_device.
  *
@@ -176,6 +182,10 @@ struct hmm_device_ops {
  * @rwsem: Serialize the mirror list modifications.
  * @mmu_notifier: The mmu_notifier of this mm.
  * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
+ * @lock: Serialize device_faults list modification.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -192,6 +202,10 @@ struct hmm {
 	struct rw_semaphore	rwsem;
 	struct mmu_notifier	mmu_notifier;
 	struct rcu_head		rcu;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
+	spinlock_t		lock;
 };
 
 
@@ -250,6 +264,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index ef94e2a..cad25d9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -67,7 +67,7 @@ static inline int hmm_event_init(struct hmm_event *event,
 				 enum hmm_etype etype)
 {
 	event->start = start & PAGE_MASK;
-	event->end = min(end, hmm->vm_end);
+	event->end = PAGE_ALIGN(min(end, hmm->vm_end));
 	if (event->start >= event->end)
 		return -EINVAL;
 	event->etype = etype;
@@ -103,6 +103,10 @@ static int hmm_init(struct hmm *hmm)
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->rwsem);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
+	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
 	hmm->mmu_notifier.ops = &hmm_notifier_ops;
@@ -167,6 +171,58 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_active(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_inactive(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -175,6 +231,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
@@ -186,6 +244,33 @@ again:
 			goto again;
 		}
 	up_read(&hmm->rwsem);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY;
+	struct mm_struct *mm = vma->vm_mm;
+	int r;
+
+	flags |= (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		r = handle_mm_fault(mm, vma, addr, flags);
+		if (r & VM_FAULT_RETRY)
+			return -EBUSY;
+		if (r & VM_FAULT_ERROR) {
+			if (r & VM_FAULT_OOM)
+				return -ENOMEM;
+			/* Same error code for all other cases. */
+			return -EFAULT;
+		}
+	}
+	return 0;
 }
 
 
@@ -228,6 +313,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	up_write(&hmm->rwsem);
 
+	wake_up(&hmm->wait_queue);
 	hmm_unref(hmm);
 }
 
@@ -416,6 +502,309 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter);
 }
 
+static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
+
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			if (pmd_write(*pmdp))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_pte_hole(unsigned long addr,
+			unsigned long next,
+			struct mm_walk *walk)
+{
+	return -ENOENT;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!pte_present(*ptep) ||
+			    (write && !pte_write(*ptep)) ||
+			    pte_protnone(*ptep)) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			if (pte_write(*ptep))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	walk.pte_hole = hmm_pte_hole;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		ret = mirror->device->ops->update(mirror, event);
+		if (!ret) {
+			addr = mirror_fault.addr;
+			goto again;
+		}
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!mirror)
+		return -ENODEV;
+	if (event->start >= mirror->hmm->vm_end) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype)) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	hmm_pt_iter_init(&iter, &mirror->pt);
+
+retry:
+	if (hmm_mirror_is_dead(mirror)) {
+		hmm_mirror_unref(&mirror);
+		return -ENODEV;
+	}
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_WFAULT:
+		if (!(vma->vm_flags & VM_WRITE)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		/* fallthrough */
+	case HMM_DEVICE_RFAULT:
+		/* Handle the PROT_NONE case early on. */
+		if (!(vma->vm_flags & (VM_WRITE | VM_READ))) {
+			ret = -EFAULT;
+			goto out;
+		}
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	if (ret != -EBUSY)
+		up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter);
+	hmm_mirror_unref(&mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 08/15] HMM: add device page fault support v6.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse, Jatin Kumar

This patch add helper for device page fault. Thus helpers will fill
the mirror page table using the CPU page table and synchronizing
with any update to CPU page table.

Changed since v1:
  - Add comment about directory lock.

Changed since v2:
  - Check for mirror->hmm in hmm_mirror_fault()

Changed since v3:
  - Adapt to HMM page table changes.

Changed since v4:
  - Fix PROT_NONE, ie do not populate from protnone pte.
  - Fix huge pmd handling (start address may != pmd start address)
  - Fix missing entry case.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |  15 ++
 mm/hmm.c            | 391 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 405 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5488fa9..d819ec9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -85,6 +85,12 @@ struct hmm_event {
 	bool			backoff;
 };
 
+static inline bool hmm_event_overlap(const struct hmm_event *a,
+				     const struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 
 /* hmm_device - Each device must register one and only one hmm_device.
  *
@@ -176,6 +182,10 @@ struct hmm_device_ops {
  * @rwsem: Serialize the mirror list modifications.
  * @mmu_notifier: The mmu_notifier of this mm.
  * @rcu: For delayed cleanup call from mmu_notifier.release() callback.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
+ * @lock: Serialize device_faults list modification.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -192,6 +202,10 @@ struct hmm {
 	struct rw_semaphore	rwsem;
 	struct mmu_notifier	mmu_notifier;
 	struct rcu_head		rcu;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
+	spinlock_t		lock;
 };
 
 
@@ -250,6 +264,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index ef94e2a..cad25d9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -67,7 +67,7 @@ static inline int hmm_event_init(struct hmm_event *event,
 				 enum hmm_etype etype)
 {
 	event->start = start & PAGE_MASK;
-	event->end = min(end, hmm->vm_end);
+	event->end = PAGE_ALIGN(min(end, hmm->vm_end));
 	if (event->start >= event->end)
 		return -EINVAL;
 	event->etype = etype;
@@ -103,6 +103,10 @@ static int hmm_init(struct hmm *hmm)
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
 	init_rwsem(&hmm->rwsem);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
+	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
 	hmm->mmu_notifier.ops = &hmm_notifier_ops;
@@ -167,6 +171,58 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_active(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_inactive(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -175,6 +231,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
@@ -186,6 +244,33 @@ again:
 			goto again;
 		}
 	up_read(&hmm->rwsem);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY;
+	struct mm_struct *mm = vma->vm_mm;
+	int r;
+
+	flags |= (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		r = handle_mm_fault(mm, vma, addr, flags);
+		if (r & VM_FAULT_RETRY)
+			return -EBUSY;
+		if (r & VM_FAULT_ERROR) {
+			if (r & VM_FAULT_OOM)
+				return -ENOMEM;
+			/* Same error code for all other cases. */
+			return -EFAULT;
+		}
+	}
+	return 0;
 }
 
 
@@ -228,6 +313,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	up_write(&hmm->rwsem);
 
+	wake_up(&hmm->wait_queue);
 	hmm_unref(hmm);
 }
 
@@ -416,6 +502,309 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter);
 }
 
+static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		i = hmm_pt_index(&mirror->pt, addr, mirror->pt.llevel);
+
+		/*
+		 * The directory lock protect against concurrent clearing of
+		 * page table bit flags. Exceptions being the dirty bit and
+		 * the device driver private flags.
+		 */
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			if (pmd_write(*pmdp))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, pfn++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_pte_hole(unsigned long addr,
+			unsigned long next,
+			struct mm_walk *walk)
+{
+	return -ENOENT;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter);
+		do {
+			if (!pte_present(*ptep) ||
+			    (write && !pte_write(*ptep)) ||
+			    pte_protnone(*ptep)) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			if (pte_write(*ptep))
+				hmm_pte_set_write(&hmm_pte[i]);
+		} while (addr += PAGE_SIZE, ptep++, i++, addr != next);
+		hmm_pt_iter_directory_unlock(iter);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	walk.pte_hole = hmm_pte_hole;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		ret = mirror->device->ops->update(mirror, event);
+		if (!ret) {
+			addr = mirror_fault.addr;
+			goto again;
+		}
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!mirror)
+		return -ENODEV;
+	if (event->start >= mirror->hmm->vm_end) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype)) {
+		hmm_mirror_unref(&mirror);
+		return -EINVAL;
+	}
+	hmm_pt_iter_init(&iter, &mirror->pt);
+
+retry:
+	if (hmm_mirror_is_dead(mirror)) {
+		hmm_mirror_unref(&mirror);
+		return -ENODEV;
+	}
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_WFAULT:
+		if (!(vma->vm_flags & VM_WRITE)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		/* fallthrough */
+	case HMM_DEVICE_RFAULT:
+		/* Handle the PROT_NONE case early on. */
+		if (!(vma->vm_flags & (VM_WRITE | VM_READ))) {
+			ret = -EFAULT;
+			goto out;
+		}
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	if (ret != -EBUSY)
+		up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter);
+	hmm_mirror_unref(&mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 09/15] HMM: add mm page table iterator helpers.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Because inside the mmu_notifier callback we do not have access to the
vma nor do we know which lock we are holding (the mmap semaphore or
the i_mmap_lock) we can not rely on the regular page table walk (nor
do we want as we have to be carefull to not split huge page).

So this patch introduce an helper to iterate of the cpu page table
content in an efficient way for the situation we are in. Which is we
know that none of the page table entry might vanish from below us
and thus it is safe to walk the page table.

The only added value of the iterator is that it keeps the page table
entry level map accross call which fit well with the HMM mirror page
table update code.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index cad25d9..a3815bd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -403,6 +403,101 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+struct mm_pt_iter {
+	struct mm_struct	*mm;
+	pte_t			*ptep;
+	unsigned long		addr;
+};
+
+static void mm_pt_iter_init(struct mm_pt_iter *pt_iter, struct mm_struct *mm)
+{
+	pt_iter->mm = mm;
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+}
+
+static void mm_pt_iter_fini(struct mm_pt_iter *pt_iter)
+{
+	pte_unmap(pt_iter->ptep);
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+	pt_iter->mm = NULL;
+}
+
+static inline bool mm_pt_iter_in_range(struct mm_pt_iter *pt_iter,
+				       unsigned long addr)
+{
+	return (addr >= pt_iter->addr && addr < (pt_iter->addr + PMD_SIZE));
+}
+
+static struct page *mm_pt_iter_page(struct mm_pt_iter *pt_iter,
+				    unsigned long addr)
+{
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+
+again:
+	/*
+	 * What we are doing here is only valid if we old either the mmap
+	 * semaphore or the i_mmap_lock of vma->address_space the address
+	 * belongs to. Sadly because we can not easily get the vma struct
+	 * we can not sanity test that either of those lock is taken.
+	 *
+	 * We have to rely on people using this code knowing what they do.
+	 */
+	if (mm_pt_iter_in_range(pt_iter, addr) && likely(pt_iter->ptep)) {
+		pte_t pte = *(pt_iter->ptep + pte_index(addr));
+		unsigned long pfn;
+
+		if (pte_none(pte) || !pte_present(pte))
+			return NULL;
+		if (unlikely(pte_special(pte)))
+			return NULL;
+
+		pfn = pte_pfn(pte);
+		if (is_zero_pfn(pfn))
+			return NULL;
+		return pfn_to_page(pfn);
+	}
+
+	if (pt_iter->ptep) {
+		pte_unmap(pt_iter->ptep);
+		pt_iter->ptep = NULL;
+		pt_iter->addr = -1UL;
+	}
+
+	pgdp = pgd_offset(pt_iter->mm, addr);
+	if (pgd_none_or_clear_bad(pgdp))
+		return NULL;
+	pudp = pud_offset(pgdp, addr);
+	if (pud_none_or_clear_bad(pudp))
+		return NULL;
+	pmdp = pmd_offset(pudp, addr);
+	/*
+	 * Because we either have the mmap semaphore or the i_mmap_lock we know
+	 * that pmd can not vanish from under us, thus if pmd exist then it is
+	 * either a huge page or a valid pmd. It might also be in the splitting
+	 * transitory state.
+	 */
+	if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
+		return NULL;
+	if (pmd_trans_splitting(*pmdp))
+		/*
+		 * FIXME idealy we would wait but we have no easy mean to get a
+		 * hold of the vma. So for now busy loop until the splitting is
+		 * done.
+		 */
+		goto again;
+	if (pmd_huge(*pmdp))
+		return pmd_page(*pmdp) + pte_index(addr);
+	/* Regular pmd and it can not morph. */
+	pt_iter->ptep = pte_offset_map(pmdp, addr & PMD_MASK);
+	pt_iter->addr = addr & PMD_MASK;
+	goto again;
+}
+
+
 /* hmm_mirror - per device mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct. A process
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 09/15] HMM: add mm page table iterator helpers.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Because inside the mmu_notifier callback we do not have access to the
vma nor do we know which lock we are holding (the mmap semaphore or
the i_mmap_lock) we can not rely on the regular page table walk (nor
do we want as we have to be carefull to not split huge page).

So this patch introduce an helper to iterate of the cpu page table
content in an efficient way for the situation we are in. Which is we
know that none of the page table entry might vanish from below us
and thus it is safe to walk the page table.

The only added value of the iterator is that it keeps the page table
entry level map accross call which fit well with the HMM mirror page
table update code.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index cad25d9..a3815bd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -403,6 +403,101 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
 };
 
 
+struct mm_pt_iter {
+	struct mm_struct	*mm;
+	pte_t			*ptep;
+	unsigned long		addr;
+};
+
+static void mm_pt_iter_init(struct mm_pt_iter *pt_iter, struct mm_struct *mm)
+{
+	pt_iter->mm = mm;
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+}
+
+static void mm_pt_iter_fini(struct mm_pt_iter *pt_iter)
+{
+	pte_unmap(pt_iter->ptep);
+	pt_iter->ptep = NULL;
+	pt_iter->addr = -1UL;
+	pt_iter->mm = NULL;
+}
+
+static inline bool mm_pt_iter_in_range(struct mm_pt_iter *pt_iter,
+				       unsigned long addr)
+{
+	return (addr >= pt_iter->addr && addr < (pt_iter->addr + PMD_SIZE));
+}
+
+static struct page *mm_pt_iter_page(struct mm_pt_iter *pt_iter,
+				    unsigned long addr)
+{
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+
+again:
+	/*
+	 * What we are doing here is only valid if we old either the mmap
+	 * semaphore or the i_mmap_lock of vma->address_space the address
+	 * belongs to. Sadly because we can not easily get the vma struct
+	 * we can not sanity test that either of those lock is taken.
+	 *
+	 * We have to rely on people using this code knowing what they do.
+	 */
+	if (mm_pt_iter_in_range(pt_iter, addr) && likely(pt_iter->ptep)) {
+		pte_t pte = *(pt_iter->ptep + pte_index(addr));
+		unsigned long pfn;
+
+		if (pte_none(pte) || !pte_present(pte))
+			return NULL;
+		if (unlikely(pte_special(pte)))
+			return NULL;
+
+		pfn = pte_pfn(pte);
+		if (is_zero_pfn(pfn))
+			return NULL;
+		return pfn_to_page(pfn);
+	}
+
+	if (pt_iter->ptep) {
+		pte_unmap(pt_iter->ptep);
+		pt_iter->ptep = NULL;
+		pt_iter->addr = -1UL;
+	}
+
+	pgdp = pgd_offset(pt_iter->mm, addr);
+	if (pgd_none_or_clear_bad(pgdp))
+		return NULL;
+	pudp = pud_offset(pgdp, addr);
+	if (pud_none_or_clear_bad(pudp))
+		return NULL;
+	pmdp = pmd_offset(pudp, addr);
+	/*
+	 * Because we either have the mmap semaphore or the i_mmap_lock we know
+	 * that pmd can not vanish from under us, thus if pmd exist then it is
+	 * either a huge page or a valid pmd. It might also be in the splitting
+	 * transitory state.
+	 */
+	if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
+		return NULL;
+	if (pmd_trans_splitting(*pmdp))
+		/*
+		 * FIXME idealy we would wait but we have no easy mean to get a
+		 * hold of the vma. So for now busy loop until the splitting is
+		 * done.
+		 */
+		goto again;
+	if (pmd_huge(*pmdp))
+		return pmd_page(*pmdp) + pte_index(addr);
+	/* Regular pmd and it can not morph. */
+	pt_iter->ptep = pte_offset_map(pmdp, addr & PMD_MASK);
+	pt_iter->addr = addr & PMD_MASK;
+	goto again;
+}
+
+
 /* hmm_mirror - per device mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct. A process
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 10/15] HMM: use CPU page table during invalidation.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Once we store the dma mapping inside the secondary page table we can
no longer easily find back the page backing an address. Instead use
the cpu page table which still has the proper information, except for
the invalidate_page() case which is handled by using the page passed
by the mmu_notifier layer.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 53 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index a3815bd..86b8bc2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,9 +47,11 @@
 static struct mmu_notifier_ops hmm_notifier_ops;
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event);
+				    struct hmm_event *event,
+				    struct page *page);
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event);
+				 struct hmm_event *event,
+				 struct page *page);
 
 
 /* hmm_event - use to track information relating to an event.
@@ -223,7 +225,9 @@ again:
 	}
 }
 
-static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+static void hmm_update(struct hmm *hmm,
+		       struct hmm_event *event,
+		       struct page *page)
 {
 	struct hmm_mirror *mirror;
 
@@ -236,7 +240,7 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
-		if (hmm_mirror_update(mirror, event)) {
+		if (hmm_mirror_update(mirror, event, page)) {
 			mirror = hmm_mirror_ref(mirror);
 			up_read(&hmm->rwsem);
 			hmm_mirror_kill(mirror);
@@ -304,7 +308,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 		/* Make sure everything is unmapped. */
 		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-		hmm_mirror_update(mirror, &event);
+		hmm_mirror_update(mirror, &event, NULL);
 
 		mirror->device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
@@ -338,9 +342,10 @@ static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
 	*etype = HMM_NONE;
 }
 
-static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					const struct mmu_notifier_range *range)
+static void hmm_notifier_invalidate(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    struct page *page,
+				    const struct mmu_notifier_range *range)
 {
 	struct hmm_event event;
 	unsigned long start = range->start, end = range->end;
@@ -379,7 +384,14 @@ static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	hmm_event_init(&event, hmm, start, end, event.etype);
 
-	hmm_update(hmm, &event);
+	hmm_update(hmm, &event, page);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
+{
+	hmm_notifier_invalidate(mn, mm, NULL, range);
 }
 
 static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
@@ -393,7 +405,7 @@ static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
 	range.start = addr & PAGE_MASK;
 	range.end = range.start + PAGE_SIZE;
 	range.event = mmu_event;
-	hmm_notifier_invalidate_range_start(mn, mm, &range);
+	hmm_notifier_invalidate(mn, mm, page, &range);
 }
 
 static struct mmu_notifier_ops hmm_notifier_ops = {
@@ -545,23 +557,27 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
 EXPORT_SYMBOL(hmm_mirror_unref);
 
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event)
+				    struct hmm_event *event,
+				    struct page *page)
 {
 	struct hmm_device *device = mirror->device;
 	int ret = 0;
 
 	ret = device->ops->update(mirror, event);
-	hmm_mirror_update_pt(mirror, event);
+	hmm_mirror_update_pt(mirror, event, page);
 	return ret;
 }
 
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event)
+				 struct hmm_event *event,
+				 struct page *page)
 {
 	unsigned long addr;
 	struct hmm_pt_iter iter;
+	struct mm_pt_iter mm_iter;
 
 	hmm_pt_iter_init(&iter, &mirror->pt);
+	mm_pt_iter_init(&mm_iter, mirror->hmm->mm);
 	for (addr = event->start; addr != event->end;) {
 		unsigned long next = event->end;
 		dma_addr_t *hmm_pte;
@@ -582,10 +598,10 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				continue;
 			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
 			    hmm_pte_test_write(hmm_pte)) {
-				struct page *page;
-
-				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
-				set_page_dirty(page);
+				page = page ? : mm_pt_iter_page(&mm_iter, addr);
+				if (page)
+					set_page_dirty(page);
+				page = NULL;
 			}
 			*hmm_pte &= event->pte_mask;
 			if (hmm_pte_test_valid_pfn(hmm_pte))
@@ -595,6 +611,7 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		hmm_pt_iter_directory_unlock(&iter);
 	}
 	hmm_pt_iter_fini(&iter);
+	mm_pt_iter_fini(&mm_iter);
 }
 
 static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
@@ -1000,7 +1017,7 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 		/* Make sure everything is unmapped. */
 		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-		hmm_mirror_update(mirror, &event);
+		hmm_mirror_update(mirror, &event, NULL);
 
 		device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 10/15] HMM: use CPU page table during invalidation.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Once we store the dma mapping inside the secondary page table we can
no longer easily find back the page backing an address. Instead use
the cpu page table which still has the proper information, except for
the invalidate_page() case which is handled by using the page passed
by the mmu_notifier layer.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 53 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index a3815bd..86b8bc2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -47,9 +47,11 @@
 static struct mmu_notifier_ops hmm_notifier_ops;
 static void hmm_mirror_kill(struct hmm_mirror *mirror);
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event);
+				    struct hmm_event *event,
+				    struct page *page);
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event);
+				 struct hmm_event *event,
+				 struct page *page);
 
 
 /* hmm_event - use to track information relating to an event.
@@ -223,7 +225,9 @@ again:
 	}
 }
 
-static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+static void hmm_update(struct hmm *hmm,
+		       struct hmm_event *event,
+		       struct page *page)
 {
 	struct hmm_mirror *mirror;
 
@@ -236,7 +240,7 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 again:
 	down_read(&hmm->rwsem);
 	hlist_for_each_entry(mirror, &hmm->mirrors, mlist)
-		if (hmm_mirror_update(mirror, event)) {
+		if (hmm_mirror_update(mirror, event, page)) {
 			mirror = hmm_mirror_ref(mirror);
 			up_read(&hmm->rwsem);
 			hmm_mirror_kill(mirror);
@@ -304,7 +308,7 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 		/* Make sure everything is unmapped. */
 		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-		hmm_mirror_update(mirror, &event);
+		hmm_mirror_update(mirror, &event, NULL);
 
 		mirror->device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
@@ -338,9 +342,10 @@ static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
 	*etype = HMM_NONE;
 }
 
-static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					const struct mmu_notifier_range *range)
+static void hmm_notifier_invalidate(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    struct page *page,
+				    const struct mmu_notifier_range *range)
 {
 	struct hmm_event event;
 	unsigned long start = range->start, end = range->end;
@@ -379,7 +384,14 @@ static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	hmm_event_init(&event, hmm, start, end, event.etype);
 
-	hmm_update(hmm, &event);
+	hmm_update(hmm, &event, page);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+					struct mm_struct *mm,
+					const struct mmu_notifier_range *range)
+{
+	hmm_notifier_invalidate(mn, mm, NULL, range);
 }
 
 static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
@@ -393,7 +405,7 @@ static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
 	range.start = addr & PAGE_MASK;
 	range.end = range.start + PAGE_SIZE;
 	range.event = mmu_event;
-	hmm_notifier_invalidate_range_start(mn, mm, &range);
+	hmm_notifier_invalidate(mn, mm, page, &range);
 }
 
 static struct mmu_notifier_ops hmm_notifier_ops = {
@@ -545,23 +557,27 @@ void hmm_mirror_unref(struct hmm_mirror **mirror)
 EXPORT_SYMBOL(hmm_mirror_unref);
 
 static inline int hmm_mirror_update(struct hmm_mirror *mirror,
-				    struct hmm_event *event)
+				    struct hmm_event *event,
+				    struct page *page)
 {
 	struct hmm_device *device = mirror->device;
 	int ret = 0;
 
 	ret = device->ops->update(mirror, event);
-	hmm_mirror_update_pt(mirror, event);
+	hmm_mirror_update_pt(mirror, event, page);
 	return ret;
 }
 
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
-				 struct hmm_event *event)
+				 struct hmm_event *event,
+				 struct page *page)
 {
 	unsigned long addr;
 	struct hmm_pt_iter iter;
+	struct mm_pt_iter mm_iter;
 
 	hmm_pt_iter_init(&iter, &mirror->pt);
+	mm_pt_iter_init(&mm_iter, mirror->hmm->mm);
 	for (addr = event->start; addr != event->end;) {
 		unsigned long next = event->end;
 		dma_addr_t *hmm_pte;
@@ -582,10 +598,10 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				continue;
 			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
 			    hmm_pte_test_write(hmm_pte)) {
-				struct page *page;
-
-				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
-				set_page_dirty(page);
+				page = page ? : mm_pt_iter_page(&mm_iter, addr);
+				if (page)
+					set_page_dirty(page);
+				page = NULL;
 			}
 			*hmm_pte &= event->pte_mask;
 			if (hmm_pte_test_valid_pfn(hmm_pte))
@@ -595,6 +611,7 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		hmm_pt_iter_directory_unlock(&iter);
 	}
 	hmm_pt_iter_fini(&iter);
+	mm_pt_iter_fini(&mm_iter);
 }
 
 static inline bool hmm_mirror_is_dead(struct hmm_mirror *mirror)
@@ -1000,7 +1017,7 @@ static void hmm_mirror_kill(struct hmm_mirror *mirror)
 
 		/* Make sure everything is unmapped. */
 		hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
-		hmm_mirror_update(mirror, &event);
+		hmm_mirror_update(mirror, &event, NULL);
 
 		device->ops->release(mirror);
 		hmm_mirror_unref(&mirror);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 11/15] HMM: add discard range helper (to clear and free resources for a range).
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

A common use case is for device driver to stop caring for a range of
address long before said range is munmapped by userspace program. To
avoid keeping track of such range provide an helper function that will
free HMM resources for a range of address.

NOTE THAT DEVICE DRIVER MUST MAKE SURE THE HARDWARE WILL NO LONGER
ACCESS THE RANGE BECAUSE CALLING THIS HELPER !

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d819ec9..10e1558 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -265,6 +265,9 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
 int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 86b8bc2..9f88df8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -917,6 +917,30 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+/* hmm_mirror_range_discard() - discard a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to stop mirroring a range of address and free
+ * any HMM resources associated with that range (including dma mapping if any).
+ *
+ * THIS FUNCTION ASSUME THAT DRIVER ALREADY STOPPED USING THE RANGE OF ADDRESS
+ * AND THUS DO NOT PERFORM ANY SYNCHRONIZATION OR UPDATE WITH THE DRIVER TO
+ * INVALIDATE SAID RANGE.
+ */
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct hmm_event event;
+
+	hmm_event_init(&event, mirror->hmm, start, end, HMM_MUNMAP);
+	hmm_mirror_update_pt(mirror, &event, NULL);
+}
+EXPORT_SYMBOL(hmm_mirror_range_discard);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 11/15] HMM: add discard range helper (to clear and free resources for a range).
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

A common use case is for device driver to stop caring for a range of
address long before said range is munmapped by userspace program. To
avoid keeping track of such range provide an helper function that will
free HMM resources for a range of address.

NOTE THAT DEVICE DRIVER MUST MAKE SURE THE HARDWARE WILL NO LONGER
ACCESS THE RANGE BECAUSE CALLING THIS HELPER !

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d819ec9..10e1558 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -265,6 +265,9 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
 void hmm_mirror_unref(struct hmm_mirror **mirror);
 int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 86b8bc2..9f88df8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -917,6 +917,30 @@ out:
 }
 EXPORT_SYMBOL(hmm_mirror_fault);
 
+/* hmm_mirror_range_discard() - discard a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to stop mirroring a range of address and free
+ * any HMM resources associated with that range (including dma mapping if any).
+ *
+ * THIS FUNCTION ASSUME THAT DRIVER ALREADY STOPPED USING THE RANGE OF ADDRESS
+ * AND THUS DO NOT PERFORM ANY SYNCHRONIZATION OR UPDATE WITH THE DRIVER TO
+ * INVALIDATE SAID RANGE.
+ */
+void hmm_mirror_range_discard(struct hmm_mirror *mirror,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct hmm_event event;
+
+	hmm_event_init(&event, mirror->hmm, start, end, HMM_MUNMAP);
+	hmm_mirror_update_pt(mirror, &event, NULL);
+}
+EXPORT_SYMBOL(hmm_mirror_range_discard);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Device driver must properly toggle the dirty inside the mirror page table
so dirtyness is properly accounted when core mm code needs to know. Provide
a simple helper to toggle that bit for a range of address.

Changed since v1:
  - Adapt to HMM page table changes.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 10e1558..4bc132a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -268,6 +268,9 @@ int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 			      unsigned long start,
 			      unsigned long end);
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 9f88df8..36cc506 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -941,6 +941,44 @@ void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 }
 EXPORT_SYMBOL(hmm_mirror_range_discard);
 
+/* hmm_mirror_range_dirty() - toggle dirty bit for a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to toggle the dirty bit for a range of address.
+ * Useful when the device driver just want to toggle the bit for whole range
+ * without walking the mirror page table itself.
+ *
+ * Note this function does not directly dirty the page behind an address, but
+ * this will happen once address is invalidated or discard by device driver or
+ * core mm code.
+ */
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_pt_iter iter;
+	unsigned long addr;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = start; addr != end;) {
+		unsigned long next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+		for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
+			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			    !hmm_pte_test_write(hmm_pte))
+				continue;
+			hmm_pte_set_dirty(hmm_pte);
+		}
+	}
+	hmm_pt_iter_fini(&iter);
+}
+EXPORT_SYMBOL(hmm_mirror_range_dirty);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Device driver must properly toggle the dirty inside the mirror page table
so dirtyness is properly accounted when core mm code needs to know. Provide
a simple helper to toggle that bit for a range of address.

Changed since v1:
  - Adapt to HMM page table changes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h |  3 +++
 mm/hmm.c            | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 10e1558..4bc132a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -268,6 +268,9 @@ int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 			      unsigned long start,
 			      unsigned long end);
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 9f88df8..36cc506 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -941,6 +941,44 @@ void hmm_mirror_range_discard(struct hmm_mirror *mirror,
 }
 EXPORT_SYMBOL(hmm_mirror_range_discard);
 
+/* hmm_mirror_range_dirty() - toggle dirty bit for a range of address.
+ *
+ * @mirror: The mirror struct.
+ * @start: Start address of the range to discard (inclusive).
+ * @end: End address of the range to discard (exclusive).
+ *
+ * Call when device driver want to toggle the dirty bit for a range of address.
+ * Useful when the device driver just want to toggle the bit for whole range
+ * without walking the mirror page table itself.
+ *
+ * Note this function does not directly dirty the page behind an address, but
+ * this will happen once address is invalidated or discard by device driver or
+ * core mm code.
+ */
+void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
+			    unsigned long start,
+			    unsigned long end)
+{
+	struct hmm_pt_iter iter;
+	unsigned long addr;
+
+	hmm_pt_iter_init(&iter, &mirror->pt);
+	for (addr = start; addr != end;) {
+		unsigned long next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+		for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
+			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			    !hmm_pte_test_write(hmm_pte))
+				continue;
+			hmm_pte_set_dirty(hmm_pte);
+		}
+	}
+	hmm_pt_iter_fini(&iter);
+}
+EXPORT_SYMBOL(hmm_mirror_range_dirty);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 13/15] HMM: DMA map memory on behalf of device driver v2.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Do the DMA mapping on behalf of the device as HMM is a good place
to perform this common task. Moreover in the future we hope to
add new infrastructure that would make DMA mapping more efficient
(lower overhead per page) by leveraging HMM data structure.

Changed since v1:
  - Adapt to HMM page table changes.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h |  11 +++
 mm/hmm.c               | 202 +++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 174 insertions(+), 39 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 4a8beb1..8a59a75 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -176,6 +176,17 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
+}
+
+static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dma(&pte) ? (pte & HMM_PTE_DMA_MASK) : 0;
+}
+
 static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
 {
 	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index 36cc506..6ed1081 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -41,6 +41,7 @@
 #include <linux/mman.h>
 #include <linux/delay.h>
 #include <linux/workqueue.h>
+#include <linux/dma-mapping.h>
 
 #include "internal.h"
 
@@ -568,6 +569,46 @@ static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 	return ret;
 }
 
+static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
+				  struct hmm_event *event,
+				  struct hmm_pt_iter *iter,
+				  struct mm_pt_iter *mm_iter,
+				  struct page *page,
+				  dma_addr_t *hmm_pte,
+				  unsigned long addr)
+{
+	bool dirty = hmm_pte_test_and_clear_dirty(hmm_pte);
+
+	if (hmm_pte_test_valid_pfn(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_pfn(hmm_pte))
+			hmm_pt_iter_directory_unref(iter);
+		goto out;
+	}
+
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		return;
+
+	if (!hmm_pte_test_valid_dma(&event->pte_mask)) {
+		struct device *dev = mirror->device->dev;
+		dma_addr_t dma_addr;
+
+		dma_addr = hmm_pte_dma_addr(*hmm_pte);
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	}
+
+	*hmm_pte &= event->pte_mask;
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		hmm_pt_iter_directory_unref(iter);
+
+out:
+	if (dirty) {
+		page = page ? : mm_pt_iter_page(mm_iter, addr);
+		if (page)
+			set_page_dirty(page);
+	}
+}
+
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				 struct hmm_event *event,
 				 struct page *page)
@@ -594,19 +635,9 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		 */
 		hmm_pt_iter_directory_lock(&iter);
 		do {
-			if (!hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
-			    hmm_pte_test_write(hmm_pte)) {
-				page = page ? : mm_pt_iter_page(&mm_iter, addr);
-				if (page)
-					set_page_dirty(page);
-				page = NULL;
-			}
-			*hmm_pte &= event->pte_mask;
-			if (hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			hmm_pt_iter_directory_unref(&iter);
+			hmm_mirror_update_pte(mirror, event, &iter, &mm_iter,
+					      page, hmm_pte, addr);
+			page = NULL;
 		} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
 		hmm_pt_iter_directory_unlock(&iter);
 	}
@@ -683,6 +714,9 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 		 */
 		hmm_pt_iter_directory_lock(iter);
 		do {
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
 			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
 				hmm_pte[i] = hmm_pte_from_pfn(pfn);
 				hmm_pt_iter_directory_ref(iter);
@@ -756,6 +790,9 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 				break;
 			}
 
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
 			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
 				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
 				hmm_pt_iter_directory_ref(iter);
@@ -772,6 +809,80 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
+			      struct hmm_pt_iter *iter,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long addr;
+	int ret;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOENT;
+
+		do {
+			dma_addr_t dma_addr, pte;
+			struct page *page;
+
+again:
+			pte = ACCESS_ONCE(hmm_pte[i]);
+			if (!hmm_pte_test_valid_pfn(&pte)) {
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+				continue;
+			}
+
+			page = pfn_to_page(hmm_pte_pfn(pte));
+			VM_BUG_ON(!page);
+			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+						DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(dev, dma_addr)) {
+				ret = -ENOMEM;
+				break;
+			}
+
+			hmm_pt_iter_directory_lock(iter);
+			/*
+			 * Make sure we transfer the dirty bit. Note that there
+			 * might still be a window for another thread to set
+			 * the dirty bit before we check for pte equality. This
+			 * will just lead to a useless retry so it is not the
+			 * end of the world here.
+			 */
+			if (hmm_pte_test_dirty(&hmm_pte[i]))
+				hmm_pte_set_dirty(&pte);
+			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				hmm_pt_iter_directory_unlock(iter);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&pte))
+					goto again;
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+			} else {
+				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+				if (hmm_pte_test_write(&pte))
+					hmm_pte_set_write(&hmm_pte[i]);
+				if (hmm_pte_test_dirty(&pte))
+					hmm_pte_set_dirty(&hmm_pte[i]);
+				hmm_pt_iter_directory_unlock(iter);
+			}
+		} while (addr += PAGE_SIZE, i++, addr != next && !ret);
+	}
+
+	return ret;
+}
+
 static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 				   struct hmm_event *event,
 				   struct vm_area_struct *vma,
@@ -780,7 +891,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	struct hmm_mirror_fault mirror_fault;
 	unsigned long addr = event->start;
 	struct mm_walk walk = {0};
-	int ret = 0;
+	int ret;
 
 	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
 		return -EACCES;
@@ -789,33 +900,45 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	if (ret)
 		return ret;
 
-again:
-	if (event->backoff) {
-		ret = -EAGAIN;
-		goto out;
-	}
-	if (addr >= event->end)
-		goto out;
+	do {
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
+		if (addr >= event->end)
+			break;
+
+		mirror_fault.event = event;
+		mirror_fault.mirror = mirror;
+		mirror_fault.vma = vma;
+		mirror_fault.addr = addr;
+		mirror_fault.iter = iter;
+		walk.mm = mirror->hmm->mm;
+		walk.private = &mirror_fault;
+		walk.pmd_entry = hmm_mirror_fault_pmd;
+		walk.pte_hole = hmm_pte_hole;
+		ret = walk_page_range(addr, event->end, &walk);
+		if (ret)
+			break;
+
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
 
-	mirror_fault.event = event;
-	mirror_fault.mirror = mirror;
-	mirror_fault.vma = vma;
-	mirror_fault.addr = addr;
-	mirror_fault.iter = iter;
-	walk.mm = mirror->hmm->mm;
-	walk.private = &mirror_fault;
-	walk.pmd_entry = hmm_mirror_fault_pmd;
-	walk.pte_hole = hmm_pte_hole;
-	ret = walk_page_range(addr, event->end, &walk);
-	if (!ret) {
-		ret = mirror->device->ops->update(mirror, event);
-		if (!ret) {
-			addr = mirror_fault.addr;
-			goto again;
+		if (mirror->device->dev) {
+			ret = hmm_mirror_dma_map(mirror, iter,
+						 addr, event->end);
+			if (ret)
+				break;
 		}
-	}
 
-out:
+		ret = mirror->device->ops->update(mirror, event);
+		if (ret)
+			break;
+		addr = mirror_fault.addr;
+	} while (1);
+
 	hmm_device_fault_end(mirror->hmm, event);
 	if (ret == -ENOENT) {
 		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
@@ -969,7 +1092,8 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 
 		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
 		for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
-			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			if ((!hmm_pte_test_valid_pfn(hmm_pte) &&
+			     !hmm_pte_test_valid_dma(hmm_pte)) ||
 			    !hmm_pte_test_write(hmm_pte))
 				continue;
 			hmm_pte_set_dirty(hmm_pte);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 13/15] HMM: DMA map memory on behalf of device driver v2.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Do the DMA mapping on behalf of the device as HMM is a good place
to perform this common task. Moreover in the future we hope to
add new infrastructure that would make DMA mapping more efficient
(lower overhead per page) by leveraging HMM data structure.

Changed since v1:
  - Adapt to HMM page table changes.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/hmm_pt.h |  11 +++
 mm/hmm.c               | 202 +++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 174 insertions(+), 39 deletions(-)

diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
index 4a8beb1..8a59a75 100644
--- a/include/linux/hmm_pt.h
+++ b/include/linux/hmm_pt.h
@@ -176,6 +176,17 @@ static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
 	return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
 }
 
+static inline dma_addr_t hmm_pte_from_dma_addr(dma_addr_t dma_addr)
+{
+	return (dma_addr & HMM_PTE_DMA_MASK) | (1 << HMM_PTE_VALID_DMA_BIT);
+}
+
+static inline dma_addr_t hmm_pte_dma_addr(dma_addr_t pte)
+{
+	/* FIXME Use max dma addr instead of 0 ? */
+	return hmm_pte_test_valid_dma(&pte) ? (pte & HMM_PTE_DMA_MASK) : 0;
+}
+
 static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
 {
 	return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index 36cc506..6ed1081 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -41,6 +41,7 @@
 #include <linux/mman.h>
 #include <linux/delay.h>
 #include <linux/workqueue.h>
+#include <linux/dma-mapping.h>
 
 #include "internal.h"
 
@@ -568,6 +569,46 @@ static inline int hmm_mirror_update(struct hmm_mirror *mirror,
 	return ret;
 }
 
+static void hmm_mirror_update_pte(struct hmm_mirror *mirror,
+				  struct hmm_event *event,
+				  struct hmm_pt_iter *iter,
+				  struct mm_pt_iter *mm_iter,
+				  struct page *page,
+				  dma_addr_t *hmm_pte,
+				  unsigned long addr)
+{
+	bool dirty = hmm_pte_test_and_clear_dirty(hmm_pte);
+
+	if (hmm_pte_test_valid_pfn(hmm_pte)) {
+		*hmm_pte &= event->pte_mask;
+		if (!hmm_pte_test_valid_pfn(hmm_pte))
+			hmm_pt_iter_directory_unref(iter);
+		goto out;
+	}
+
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		return;
+
+	if (!hmm_pte_test_valid_dma(&event->pte_mask)) {
+		struct device *dev = mirror->device->dev;
+		dma_addr_t dma_addr;
+
+		dma_addr = hmm_pte_dma_addr(*hmm_pte);
+		dma_unmap_page(dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	}
+
+	*hmm_pte &= event->pte_mask;
+	if (!hmm_pte_test_valid_dma(hmm_pte))
+		hmm_pt_iter_directory_unref(iter);
+
+out:
+	if (dirty) {
+		page = page ? : mm_pt_iter_page(mm_iter, addr);
+		if (page)
+			set_page_dirty(page);
+	}
+}
+
 static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 				 struct hmm_event *event,
 				 struct page *page)
@@ -594,19 +635,9 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 		 */
 		hmm_pt_iter_directory_lock(&iter);
 		do {
-			if (!hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			if (hmm_pte_test_and_clear_dirty(hmm_pte) &&
-			    hmm_pte_test_write(hmm_pte)) {
-				page = page ? : mm_pt_iter_page(&mm_iter, addr);
-				if (page)
-					set_page_dirty(page);
-				page = NULL;
-			}
-			*hmm_pte &= event->pte_mask;
-			if (hmm_pte_test_valid_pfn(hmm_pte))
-				continue;
-			hmm_pt_iter_directory_unref(&iter);
+			hmm_mirror_update_pte(mirror, event, &iter, &mm_iter,
+					      page, hmm_pte, addr);
+			page = NULL;
 		} while (addr += PAGE_SIZE, hmm_pte++, addr != next);
 		hmm_pt_iter_directory_unlock(&iter);
 	}
@@ -683,6 +714,9 @@ static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
 		 */
 		hmm_pt_iter_directory_lock(iter);
 		do {
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
 			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
 				hmm_pte[i] = hmm_pte_from_pfn(pfn);
 				hmm_pt_iter_directory_ref(iter);
@@ -756,6 +790,9 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 				break;
 			}
 
+			if (hmm_pte_test_valid_dma(&hmm_pte[i]))
+				continue;
+
 			if (!hmm_pte_test_valid_pfn(&hmm_pte[i])) {
 				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
 				hmm_pt_iter_directory_ref(iter);
@@ -772,6 +809,80 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
+			      struct hmm_pt_iter *iter,
+			      unsigned long start,
+			      unsigned long end)
+{
+	struct device *dev = mirror->device->dev;
+	unsigned long addr;
+	int ret;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, next = end;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+		if (!hmm_pte)
+			return -ENOENT;
+
+		do {
+			dma_addr_t dma_addr, pte;
+			struct page *page;
+
+again:
+			pte = ACCESS_ONCE(hmm_pte[i]);
+			if (!hmm_pte_test_valid_pfn(&pte)) {
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+				continue;
+			}
+
+			page = pfn_to_page(hmm_pte_pfn(pte));
+			VM_BUG_ON(!page);
+			dma_addr = dma_map_page(dev, page, 0, PAGE_SIZE,
+						DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(dev, dma_addr)) {
+				ret = -ENOMEM;
+				break;
+			}
+
+			hmm_pt_iter_directory_lock(iter);
+			/*
+			 * Make sure we transfer the dirty bit. Note that there
+			 * might still be a window for another thread to set
+			 * the dirty bit before we check for pte equality. This
+			 * will just lead to a useless retry so it is not the
+			 * end of the world here.
+			 */
+			if (hmm_pte_test_dirty(&hmm_pte[i]))
+				hmm_pte_set_dirty(&pte);
+			if (ACCESS_ONCE(hmm_pte[i]) != pte) {
+				hmm_pt_iter_directory_unlock(iter);
+				dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+					       DMA_BIDIRECTIONAL);
+				if (hmm_pte_test_valid_pfn(&pte))
+					goto again;
+				if (!hmm_pte_test_valid_dma(&pte)) {
+					ret = -ENOENT;
+					break;
+				}
+			} else {
+				hmm_pte[i] = hmm_pte_from_dma_addr(dma_addr);
+				if (hmm_pte_test_write(&pte))
+					hmm_pte_set_write(&hmm_pte[i]);
+				if (hmm_pte_test_dirty(&pte))
+					hmm_pte_set_dirty(&hmm_pte[i]);
+				hmm_pt_iter_directory_unlock(iter);
+			}
+		} while (addr += PAGE_SIZE, i++, addr != next && !ret);
+	}
+
+	return ret;
+}
+
 static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 				   struct hmm_event *event,
 				   struct vm_area_struct *vma,
@@ -780,7 +891,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	struct hmm_mirror_fault mirror_fault;
 	unsigned long addr = event->start;
 	struct mm_walk walk = {0};
-	int ret = 0;
+	int ret;
 
 	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
 		return -EACCES;
@@ -789,33 +900,45 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 	if (ret)
 		return ret;
 
-again:
-	if (event->backoff) {
-		ret = -EAGAIN;
-		goto out;
-	}
-	if (addr >= event->end)
-		goto out;
+	do {
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
+		if (addr >= event->end)
+			break;
+
+		mirror_fault.event = event;
+		mirror_fault.mirror = mirror;
+		mirror_fault.vma = vma;
+		mirror_fault.addr = addr;
+		mirror_fault.iter = iter;
+		walk.mm = mirror->hmm->mm;
+		walk.private = &mirror_fault;
+		walk.pmd_entry = hmm_mirror_fault_pmd;
+		walk.pte_hole = hmm_pte_hole;
+		ret = walk_page_range(addr, event->end, &walk);
+		if (ret)
+			break;
+
+		if (event->backoff) {
+			ret = -EAGAIN;
+			break;
+		}
 
-	mirror_fault.event = event;
-	mirror_fault.mirror = mirror;
-	mirror_fault.vma = vma;
-	mirror_fault.addr = addr;
-	mirror_fault.iter = iter;
-	walk.mm = mirror->hmm->mm;
-	walk.private = &mirror_fault;
-	walk.pmd_entry = hmm_mirror_fault_pmd;
-	walk.pte_hole = hmm_pte_hole;
-	ret = walk_page_range(addr, event->end, &walk);
-	if (!ret) {
-		ret = mirror->device->ops->update(mirror, event);
-		if (!ret) {
-			addr = mirror_fault.addr;
-			goto again;
+		if (mirror->device->dev) {
+			ret = hmm_mirror_dma_map(mirror, iter,
+						 addr, event->end);
+			if (ret)
+				break;
 		}
-	}
 
-out:
+		ret = mirror->device->ops->update(mirror, event);
+		if (ret)
+			break;
+		addr = mirror_fault.addr;
+	} while (1);
+
 	hmm_device_fault_end(mirror->hmm, event);
 	if (ret == -ENOENT) {
 		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
@@ -969,7 +1092,8 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
 
 		hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
 		for (; hmm_pte && addr != next; hmm_pte++, addr += PAGE_SIZE) {
-			if (!hmm_pte_test_valid_pfn(hmm_pte) ||
+			if ((!hmm_pte_test_valid_pfn(hmm_pte) &&
+			     !hmm_pte_test_valid_dma(hmm_pte)) ||
 			    !hmm_pte_test_write(hmm_pte))
 				continue;
 			hmm_pte_set_dirty(hmm_pte);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 14/15] HMM: Add support for hugetlb.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Support hugetlb vma allmost like other vma. Exception being that we
will not support migration of hugetlb memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6ed1081..9e5017a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -809,6 +809,65 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+static int hmm_mirror_fault_hugetlb_entry(pte_t *ptep,
+					  unsigned long hmask,
+					  unsigned long addr,
+					  unsigned long end,
+					  struct mm_walk *walk)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long pfn, next;
+	dma_addr_t *hmm_pte;
+	pte_t pte;
+
+	/*
+	 * Hugepages under user process are always in RAM and never
+	 * swapped out, but theoretically it needs to be checked.
+	 */
+	if (!ptep)
+		return -ENOENT;
+
+	pte = huge_ptep_get(ptep);
+	pfn = pte_pfn(pte);
+	if (!huge_pte_none(pte) || (write && !huge_pte_write(pte)))
+		return -ENOENT;
+
+	hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+	if (!hmm_pte)
+		return -ENOMEM;
+	hmm_pt_iter_directory_lock(iter);
+	for (; addr != end; addr += PAGE_SIZE, ++pfn, ++hmm_pte) {
+		/* Switch to another HMM page table directory ? */
+		if (addr == next) {
+			hmm_pt_iter_directory_unlock(iter);
+			hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+			if (!hmm_pte)
+				return -ENOMEM;
+			hmm_pt_iter_directory_lock(iter);
+		}
+
+		if (hmm_pte_test_valid_dma(hmm_pte))
+			continue;
+
+		if (!hmm_pte_test_valid_pfn(hmm_pte)) {
+			*hmm_pte = hmm_pte_from_pfn(pfn);
+			hmm_pt_iter_directory_ref(iter);
+		}
+		BUG_ON(hmm_pte_pfn(*hmm_pte) != pfn);
+		if (write)
+			hmm_pte_set_write(hmm_pte);
+	}
+	hmm_pt_iter_directory_unlock(iter);
+#else
+	BUG();
+#endif
+	return 0;
+}
+
 static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 			      struct hmm_pt_iter *iter,
 			      unsigned long start,
@@ -916,6 +975,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 		walk.mm = mirror->hmm->mm;
 		walk.private = &mirror_fault;
 		walk.pmd_entry = hmm_mirror_fault_pmd;
+		walk.hugetlb_entry = hmm_mirror_fault_hugetlb_entry;
 		walk.pte_hole = hmm_pte_hole;
 		ret = walk_page_range(addr, event->end, &walk);
 		if (ret)
@@ -1002,7 +1062,7 @@ retry:
 		goto out;
 	}
 	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
-	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))) {
 		ret = -EFAULT;
 		goto out;
 	}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 14/15] HMM: Add support for hugetlb.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

Support hugetlb vma allmost like other vma. Exception being that we
will not support migration of hugetlb memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6ed1081..9e5017a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -809,6 +809,65 @@ static int hmm_mirror_fault_pmd(pmd_t *pmdp,
 	return ret;
 }
 
+static int hmm_mirror_fault_hugetlb_entry(pte_t *ptep,
+					  unsigned long hmask,
+					  unsigned long addr,
+					  unsigned long end,
+					  struct mm_walk *walk)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long pfn, next;
+	dma_addr_t *hmm_pte;
+	pte_t pte;
+
+	/*
+	 * Hugepages under user process are always in RAM and never
+	 * swapped out, but theoretically it needs to be checked.
+	 */
+	if (!ptep)
+		return -ENOENT;
+
+	pte = huge_ptep_get(ptep);
+	pfn = pte_pfn(pte);
+	if (!huge_pte_none(pte) || (write && !huge_pte_write(pte)))
+		return -ENOENT;
+
+	hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+	if (!hmm_pte)
+		return -ENOMEM;
+	hmm_pt_iter_directory_lock(iter);
+	for (; addr != end; addr += PAGE_SIZE, ++pfn, ++hmm_pte) {
+		/* Switch to another HMM page table directory ? */
+		if (addr == next) {
+			hmm_pt_iter_directory_unlock(iter);
+			hmm_pte = hmm_pt_iter_populate(iter, addr, &next);
+			if (!hmm_pte)
+				return -ENOMEM;
+			hmm_pt_iter_directory_lock(iter);
+		}
+
+		if (hmm_pte_test_valid_dma(hmm_pte))
+			continue;
+
+		if (!hmm_pte_test_valid_pfn(hmm_pte)) {
+			*hmm_pte = hmm_pte_from_pfn(pfn);
+			hmm_pt_iter_directory_ref(iter);
+		}
+		BUG_ON(hmm_pte_pfn(*hmm_pte) != pfn);
+		if (write)
+			hmm_pte_set_write(hmm_pte);
+	}
+	hmm_pt_iter_directory_unlock(iter);
+#else
+	BUG();
+#endif
+	return 0;
+}
+
 static int hmm_mirror_dma_map(struct hmm_mirror *mirror,
 			      struct hmm_pt_iter *iter,
 			      unsigned long start,
@@ -916,6 +975,7 @@ static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
 		walk.mm = mirror->hmm->mm;
 		walk.private = &mirror_fault;
 		walk.pmd_entry = hmm_mirror_fault_pmd;
+		walk.hugetlb_entry = hmm_mirror_fault_hugetlb_entry;
 		walk.pte_hole = hmm_pte_hole;
 		ret = walk_page_range(addr, event->end, &walk);
 		if (ret)
@@ -1002,7 +1062,7 @@ retry:
 		goto out;
 	}
 	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
-	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))) {
 		ret = -EFAULT;
 		goto out;
 	}
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
  2015-10-21 20:59 ` Jérôme Glisse
@ 2015-10-21 21:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 11908 bytes --]

This add documentation on how HMM works and a more in depth view of how it
should be use by device driver writers.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..febed50
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,219 @@
+Heterogeneous Memory Management (HMM)
+-------------------------------------
+
+The raison d'être of HMM is to provide a common API for device driver that
+wants to mirror a process address space on there device and/or migrate system
+memory to device memory. Device driver can decide to only use one aspect of
+HMM (mirroring or memory migration), for instance some device can directly
+access process address space through hardware (for instance PCIe ATS/PASID),
+but still want to benefit from memory migration capabilities that HMM offer.
+
+While HMM rely on existing kernel infrastructure (namely mmu_notifier) some
+of its features (memory migration, atomic access) require integration with
+core mm kernel code. Having HMM as the common intermediary is more appealing
+than having each device driver hooking itself inside the common mm code.
+
+Moreover HMM as a layer allows integration with DMA API or page reclaimation.
+
+
+Mirroring address space on the device:
+--------------------------------------
+
+Device that can't directly access transparently the process address space, need
+to mirror the CPU page table into there own page table. HMM helps to keep the
+device page table synchronize with the CPU page table. It is not expected that
+the device will fully mirror the CPU page table but only mirror region that are
+actively accessed by the device. For that reasons HMM only helps populating and
+synchronizing device page table for range that the device driver explicitly ask
+for.
+
+Mirroring address space inside the device page table is easy with HMM :
+
+  /* Create a mirror for the current process for your device. */
+  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
+  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
+
+  ...
+
+  /* Mirror memory (in read mode) between addressA and addressB */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >update() callback. During the
+     * callback use the HMM page table to populate the device page table. You
+     * can only use the HMM page table to populate the device page table for
+     * the specified range during the >update() callback, at any other point in
+     * time the HMM page table content should be assume to be undefined.
+     */
+    your_hmm_device->update(mirror, event);
+
+  ...
+
+  /* Process is quiting or device done stop the mirroring and cleanup. */
+  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
+  /* Device driver can free your_hmm_mirror */
+
+
+HMM mirror page table:
+----------------------
+
+Each hmm_mirror object is associated with a mirror page table that HMM keeps
+synchronize with the CPU page table by using the mmu_notifier API. HMM is using
+its own generic page table format because it needs to store DMA address, which
+are bigger than long on some architecture, and have more flags per entry than
+radix tree allows.
+
+The HMM page table mostly mirror x86 page table layout. A page holds a global
+directory and each entry points to a lower level directory. Unlike regular CPU
+page table, directory level are more aggressively freed and remove from the HMM
+mirror page table. This means device driver needs to use the HMM helpers and to
+follow directive on when and how to access the mirror page table. HMM use the
+per page spinlock of directory page to synchronize update of directory ie update
+can happen on different directory concurently.
+
+As a rules the mirror page table can only be accessed by device driver from one
+of the HMM device callback. Any access from outside a callback is illegal and
+gives undertimed result.
+
+Accessing the mirror page table from a device callback needs to use the HMM
+page table helpers. A loop to access entry for a range of address looks like :
+
+  /* Initialize a HMM page table iterator. */
+  struct hmm_pt_iter iter;
+  hmm_pt_iter_init(&iter, &mirror->pt)
+
+  /* Get pointer to HMM page table entry for a given address. */
+  dma_addr_t *hmm_pte;
+  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+
+If there is no valid entry directory for given range address then hmm_pte is
+NULL. If there is a valid entry directory then you can access the hmm_pte and
+the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
+the same iter struct for a different address or call hmm_pt_iter_fini().
+
+While the HMM page table entry pointer stays valid you can only modify the
+value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
+threads might be updating the same entry concurrently. The device driver only
+need to update an HMM page table entry to set the dirty bit, so driver should
+only be using hmm_pte_set_dirty().
+
+Similarly to extract information the device driver should use one of the helper
+like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
+is a device driver at initialization parameter).
+
+
+Migrating system memory to device memory:
+-----------------------------------------
+
+Device like discret GPU often have there own local memory which offer bigger
+bandwidth and smaller latency than access to system memory for the GPU. This
+local memory is not necessarily accessible by the CPU. Device local memory will
+remain revealent for the foreseeable future as bandwidth of GPU memory keep
+increasing faster than bandwidth of system memory and as latency of PCIe does
+not decrease.
+
+Thus to maximize use of device like GPU, program need to use the device memory.
+Userspace API wants to make this as transparent as it can be, so that there is
+no need for complex modification of applications.
+
+Transparent use of device memory for range of address of a process require core
+mm code modifications. Adding a new memory zone for devices memory did not make
+sense given that such memory is often only accessible by the device only. This
+is why we decided to use a special kind of swap, migrated memory is mark as a
+special swap entry inside the CPU page table.
+
+While HMM handles the migration process, it does not decide what range or when
+to migrate memory. The decision to perform such migration is under the control
+of the device driver. Migration back to system memory happens either because
+the CPU try to access the memory or because device driver decided to migrate
+the memory back.
+
+
+  /* Migrate system memory between addressA and addressB to device memory. */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >copy_to_device() callback.
+     * Device driver must allocate device memory, DMA system memory to device
+     * memory, update the device page table to point to device memory and
+     * return. See hmm.h for details instructions and how failure are handled.
+     */
+    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
+
+
+Right now HMM only support migrating anonymous private memory. Migration of
+share memory and more generaly file mapped memory is on the road map.
+
+
+Locking consideration and overall design:
+-----------------------------------------
+
+As a rule HMM will handle proper locking on the behalf of the device driver,
+as such device driver does not need to take any mm lock before calling into
+the HMM code.
+
+HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The
+device driver can only free those after calling hmm_device_unregister() or
+hmm_mirror_unregister() respectively.
+
+All the lock inside any of the HMM structure should never be use by the device
+driver. They are intended to be use only and only by HMM code. Below is short
+description of the 3 main locks that exist for HMM internal use. Educational
+purpose only.
+
+Each process mm has one and only one struct hmm associated with it. Each hmm
+struct can be use by several different mirror. There is one and only one mirror
+per mm and device pair. So in essence the hmm struct is the core that dispatch
+everything to every single mirror, each of them corresponding to a specific
+device. The list of mirror for an hmm struct is protected by a semaphore as it
+sees mostly read access.
+
+Each time a device fault a range of address it calls hmm_mirror_fault(), HMM
+keeps track, inside the hmm struct, of each range currently being faulted. It
+does that so it can synchronize with any CPU page table update. If there is a
+CPU page table update then a callback through mmu_notifier will happen and HMM
+will try to interrupt the device page fault that conflict (ie address range
+overlap with the range being updated) and wait for them to back off. This
+insure that at no point in time the device driver see transient page table
+information. The list of active fault is protected by a spinlock, query on
+that list should be short and quick (we haven't gather enough statistic on
+that side yet to have a good idea of the average access pattern).
+
+Each device driver wanting to use HMM must register one and only one hmm_device
+struct per physical device with HMM. The hmm_device struct have pointer to the
+device driver call back and keeps track of active mirrors for a given device.
+The active mirrors list is protected by a spinlock.
+
+
+Future work:
+------------
+
+Improved atomic access by the device to system memory. Some platform bus (PCIe)
+offer limited number of atomic memory operations, some platform do not even
+have any kind of atomic memory operations by a device. In order to allow such
+atomic operation we want to map page read only the CPU while the device perform
+its operation. For this we need a new case inside the CPU write fault code path
+to synchronize with the device.
+
+We want to allow program to lock a range of memory inside device memory and
+forbid CPU access while the memory is lock inside the device. Any CPU access
+to locked range would result in SIGBUS. We think that madvise() would be the
+right syscall into which we could plug that feature.
+
+In order to minimize kernel memory consumption and overhead of DMA mapping, we
+want to introduce new DMA API that allows to manage mapping on IOMMU directory
+page basis. This would allow to map/unmap/update DMA mapping in bulk and
+minimize IOMMU update and flushing overhead. Moreover this would allow to
+improve IOMMU bad access reporting for DMA address inside those directory.
+
+Because update to the device page table might require "heavy" synchronization
+with the device, the mmu_notifier callback might have to sleep while HMM is
+waiting for the device driver to report device page table update completion.
+This is especialy bad if this happens during page reclaimation, this might
+bring the system to pause. We want to mitigate this, either by maintaining a
+new intermediate lru level in which we put pages actively mirrored by a device
+or by some other mecanism. For time being we advice that device driver that
+use HMM explicitly explain this corner case so that user are aware that this
+can happens if there is memory pressure.
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
@ 2015-10-21 21:00   ` Jérôme Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jérôme Glisse @ 2015-10-21 21:00 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Jérôme Glisse

This add documentation on how HMM works and a more in depth view of how it
should be use by device driver writers.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 0000000..febed50
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,219 @@
+Heterogeneous Memory Management (HMM)
+-------------------------------------
+
+The raison d'etre of HMM is to provide a common API for device driver that
+wants to mirror a process address space on there device and/or migrate system
+memory to device memory. Device driver can decide to only use one aspect of
+HMM (mirroring or memory migration), for instance some device can directly
+access process address space through hardware (for instance PCIe ATS/PASID),
+but still want to benefit from memory migration capabilities that HMM offer.
+
+While HMM rely on existing kernel infrastructure (namely mmu_notifier) some
+of its features (memory migration, atomic access) require integration with
+core mm kernel code. Having HMM as the common intermediary is more appealing
+than having each device driver hooking itself inside the common mm code.
+
+Moreover HMM as a layer allows integration with DMA API or page reclaimation.
+
+
+Mirroring address space on the device:
+--------------------------------------
+
+Device that can't directly access transparently the process address space, need
+to mirror the CPU page table into there own page table. HMM helps to keep the
+device page table synchronize with the CPU page table. It is not expected that
+the device will fully mirror the CPU page table but only mirror region that are
+actively accessed by the device. For that reasons HMM only helps populating and
+synchronizing device page table for range that the device driver explicitly ask
+for.
+
+Mirroring address space inside the device page table is easy with HMM :
+
+  /* Create a mirror for the current process for your device. */
+  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
+  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
+
+  ...
+
+  /* Mirror memory (in read mode) between addressA and addressB */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >update() callback. During the
+     * callback use the HMM page table to populate the device page table. You
+     * can only use the HMM page table to populate the device page table for
+     * the specified range during the >update() callback, at any other point in
+     * time the HMM page table content should be assume to be undefined.
+     */
+    your_hmm_device->update(mirror, event);
+
+  ...
+
+  /* Process is quiting or device done stop the mirroring and cleanup. */
+  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
+  /* Device driver can free your_hmm_mirror */
+
+
+HMM mirror page table:
+----------------------
+
+Each hmm_mirror object is associated with a mirror page table that HMM keeps
+synchronize with the CPU page table by using the mmu_notifier API. HMM is using
+its own generic page table format because it needs to store DMA address, which
+are bigger than long on some architecture, and have more flags per entry than
+radix tree allows.
+
+The HMM page table mostly mirror x86 page table layout. A page holds a global
+directory and each entry points to a lower level directory. Unlike regular CPU
+page table, directory level are more aggressively freed and remove from the HMM
+mirror page table. This means device driver needs to use the HMM helpers and to
+follow directive on when and how to access the mirror page table. HMM use the
+per page spinlock of directory page to synchronize update of directory ie update
+can happen on different directory concurently.
+
+As a rules the mirror page table can only be accessed by device driver from one
+of the HMM device callback. Any access from outside a callback is illegal and
+gives undertimed result.
+
+Accessing the mirror page table from a device callback needs to use the HMM
+page table helpers. A loop to access entry for a range of address looks like :
+
+  /* Initialize a HMM page table iterator. */
+  struct hmm_pt_iter iter;
+  hmm_pt_iter_init(&iter, &mirror->pt)
+
+  /* Get pointer to HMM page table entry for a given address. */
+  dma_addr_t *hmm_pte;
+  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
+
+If there is no valid entry directory for given range address then hmm_pte is
+NULL. If there is a valid entry directory then you can access the hmm_pte and
+the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
+the same iter struct for a different address or call hmm_pt_iter_fini().
+
+While the HMM page table entry pointer stays valid you can only modify the
+value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
+threads might be updating the same entry concurrently. The device driver only
+need to update an HMM page table entry to set the dirty bit, so driver should
+only be using hmm_pte_set_dirty().
+
+Similarly to extract information the device driver should use one of the helper
+like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
+is a device driver at initialization parameter).
+
+
+Migrating system memory to device memory:
+-----------------------------------------
+
+Device like discret GPU often have there own local memory which offer bigger
+bandwidth and smaller latency than access to system memory for the GPU. This
+local memory is not necessarily accessible by the CPU. Device local memory will
+remain revealent for the foreseeable future as bandwidth of GPU memory keep
+increasing faster than bandwidth of system memory and as latency of PCIe does
+not decrease.
+
+Thus to maximize use of device like GPU, program need to use the device memory.
+Userspace API wants to make this as transparent as it can be, so that there is
+no need for complex modification of applications.
+
+Transparent use of device memory for range of address of a process require core
+mm code modifications. Adding a new memory zone for devices memory did not make
+sense given that such memory is often only accessible by the device only. This
+is why we decided to use a special kind of swap, migrated memory is mark as a
+special swap entry inside the CPU page table.
+
+While HMM handles the migration process, it does not decide what range or when
+to migrate memory. The decision to perform such migration is under the control
+of the device driver. Migration back to system memory happens either because
+the CPU try to access the memory or because device driver decided to migrate
+the memory back.
+
+
+  /* Migrate system memory between addressA and addressB to device memory. */
+  your_hmm_event->hmm_event.start = addressA;
+  your_hmm_event->hmm_event.end = addressB;
+  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
+  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
+    /* HMM callback into your driver with the >copy_to_device() callback.
+     * Device driver must allocate device memory, DMA system memory to device
+     * memory, update the device page table to point to device memory and
+     * return. See hmm.h for details instructions and how failure are handled.
+     */
+    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
+
+
+Right now HMM only support migrating anonymous private memory. Migration of
+share memory and more generaly file mapped memory is on the road map.
+
+
+Locking consideration and overall design:
+-----------------------------------------
+
+As a rule HMM will handle proper locking on the behalf of the device driver,
+as such device driver does not need to take any mm lock before calling into
+the HMM code.
+
+HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The
+device driver can only free those after calling hmm_device_unregister() or
+hmm_mirror_unregister() respectively.
+
+All the lock inside any of the HMM structure should never be use by the device
+driver. They are intended to be use only and only by HMM code. Below is short
+description of the 3 main locks that exist for HMM internal use. Educational
+purpose only.
+
+Each process mm has one and only one struct hmm associated with it. Each hmm
+struct can be use by several different mirror. There is one and only one mirror
+per mm and device pair. So in essence the hmm struct is the core that dispatch
+everything to every single mirror, each of them corresponding to a specific
+device. The list of mirror for an hmm struct is protected by a semaphore as it
+sees mostly read access.
+
+Each time a device fault a range of address it calls hmm_mirror_fault(), HMM
+keeps track, inside the hmm struct, of each range currently being faulted. It
+does that so it can synchronize with any CPU page table update. If there is a
+CPU page table update then a callback through mmu_notifier will happen and HMM
+will try to interrupt the device page fault that conflict (ie address range
+overlap with the range being updated) and wait for them to back off. This
+insure that at no point in time the device driver see transient page table
+information. The list of active fault is protected by a spinlock, query on
+that list should be short and quick (we haven't gather enough statistic on
+that side yet to have a good idea of the average access pattern).
+
+Each device driver wanting to use HMM must register one and only one hmm_device
+struct per physical device with HMM. The hmm_device struct have pointer to the
+device driver call back and keeps track of active mirrors for a given device.
+The active mirrors list is protected by a spinlock.
+
+
+Future work:
+------------
+
+Improved atomic access by the device to system memory. Some platform bus (PCIe)
+offer limited number of atomic memory operations, some platform do not even
+have any kind of atomic memory operations by a device. In order to allow such
+atomic operation we want to map page read only the CPU while the device perform
+its operation. For this we need a new case inside the CPU write fault code path
+to synchronize with the device.
+
+We want to allow program to lock a range of memory inside device memory and
+forbid CPU access while the memory is lock inside the device. Any CPU access
+to locked range would result in SIGBUS. We think that madvise() would be the
+right syscall into which we could plug that feature.
+
+In order to minimize kernel memory consumption and overhead of DMA mapping, we
+want to introduce new DMA API that allows to manage mapping on IOMMU directory
+page basis. This would allow to map/unmap/update DMA mapping in bulk and
+minimize IOMMU update and flushing overhead. Moreover this would allow to
+improve IOMMU bad access reporting for DMA address inside those directory.
+
+Because update to the device page table might require "heavy" synchronization
+with the device, the mmu_notifier callback might have to sleep while HMM is
+waiting for the device driver to report device page table update completion.
+This is especialy bad if this happens during page reclaimation, this might
+bring the system to pause. We want to mitigate this, either by maintaining a
+new intermediate lru level in which we put pages actively mirrored by a device
+or by some other mecanism. For time being we advice that device driver that
+use HMM explicitly explain this corner case so that user are aware that this
+can happens if there is memory pressure.
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
  2015-10-21 21:00   ` Jérôme Glisse
@ 2015-10-22  3:23     ` Randy Dunlap
  -1 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2015-10-22  3:23 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

Hi,

Some corrections and a few questions...

On 10/21/15 14:00, Jérôme Glisse wrote:
> This add documentation on how HMM works and a more in depth view of how it
> should be use by device driver writers.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 219 insertions(+)
>  create mode 100644 Documentation/vm/hmm.txt
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> new file mode 100644
> index 0000000..febed50
> --- /dev/null
> +++ b/Documentation/vm/hmm.txt
> @@ -0,0 +1,219 @@
> +Heterogeneous Memory Management (HMM)
> +-------------------------------------
> +
> +The raison d'�tre of HMM is to provide a common API for device driver that

                                                                    drivers

> +wants to mirror a process address space on there device and/or migrate system

   want                                       their

> +memory to device memory. Device driver can decide to only use one aspect of

                                   drivers

> +HMM (mirroring or memory migration), for instance some device can directly
> +access process address space through hardware (for instance PCIe ATS/PASID),
> +but still want to benefit from memory migration capabilities that HMM offer.
> +
> +While HMM rely on existing kernel infrastructure (namely mmu_notifier) some

             relies

> +of its features (memory migration, atomic access) require integration with
> +core mm kernel code. Having HMM as the common intermediary is more appealing

        MM

> +than having each device driver hooking itself inside the common mm code.

                                                                   MM

> +
> +Moreover HMM as a layer allows integration with DMA API or page reclaimation.

                                                                   reclamation.

> +
> +
> +Mirroring address space on the device:
> +--------------------------------------
> +
> +Device that can't directly access transparently the process address space, need
> +to mirror the CPU page table into there own page table. HMM helps to keep the

                                     their

> +device page table synchronize with the CPU page table. It is not expected that

                     synchronized

> +the device will fully mirror the CPU page table but only mirror region that are

                                                                   regions

> +actively accessed by the device. For that reasons HMM only helps populating and

                                             reason

> +synchronizing device page table for range that the device driver explicitly ask

                                       ranges                                  asks

or is only one range supported?


> +for.
> +
> +Mirroring address space inside the device page table is easy with HMM :

                                                                     HMM:

> +
> +  /* Create a mirror for the current process for your device. */
> +  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
> +  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
> +
> +  ...
> +
> +  /* Mirror memory (in read mode) between addressA and addressB */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

Multiple events (ranges) can be specified?
Is hmm_event.end (addressB) included or excluded from the range?

> +  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >update() callback. During the
> +     * callback use the HMM page table to populate the device page table. You
> +     * can only use the HMM page table to populate the device page table for
> +     * the specified range during the >update() callback, at any other point in
> +     * time the HMM page table content should be assume to be undefined.

                                                    assumed

> +     */
> +    your_hmm_device->update(mirror, event);
> +
> +  ...
> +
> +  /* Process is quiting or device done stop the mirroring and cleanup. */

                   quitting or device done; stop

> +  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
> +  /* Device driver can free your_hmm_mirror */
> +
> +
> +HMM mirror page table:
> +----------------------
> +
> +Each hmm_mirror object is associated with a mirror page table that HMM keeps
> +synchronize with the CPU page table by using the mmu_notifier API. HMM is using

   synchronized

> +its own generic page table format because it needs to store DMA address, which

                                                                   adresses,

> +are bigger than long on some architecture, and have more flags per entry than

                                architectures,

> +radix tree allows.
> +
> +The HMM page table mostly mirror x86 page table layout. A page holds a global

                             mirrors

> +directory and each entry points to a lower level directory. Unlike regular CPU
> +page table, directory level are more aggressively freed and remove from the HMM

        tables,          levels                                removed

> +mirror page table. This means device driver needs to use the HMM helpers and to

                                        drivers need

> +follow directive on when and how to access the mirror page table. HMM use the

                                                                         uses

> +per page spinlock of directory page to synchronize update of directory ie update

                                  pages                         directory, i.e.,

> +can happen on different directory concurently.

                                     concurrently.

> +
> +As a rules the mirror page table can only be accessed by device driver from one

        rule                                             by a device driver

> +of the HMM device callback. Any access from outside a callback is illegal and

                     callbacks.

> +gives undertimed result.

         undetermined
or       undefined

> +
> +Accessing the mirror page table from a device callback needs to use the HMM
> +page table helpers. A loop to access entry for a range of address looks like :

                                        entries              addresses looks like:

> +
> +  /* Initialize a HMM page table iterator. */

                   an HMM

> +  struct hmm_pt_iter iter;
> +  hmm_pt_iter_init(&iter, &mirror->pt)
> +
> +  /* Get pointer to HMM page table entry for a given address. */
> +  dma_addr_t *hmm_pte;
> +  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);

what are 'addr' and 'next'? (types)

> +
> +If there is no valid entry directory for given range address then hmm_pte is
> +NULL. If there is a valid entry directory then you can access the hmm_pte and
> +the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
> +the same iter struct for a different address or call hmm_pt_iter_fini().
> +
> +While the HMM page table entry pointer stays valid you can only modify the
> +value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
> +threads might be updating the same entry concurrently. The device driver only
> +need to update an HMM page table entry to set the dirty bit, so driver should

   needs                                                           drivers

> +only be using hmm_pte_set_dirty().
> +
> +Similarly to extract information the device driver should use one of the helper

                                                                            helpers

> +like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
> +is a device driver at initialization parameter).
> +
> +
> +Migrating system memory to device memory:
> +-----------------------------------------
> +
> +Device like discret GPU often have there own local memory which offer bigger

   Devices     discrete GPUs          their

> +bandwidth and smaller latency than access to system memory for the GPU. This
> +local memory is not necessarily accessible by the CPU. Device local memory will
> +remain revealent for the foreseeable future as bandwidth of GPU memory keep

          relevant                                                        keeps

> +increasing faster than bandwidth of system memory and as latency of PCIe does
> +not decrease.
> +
> +Thus to maximize use of device like GPU, program need to use the device memory.

                           devices like GPUs, programs

> +Userspace API wants to make this as transparent as it can be, so that there is
> +no need for complex modification of applications.
> +
> +Transparent use of device memory for range of address of a process require core

                                                                      requires

> +mm code modifications. Adding a new memory zone for devices memory did not make

   MM                                                  device

> +sense given that such memory is often only accessible by the device only. This
> +is why we decided to use a special kind of swap, migrated memory is mark as a

                                              swap;                    marked

> +special swap entry inside the CPU page table.
> +
> +While HMM handles the migration process, it does not decide what range or when
> +to migrate memory. The decision to perform such migration is under the control
> +of the device driver. Migration back to system memory happens either because
> +the CPU try to access the memory or because device driver decided to migrate

           tries

> +the memory back.
> +
> +
> +  /* Migrate system memory between addressA and addressB to device memory. */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

is hmm_event.end (addressB) inclusive and exclusive?
i.e., is it end_of_copy + 1?
i.e., is the size of the copy addressB - addressA or
      addressB - addressA + 1?
i.e., is addressB = addressA + size
or is    addressB = addressA + size - 1

In my experience it is usually better to have a start_address and size
instead of start_address and end_address.

> +  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >copy_to_device() callback.
> +     * Device driver must allocate device memory, DMA system memory to device
> +     * memory, update the device page table to point to device memory and
> +     * return. See hmm.h for details instructions and how failure are handled.

                                detailed                     failures

> +     */
> +    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
> +
> +
> +Right now HMM only support migrating anonymous private memory. Migration of

                      supports

> +share memory and more generaly file mapped memory is on the road map.

   shared                generally

> +
> +
> +Locking consideration and overall design:
> +-----------------------------------------
> +
> +As a rule HMM will handle proper locking on the behalf of the device driver,
> +as such device driver does not need to take any mm lock before calling into

                                                   MM

> +the HMM code.
> +
> +HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The

                           for

> +device driver can only free those after calling hmm_device_unregister() or
> +hmm_mirror_unregister() respectively.
> +
> +All the lock inside any of the HMM structure should never be use by the device

           locks                      structures

> +driver. They are intended to be use only and only by HMM code. Below is short

                                   used only by the HMM code.

> +description of the 3 main locks that exist for HMM internal use. Educational
> +purpose only.
> +
> +Each process mm has one and only one struct hmm associated with it. Each hmm

                MM

> +struct can be use by several different mirror. There is one and only one mirror

                                          mirrors.

> +per mm and device pair. So in essence the hmm struct is the core that dispatch

       MM                                                                dispatches

> +everything to every single mirror, each of them corresponding to a specific
> +device. The list of mirror for an hmm struct is protected by a semaphore as it

                       mirrors
> +sees mostly read access.
> +
> +Each time a device fault a range of address it calls hmm_mirror_fault(), HMM

                      faults

> +keeps track, inside the hmm struct, of each range currently being faulted. It
> +does that so it can synchronize with any CPU page table update. If there is a
> +CPU page table update then a callback through mmu_notifier will happen and HMM
> +will try to interrupt the device page fault that conflict (ie address range

                                                    conflicts (i.e.,

> +overlap with the range being updated) and wait for them to back off. This
> +insure that at no point in time the device driver see transient page table

   insures                                           sees

> +information. The list of active fault is protected by a spinlock, query on

                                   faults                  spinlock;

> +that list should be short and quick (we haven't gather enough statistic on

                                                   gathered      statistics

> +that side yet to have a good idea of the average access pattern).
> +
> +Each device driver wanting to use HMM must register one and only one hmm_device
> +struct per physical device with HMM. The hmm_device struct have pointer to the

                                                              has

> +device driver call back and keeps track of active mirrors for a given device.

                 callback

> +The active mirrors list is protected by a spinlock.
> +
> +
> +Future work:
> +------------
> +
> +Improved atomic access by the device to system memory. Some platform bus (PCIe)

                                                                        busses

> +offer limited number of atomic memory operations, some platform do not even

                                         operations;      platforms

> +have any kind of atomic memory operations by a device. In order to allow such
> +atomic operation we want to map page read only the CPU while the device perform

          operations               pages read-only in the CPU              performs

> +its operation. For this we need a new case inside the CPU write fault code path
> +to synchronize with the device.
> +
> +We want to allow program to lock a range of memory inside device memory and

              allow a program

> +forbid CPU access while the memory is lock inside the device. Any CPU access

                                         locked

> +to locked range would result in SIGBUS. We think that madvise() would be the
> +right syscall into which we could plug that feature.
> +
> +In order to minimize kernel memory consumption and overhead of DMA mapping, we
> +want to introduce new DMA API that allows to manage mapping on IOMMU directory
> +page basis. This would allow to map/unmap/update DMA mapping in bulk and
> +minimize IOMMU update and flushing overhead. Moreover this would allow to
> +improve IOMMU bad access reporting for DMA address inside those directory.
> +
> +Because update to the device page table might require "heavy" synchronization
> +with the device, the mmu_notifier callback might have to sleep while HMM is
> +waiting for the device driver to report device page table update completion.
> +This is especialy bad if this happens during page reclaimation, this might

           especially                                reclamation;

> +bring the system to pause. We want to mitigate this, either by maintaining a
> +new intermediate lru level in which we put pages actively mirrored by a device

                    LRU

> +or by some other mecanism. For time being we advice that device driver that

                    mechanism.                  advise

> +use HMM explicitly explain this corner case so that user are aware that this

                                                       users

> +can happens if there is memory pressure.

       happen
> 


-- 
~Randy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
@ 2015-10-22  3:23     ` Randy Dunlap
  0 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2015-10-22  3:23 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

Hi,

Some corrections and a few questions...

On 10/21/15 14:00, JA(C)rA'me Glisse wrote:
> This add documentation on how HMM works and a more in depth view of how it
> should be use by device driver writers.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 219 insertions(+)
>  create mode 100644 Documentation/vm/hmm.txt
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> new file mode 100644
> index 0000000..febed50
> --- /dev/null
> +++ b/Documentation/vm/hmm.txt
> @@ -0,0 +1,219 @@
> +Heterogeneous Memory Management (HMM)
> +-------------------------------------
> +
> +The raison d'i? 1/2 tre of HMM is to provide a common API for device driver that

                                                                    drivers

> +wants to mirror a process address space on there device and/or migrate system

   want                                       their

> +memory to device memory. Device driver can decide to only use one aspect of

                                   drivers

> +HMM (mirroring or memory migration), for instance some device can directly
> +access process address space through hardware (for instance PCIe ATS/PASID),
> +but still want to benefit from memory migration capabilities that HMM offer.
> +
> +While HMM rely on existing kernel infrastructure (namely mmu_notifier) some

             relies

> +of its features (memory migration, atomic access) require integration with
> +core mm kernel code. Having HMM as the common intermediary is more appealing

        MM

> +than having each device driver hooking itself inside the common mm code.

                                                                   MM

> +
> +Moreover HMM as a layer allows integration with DMA API or page reclaimation.

                                                                   reclamation.

> +
> +
> +Mirroring address space on the device:
> +--------------------------------------
> +
> +Device that can't directly access transparently the process address space, need
> +to mirror the CPU page table into there own page table. HMM helps to keep the

                                     their

> +device page table synchronize with the CPU page table. It is not expected that

                     synchronized

> +the device will fully mirror the CPU page table but only mirror region that are

                                                                   regions

> +actively accessed by the device. For that reasons HMM only helps populating and

                                             reason

> +synchronizing device page table for range that the device driver explicitly ask

                                       ranges                                  asks

or is only one range supported?


> +for.
> +
> +Mirroring address space inside the device page table is easy with HMM :

                                                                     HMM:

> +
> +  /* Create a mirror for the current process for your device. */
> +  your_hmm_mirror->hmm_mirror.device = your_hmm_device;
> +  hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
> +
> +  ...
> +
> +  /* Mirror memory (in read mode) between addressA and addressB */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

Multiple events (ranges) can be specified?
Is hmm_event.end (addressB) included or excluded from the range?

> +  your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >update() callback. During the
> +     * callback use the HMM page table to populate the device page table. You
> +     * can only use the HMM page table to populate the device page table for
> +     * the specified range during the >update() callback, at any other point in
> +     * time the HMM page table content should be assume to be undefined.

                                                    assumed

> +     */
> +    your_hmm_device->update(mirror, event);
> +
> +  ...
> +
> +  /* Process is quiting or device done stop the mirroring and cleanup. */

                   quitting or device done; stop

> +  hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
> +  /* Device driver can free your_hmm_mirror */
> +
> +
> +HMM mirror page table:
> +----------------------
> +
> +Each hmm_mirror object is associated with a mirror page table that HMM keeps
> +synchronize with the CPU page table by using the mmu_notifier API. HMM is using

   synchronized

> +its own generic page table format because it needs to store DMA address, which

                                                                   adresses,

> +are bigger than long on some architecture, and have more flags per entry than

                                architectures,

> +radix tree allows.
> +
> +The HMM page table mostly mirror x86 page table layout. A page holds a global

                             mirrors

> +directory and each entry points to a lower level directory. Unlike regular CPU
> +page table, directory level are more aggressively freed and remove from the HMM

        tables,          levels                                removed

> +mirror page table. This means device driver needs to use the HMM helpers and to

                                        drivers need

> +follow directive on when and how to access the mirror page table. HMM use the

                                                                         uses

> +per page spinlock of directory page to synchronize update of directory ie update

                                  pages                         directory, i.e.,

> +can happen on different directory concurently.

                                     concurrently.

> +
> +As a rules the mirror page table can only be accessed by device driver from one

        rule                                             by a device driver

> +of the HMM device callback. Any access from outside a callback is illegal and

                     callbacks.

> +gives undertimed result.

         undetermined
or       undefined

> +
> +Accessing the mirror page table from a device callback needs to use the HMM
> +page table helpers. A loop to access entry for a range of address looks like :

                                        entries              addresses looks like:

> +
> +  /* Initialize a HMM page table iterator. */

                   an HMM

> +  struct hmm_pt_iter iter;
> +  hmm_pt_iter_init(&iter, &mirror->pt)
> +
> +  /* Get pointer to HMM page table entry for a given address. */
> +  dma_addr_t *hmm_pte;
> +  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);

what are 'addr' and 'next'? (types)

> +
> +If there is no valid entry directory for given range address then hmm_pte is
> +NULL. If there is a valid entry directory then you can access the hmm_pte and
> +the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
> +the same iter struct for a different address or call hmm_pt_iter_fini().
> +
> +While the HMM page table entry pointer stays valid you can only modify the
> +value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
> +threads might be updating the same entry concurrently. The device driver only
> +need to update an HMM page table entry to set the dirty bit, so driver should

   needs                                                           drivers

> +only be using hmm_pte_set_dirty().
> +
> +Similarly to extract information the device driver should use one of the helper

                                                                            helpers

> +like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
> +is a device driver at initialization parameter).
> +
> +
> +Migrating system memory to device memory:
> +-----------------------------------------
> +
> +Device like discret GPU often have there own local memory which offer bigger

   Devices     discrete GPUs          their

> +bandwidth and smaller latency than access to system memory for the GPU. This
> +local memory is not necessarily accessible by the CPU. Device local memory will
> +remain revealent for the foreseeable future as bandwidth of GPU memory keep

          relevant                                                        keeps

> +increasing faster than bandwidth of system memory and as latency of PCIe does
> +not decrease.
> +
> +Thus to maximize use of device like GPU, program need to use the device memory.

                           devices like GPUs, programs

> +Userspace API wants to make this as transparent as it can be, so that there is
> +no need for complex modification of applications.
> +
> +Transparent use of device memory for range of address of a process require core

                                                                      requires

> +mm code modifications. Adding a new memory zone for devices memory did not make

   MM                                                  device

> +sense given that such memory is often only accessible by the device only. This
> +is why we decided to use a special kind of swap, migrated memory is mark as a

                                              swap;                    marked

> +special swap entry inside the CPU page table.
> +
> +While HMM handles the migration process, it does not decide what range or when
> +to migrate memory. The decision to perform such migration is under the control
> +of the device driver. Migration back to system memory happens either because
> +the CPU try to access the memory or because device driver decided to migrate

           tries

> +the memory back.
> +
> +
> +  /* Migrate system memory between addressA and addressB to device memory. */
> +  your_hmm_event->hmm_event.start = addressA;
> +  your_hmm_event->hmm_event.end = addressB;

is hmm_event.end (addressB) inclusive and exclusive?
i.e., is it end_of_copy + 1?
i.e., is the size of the copy addressB - addressA or
      addressB - addressA + 1?
i.e., is addressB = addressA + size
or is    addressB = addressA + size - 1

In my experience it is usually better to have a start_address and size
instead of start_address and end_address.

> +  your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
> +  hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> +    /* HMM callback into your driver with the >copy_to_device() callback.
> +     * Device driver must allocate device memory, DMA system memory to device
> +     * memory, update the device page table to point to device memory and
> +     * return. See hmm.h for details instructions and how failure are handled.

                                detailed                     failures

> +     */
> +    your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
> +
> +
> +Right now HMM only support migrating anonymous private memory. Migration of

                      supports

> +share memory and more generaly file mapped memory is on the road map.

   shared                generally

> +
> +
> +Locking consideration and overall design:
> +-----------------------------------------
> +
> +As a rule HMM will handle proper locking on the behalf of the device driver,
> +as such device driver does not need to take any mm lock before calling into

                                                   MM

> +the HMM code.
> +
> +HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The

                           for

> +device driver can only free those after calling hmm_device_unregister() or
> +hmm_mirror_unregister() respectively.
> +
> +All the lock inside any of the HMM structure should never be use by the device

           locks                      structures

> +driver. They are intended to be use only and only by HMM code. Below is short

                                   used only by the HMM code.

> +description of the 3 main locks that exist for HMM internal use. Educational
> +purpose only.
> +
> +Each process mm has one and only one struct hmm associated with it. Each hmm

                MM

> +struct can be use by several different mirror. There is one and only one mirror

                                          mirrors.

> +per mm and device pair. So in essence the hmm struct is the core that dispatch

       MM                                                                dispatches

> +everything to every single mirror, each of them corresponding to a specific
> +device. The list of mirror for an hmm struct is protected by a semaphore as it

                       mirrors
> +sees mostly read access.
> +
> +Each time a device fault a range of address it calls hmm_mirror_fault(), HMM

                      faults

> +keeps track, inside the hmm struct, of each range currently being faulted. It
> +does that so it can synchronize with any CPU page table update. If there is a
> +CPU page table update then a callback through mmu_notifier will happen and HMM
> +will try to interrupt the device page fault that conflict (ie address range

                                                    conflicts (i.e.,

> +overlap with the range being updated) and wait for them to back off. This
> +insure that at no point in time the device driver see transient page table

   insures                                           sees

> +information. The list of active fault is protected by a spinlock, query on

                                   faults                  spinlock;

> +that list should be short and quick (we haven't gather enough statistic on

                                                   gathered      statistics

> +that side yet to have a good idea of the average access pattern).
> +
> +Each device driver wanting to use HMM must register one and only one hmm_device
> +struct per physical device with HMM. The hmm_device struct have pointer to the

                                                              has

> +device driver call back and keeps track of active mirrors for a given device.

                 callback

> +The active mirrors list is protected by a spinlock.
> +
> +
> +Future work:
> +------------
> +
> +Improved atomic access by the device to system memory. Some platform bus (PCIe)

                                                                        busses

> +offer limited number of atomic memory operations, some platform do not even

                                         operations;      platforms

> +have any kind of atomic memory operations by a device. In order to allow such
> +atomic operation we want to map page read only the CPU while the device perform

          operations               pages read-only in the CPU              performs

> +its operation. For this we need a new case inside the CPU write fault code path
> +to synchronize with the device.
> +
> +We want to allow program to lock a range of memory inside device memory and

              allow a program

> +forbid CPU access while the memory is lock inside the device. Any CPU access

                                         locked

> +to locked range would result in SIGBUS. We think that madvise() would be the
> +right syscall into which we could plug that feature.
> +
> +In order to minimize kernel memory consumption and overhead of DMA mapping, we
> +want to introduce new DMA API that allows to manage mapping on IOMMU directory
> +page basis. This would allow to map/unmap/update DMA mapping in bulk and
> +minimize IOMMU update and flushing overhead. Moreover this would allow to
> +improve IOMMU bad access reporting for DMA address inside those directory.
> +
> +Because update to the device page table might require "heavy" synchronization
> +with the device, the mmu_notifier callback might have to sleep while HMM is
> +waiting for the device driver to report device page table update completion.
> +This is especialy bad if this happens during page reclaimation, this might

           especially                                reclamation;

> +bring the system to pause. We want to mitigate this, either by maintaining a
> +new intermediate lru level in which we put pages actively mirrored by a device

                    LRU

> +or by some other mecanism. For time being we advice that device driver that

                    mechanism.                  advise

> +use HMM explicitly explain this corner case so that user are aware that this

                                                       users

> +can happens if there is memory pressure.

       happen
> 


-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
  2015-10-22  3:23     ` Randy Dunlap
@ 2015-10-22 14:11       ` Jerome Glisse
  -1 siblings, 0 replies; 42+ messages in thread
From: Jerome Glisse @ 2015-10-22 14:11 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Christophe Harle, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Shachar Raindel, Liran Liss, Roland Dreier,
	Ben Sander, Greg Stoner, John Bridgman, Michael Mantor,
	Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

On Wed, Oct 21, 2015 at 08:23:41PM -0700, Randy Dunlap wrote:
> Hi,
> 
> Some corrections and a few questions...

Thanks for the corrections. Answer below.

> On 10/21/15 14:00, Jérôme Glisse wrote:
> > This add documentation on how HMM works and a more in depth view of how it
> > should be use by device driver writers.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>

[...]

> > +synchronizing device page table for range that the device driver explicitly ask
> 
>                                        ranges                                  asks
> 
> or is only one range supported?

Several ranges are supported.


[...]

> > +  /* Mirror memory (in read mode) between addressA and addressB */
> > +  your_hmm_event->hmm_event.start = addressA;
> > +  your_hmm_event->hmm_event.end = addressB;
> 
> Multiple events (ranges) can be specified?

Device driver have to make one call per range but multiple threads can make
concurrent call for different ranges.

> Is hmm_event.end (addressB) included or excluded from the range?

Forgot to copy comment from header file, start is inclusive, end is exclusive.


[...]

> > +  struct hmm_pt_iter iter;
> > +  hmm_pt_iter_init(&iter, &mirror->pt)
> > +
> > +  /* Get pointer to HMM page table entry for a given address. */
> > +  dma_addr_t *hmm_pte;
> > +  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
> 
> what are 'addr' and 'next'? (types)

unsigned long will add then to the doc, good point.

[...]


> > +  /* Migrate system memory between addressA and addressB to device memory. */
> > +  your_hmm_event->hmm_event.start = addressA;
> > +  your_hmm_event->hmm_event.end = addressB;
> 
> is hmm_event.end (addressB) inclusive and exclusive?
> i.e., is it end_of_copy + 1?
> i.e., is the size of the copy addressB - addressA or
>       addressB - addressA + 1?
> i.e., is addressB = addressA + size
> or is    addressB = addressA + size - 1

Exclusive last one.


> In my experience it is usually better to have a start_address and size
> instead of start_address and end_address.

I switched several time btw the 2 offer differents version of the patchset,
it is something that can be change down the road unless you have strong
feeling about it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
@ 2015-10-22 14:11       ` Jerome Glisse
  0 siblings, 0 replies; 42+ messages in thread
From: Jerome Glisse @ 2015-10-22 14:11 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Christophe Harle, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Shachar Raindel, Liran Liss, Roland Dreier,
	Ben Sander, Greg Stoner, John Bridgman, Michael Mantor,
	Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

On Wed, Oct 21, 2015 at 08:23:41PM -0700, Randy Dunlap wrote:
> Hi,
> 
> Some corrections and a few questions...

Thanks for the corrections. Answer below.

> On 10/21/15 14:00, JA(C)rA'me Glisse wrote:
> > This add documentation on how HMM works and a more in depth view of how it
> > should be use by device driver writers.
> > 
> > Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>

[...]

> > +synchronizing device page table for range that the device driver explicitly ask
> 
>                                        ranges                                  asks
> 
> or is only one range supported?

Several ranges are supported.


[...]

> > +  /* Mirror memory (in read mode) between addressA and addressB */
> > +  your_hmm_event->hmm_event.start = addressA;
> > +  your_hmm_event->hmm_event.end = addressB;
> 
> Multiple events (ranges) can be specified?

Device driver have to make one call per range but multiple threads can make
concurrent call for different ranges.

> Is hmm_event.end (addressB) included or excluded from the range?

Forgot to copy comment from header file, start is inclusive, end is exclusive.


[...]

> > +  struct hmm_pt_iter iter;
> > +  hmm_pt_iter_init(&iter, &mirror->pt)
> > +
> > +  /* Get pointer to HMM page table entry for a given address. */
> > +  dma_addr_t *hmm_pte;
> > +  hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);
> 
> what are 'addr' and 'next'? (types)

unsigned long will add then to the doc, good point.

[...]


> > +  /* Migrate system memory between addressA and addressB to device memory. */
> > +  your_hmm_event->hmm_event.start = addressA;
> > +  your_hmm_event->hmm_event.end = addressB;
> 
> is hmm_event.end (addressB) inclusive and exclusive?
> i.e., is it end_of_copy + 1?
> i.e., is the size of the copy addressB - addressA or
>       addressB - addressA + 1?
> i.e., is addressB = addressA + size
> or is    addressB = addressA + size - 1

Exclusive last one.


> In my experience it is usually better to have a start_address and size
> instead of start_address and end_address.

I switched several time btw the 2 offer differents version of the patchset,
it is something that can be change down the road unless you have strong
feeling about it.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 00/15] HMM (Heterogeneous Memory Management)
  2015-10-21 20:59 ` Jérôme Glisse
                   ` (15 preceding siblings ...)
  (?)
@ 2015-10-25 10:00 ` Haggai Eran
  -1 siblings, 0 replies; 42+ messages in thread
From: Haggai Eran @ 2015-10-25 10:00 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher, Linda Wang, Kevin E Martin, Jeff Law,
	Or Gerlitz, Sagi Grimberg, Aneesh Kumar K.V

On 21/10/2015 23:59, JA(C)rA'me Glisse wrote:
> HMM (HMM (Heterogeneous Memory Management) is an helper layer
> for device driver, its main features are :
>    - Shadow CPU page table of a process into a device specific
>      format page table and keep both page table synchronize.
>    - Handle DMA mapping of system ram page on behalf of device
>      (for shadowed page table entry).
>    - Migrate private anonymous memory to private device memory
>      and handle CPU page fault (which triggers a migration back
>      to system memory so CPU can access it).
> 
> Benefits of HMM :
>    - Avoid current model where device driver have to pin page
>      which blocks several kernel features (KSM, migration, ...).
>    - No impact on existing workload that do not use HMM (it only
>      adds couple more if() to common code path).
>    - Intended as common infrastructure for several different
>      hardware, as of today Mellanox and NVidia.
>    - Allow userspace API to move away from explicit copy code
>      path where application programmer has to manage manually
>      memcpy to and from device memory.
>    - Transparent to userspace, for instance allowing library to
>      use GPU without involving application linked against it.
> 
> I expect other hardware company to express interest in HMM and
> eventualy start using it with their new hardware. I give a more
> in depth motivation after the change log.

The RDMA stack had IO paging support since kernel v4.0, using the
mmu_notifier APIs to interface with the mm subsystem. As one may expect,
it allows RDMA applications to decrease the amount of memory that needs
to be pinned, and allows the kernel to better allocate physical memory.
HMM looks like a better API than mmu_notifiers for that purpose, as it
allows sharing more code. It handles internally things that any similar
driver or subsystem would need to do, such as synchronization between
page fault events and invalidations, and DMA-mapping pages for device
use. It looks like it can be extended to also assist in device peer to
peer memory mapping, to allow capable devices to transfer data directly
without CPU intervention.

Regards,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
  2015-10-22  3:23     ` Randy Dunlap
  (?)
  (?)
@ 2015-10-28  1:19     ` David Woodhouse
  2015-10-28 17:10         ` Randy Dunlap
  -1 siblings, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2015-10-28  1:19 UTC (permalink / raw)
  To: Randy Dunlap, Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

On Wed, 2015-10-21 at 20:23 -0700, Randy Dunlap wrote:
> On 10/21/15 14:00, Jérôme Glisse wrote:
...
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>

Not sure how Randy's email screwed that one up; it was a perfectly fine
instance of your name.

> > +++ b/Documentation/vm/hmm.txt
> > @@ -0,0 +1,219 @@
> > +Heterogeneous Memory Management (HMM)
> > +-------------------------------------
> > +
> > +The raison d'�tre of HMM is to provide a common API...

That one, on the other hand, was yours. Your original commit in your
git tree has that 'ê' character in some legacy 8-bit character set
instead of UTF-8. Did it fall through a time-warp from the 20th
century? :)

-- 
dwmw2



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
  2015-10-28  1:19     ` David Woodhouse
@ 2015-10-28 17:10         ` Randy Dunlap
  0 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2015-10-28 17:10 UTC (permalink / raw)
  To: David Woodhouse, Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

On 10/27/15 18:19, David Woodhouse wrote:
> On Wed, 2015-10-21 at 20:23 -0700, Randy Dunlap wrote:
>> On 10/21/15 14:00, Jérôme Glisse wrote:
> ...
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> 
> Not sure how Randy's email screwed that one up; it was a perfectly fine
> instance of your name.
> 

I'm probably missing some config setting in Thunderbird.


-- 
~Randy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.
@ 2015-10-28 17:10         ` Randy Dunlap
  0 siblings, 0 replies; 42+ messages in thread
From: Randy Dunlap @ 2015-10-28 17:10 UTC (permalink / raw)
  To: David Woodhouse, Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Linus Torvalds, joro, Mel Gorman, H. Peter Anvin, Peter Zijlstra,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Rik van Riel,
	Dave Airlie, Brendan Conoboy, Joe Donohue, Christophe Harle,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Shachar Raindel, Liran Liss,
	Roland Dreier, Ben Sander, Greg Stoner, John Bridgman,
	Michael Mantor, Paul Blinzer, Leonid Shamis, Laurent Morichetti,
	Alexander Deucher

On 10/27/15 18:19, David Woodhouse wrote:
> On Wed, 2015-10-21 at 20:23 -0700, Randy Dunlap wrote:
>> On 10/21/15 14:00, JA?A(C)rA?A'me Glisse wrote:
> ...
>>> Signed-off-by: JA?A(C)rA?A'me Glisse <jglisse@redhat.com>
> 
> Not sure how Randy's email screwed that one up; it was a perfectly fine
> instance of your name.
> 

I'm probably missing some config setting in Thunderbird.


-- 
~Randy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-10-28 17:10 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-21 20:59 [PATCH v11 00/15] HMM (Heterogeneous Memory Management) Jérôme Glisse
2015-10-21 20:59 ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 01/15] mmu_notifier: add event information to address invalidation v8 Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 02/15] mmu_notifier: keep track of active invalidation ranges v5 Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 03/15] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() v2 Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 20:59 ` [PATCH v11 04/15] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier Jérôme Glisse
2015-10-21 20:59   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 05/15] HMM: introduce heterogeneous memory management v5 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 20:18   ` Randy Dunlap
2015-10-21 20:18     ` Randy Dunlap
2015-10-21 21:00 ` [PATCH v11 06/15] HMM: add HMM page table v4 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 07/15] HMM: add per mirror " Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 08/15] HMM: add device page fault support v6 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 09/15] HMM: add mm page table iterator helpers Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 10/15] HMM: use CPU page table during invalidation Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 11/15] HMM: add discard range helper (to clear and free resources for a range) Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 12/15] HMM: add dirty range helper (toggle dirty bit inside mirror page table) v2 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 13/15] HMM: DMA map memory on behalf of device driver v2 Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 14/15] HMM: Add support for hugetlb Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-21 21:00 ` [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it Jérôme Glisse
2015-10-21 21:00   ` Jérôme Glisse
2015-10-22  3:23   ` Randy Dunlap
2015-10-22  3:23     ` Randy Dunlap
2015-10-22 14:11     ` Jerome Glisse
2015-10-22 14:11       ` Jerome Glisse
2015-10-28  1:19     ` David Woodhouse
2015-10-28 17:10       ` Randy Dunlap
2015-10-28 17:10         ` Randy Dunlap
2015-10-25 10:00 ` [PATCH v11 00/15] HMM (Heterogeneous Memory Management) Haggai Eran

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.