All of lore.kernel.org
 help / color / mirror / Atom feed
* HMM (Heterogeneous Memory Management) v7
@ 2014-12-22 16:48 ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, linux-fsdevel, Linda Wang,
	Kevin E Martin, Jerome Glisse, Jeff Law, Haggai Eran, Or Gerlitz,
	Sagi Grimberg

So after PTO and before end of year frenzy here is an updated HMM patchset.
While not reusing Linus page table design, i use something that is, in my
view at least, close to it. Also i avoid pretending that this will be useful
to other and move it to hmm specific code. There is a longer justification
on why implementing a new page table code instead of using radix or other
existing kernel structure as part of commit message.

Everything else is pretty much the same, ie this patchset is just the ground
foundation on to which we want to build our features set. Main feature being
migrating memory to device memory. The very first version of this patchset
already show cased proof of concept of much of the features.

Below is previous patchset cover letter pretty much unchanged as background
and motivation for it did not.


What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).


Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).


Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
  1 - Tansparently handle page mapping on behalf of device driver (DMA).
  2 - Improve DMA api to better match new usage pattern of HMM.
  3 - Migration of anonymous memory to device memory.
  4 - Locking memory to remote memory (CPU access trigger SIGBUS).
  5 - Access exclusion btw CPU and device for atomic operations.
  6 - Migration of file backed memory to device memory.


How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
    we map but address range and managing which page is effectively backing
    an address should be easy to update. I gave a presentation about that
    during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
    system memory on cpu access. An implementation of this was already sent
    as part of v1. This will be low impact and only add a new special swap
    type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
    appropriate for this. My first feeling was to use mbind as it has the
    right semantic (binding a range of address to a device) but mbind is
    too numa centric.

    Second one was madvise, but semantic does not match, madvise does allow
    kernel to ignore them while we do want to block cpu access for as long
    as the range is bind to a device.

    So i do not think any of existing syscall can be extended with new flags
    but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
    some atomic operation on it (this is mainly to work around system bus
    that do not support atomic memory access and sadly there is a large
    base of hardware without that feature).

    Easiest implementation would be using some page flags but there is none
    left. So it must be some flags in vma to know if there is a need to query
    HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
    concept with v1, i am still have a lot of conflictual feeling about how
    to achieve this.


As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
  v1 http://lwn.net/Articles/597289/
  v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
  v3 https://lkml.org/lkml/2014/6/13/633
  v4 https://lkml.org/lkml/2014/8/29/423
  v5 https://lkml.org/lkml/2014/11/3/759
  v6 http://lwn.net/Articles/619737/

Cheers,
Jérôme

To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: <linux-kernel@vger.kernel.org>,
Cc: linux-mm <linux-mm@kvack.org>,
Cc: <linux-fsdevel@vger.kernel.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,


^ permalink raw reply	[flat|nested] 21+ messages in thread

* HMM (Heterogeneous Memory Management) v7
@ 2014-12-22 16:48 ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier

So after PTO and before end of year frenzy here is an updated HMM patchset.
While not reusing Linus page table design, i use something that is, in my
view at least, close to it. Also i avoid pretending that this will be useful
to other and move it to hmm specific code. There is a longer justification
on why implementing a new page table code instead of using radix or other
existing kernel structure as part of commit message.

Everything else is pretty much the same, ie this patchset is just the ground
foundation on to which we want to build our features set. Main feature being
migrating memory to device memory. The very first version of this patchset
already show cased proof of concept of much of the features.

Below is previous patchset cover letter pretty much unchanged as background
and motivation for it did not.


What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).


Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).


Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
  1 - Tansparently handle page mapping on behalf of device driver (DMA).
  2 - Improve DMA api to better match new usage pattern of HMM.
  3 - Migration of anonymous memory to device memory.
  4 - Locking memory to remote memory (CPU access trigger SIGBUS).
  5 - Access exclusion btw CPU and device for atomic operations.
  6 - Migration of file backed memory to device memory.


How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
    we map but address range and managing which page is effectively backing
    an address should be easy to update. I gave a presentation about that
    during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
    system memory on cpu access. An implementation of this was already sent
    as part of v1. This will be low impact and only add a new special swap
    type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
    appropriate for this. My first feeling was to use mbind as it has the
    right semantic (binding a range of address to a device) but mbind is
    too numa centric.

    Second one was madvise, but semantic does not match, madvise does allow
    kernel to ignore them while we do want to block cpu access for as long
    as the range is bind to a device.

    So i do not think any of existing syscall can be extended with new flags
    but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
    some atomic operation on it (this is mainly to work around system bus
    that do not support atomic memory access and sadly there is a large
    base of hardware without that feature).

    Easiest implementation would be using some page flags but there is none
    left. So it must be some flags in vma to know if there is a need to query
    HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
    concept with v1, i am still have a lot of conflictual feeling about how
    to achieve this.


As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
  v1 http://lwn.net/Articles/597289/
  v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
  v3 https://lkml.org/lkml/2014/6/13/633
  v4 https://lkml.org/lkml/2014/8/29/423
  v5 https://lkml.org/lkml/2014/11/3/759
  v6 http://lwn.net/Articles/619737/

Cheers,
Jérôme

To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: <linux-kernel@vger.kernel.org>,
Cc: linux-mm <linux-mm@kvack.org>,
Cc: <linux-fsdevel@vger.kernel.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* HMM (Heterogeneous Memory Management) v7
@ 2014-12-22 16:48 ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, linux-fsdevel, Linda Wang,
	Kevin E Martin, Jerome Glisse, Jeff Law, Haggai Eran, Or Gerlitz,
	Sagi Grimberg

So after PTO and before end of year frenzy here is an updated HMM patchset.
While not reusing Linus page table design, i use something that is, in my
view at least, close to it. Also i avoid pretending that this will be useful
to other and move it to hmm specific code. There is a longer justification
on why implementing a new page table code instead of using radix or other
existing kernel structure as part of commit message.

Everything else is pretty much the same, ie this patchset is just the ground
foundation on to which we want to build our features set. Main feature being
migrating memory to device memory. The very first version of this patchset
already show cased proof of concept of much of the features.

Below is previous patchset cover letter pretty much unchanged as background
and motivation for it did not.


What it is ?

In a nutshell HMM is a subsystem that provide an easy to use api to mirror a
process address on a device with minimal hardware requirement (mainly device
page fault and read only page mapping). This does not rely on ATS and PASID
PCIE extensions. It intends to supersede those extensions by allowing to move
system memory to device memory in a transparent fashion for core kernel mm
code (ie cpu page fault on page residing in device memory will trigger
migration back to system memory).


Why doing this ?

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

The migration side is simply because GPU memory bandwidth is far beyond than
system memory bandwith and there is no sign that this gap is closing (quite the
opposite).


Current status and future features :

None of this core code change in any major way core kernel mm code. This
is simple ground work with no impact on existing code path. Features that
will be implemented on top of this are :
  1 - Tansparently handle page mapping on behalf of device driver (DMA).
  2 - Improve DMA api to better match new usage pattern of HMM.
  3 - Migration of anonymous memory to device memory.
  4 - Locking memory to remote memory (CPU access trigger SIGBUS).
  5 - Access exclusion btw CPU and device for atomic operations.
  6 - Migration of file backed memory to device memory.


How future features will be implemented :
1 - Simply use existing DMA api to map page on behalf of a device.
2 - Introduce new DMA api to match new semantic of HMM. It is no longer page
    we map but address range and managing which page is effectively backing
    an address should be easy to update. I gave a presentation about that
    during this LPC.
3 - Requires change to cpu page fault code path to handle migration back to
    system memory on cpu access. An implementation of this was already sent
    as part of v1. This will be low impact and only add a new special swap
    type handling to existing fault code.
4 - Require a new syscall as i can not see which current syscall would be
    appropriate for this. My first feeling was to use mbind as it has the
    right semantic (binding a range of address to a device) but mbind is
    too numa centric.

    Second one was madvise, but semantic does not match, madvise does allow
    kernel to ignore them while we do want to block cpu access for as long
    as the range is bind to a device.

    So i do not think any of existing syscall can be extended with new flags
    but maybe i am wrong.
5 - Allowing to map a page as read only on the CPU while a device perform
    some atomic operation on it (this is mainly to work around system bus
    that do not support atomic memory access and sadly there is a large
    base of hardware without that feature).

    Easiest implementation would be using some page flags but there is none
    left. So it must be some flags in vma to know if there is a need to query
    HMM for write protection.

6 - This is the trickiest one to implement and while i showed a proof of
    concept with v1, i am still have a lot of conflictual feeling about how
    to achieve this.


As usual comments are more then welcome. Thanks in advance to anyone that
take a look at this code.

Previous patchset posting :
  v1 http://lwn.net/Articles/597289/
  v2 https://lkml.org/lkml/2014/6/12/559 (cover letter did not make it to ml)
  v3 https://lkml.org/lkml/2014/6/13/633
  v4 https://lkml.org/lkml/2014/8/29/423
  v5 https://lkml.org/lkml/2014/11/3/759
  v6 http://lwn.net/Articles/619737/

Cheers,
JA(C)rA'me

To: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: <linux-kernel@vger.kernel.org>,
Cc: linux-mm <linux-mm@kvack.org>,
Cc: <linux-fsdevel@vger.kernel.org>,
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] mmu_notifier: add event information to address invalidation v6
  2014-12-22 16:48 ` j.glisse
@ 2014-12-22 16:48   ` j.glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate hugue page split from mlock/munlock and softdirty.

Changed since v4:
  - Rebase (no other changes).

Changed since v5:
  - Typo fix.
  - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
    fact that the address range is still valid just the page backing it
    are no longer.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
 drivers/infiniband/core/umem_odp.c      |   9 ++-
 drivers/iommu/amd_iommu_v2.c            |   3 +-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 131 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/filemap_xip.c                        |   2 +-
 mm/huge_memory.c                        |  39 ++++++----
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/madvise.c                            |   4 +-
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   6 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |  24 ++++--
 virt/kvm/kvm_main.c                     |  12 ++-
 21 files changed, 274 insertions(+), 107 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index d182058..20dbd26 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -129,7 +129,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index a69bd44..daf53d3 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -109,7 +109,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6095872..bc36e8c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 90f70d0..31acb07 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -402,7 +402,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 246eae8..8a79a74 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -875,7 +875,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0,
+							    -1, MMU_ISDIRTY);
 		}
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
@@ -900,7 +901,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0,
+							  -1, MMU_ISDIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 95243d2..ac2a121 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,66 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_HSPLIT huge page split, the memory is the same only the page table
+ *     structure is updated (level added or removed).
+ *
+ *   - MMU_ISDIRTY need to update the dirty bit of the page table so proper
+ *     dirty accounting can happen.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_WRITE_PROTECT: memory is being write protected (ie should be mapped
+ *     read only no matter what the vma memory protection allows). All write
+ *     accesses must stop after invalidate_range_start callback returns. Read
+ *     access are still allowed.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_HSPLIT = 0,
+	MMU_ISDIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -82,7 +142,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -93,7 +154,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -140,10 +202,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 
 	/*
 	 * invalidate_range() is either called between
@@ -206,13 +272,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -240,31 +313,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -359,13 +439,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -393,22 +473,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f2..802828a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 0d105ae..fb97c7c 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,7 +193,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf3b67b..75eb651 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1028,7 +1028,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1062,7 +1063,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1072,7 +1074,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1164,7 +1167,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1196,7 +1200,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1667,7 +1672,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HSPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1683,7 +1689,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HSPLIT);
 
 	return ret;
 }
@@ -2504,7 +2511,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2514,7 +2522,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2905,24 +2914,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 85032de..b4770c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2560,7 +2560,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2614,7 +2615,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2640,7 +2642,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2713,7 +2716,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2891,8 +2895,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2913,7 +2917,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3351,7 +3356,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3382,7 +3387,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index d247efa..8c3a892 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 6fc9b82..d7ac37a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -376,9 +376,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	madvise_free_page_range(&tlb, vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 33f7370..ffca25f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_MIGRATE);
 	return ret;
 }
 
@@ -1360,10 +1361,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1385,10 +1388,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1411,9 +1414,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2198,7 +2201,8 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2230,7 +2234,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2269,7 +2273,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 344cdf6..254d5bf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1801,12 +1801,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1860,7 +1862,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0..e51ea02 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ace9345..0f5dbfe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,7 +152,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -180,7 +181,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end,
+						  MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 17fa018..1ede220 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -229,7 +230,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index b404783..1d96644 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -869,7 +869,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1171,8 +1171,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_event event = MMU_MIGRATE;
 	int dirty = 0;
 
+	if (flags & TTU_MUNLOCK)
+		event = MMU_MUNLOCK;
+
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
@@ -1292,7 +1296,7 @@ discard:
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, event);
 out:
 	return ret;
 
@@ -1346,7 +1350,9 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma,
+				struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1360,6 +1366,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_MUNLOCK;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1374,7 +1384,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1443,7 +1453,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1499,7 +1509,9 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+							 vma, page,
+							 (enum ttu_flags)arg)
+							 == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f528343..be2f937 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -254,7 +254,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -296,7 +297,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -312,7 +314,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -338,7 +341,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/7] mmu_notifier: add event information to address invalidation v6
@ 2014-12-22 16:48   ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The event information will be useful for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their usage
    also documenting what exceptation the listener can have in
    respect to each event.

Changed since v2:
  - Avoid crazy name.
  - Do not move code that do not need to move.

Changed since v3:
  - Separate hugue page split from mlock/munlock and softdirty.

Changed since v4:
  - Rebase (no other changes).

Changed since v5:
  - Typo fix.
  - Changed zap_page_range from MMU_MUNMAP to MMU_MIGRATE to reflect the
    fact that the address range is still valid just the page backing it
    are no longer.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/gpu/drm/radeon/radeon_mn.c      |   3 +-
 drivers/infiniband/core/umem_odp.c      |   9 ++-
 drivers/iommu/amd_iommu_v2.c            |   3 +-
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/mmu_notifier.h            | 131 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/filemap_xip.c                        |   2 +-
 mm/huge_memory.c                        |  39 ++++++----
 mm/hugetlb.c                            |  23 +++---
 mm/ksm.c                                |  18 +++--
 mm/madvise.c                            |   4 +-
 mm/memory.c                             |  27 ++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 ++++---
 mm/mprotect.c                           |   6 +-
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |  24 ++++--
 virt/kvm/kvm_main.c                     |  12 ++-
 21 files changed, 274 insertions(+), 107 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index d182058..20dbd26 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -129,7 +129,8 @@ restart:
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index a69bd44..daf53d3 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -109,7 +109,8 @@ static void radeon_mn_release(struct mmu_notifier *mn,
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     enum mmu_event event)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6095872..bc36e8c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -165,7 +165,8 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -192,7 +193,8 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -217,7 +219,8 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 90f70d0..31acb07 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -402,7 +402,8 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 246eae8..8a79a74 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -875,7 +875,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0,
+							    -1, MMU_ISDIRTY);
 		}
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
@@ -900,7 +901,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0,
+							  -1, MMU_ISDIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 95243d2..ac2a121 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,66 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* MMU Events report fine-grained information to the callback routine, allowing
+ * the event listener to make a more informed decision as to what action to
+ * take. The event types are:
+ *
+ *   - MMU_HSPLIT huge page split, the memory is the same only the page table
+ *     structure is updated (level added or removed).
+ *
+ *   - MMU_ISDIRTY need to update the dirty bit of the page table so proper
+ *     dirty accounting can happen.
+ *
+ *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
+ *     access must stop after invalidate_range_start callback returns.
+ *     Furthermore, no read access should be allowed either, as a new page can
+ *     be remapped with write access before the invalidate_range_end callback
+ *     happens and thus any read access to old page might read stale data. There
+ *     are several sources for this event, including:
+ *
+ *         - A page moving to swap (various reasons, including page reclaim),
+ *         - An mremap syscall,
+ *         - migration for NUMA reasons,
+ *         - balancing the memory pool,
+ *         - write fault on COW page,
+ *         - and more that are not listed here.
+ *
+ *   - MMU_MPROT: memory access protection is changing. Refer to the vma to get
+ *     the new access protection. All memory access are still valid until the
+ *     invalidate_range_end callback.
+ *
+ *   - MMU_MUNLOCK: unlock memory. Content of page table stays the same but
+ *     page are unlocked.
+ *
+ *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
+ *     process destruction). However, access is still allowed, up until the
+ *     invalidate_range_free_pages callback. This also implies that secondary
+ *     page table can be trimmed, because the address range is no longer valid.
+ *
+ *   - MMU_WRITE_BACK: memory is being written back to disk, all write accesses
+ *     must stop after invalidate_range_start callback returns. Read access are
+ *     still allowed.
+ *
+ *   - MMU_WRITE_PROTECT: memory is being write protected (ie should be mapped
+ *     read only no matter what the vma memory protection allows). All write
+ *     accesses must stop after invalidate_range_start callback returns. Read
+ *     access are still allowed.
+ *
+ * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
+ * because it will always lead to reasonable behavior, but will not allow the
+ * listener a chance to optimize its events.
+ */
+enum mmu_event {
+	MMU_HSPLIT = 0,
+	MMU_ISDIRTY,
+	MMU_MIGRATE,
+	MMU_MPROT,
+	MMU_MUNLOCK,
+	MMU_MUNMAP,
+	MMU_WRITE_BACK,
+	MMU_WRITE_PROTECT,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -82,7 +142,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -93,7 +154,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -140,10 +202,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 
 	/*
 	 * invalidate_range() is either called between
@@ -206,13 +272,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 
@@ -240,31 +313,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -359,13 +439,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -393,22 +473,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f2..802828a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -176,7 +176,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -194,7 +195,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -208,7 +211,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 0d105ae..fb97c7c 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,7 +193,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf3b67b..75eb651 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1028,7 +1028,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1062,7 +1063,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1072,7 +1074,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1164,7 +1167,8 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	spin_lock(ptl);
 	if (page)
@@ -1196,7 +1200,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1667,7 +1672,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_HSPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1683,7 +1689,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_HSPLIT);
 
 	return ret;
 }
@@ -2504,7 +2511,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2514,7 +2522,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2905,24 +2914,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 85032de..b4770c4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2560,7 +2560,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2614,7 +2615,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2640,7 +2642,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2713,7 +2716,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2891,8 +2895,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2913,7 +2917,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3351,7 +3356,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3382,7 +3387,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index d247efa..8c3a892 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_WRITE_PROTECT);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +905,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_WRITE_PROTECT);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +913,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_WRITE_PROTECT);
 out:
 	return err;
 }
@@ -948,7 +950,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
+					    MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -961,7 +964,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -971,7 +976,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
+					  MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 6fc9b82..d7ac37a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -376,9 +376,9 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	madvise_free_page_range(&tlb, vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 33f7370..ffca25f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_MIGRATE);
 	return ret;
 }
 
@@ -1360,10 +1361,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1385,10 +1388,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1411,9 +1414,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2198,7 +2201,8 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2230,7 +2234,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2269,7 +2273,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 344cdf6..254d5bf 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1801,12 +1801,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1860,7 +1862,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 3b9b3d0..e51ea02 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -142,8 +142,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -151,13 +153,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -165,13 +168,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -179,14 +185,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -204,7 +213,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		if (mn->ops->invalidate_range)
 			mn->ops->invalidate_range(mn, mm, start, end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ace9345..0f5dbfe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,7 +152,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, MMU_MPROT);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -180,7 +181,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end,
+						  MMU_MPROT);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 17fa018..1ede220 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -229,7 +230,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index b404783..1d96644 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -869,7 +869,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WRITE_BACK);
 		(*cleaned)++;
 	}
 out:
@@ -1171,8 +1171,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_event event = MMU_MIGRATE;
 	int dirty = 0;
 
+	if (flags & TTU_MUNLOCK)
+		event = MMU_MUNLOCK;
+
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
@@ -1292,7 +1296,7 @@ discard:
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, event);
 out:
 	return ret;
 
@@ -1346,7 +1350,9 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma,
+				struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1360,6 +1366,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_MUNLOCK;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1374,7 +1384,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1443,7 +1453,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1499,7 +1509,9 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+							 vma, page,
+							 (enum ttu_flags)arg)
+							 == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f528343..be2f937 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -254,7 +254,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -296,7 +297,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -312,7 +314,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -338,7 +341,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2
  2014-12-22 16:48 ` j.glisse
@ 2014-12-22 16:48   ` j.glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
can be considered as forming an "atomic" section for the cpu page table update
point of view. Between this two function the cpu page table content is unreliable
for the address range being invalidated.

Current user such as kvm need to know when they can trust the content of the cpu
page table. This becomes even more important to new users of the mmu_notifier
api (such as HMM or ODP).

This patch use a structure define at all call site to invalidate_range_start()
that is added to a list for the duration of the invalidation. It adds two new
helpers to allow querying if a range is being invalidated or to wait for a range
to become valid.

For proper synchronization, user must block new range invalidation from inside
there invalidate_range_start() callback, before calling the helper functions.
Otherwise there is no garanty that a new range invalidation will not be added
after the call to the helper function to query for existing range.

Changed since v1:
  - Fix a possible deadlock in mmu_notifier_range_wait_valid()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  9 ++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 14 ++----
 drivers/infiniband/core/umem_odp.c      | 16 +++---
 drivers/misc/sgi-gru/grutlbpurge.c      | 15 +++---
 drivers/xen/gntdev.c                    | 15 +++---
 fs/proc/task_mmu.c                      | 12 +++--
 include/linux/mmu_notifier.h            | 60 +++++++++++++----------
 kernel/events/uprobes.c                 | 13 +++--
 mm/huge_memory.c                        | 78 +++++++++++++----------------
 mm/hugetlb.c                            | 55 +++++++++++----------
 mm/ksm.c                                | 28 +++++------
 mm/madvise.c                            |  8 ++-
 mm/memory.c                             | 78 ++++++++++++++++-------------
 mm/migrate.c                            | 36 +++++++-------
 mm/mmu_notifier.c                       | 87 ++++++++++++++++++++++++++++-----
 mm/mprotect.c                           | 18 ++++---
 mm/mremap.c                             | 14 +++---
 mm/rmap.c                               | 15 +++---
 virt/kvm/kvm_main.c                     | 10 ++--
 19 files changed, 322 insertions(+), 259 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 20dbd26..a78eede 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -128,16 +128,15 @@ restart:
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       const struct mmu_notifier_range *range)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
-	unsigned long next = start;
+	unsigned long next = range->start;
 	unsigned long serial = 0;
+	/* interval ranges are inclusive, but invalidate range is exclusive */
+	unsigned long end = range->end - 1, start = range->start;
 
-	end--; /* interval ranges are inclusive, but invalidate range is exclusive */
 	while (next < end) {
 		struct drm_i915_gem_object *obj = NULL;
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index daf53d3..63e6936 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -100,34 +100,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  *
  * @mn: our notifier
  * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					     const struct mmu_notifier_range *range)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct radeon_bo *bo;
 		struct fence *fence;
 		int r;
 
 		bo = container_of(it, struct radeon_bo, mn_it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		r = radeon_bo_reserve(bo, true);
 		if (r) {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index bc36e8c..097f1d1 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -192,9 +192,7 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+						    const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	ib_ucontext_notifier_start_account(context);
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -218,9 +216,7 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+						  const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_end_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end,
-				       enum mmu_event event)
+				       const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	STAT(mmu_invalidate_range);
 	atomic_inc(&gms->ms_range_active);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
+		range->start, range->end, atomic_read(&gms->ms_range_active));
+	gru_flush_tlb_range(gms, range->start, range->end - range->start);
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event)
+				     struct mm_struct *mm,
+				     const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	(void)atomic_dec_and_test(&gms->ms_range_active);
 
 	wake_up_all(&gms->ms_wait_queue);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+		range->start, range->end);
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index fe9da94..db5c2cad 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,19 +428,17 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end,
-				enum mmu_event event)
+				const struct mmu_notifier_range *range)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
 	spin_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	spin_unlock(&priv->lock);
 }
@@ -450,7 +448,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	struct mmu_notifier_range range;
+
+	range.start = address;
+	range.end = address + PAGE_SIZE;
+	range.event = event;
+	mn_invl_range_start(mn, mm, &range);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a79a74..eb9f931 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -861,6 +861,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.mm = mm,
 			.private = &cp,
 		};
+		struct mmu_notifier_range range = {
+			.start = 0,
+			.end = -1UL,
+			.event = MMU_ISDIRTY,
+		};
+
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
@@ -875,8 +881,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0,
-							    -1, MMU_ISDIRTY);
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
@@ -901,8 +906,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0,
-							  -1, MMU_ISDIRTY);
+			mmu_notifier_invalidate_range_end(mm, &range);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ac2a121..d20eeb1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -69,6 +69,13 @@ enum mmu_event {
 	MMU_WRITE_PROTECT,
 };
 
+struct mmu_notifier_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+	enum mmu_event event;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -82,6 +89,12 @@ struct mmu_notifier_mm {
 	struct hlist_head list;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	/* List of all active range invalidations. */
+	struct list_head ranges;
+	/* Number of active range invalidations. */
+	int nranges;
+	/* For threads waiting on range invalidations. */
+	wait_queue_head_t wait_queue;
 };
 
 struct mmu_notifier_ops {
@@ -202,14 +215,10 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start,
-				       unsigned long end,
-				       enum mmu_event event);
+				       const struct mmu_notifier_range *range);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event);
+				     const struct mmu_notifier_range *range);
 
 	/*
 	 * invalidate_range() is either called between
@@ -279,15 +288,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event);
+						  struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						enum mmu_event event);
+						struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
+extern void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -330,21 +341,22 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       struct mmu_notifier_range *range)
 {
+	/*
+	 * Initialize list no matter what in case a mmu_notifier register after
+	 * a range_start but before matching range_end.
+	 */
+	INIT_LIST_HEAD(&range->list);
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+						     struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -486,16 +498,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       struct mmu_notifier_range *range)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+						     struct mmu_notifier_range *range)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 802828a..b7f7f6b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -164,9 +164,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
 	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -176,8 +174,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -211,8 +211,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 75eb651..30db47f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -987,8 +987,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1026,10 +1025,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1063,8 +1062,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1074,8 +1072,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1094,8 +1091,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	ptl = pmd_lockptr(mm, pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1165,10 +1161,10 @@ alloc:
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	spin_lock(ptl);
 	if (page)
@@ -1200,8 +1196,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return ret;
 out_unlock:
@@ -1668,12 +1663,12 @@ static int __split_huge_page_splitting(struct page *page,
 	spinlock_t *ptl;
 	pmd_t *pmd;
 	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_HSPLIT);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_HSPLIT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1689,8 +1684,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_HSPLIT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return ret;
 }
@@ -2468,8 +2462,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -2509,10 +2502,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
-	mmun_start = address;
-	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2522,8 +2515,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2906,36 +2898,32 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b4770c4..a19abdb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2551,17 +2551,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	int ret = 0;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	mmun_start = vma->vm_start;
-	mmun_end = vma->vm_end;
+	range.start = vma->vm_start;
+	range.end = vma->vm_end;
+	range.event = MMU_MIGRATE;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src, &range);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2601,8 +2600,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
+				mmu_notifier_invalidate_range(src, range.start,
+								   range.end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -2615,8 +2614,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src, &range);
 
 	return ret;
 }
@@ -2634,16 +2632,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2716,8 +2715,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2814,8 +2812,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	old_page = pte_page(pte);
 
@@ -2893,10 +2890,11 @@ retry_avoidcopy:
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
-	mmun_start = address & huge_page_mask(h);
-	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = address & huge_page_mask(h);
+	range.end = range.start + huge_page_size(h);
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
+
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2908,7 +2906,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -2917,8 +2915,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3352,11 +3349,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	struct mmu_notifier_range range;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MPROT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3387,7 +3388,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c3a892..3667d98 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
+	struct mmu_notifier_range range;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
 	pte_t *ptep;
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_WRITE_PROTECT);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_WRITE_PROTECT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_WRITE_PROTECT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index d7ac37a..9e91bcf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -357,6 +357,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	unsigned long start, end;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
+	struct mmu_notifier_range range;
 
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
 		return -EINVAL;
@@ -376,9 +377,12 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MUNMAP;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	madvise_free_page_range(&tlb, vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_finish_mmu(&tlb, start, end);
 
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index ffca25f..0d93542 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1009,8 +1009,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
 
@@ -1046,11 +1045,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * is_cow_mapping() returns true.
 	 */
 	is_cow = is_cow_mapping(vma->vm_flags);
-	mmun_start = addr;
-	mmun_end   = end;
+	range.start = addr;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src_mm, &range);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,8 +1066,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
-						  MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src_mm, &range);
 	return ret;
 }
 
@@ -1360,13 +1358,16 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {
+		.start = start_addr,
+		.end = end_addr,
+		.event = MMU_MUNMAP,
+	};
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 }
 
 /**
@@ -1383,16 +1384,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = start + size;
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = start + size,
+		.event = MMU_MIGRATE,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, start, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+		unmap_single_vma(&tlb, vma, start, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, start, range.end);
 }
 
 /**
@@ -1409,15 +1414,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = address + size;
+	struct mmu_notifier_range range = {
+		.start = address,
+		.end = address + size,
+		.event = MMU_MUNMAP,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, address, end);
+	tlb_gather_mmu(&tlb, mm, address, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
-	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, address, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	unmap_single_vma(&tlb, vma, address, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, address, range.end);
 }
 
 /**
@@ -2037,10 +2046,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret = 0;
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
-	unsigned long mmun_start = 0;	/* For mmu_notifiers */
-	unsigned long mmun_end = 0;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
+	range.start = 0;
+	range.end = 0;
+
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
 		/*
@@ -2199,10 +2210,10 @@ gotten:
 	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
-	mmun_start  = address & PAGE_MASK;
-	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2272,9 +2283,8 @@ gotten:
 		page_cache_release(new_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+	if (range.end > range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 254d5bf..e9858e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1763,10 +1763,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 	pmd_t orig_entry;
 
+	range.start = address & HPAGE_PMD_MASK;
+	range.end = range.start + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
@@ -1788,7 +1791,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	}
 
 	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
+		flush_tlb_range(vma, range.start, range.end);
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
@@ -1801,14 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1841,17 +1842,17 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_clear_flush_notify(vma, mmun_start, pmd);
-	set_pmd_at(mm, mmun_start, pmd, entry);
-	flush_tlb_range(vma, mmun_start, mmun_end);
+	flush_cache_range(vma, range.start, range.end);
+	page_add_anon_rmap(new_page, vma, range.start);
+	pmdp_clear_flush_notify(vma, range.start, pmd);
+	set_pmd_at(mm, range.start, pmd, entry);
+	flush_tlb_range(vma, range.start, range.end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		set_pmd_at(mm, range.start, pmd, orig_entry);
+		flush_tlb_range(vma, range.start, range.end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1862,8 +1863,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -1888,7 +1888,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_mknonnuma(entry);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, range.start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index e51ea02..8f6f994 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,9 +174,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end,
-					   enum mmu_event event)
+					   struct mmu_notifier_range *range)
 
 {
 	struct mmu_notifier *mn;
@@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
-							end, event);
+			mn->ops->invalidate_range_start(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	/*
+	 * This must happen after the callback so that subsystem can block on
+	 * new invalidation range to synchronize itself.
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+	mm->mmu_notifier_mm->nranges++;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 unsigned long start,
-					 unsigned long end,
-					 enum mmu_event event)
+					 struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	/*
+	 * This must happen before the callback so that subsystem can unblock
+	 * when range invalidation end.
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_del_init(&range->list);
+	mm->mmu_notifier_mm->nranges--;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		/*
@@ -211,12 +224,18 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * (besides the pointer check).
 		 */
 		if (mn->ops->invalidate_range)
-			mn->ops->invalidate_range(mn, mm, start, end);
+			mn->ops->invalidate_range(mn, mm,
+						  range->start, range->end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
-						      end, event);
+			mn->ops->invalidate_range_end(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	/*
+	 * Wakeup after callback so they can do their job before any of the
+	 * waiters resume.
+	 */
+	wake_up(&mm->mmu_notifier_mm->wait_queue);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
@@ -235,6 +254,49 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
+static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
+					       unsigned long start,
+					       unsigned long end)
+{
+	struct mmu_notifier_range *range;
+
+	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+		if (!(range->end <= start || range->start >= end))
+			return false;
+	}
+	return true;
+}
+
+bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	bool valid;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	valid = mmu_notifier_range_is_valid_locked(mm, start, end);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
+
+void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+				   unsigned long start,
+				   unsigned long end)
+{
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
+		int nranges = mm->mmu_notifier_mm->nranges;
+
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		wait_event(mm->mmu_notifier_mm->wait_queue,
+			   nranges != mm->mmu_notifier_mm->nranges);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
@@ -264,6 +326,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+		mmu_notifier_mm->nranges = 0;
+		init_waitqueue_head(&mmu_notifier_mm->wait_queue);
 
 		mm->mmu_notifier_mm = mmu_notifier_mm;
 		mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0f5dbfe..c88f770 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
+	struct mmu_notifier_range range = {
+		.start = 0,
+	};
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -150,10 +152,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
-							    end, MMU_MPROT);
+		if (!range.start) {
+			range.start = addr;
+			range.end = end;
+			range.event = MMU_MPROT;
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -180,9 +183,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end,
-						  MMU_MPROT);
+	if (range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 1ede220..5556f51 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -167,18 +167,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks)
 {
 	unsigned long extent, next, old_end;
+	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmun_start = old_addr;
-	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = old_addr;
+	range.end = old_end;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, &range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -230,8 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, &range);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 1d96644..611a640 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1361,15 +1361,14 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
-	enum mmu_event event = MMU_MIGRATE;
 
+	range.event = MMU_MIGRATE;
 	if (flags & TTU_MUNLOCK)
-		event = MMU_MUNLOCK;
+		range.event = MMU_MUNLOCK;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1382,9 +1381,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	if (!pmd)
 		return ret;
 
-	mmun_start = address;
-	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
+	range.start = address;
+	range.end = end;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1453,7 +1452,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be2f937..684af9e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -313,9 +313,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+						    const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -328,7 +326,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_notifier_count++;
-	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
 	need_tlb_flush |= kvm->tlbs_dirty;
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
@@ -340,9 +338,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+						  const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2
@ 2014-12-22 16:48   ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The mmu_notifier_invalidate_range_start() and mmu_notifier_invalidate_range_end()
can be considered as forming an "atomic" section for the cpu page table update
point of view. Between this two function the cpu page table content is unreliable
for the address range being invalidated.

Current user such as kvm need to know when they can trust the content of the cpu
page table. This becomes even more important to new users of the mmu_notifier
api (such as HMM or ODP).

This patch use a structure define at all call site to invalidate_range_start()
that is added to a list for the duration of the invalidation. It adds two new
helpers to allow querying if a range is being invalidated or to wait for a range
to become valid.

For proper synchronization, user must block new range invalidation from inside
there invalidate_range_start() callback, before calling the helper functions.
Otherwise there is no garanty that a new range invalidation will not be added
after the call to the helper function to query for existing range.

Changed since v1:
  - Fix a possible deadlock in mmu_notifier_range_wait_valid()

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  9 ++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 14 ++----
 drivers/infiniband/core/umem_odp.c      | 16 +++---
 drivers/misc/sgi-gru/grutlbpurge.c      | 15 +++---
 drivers/xen/gntdev.c                    | 15 +++---
 fs/proc/task_mmu.c                      | 12 +++--
 include/linux/mmu_notifier.h            | 60 +++++++++++++----------
 kernel/events/uprobes.c                 | 13 +++--
 mm/huge_memory.c                        | 78 +++++++++++++----------------
 mm/hugetlb.c                            | 55 +++++++++++----------
 mm/ksm.c                                | 28 +++++------
 mm/madvise.c                            |  8 ++-
 mm/memory.c                             | 78 ++++++++++++++++-------------
 mm/migrate.c                            | 36 +++++++-------
 mm/mmu_notifier.c                       | 87 ++++++++++++++++++++++++++++-----
 mm/mprotect.c                           | 18 ++++---
 mm/mremap.c                             | 14 +++---
 mm/rmap.c                               | 15 +++---
 virt/kvm/kvm_main.c                     | 10 ++--
 19 files changed, 322 insertions(+), 259 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 20dbd26..a78eede 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -128,16 +128,15 @@ restart:
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       const struct mmu_notifier_range *range)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
-	unsigned long next = start;
+	unsigned long next = range->start;
 	unsigned long serial = 0;
+	/* interval ranges are inclusive, but invalidate range is exclusive */
+	unsigned long end = range->end - 1, start = range->start;
 
-	end--; /* interval ranges are inclusive, but invalidate range is exclusive */
 	while (next < end) {
 		struct drm_i915_gem_object *obj = NULL;
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index daf53d3..63e6936 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -100,34 +100,30 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  *
  * @mn: our notifier
  * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: Address range information.
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
 static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long start,
-					     unsigned long end,
-					     enum mmu_event event)
+					     const struct mmu_notifier_range *range)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct interval_tree_node *it;
-
 	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
+	unsigned long end = range->end - 1;
 
 	mutex_lock(&rmn->lock);
 
-	it = interval_tree_iter_first(&rmn->objects, start, end);
+	it = interval_tree_iter_first(&rmn->objects, range->start, end);
 	while (it) {
 		struct radeon_bo *bo;
 		struct fence *fence;
 		int r;
 
 		bo = container_of(it, struct radeon_bo, mn_it);
-		it = interval_tree_iter_next(it, start, end);
+		it = interval_tree_iter_next(it, range->start, end);
 
 		r = radeon_bo_reserve(bo, true);
 		if (r) {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index bc36e8c..097f1d1 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -192,9 +192,7 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+						    const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -203,8 +201,8 @@ static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	ib_ucontext_notifier_start_account(context);
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -218,9 +216,7 @@ static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
 
 static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+						  const struct mmu_notifier_range *range)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
@@ -228,8 +224,8 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
-				      end,
+	rbt_ib_umem_for_each_in_range(&context->umem_tree, range->start,
+				      range->end,
 				      invalidate_range_end_trampoline, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..44b41b7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,8 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end,
-				       enum mmu_event event)
+				       const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -230,14 +229,13 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	STAT(mmu_invalidate_range);
 	atomic_inc(&gms->ms_range_active);
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
-		start, end, atomic_read(&gms->ms_range_active));
-	gru_flush_tlb_range(gms, start, end - start);
+		range->start, range->end, atomic_read(&gms->ms_range_active));
+	gru_flush_tlb_range(gms, range->start, range->end - range->start);
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event)
+				     struct mm_struct *mm,
+				     const struct mmu_notifier_range *range)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -246,7 +244,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	(void)atomic_dec_and_test(&gms->ms_range_active);
 
 	wake_up_all(&gms->ms_wait_queue);
-	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
+	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms,
+		range->start, range->end);
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index fe9da94..db5c2cad 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,19 +428,17 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end,
-				enum mmu_event event)
+				const struct mmu_notifier_range *range)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
 	spin_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
-		unmap_if_in_range(map, start, end);
+		unmap_if_in_range(map, range->start, range->end);
 	}
 	spin_unlock(&priv->lock);
 }
@@ -450,7 +448,12 @@ static void mn_invl_page(struct mmu_notifier *mn,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	struct mmu_notifier_range range;
+
+	range.start = address;
+	range.end = address + PAGE_SIZE;
+	range.event = event;
+	mn_invl_range_start(mn, mm, &range);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a79a74..eb9f931 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -861,6 +861,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.mm = mm,
 			.private = &cp,
 		};
+		struct mmu_notifier_range range = {
+			.start = 0,
+			.end = -1UL,
+			.event = MMU_ISDIRTY,
+		};
+
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
@@ -875,8 +881,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				downgrade_write(&mm->mmap_sem);
 				break;
 			}
-			mmu_notifier_invalidate_range_start(mm, 0,
-							    -1, MMU_ISDIRTY);
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
@@ -901,8 +906,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0,
-							  -1, MMU_ISDIRTY);
+			mmu_notifier_invalidate_range_end(mm, &range);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ac2a121..d20eeb1 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -69,6 +69,13 @@ enum mmu_event {
 	MMU_WRITE_PROTECT,
 };
 
+struct mmu_notifier_range {
+	struct list_head list;
+	unsigned long start;
+	unsigned long end;
+	enum mmu_event event;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -82,6 +89,12 @@ struct mmu_notifier_mm {
 	struct hlist_head list;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	/* List of all active range invalidations. */
+	struct list_head ranges;
+	/* Number of active range invalidations. */
+	int nranges;
+	/* For threads waiting on range invalidations. */
+	wait_queue_head_t wait_queue;
 };
 
 struct mmu_notifier_ops {
@@ -202,14 +215,10 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start,
-				       unsigned long end,
-				       enum mmu_event event);
+				       const struct mmu_notifier_range *range);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start,
-				     unsigned long end,
-				     enum mmu_event event);
+				     const struct mmu_notifier_range *range);
 
 	/*
 	 * invalidate_range() is either called between
@@ -279,15 +288,17 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event);
+						  struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						unsigned long start,
-						unsigned long end,
-						enum mmu_event event);
+						struct mmu_notifier_range *range);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
+extern bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+					unsigned long start,
+					unsigned long end);
+extern void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -330,21 +341,22 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       struct mmu_notifier_range *range)
 {
+	/*
+	 * Initialize list no matter what in case a mmu_notifier register after
+	 * a range_start but before matching range_end.
+	 */
+	INIT_LIST_HEAD(&range->list);
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+						     struct mmu_notifier_range *range)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -486,16 +498,12 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-						       unsigned long start,
-						       unsigned long end,
-						       enum mmu_event event)
+						       struct mmu_notifier_range *range)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-						     unsigned long start,
-						     unsigned long end,
-						     enum mmu_event event)
+						     struct mmu_notifier_range *range)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 802828a..b7f7f6b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -164,9 +164,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
 	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
@@ -176,8 +174,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -211,8 +211,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 75eb651..30db47f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -987,8 +987,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -1026,10 +1025,10 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1063,8 +1062,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1074,8 +1072,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1094,8 +1091,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	ptl = pmd_lockptr(mm, pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1165,10 +1161,10 @@ alloc:
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	spin_lock(ptl);
 	if (page)
@@ -1200,8 +1196,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return ret;
 out_unlock:
@@ -1668,12 +1663,12 @@ static int __split_huge_page_splitting(struct page *page,
 	spinlock_t *ptl;
 	pmd_t *pmd;
 	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_HSPLIT);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_HSPLIT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1689,8 +1684,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_HSPLIT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return ret;
 }
@@ -2468,8 +2462,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	int isolated;
 	unsigned long hstart, hend;
 	struct mem_cgroup *memcg;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -2509,10 +2502,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	pte_ptl = pte_lockptr(mm, pmd);
 
-	mmun_start = address;
-	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address;
+	range.end = address + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2522,8 +2515,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2906,36 +2898,32 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	range.start = haddr;
+	range.end = haddr + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b4770c4..a19abdb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2551,17 +2551,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	int cow;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	int ret = 0;
 
 	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	mmun_start = vma->vm_start;
-	mmun_end = vma->vm_end;
+	range.start = vma->vm_start;
+	range.end = vma->vm_end;
+	range.event = MMU_MIGRATE;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src, &range);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2601,8 +2600,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		} else {
 			if (cow) {
 				huge_ptep_set_wrprotect(src, addr, src_pte);
-				mmu_notifier_invalidate_range(src, mmun_start,
-								   mmun_end);
+				mmu_notifier_invalidate_range(src, range.start,
+								   range.end);
 			}
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
@@ -2615,8 +2614,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src, &range);
 
 	return ret;
 }
@@ -2634,16 +2632,17 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	address = start;
 again:
 	for (; address < end; address += sz) {
@@ -2716,8 +2715,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2814,8 +2812,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	old_page = pte_page(pte);
 
@@ -2893,10 +2890,11 @@ retry_avoidcopy:
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
-	mmun_start = address & huge_page_mask(h);
-	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = address & huge_page_mask(h);
+	range.end = range.start + huge_page_size(h);
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
+
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2908,7 +2906,7 @@ retry_avoidcopy:
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, address, ptep);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
 		page_remove_rmap(old_page);
@@ -2917,8 +2915,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out_release_all:
 	page_cache_release(new_page);
 out_release_old:
@@ -3352,11 +3349,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	struct mmu_notifier_range range;
 
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MPROT);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MPROT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3387,7 +3388,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	flush_tlb_range(vma, start, end);
 	mmu_notifier_invalidate_range(mm, start, end);
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MPROT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c3a892..3667d98 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -855,14 +855,13 @@ static inline int pages_identical(struct page *page1, struct page *page2)
 static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
+	struct mmu_notifier_range range;
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
 	pte_t *ptep;
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -870,10 +869,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_WRITE_PROTECT);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_WRITE_PROTECT;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -913,8 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_WRITE_PROTECT);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
@@ -937,8 +935,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -948,10 +945,10 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end,
-					    MMU_MIGRATE);
+	range.start = addr;
+	range.end = addr + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -976,8 +973,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end,
-					  MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index d7ac37a..9e91bcf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -357,6 +357,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	unsigned long start, end;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
+	struct mmu_notifier_range range;
 
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
 		return -EINVAL;
@@ -376,9 +377,12 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
+	range.start = start;
+	range.end = end;
+	range.event = MMU_MUNMAP;
+	mmu_notifier_invalidate_range_start(mm, &range);
 	madvise_free_page_range(&tlb, vma, start, end);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	tlb_finish_mmu(&tlb, start, end);
 
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index ffca25f..0d93542 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1009,8 +1009,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
 
@@ -1046,11 +1045,11 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * is_cow_mapping() returns true.
 	 */
 	is_cow = is_cow_mapping(vma->vm_flags);
-	mmun_start = addr;
-	mmun_end   = end;
+	range.start = addr;
+	range.end = end;
+	range.event = MMU_MIGRATE;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_start(src_mm, &range);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,8 +1066,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
-						  MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src_mm, &range);
 	return ret;
 }
 
@@ -1360,13 +1358,16 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range = {
+		.start = start_addr,
+		.end = end_addr,
+		.event = MMU_MUNMAP,
+	};
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, &range);
 }
 
 /**
@@ -1383,16 +1384,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = start + size;
+	struct mmu_notifier_range range = {
+		.start = start,
+		.end = start + size,
+		.event = MMU_MIGRATE,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, start, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MIGRATE);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MIGRATE);
-	tlb_finish_mmu(&tlb, start, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+		unmap_single_vma(&tlb, vma, start, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, start, range.end);
 }
 
 /**
@@ -1409,15 +1414,19 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long end = address + size;
+	struct mmu_notifier_range range = {
+		.start = address,
+		.end = address + size,
+		.event = MMU_MUNMAP,
+	};
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm, address, end);
+	tlb_gather_mmu(&tlb, mm, address, range.end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
-	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
-	tlb_finish_mmu(&tlb, address, end);
+	mmu_notifier_invalidate_range_start(mm, &range);
+	unmap_single_vma(&tlb, vma, address, range.end, details);
+	mmu_notifier_invalidate_range_end(mm, &range);
+	tlb_finish_mmu(&tlb, address, range.end);
 }
 
 /**
@@ -2037,10 +2046,12 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret = 0;
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
-	unsigned long mmun_start = 0;	/* For mmu_notifiers */
-	unsigned long mmun_end = 0;	/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	struct mem_cgroup *memcg;
 
+	range.start = 0;
+	range.end = 0;
+
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
 		/*
@@ -2199,10 +2210,10 @@ gotten:
 	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
-	mmun_start  = address & PAGE_MASK;
-	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = address & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2272,9 +2283,8 @@ gotten:
 		page_cache_release(new_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+	if (range.end > range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 254d5bf..e9858e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1763,10 +1763,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	struct mmu_notifier_range range;
 	pmd_t orig_entry;
 
+	range.start = address & HPAGE_PMD_MASK;
+	range.end = range.start + HPAGE_PMD_SIZE;
+	range.event = MMU_MIGRATE;
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
@@ -1788,7 +1791,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	}
 
 	if (mm_tlb_flush_pending(mm))
-		flush_tlb_range(vma, mmun_start, mmun_end);
+		flush_tlb_range(vma, range.start, range.end);
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
@@ -1801,14 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, &range);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
-						  mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1841,17 +1842,17 @@ fail_putback:
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
-	pmdp_clear_flush_notify(vma, mmun_start, pmd);
-	set_pmd_at(mm, mmun_start, pmd, entry);
-	flush_tlb_range(vma, mmun_start, mmun_end);
+	flush_cache_range(vma, range.start, range.end);
+	page_add_anon_rmap(new_page, vma, range.start);
+	pmdp_clear_flush_notify(vma, range.start, pmd);
+	set_pmd_at(mm, range.start, pmd, entry);
+	flush_tlb_range(vma, range.start, range.end);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
+		set_pmd_at(mm, range.start, pmd, orig_entry);
+		flush_tlb_range(vma, range.start, range.end);
+		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		update_mmu_cache_pmd(vma, address, &entry);
 		page_remove_rmap(new_page);
 		goto fail_putback;
@@ -1862,8 +1863,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, &range);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -1888,7 +1888,7 @@ out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_mknonnuma(entry);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, range.start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index e51ea02..8f6f994 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,9 +174,7 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end,
-					   enum mmu_event event)
+					   struct mmu_notifier_range *range)
 
 {
 	struct mmu_notifier *mn;
@@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
-							end, event);
+			mn->ops->invalidate_range_start(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	/*
+	 * This must happen after the callback so that subsystem can block on
+	 * new invalidation range to synchronize itself.
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
+	mm->mmu_notifier_mm->nranges++;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-					 unsigned long start,
-					 unsigned long end,
-					 enum mmu_event event)
+					 struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	/*
+	 * This must happen before the callback so that subsystem can unblock
+	 * when range invalidation end.
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	list_del_init(&range->list);
+	mm->mmu_notifier_mm->nranges--;
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		/*
@@ -211,12 +224,18 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 		 * (besides the pointer check).
 		 */
 		if (mn->ops->invalidate_range)
-			mn->ops->invalidate_range(mn, mm, start, end);
+			mn->ops->invalidate_range(mn, mm,
+						  range->start, range->end);
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
-						      end, event);
+			mn->ops->invalidate_range_end(mn, mm, range);
 	}
 	srcu_read_unlock(&srcu, id);
+
+	/*
+	 * Wakeup after callback so they can do their job before any of the
+	 * waiters resume.
+	 */
+	wake_up(&mm->mmu_notifier_mm->wait_queue);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
 
@@ -235,6 +254,49 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
+static bool mmu_notifier_range_is_valid_locked(struct mm_struct *mm,
+					       unsigned long start,
+					       unsigned long end)
+{
+	struct mmu_notifier_range *range;
+
+	list_for_each_entry(range, &mm->mmu_notifier_mm->ranges, list) {
+		if (!(range->end <= start || range->start >= end))
+			return false;
+	}
+	return true;
+}
+
+bool mmu_notifier_range_is_valid(struct mm_struct *mm,
+				 unsigned long start,
+				 unsigned long end)
+{
+	bool valid;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	valid = mmu_notifier_range_is_valid_locked(mm, start, end);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+	return valid;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_is_valid);
+
+void mmu_notifier_range_wait_valid(struct mm_struct *mm,
+				   unsigned long start,
+				   unsigned long end)
+{
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (!mmu_notifier_range_is_valid_locked(mm, start, end)) {
+		int nranges = mm->mmu_notifier_mm->nranges;
+
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		wait_event(mm->mmu_notifier_mm->wait_queue,
+			   nranges != mm->mmu_notifier_mm->nranges);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_range_wait_valid);
+
 static int do_mmu_notifier_register(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
 				    int take_mmap_sem)
@@ -264,6 +326,9 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 	if (!mm_has_notifiers(mm)) {
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		INIT_LIST_HEAD(&mmu_notifier_mm->ranges);
+		mmu_notifier_mm->nranges = 0;
+		init_waitqueue_head(&mmu_notifier_mm->wait_queue);
 
 		mm->mmu_notifier_mm = mmu_notifier_mm;
 		mmu_notifier_mm = NULL;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0f5dbfe..c88f770 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	unsigned long next;
 	unsigned long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	unsigned long mni_start = 0;
+	struct mmu_notifier_range range = {
+		.start = 0,
+	};
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -150,10 +152,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */
-		if (!mni_start) {
-			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
-							    end, MMU_MPROT);
+		if (!range.start) {
+			range.start = addr;
+			range.end = end;
+			range.event = MMU_MPROT;
+			mmu_notifier_invalidate_range_start(mm, &range);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -180,9 +183,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pages += this_pages;
 	} while (pmd++, addr = next, addr != end);
 
-	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end,
-						  MMU_MPROT);
+	if (range.start)
+		mmu_notifier_invalidate_range_end(mm, &range);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 1ede220..5556f51 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -167,18 +167,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		bool need_rmap_locks)
 {
 	unsigned long extent, next, old_end;
+	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmun_start = old_addr;
-	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
-					    mmun_end, MMU_MIGRATE);
+	range.start = old_addr;
+	range.end = old_end;
+	range.event = MMU_MIGRATE;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, &range);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -230,8 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
-					  mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, &range);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 1d96644..611a640 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1361,15 +1361,14 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	struct mmu_notifier_range range;
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
-	enum mmu_event event = MMU_MIGRATE;
 
+	range.event = MMU_MIGRATE;
 	if (flags & TTU_MUNLOCK)
-		event = MMU_MUNLOCK;
+		range.event = MMU_MUNLOCK;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1382,9 +1381,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	if (!pmd)
 		return ret;
 
-	mmun_start = address;
-	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
+	range.start = address;
+	range.end = end;
+	mmu_notifier_invalidate_range_start(mm, &range);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1453,7 +1452,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
+	mmu_notifier_invalidate_range_end(mm, &range);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be2f937..684af9e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -313,9 +313,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
-						    unsigned long start,
-						    unsigned long end,
-						    enum mmu_event event)
+						    const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -328,7 +326,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * count is also read inside the mmu_lock critical section.
 	 */
 	kvm->mmu_notifier_count++;
-	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
 	need_tlb_flush |= kvm->tlbs_dirty;
 	/* we've to flush the tlb before the pages can be freed */
 	if (need_tlb_flush)
@@ -340,9 +338,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end,
-						  enum mmu_event event)
+						  const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] HMM: introduce heterogeneous memory management.
  2014-12-22 16:48 ` j.glisse
@ 2014-12-22 16:48   ` j.glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

This patch only introduce core HMM functions for registering a new mirror and
stopping a mirror as well as registering and unregistering a device.

The lifecycle of HMM object is handled differently then one of mmu_notifier
because unlike mmu_notifier there can be concurrent call from both mm code to
HMM code and/or from device driver code to HMM code. Moreover lifetime of HMM
can be uncorrelated from the lifetime of the process that is being mirror.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 129 ++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  15 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 374 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 553 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a442d32..c5bb62e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4533,6 +4533,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..8eddc15
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,129 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * This callback is call either on mm destruction or as result to a
+	 * call to hmm_mirror_release(). Device driver have to stop all hw
+	 * thread and all usage of the address space, it has to dirty all pages
+	 * that have been dirty by the device. But it must not clear any entry
+	 * from the mirror page table.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+};
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @name: Device name (uniquely identify the device on the system).
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	const char			*name;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f80d019..0e054f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2208,5 +2208,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6d34aa2..57a3e1c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -426,6 +430,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..0bb9dc4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -568,6 +569,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..b249db0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,18 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	select GENERIC_PAGE_TABLE
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 3548460..cb2f9ed 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -73,3 +73,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..55afec0
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,374 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct srcu_struct srcu;
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @mm: The mm struct this hmm is associated with.
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @lock: Serialize the mirror list modifications.
+ * @kref: Reference counter
+ * @mmu_notifier: The mmu_notifier of this mm.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ */
+struct hmm {
+	struct hlist_head	mirrors;
+	struct mm_struct	*mm;
+	unsigned long		vm_end;
+	spinlock_t		lock;
+	struct kref		kref;
+	struct mmu_notifier	mmu_notifier;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	spin_lock_init(&hmm->lock);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	spin_lock(&hmm->lock);
+	hlist_for_each_entry_rcu(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			spin_unlock(&hmm->lock);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	spin_unlock(&hmm->lock);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister(&hmm->mmu_notifier, hmm->mm);
+
+	kfree(hmm);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+	int id;
+
+	/*
+	 * The hmm structure can not be free because the mmu_notifier srcu is
+	 * read locked thus any concurrent hmm_mirror_unregister that would
+	 * free hmm would have to wait on the mmu_notifier srcu.
+	 */
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+
+	id = srcu_read_lock(&srcu);
+	spin_lock(&hmm->lock);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init_rcu(&mirror->mlist);
+		spin_unlock(&hmm->lock);
+
+		mirror->device->ops->release(mirror);
+
+		mutex_lock(&mirror->device->mutex);
+		list_del_init(&mirror->dlist);
+		mutex_unlock(&mirror->device->mutex);
+
+		spin_lock(&hmm->lock);
+	}
+	spin_unlock(&hmm->lock);
+
+	srcu_read_unlock(&srcu, id);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * @device: The device struct to against which the mirror is registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	mutex_lock(&device->mutex);
+	mirror->device = device;
+	list_add(&mirror->dlist, &device->mirrors);
+	mutex_unlock(&device->mutex);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&device->mutex);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_release(struct hmm_mirror *mirror)
+{
+	spin_lock(&mirror->hmm->lock);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init_rcu(&mirror->mlist);
+		spin_unlock(&mirror->hmm->lock);
+		mirror->device->ops->release(mirror);
+
+		mutex_lock(&mirror->device->mutex);
+		list_del_init(&mirror->dlist);
+		mutex_unlock(&mirror->device->mutex);
+	} else
+		spin_unlock(&mirror->hmm->lock);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->stop() callback if it did not aleady
+ * happen.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hmm_mirror_release(mirror);
+	srcu_read_unlock(&srcu, id);
+
+	/*
+	 * Wait for any running method to finish, of course including
+	 * ->release() if it was run by hmm_notifier_release instead of us.
+	 */
+	synchronize_srcu(&srcu);
+
+	mirror->hmm = hmm_unref(mirror->hmm);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->name);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	mutex_lock(&device->mutex);
+	if (!list_empty(&device->mirrors)) {
+		mutex_unlock(&device->mutex);
+		return -EBUSY;
+	}
+	mutex_unlock(&device->mutex);
+	synchronize_srcu(&srcu);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
+
+
+static int __init hmm_subsys_init(void)
+{
+	return init_srcu_struct(&srcu);
+}
+subsys_initcall(hmm_subsys_init);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] HMM: introduce heterogeneous memory management.
@ 2014-12-22 16:48   ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch only introduce core HMM functions for registering a new mirror and
stopping a mirror as well as registering and unregistering a device.

The lifecycle of HMM object is handled differently then one of mmu_notifier
because unlike mmu_notifier there can be concurrent call from both mm code to
HMM code and/or from device driver code to HMM code. Moreover lifetime of HMM
can be uncorrelated from the lifetime of the process that is being mirror.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS              |   7 +
 include/linux/hmm.h      | 129 ++++++++++++++++
 include/linux/mm.h       |  11 ++
 include/linux/mm_types.h |  14 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  15 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 | 374 +++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 553 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a442d32..c5bb62e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4533,6 +4533,13 @@ F:	include/uapi/linux/if_hippi.h
 F:	net/802/hippi.c
 F:	drivers/net/hippi/
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm.c
+F:	include/linux/hmm.h
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..8eddc15
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,129 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu using
+ * its own page table for the process. It supports everything except special
+ * vma.
+ *
+ * Mandatory hardware features :
+ *   - An mmu with pagetable.
+ *   - Read only flag per cpu page.
+ *   - Page fault ie hardware must stop and wait for kernel to service fault.
+ *
+ * Optional hardware features :
+ *   - Dirty bit per cpu page.
+ *   - Access bit per cpu page.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_mirror;
+struct hmm;
+
+
+/* hmm_device - Each device must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* struct hmm_device_operations - HMM device operation callback
+ */
+struct hmm_device_ops {
+	/* release() - mirror must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * This callback is call either on mm destruction or as result to a
+	 * call to hmm_mirror_release(). Device driver have to stop all hw
+	 * thread and all usage of the address space, it has to dirty all pages
+	 * that have been dirty by the device. But it must not clear any entry
+	 * from the mirror page table.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+};
+
+/* struct hmm_device - per device HMM structure
+ *
+ * @name: Device name (uniquely identify the device on the system).
+ * @ops: The hmm operations callback.
+ * @mirrors: List of all active mirrors for the device.
+ * @mutex: Mutex protecting mirrors list.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once per linux device).
+ */
+struct hmm_device {
+	const char			*name;
+	const struct hmm_device_ops	*ops;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+};
+
+int hmm_device_register(struct hmm_device *device);
+int hmm_device_unregister(struct hmm_device *device);
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. Same process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm HMM structure
+ *
+ * @device: The hmm_device struct this hmm_mirror is associated to.
+ * @hmm: The hmm struct this hmm_mirror is associated to.
+ * @dlist: List of all hmm_mirror for same device.
+ * @mlist: List of all hmm_mirror for same process.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	struct list_head	dlist;
+	struct hlist_node	mlist;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+#endif /* CONFIG_HMM */
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f80d019..0e054f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2208,5 +2208,16 @@ void __init setup_nr_node_ids(void);
 static inline void setup_nr_node_ids(void) {}
 #endif
 
+#ifdef CONFIG_HMM
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+#else /* !CONFIG_HMM */
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+}
+#endif /* !CONFIG_HMM */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6d34aa2..57a3e1c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -426,6 +430,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..0bb9dc4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -568,6 +569,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..b249db0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,18 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+if STAGING
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	select GENERIC_PAGE_TABLE
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
+endif # STAGING
diff --git a/mm/Makefile b/mm/Makefile
index 3548460..cb2f9ed 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -73,3 +73,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..55afec0
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,374 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between system memory and device memory refer
+ * as remote memory from here on out.
+ *
+ * Refer to include/linux/hmm.h for further information on general design.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/list.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/mman.h>
+#include <linux/delay.h>
+#include <linux/workqueue.h>
+
+#include "internal.h"
+
+static struct srcu_struct srcu;
+
+
+/* struct hmm - per mm_struct HMM states.
+ *
+ * @mirrors: List of all mirror for this mm (one per device).
+ * @mm: The mm struct this hmm is associated with.
+ * @vm_end: Last valid address for this mm (exclusive).
+ * @lock: Serialize the mirror list modifications.
+ * @kref: Reference counter
+ * @mmu_notifier: The mmu_notifier of this mm.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change made to
+ * the process address space.
+ */
+struct hmm {
+	struct hlist_head	mirrors;
+	struct mm_struct	*mm;
+	unsigned long		vm_end;
+	spinlock_t		lock;
+	struct kref		kref;
+	struct mmu_notifier	mmu_notifier;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+
+/* hmm - core HMM functions.
+ *
+ * Core HMM functions that deal with all the process mm activities.
+ */
+
+static int hmm_init(struct hmm *hmm)
+{
+	hmm->mm = current->mm;
+	hmm->vm_end = TASK_SIZE;
+	kref_init(&hmm->kref);
+	INIT_HLIST_HEAD(&hmm->mirrors);
+	spin_lock_init(&hmm->lock);
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	return __mmu_notifier_register(&hmm->mmu_notifier, current->mm);
+}
+
+static int hmm_add_mirror(struct hmm *hmm, struct hmm_mirror *mirror)
+{
+	struct hmm_mirror *tmp;
+
+	spin_lock(&hmm->lock);
+	hlist_for_each_entry_rcu(tmp, &hmm->mirrors, mlist)
+		if (tmp->device == mirror->device) {
+			/* Same device can mirror only once. */
+			spin_unlock(&hmm->lock);
+			return -EINVAL;
+		}
+	hlist_add_head(&mirror->mlist, &hmm->mirrors);
+	spin_unlock(&hmm->lock);
+
+	return 0;
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (!hmm || !kref_get_unless_zero(&hmm->kref))
+		return NULL;
+	return hmm;
+}
+
+static void hmm_destroy(struct kref *kref)
+{
+	struct hmm *hmm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	BUG_ON(!hlist_empty(&hmm->mirrors));
+
+	down_write(&hmm->mm->mmap_sem);
+	/* A new hmm might have been register before reaching that point. */
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+
+	mmu_notifier_unregister(&hmm->mmu_notifier, hmm->mm);
+
+	kfree(hmm);
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm)
+		kref_put(&hmm->kref, hmm_destroy);
+	return NULL;
+}
+
+
+/* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
+ *
+ * HMM use use mmu notifier to track change made to process address space.
+ */
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+	int id;
+
+	/*
+	 * The hmm structure can not be free because the mmu_notifier srcu is
+	 * read locked thus any concurrent hmm_mirror_unregister that would
+	 * free hmm would have to wait on the mmu_notifier srcu.
+	 */
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+
+	id = srcu_read_lock(&srcu);
+	spin_lock(&hmm->lock);
+	while (hmm->mirrors.first) {
+		struct hmm_mirror *mirror;
+
+		mirror = hlist_entry(hmm->mirrors.first,
+				     struct hmm_mirror,
+				     mlist);
+		hlist_del_init_rcu(&mirror->mlist);
+		spin_unlock(&hmm->lock);
+
+		mirror->device->ops->release(mirror);
+
+		mutex_lock(&mirror->device->mutex);
+		list_del_init(&mirror->dlist);
+		mutex_unlock(&mirror->device->mutex);
+
+		spin_lock(&hmm->lock);
+	}
+	spin_unlock(&hmm->lock);
+
+	srcu_read_unlock(&srcu, id);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+};
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and their helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+/* hmm_mirror_register() - register mirror against current process for a device.
+ *
+ * @mirror: The mirror struct being registered.
+ * @device: The device struct to against which the mirror is registered.
+ * Returns: 0 on success or -ENOMEM, -EINVAL on error.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * HMM shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The task device driver want to mirror must be current !
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return NULL if
+ * the hmm_device already has an hmm_mirror for the the mm.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device)
+{
+	struct mm_struct *mm = current->mm;
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!device);
+	BUG_ON(!mm);
+
+	/*
+	 * Initialize the mirror struct fields, the mlist init and del dance is
+	 * necessary to make the error path easier for driver and for hmm.
+	 */
+	INIT_HLIST_NODE(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+	mutex_lock(&device->mutex);
+	mirror->device = device;
+	list_add(&mirror->dlist, &device->mirrors);
+	mutex_unlock(&device->mutex);
+
+	down_write(&mm->mmap_sem);
+
+	hmm = mm->hmm ? hmm_ref(hmm) : NULL;
+	if (hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			up_write(&mm->mmap_sem);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		ret = hmm_init(hmm);
+		if (ret) {
+			up_write(&mm->mmap_sem);
+			kfree(hmm);
+			goto error;
+		}
+
+		mm->hmm = hmm;
+	}
+
+	mirror->hmm = hmm;
+	ret = hmm_add_mirror(hmm, mirror);
+	up_write(&mm->mmap_sem);
+	if (ret) {
+		mirror->hmm = NULL;
+		hmm_unref(hmm);
+		goto error;
+	}
+	return 0;
+
+error:
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&device->mutex);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+static void hmm_mirror_release(struct hmm_mirror *mirror)
+{
+	spin_lock(&mirror->hmm->lock);
+	if (!hlist_unhashed(&mirror->mlist)) {
+		hlist_del_init_rcu(&mirror->mlist);
+		spin_unlock(&mirror->hmm->lock);
+		mirror->device->ops->release(mirror);
+
+		mutex_lock(&mirror->device->mutex);
+		list_del_init(&mirror->dlist);
+		mutex_unlock(&mirror->device->mutex);
+	} else
+		spin_unlock(&mirror->hmm->lock);
+}
+
+/* hmm_mirror_unregister() - unregister a mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Driver can call this function when it wants to stop mirroring a process.
+ * This will trigger a call to the ->stop() callback if it did not aleady
+ * happen.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hmm_mirror_release(mirror);
+	srcu_read_unlock(&srcu, id);
+
+	/*
+	 * Wait for any running method to finish, of course including
+	 * ->release() if it was run by hmm_notifier_release instead of us.
+	 */
+	synchronize_srcu(&srcu);
+
+	mirror->hmm = hmm_unref(mirror->hmm);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw HMM and each device driver.
+ */
+
+/* hmm_device_register() - register a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EINVAL otherwise.
+ *
+ *
+ * Call when device driver want to register itself with HMM. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->name);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->release);
+
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+/* hmm_device_unregister() - unregister a device with HMM.
+ *
+ * @device: The hmm_device struct.
+ * Returns: 0 on success or -EBUSY otherwise.
+ *
+ * Call when device driver want to unregister itself with HMM. This will check
+ * that there is no any active mirror and returns -EBUSY if so.
+ */
+int hmm_device_unregister(struct hmm_device *device)
+{
+	mutex_lock(&device->mutex);
+	if (!list_empty(&device->mirrors)) {
+		mutex_unlock(&device->mutex);
+		return -EBUSY;
+	}
+	mutex_unlock(&device->mutex);
+	synchronize_srcu(&srcu);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_unregister);
+
+
+static int __init hmm_subsys_init(void)
+{
+	return init_srcu_struct(&srcu);
+}
+subsys_initcall(hmm_subsys_init);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] HMM: add HMM page table.
  2014-12-22 16:48 ` j.glisse
@ 2014-12-22 16:48   ` j.glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Heterogeneous memory management main purpose is to mirror a process address.
To do so it must maintain a secondary page table that is use by the device
driver to program the device or build a device specific page table.

Radix tree can not be use to create this secondary page table because HMM
needs more flags than RADIX_TREE_MAX_TAGS (while this can be increase we
believe HMM will require so much flags that cost will becomes prohibitive
to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to store dma
address and on some platform sizeof(dma_addr_t) > sizeof(long). Thus radix
tree is unsuitable to fulfill HMM requirement hence why we introduce this
code which allows to create page table that can grow and shrink dynamicly.

The design is very clause to CPU page table as it reuse some of the feature
such as spinlock embedded in struct page.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 261 ++++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 425 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 689 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c5bb62e..02f9f29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4539,6 +4539,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..88fc519
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,261 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ */
+#define HMM_PTE_VALID		(1 << 0)
+#define HMM_PTE_WRITE		(1 << 1)
+#define HMM_PTE_DIRTY		(1 << 2)
+#define HMM_PFN_SHIFT		4
+#define HMM_PFN_MASK		(~((dma_addr_t)((1 << HMM_PFN_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PTE_VALID;
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return pte >> HMM_PFN_SHIFT;
+}
+
+#define HMM_PT_MAX_LEVEL	6
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return addr & pt->mask[level];
+}
+
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return (addr | (~pt->mask[level])) + 1UL;
+}
+
+static inline unsigned long hmm_pt_level_next(struct hmm_pt *pt,
+					      unsigned long addr,
+					      unsigned long end,
+					      unsigned level)
+{
+	addr = (addr | (~pt->mask[level])) + 1UL;
+	return (addr - 1 < end - 1) ? addr : end;
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end);
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr);
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr);
+
+/* hmm_pt_protect_directory_unref() - reference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to reference.
+ *
+ * This function will reference a directory but it is illegal for refcount to
+ * be 0 as this helper should only be call when iterator is protecting the
+ * directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
+					     char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+/* hmm_pt_protect_directory_unref() - unreference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to unreference.
+ *
+ * This function will unreference a directory but it is illegal for refcount to
+ * reach 0 here as this helper should only be call when iterator is protecting
+ * the directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter,
+					       char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	/* Illegal this should not happen. */
+	BUG();
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   struct hmm_pt *pt,
+					   unsigned long addr)
+{
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr >= hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter,
+					      struct hmm_pt *pt)
+{
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter,
+						struct hmm_pt *pt)
+{
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index cb2f9ed..d2e50f2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -73,4 +73,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..4af7ca8
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	pt->last &= PAGE_MASK;
+	spin_lock_init(&pt->lock);
+	/* Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/* Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PTE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pte_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PTE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pte_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+
+/* hmm_pt_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter)
+{
+	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     struct hmm_pt *pt,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pte_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is not safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	hmm_pt_iter_directory_unref(iter, level - 1);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    struct hmm_pt *pt,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, pt, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+				  struct page *ptd,
+				  unsigned long addr,
+				  unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end)
+{
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			return hmm_pt_level_next(pt, iter->cur, end, i);
+	}
+
+	/*
+	 * No need for rcu protection worst case is we return a now dead
+	 * address.
+	 */
+	if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PTE_VALID)
+		return hmm_pt_level_next(pt, addr, end, pt->llevel);
+	for (; addr < end; addr = hmm_pt_level_next(pt, addr, end, 0))
+		if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PTE_VALID)
+			return addr;
+	return end;
+}
+EXPORT_SYMBOL(hmm_pt_iter_next);
+
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr)
+{
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr < hmm_pt_level_end(pt, iter->cur, pt->llevel))
+		return hmm_pt_iter_ptdp(iter, pt, addr);
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PTE_VALID)) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pte_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1)
+			return NULL;
+	}
+
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_update);
+
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_update(iter, pt, addr);
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PTE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pte_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				iter->cur = hmm_pt_level_start(pt, addr, i);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pte_from_pfn(page_to_pfn(new));
+		hmm_pt_iter_directory_ref(iter, i - 1);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		iter->cur = hmm_pt_level_start(pt, addr, i);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] HMM: add HMM page table.
@ 2014-12-22 16:48   ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Heterogeneous memory management main purpose is to mirror a process address.
To do so it must maintain a secondary page table that is use by the device
driver to program the device or build a device specific page table.

Radix tree can not be use to create this secondary page table because HMM
needs more flags than RADIX_TREE_MAX_TAGS (while this can be increase we
believe HMM will require so much flags that cost will becomes prohibitive
to others users of radix tree).

Moreover radix tree is built around long but for HMM we need to store dma
address and on some platform sizeof(dma_addr_t) > sizeof(long). Thus radix
tree is unsuitable to fulfill HMM requirement hence why we introduce this
code which allows to create page table that can grow and shrink dynamicly.

The design is very clause to CPU page table as it reuse some of the feature
such as spinlock embedded in struct page.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 MAINTAINERS            |   2 +
 include/linux/hmm_pt.h | 261 ++++++++++++++++++++++++++++++
 mm/Makefile            |   2 +-
 mm/hmm_pt.c            | 425 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 689 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm_pt.h
 create mode 100644 mm/hmm_pt.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c5bb62e..02f9f29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4539,6 +4539,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm.c
 F:	include/linux/hmm.h
+F:	mm/hmm_pt.c
+F:	include/linux/hmm_pt.h
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/linux/hmm_pt.h b/include/linux/hmm_pt.h
new file mode 100644
index 0000000..88fc519
--- /dev/null
+++ b/include/linux/hmm_pt.h
@@ -0,0 +1,261 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is.
+ *
+ * HMM page table rely on a locking mecanism similar to CPU page table for page
+ * table update. It use the spinlock embedded inside the struct page to protect
+ * change to page table directory which should minimize lock contention for
+ * concurrent update.
+ *
+ * It does also provide a directory tree protection mechanism. Unlike CPU page
+ * table there is no mmap semaphore to protect directory tree from removal and
+ * this is done intentionaly so that concurrent removal/insertion of directory
+ * inside the tree can happen.
+ *
+ * So anyone walking down the page table must protect directory it traverses so
+ * they are not free by some other thread. This is done by using a reference
+ * counter for each directory. Before traversing a directory a reference is
+ * taken and once traversal is done the reference is drop.
+ *
+ * A directory entry dereference and refcount increment of sub-directory page
+ * must happen in a critical rcu section so that directory page removal can
+ * gracefully wait for all possible other threads that might have dereferenced
+ * the directory.
+ */
+#ifndef _HMM_PT_H
+#define _HMM_PT_H
+
+/*
+ * The HMM page table entry does not reflect any specific hardware. It is just
+ * a common entry format use by HMM internal and expose to HMM user so they can
+ * extract information out of HMM page table.
+ */
+#define HMM_PTE_VALID		(1 << 0)
+#define HMM_PTE_WRITE		(1 << 1)
+#define HMM_PTE_DIRTY		(1 << 2)
+#define HMM_PFN_SHIFT		4
+#define HMM_PFN_MASK		(~((dma_addr_t)((1 << HMM_PFN_SHIFT) - 1)))
+
+static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PTE_VALID;
+}
+
+static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
+{
+	return pte >> HMM_PFN_SHIFT;
+}
+
+#define HMM_PT_MAX_LEVEL	6
+
+/* struct hmm_pt - HMM page table structure.
+ *
+ * @mask: Array of address mask value of each level.
+ * @directory_mask: Mask for directory index (see below).
+ * @last: Last valid address (inclusive).
+ * @pgd: page global directory (top first level of the directory tree).
+ * @lock: Share lock if spinlock_t does not fit in struct page.
+ * @shift: Array of address shift value of each level.
+ * @llevel: Last level.
+ *
+ * The index into each directory for a given address and level is :
+ *   (address >> shift[level]) & directory_mask
+ *
+ * Only hmm_pt.last field needs to be set before calling hmm_pt_init().
+ */
+struct hmm_pt {
+	unsigned long		mask[HMM_PT_MAX_LEVEL];
+	unsigned long		directory_mask;
+	unsigned long		last;
+	dma_addr_t		*pgd;
+	spinlock_t		lock;
+	unsigned char		shift[HMM_PT_MAX_LEVEL];
+	unsigned char		llevel;
+};
+
+int hmm_pt_init(struct hmm_pt *pt);
+void hmm_pt_fini(struct hmm_pt *pt);
+
+static inline unsigned hmm_pt_index(struct hmm_pt *pt,
+				    unsigned long addr,
+				    unsigned level)
+{
+	return (addr >> pt->shift[level]) & pt->directory_mask;
+}
+
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	if (level)
+		spin_lock(&ptd->ptl);
+	else
+		spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	if (level)
+		spin_unlock(&ptd->ptl);
+	else
+		spin_unlock(&pt->lock);
+}
+#else /* USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS */
+static inline void hmm_pt_directory_lock(struct hmm_pt *pt,
+					 struct page *ptd,
+					 unsigned level)
+{
+	spin_lock(&pt->lock);
+}
+
+static inline void hmm_pt_directory_unlock(struct hmm_pt *pt,
+					   struct page *ptd,
+					   unsigned level)
+{
+	spin_unlock(&pt->lock);
+}
+#endif
+
+static inline unsigned long hmm_pt_level_start(struct hmm_pt *pt,
+					       unsigned long addr,
+					       unsigned level)
+{
+	return addr & pt->mask[level];
+}
+
+static inline unsigned long hmm_pt_level_end(struct hmm_pt *pt,
+					     unsigned long addr,
+					     unsigned level)
+{
+	return (addr | (~pt->mask[level])) + 1UL;
+}
+
+static inline unsigned long hmm_pt_level_next(struct hmm_pt *pt,
+					      unsigned long addr,
+					      unsigned long end,
+					      unsigned level)
+{
+	addr = (addr | (~pt->mask[level])) + 1UL;
+	return (addr - 1 < end - 1) ? addr : end;
+}
+
+
+/* struct hmm_pt_iter - page table iterator states.
+ *
+ * @ptd: Array of directory struct page pointer for each levels.
+ * @ptdp: Array of pointer to mapped directory levels.
+ * @dead_directories: List of directories that died while walking page table.
+ * @cur: Current address.
+ */
+struct hmm_pt_iter {
+	struct page		*ptd[HMM_PT_MAX_LEVEL - 1];
+	dma_addr_t		*ptdp[HMM_PT_MAX_LEVEL - 1];
+	struct list_head	dead_directories;
+	unsigned long		cur;
+};
+
+void hmm_pt_iter_init(struct hmm_pt_iter *iter);
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt);
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end);
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr);
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr);
+
+/* hmm_pt_protect_directory_unref() - reference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to reference.
+ *
+ * This function will reference a directory but it is illegal for refcount to
+ * be 0 as this helper should only be call when iterator is protecting the
+ * directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
+					     char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
+		/* Illegal this should not happen. */
+		BUG();
+}
+
+/* hmm_pt_protect_directory_unref() - unreference a directory.
+ *
+ * @iter: Iterator states that currently protect the directory.
+ * @level: Level of the directory to unreference.
+ *
+ * This function will unreference a directory but it is illegal for refcount to
+ * reach 0 here as this helper should only be call when iterator is protecting
+ * the directory (ie iterator hold a reference for the directory).
+ *
+ * HMM user will call this with level = pt.llevel any other value is supicious
+ * outside of hmm_pt code.
+ */
+static inline void hmm_pt_iter_directory_unref(struct hmm_pt_iter *iter,
+					       char level)
+{
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	/* Illegal this should not happen. */
+	BUG();
+}
+
+static inline dma_addr_t *hmm_pt_iter_ptdp(struct hmm_pt_iter *iter,
+					   struct hmm_pt *pt,
+					   unsigned long addr)
+{
+	BUG_ON(!iter->ptd[pt->llevel - 1] ||
+	       addr < hmm_pt_level_start(pt, iter->cur, pt->llevel) ||
+	       addr >= hmm_pt_level_end(pt, iter->cur, pt->llevel));
+	return &iter->ptdp[pt->llevel - 1][hmm_pt_index(pt, addr, pt->llevel)];
+}
+
+static inline void hmm_pt_iter_directory_lock(struct hmm_pt_iter *iter,
+					      struct hmm_pt *pt)
+{
+	hmm_pt_directory_lock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+static inline void hmm_pt_iter_directory_unlock(struct hmm_pt_iter *iter,
+						struct hmm_pt *pt)
+{
+	hmm_pt_directory_unlock(pt, iter->ptd[pt->llevel - 1], pt->llevel);
+}
+
+
+#endif /* _HMM_PT_H */
diff --git a/mm/Makefile b/mm/Makefile
index cb2f9ed..d2e50f2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -73,4 +73,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
-obj-$(CONFIG_HMM) += hmm.o
+obj-$(CONFIG_HMM) += hmm.o hmm_pt.o
diff --git a/mm/hmm_pt.c b/mm/hmm_pt.c
new file mode 100644
index 0000000..4af7ca8
--- /dev/null
+++ b/mm/hmm_pt.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * This provide a set of helpers for HMM page table. See include/linux/hmm.h
+ * for a description of what HMM is and include/linux/hmm_pt.h.
+ */
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/hmm_pt.h>
+
+/* hmm_pt_init() - initialize HMM page table.
+ *
+ * @pt: HMM page table to initialize.
+ *
+ * This function will initialize HMM page table and allocate memory for global
+ * directory. Only the hmm_pt.last fields need to be set prior to calling this
+ * function.
+ */
+int hmm_pt_init(struct hmm_pt *pt)
+{
+	unsigned directory_shift, i = 0, npgd;
+
+	pt->last &= PAGE_MASK;
+	spin_lock_init(&pt->lock);
+	/* Directory shift is the number of bits that a single directory level
+	 * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
+	 * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
+	 */
+	directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
+	/* Level 0 is the root level of the page table. It might use less
+	 * bits than directory_shift but all sub-directory level will use all
+	 * directory_shift bits.
+	 *
+	 * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
+	 * sizeof(dma_addr_t) == 8 then :
+	 *   directory_shift = 9
+	 *   shift[0] = 39
+	 *   shift[1] = 30
+	 *   shift[2] = 21
+	 *   shift[3] = 12
+	 *   llevel = 3
+	 *
+	 * Note that shift[llevel] == PAGE_SHIFT because the last level
+	 * correspond to the page table entry level (ignoring the case of huge
+	 * page).
+	 */
+	pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
+			directory_shift) + PAGE_SHIFT;
+	while (pt->shift[i++] > PAGE_SHIFT)
+		pt->shift[i] = pt->shift[i - 1] - directory_shift;
+	pt->llevel = i - 1;
+	pt->directory_mask = (1 << directory_shift) - 1;
+
+	for (i = 0; i <= pt->llevel; ++i)
+		pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
+
+	npgd = (pt->last >> pt->shift[0]) + 1;
+	pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
+	if (!pt->pgd)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_pt_init);
+
+static void hmm_pt_fini_directory(struct hmm_pt *pt,
+				  struct page *ptd,
+				  unsigned level)
+{
+	dma_addr_t *ptdp;
+	unsigned i;
+
+	if (level == pt->llevel)
+		return;
+
+	ptdp = kmap(ptd);
+	for (i = 0; i <= pt->directory_mask; ++i) {
+		struct page *lptd;
+
+		if (!(ptdp[i] & HMM_PTE_VALID))
+			continue;
+		lptd = pfn_to_page(hmm_pte_pfn(ptdp[i]));
+		ptdp[i] = 0;
+		hmm_pt_fini_directory(pt, lptd, level + 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+	kunmap(ptd);
+}
+
+/* hmm_pt_fini() - finalize HMM page table.
+ *
+ * @pt: HMM page table to finalize.
+ *
+ * This function will free all resources of a directory page table.
+ */
+void hmm_pt_fini(struct hmm_pt *pt)
+{
+	unsigned i;
+
+	/* Free all directory. */
+	for (i = 0; i <= (pt->last >> pt->shift[0]); ++i) {
+		struct page *ptd;
+
+		if (!(pt->pgd[i] & HMM_PTE_VALID))
+			continue;
+		ptd = pfn_to_page(hmm_pte_pfn(pt->pgd[i]));
+		pt->pgd[i] = 0;
+		hmm_pt_fini_directory(pt, ptd, 1);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+
+	kfree(pt->pgd);
+	pt->pgd = NULL;
+}
+EXPORT_SYMBOL(hmm_pt_fini);
+
+
+/* hmm_pt_init() - initialize iterator states.
+ *
+ * @iter: Iterator states.
+ *
+ * This function will initialize iterator states. It must always be pair with a
+ * call to hmm_pt_iter_fini().
+ */
+void hmm_pt_iter_init(struct hmm_pt_iter *iter)
+{
+	memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
+	INIT_LIST_HEAD(&iter->dead_directories);
+}
+EXPORT_SYMBOL(hmm_pt_iter_init);
+
+/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ * @level: Level of the directory to unref.
+ *
+ * This function will unreference a directory and add it to dead list if
+ * directory no longer have any reference. It will also clear the entry to
+ * that directory into the upper level directory as well as dropping ref
+ * on the upper directory.
+ */
+static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
+					     struct hmm_pt *pt,
+					     unsigned level)
+{
+	struct page *upper_ptd;
+	dma_addr_t *upper_ptdp;
+
+	/* Nothing to do for root level. */
+	if (!level)
+		return;
+
+	if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
+		return;
+
+	upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
+	upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
+	upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
+	hmm_pt_directory_lock(pt, upper_ptd, level - 1);
+	/*
+	 * There might be race btw decrementing reference count on a directory
+	 * and another thread trying to fault in a new directory. To avoid
+	 * erasing the new directory entry we need to check that the entry
+	 * still correspond to the directory we are removing.
+	 */
+	if (hmm_pte_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
+		*upper_ptdp = 0;
+	hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
+
+	/* Add it to delayed free list. */
+	list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
+
+	/*
+	 * The upper directory is not safe to unref as we have an extra ref and
+	 * thus refcount should not reach 0.
+	 */
+	hmm_pt_iter_directory_unref(iter, level - 1);
+}
+
+static void hmm_pt_iter_unprotect_directory(struct hmm_pt_iter *iter,
+					    struct hmm_pt *pt,
+					    unsigned level)
+{
+	if (!iter->ptd[level - 1])
+		return;
+	kunmap(iter->ptd[level - 1]);
+	hmm_pt_iter_directory_unref_safe(iter, pt, level);
+	iter->ptd[level - 1] = NULL;
+}
+
+/* hmm_pt_iter_protect_directory() - protect a directory.
+ *
+ * @iter: Iterator states.
+ * @ptd: directory struct page to protect.
+ * @addr: Address of the directory.
+ * @level: Level of this directory (> 0).
+ * Returns -EINVAL on error, 1 if protection succeeded, 0 otherwise.
+ *
+ * This function will proctect a directory by taking a reference. It will also
+ * map the directory to allow cpu access.
+ *
+ * Call to this function must be made from inside the rcu read critical section
+ * that convert the table entry to the directory struct page. Doing so allow to
+ * support concurrent removal of directory because this function will take the
+ * reference inside the rcu critical section and thus rcu synchronization will
+ * garanty that we can safely free directory.
+ */
+int hmm_pt_iter_protect_directory(struct hmm_pt_iter *iter,
+				  struct page *ptd,
+				  unsigned long addr,
+				  unsigned level)
+{
+	/* This must be call inside rcu read section. */
+	BUG_ON(!rcu_read_lock_held());
+
+	if (!level || iter->ptd[level - 1]) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	if (!atomic_inc_not_zero(&ptd->_mapcount)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	rcu_read_unlock();
+
+	iter->ptd[level - 1] = ptd;
+	iter->ptdp[level - 1] = kmap(ptd);
+	iter->cur = addr;
+
+	return 1;
+}
+
+unsigned long hmm_pt_iter_next(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr,
+			       unsigned long end)
+{
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			return hmm_pt_level_next(pt, iter->cur, end, i);
+	}
+
+	/*
+	 * No need for rcu protection worst case is we return a now dead
+	 * address.
+	 */
+	if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PTE_VALID)
+		return hmm_pt_level_next(pt, addr, end, pt->llevel);
+	for (; addr < end; addr = hmm_pt_level_next(pt, addr, end, 0))
+		if (pt->pgd[hmm_pt_index(pt, addr, 0)] & HMM_PTE_VALID)
+			return addr;
+	return end;
+}
+EXPORT_SYMBOL(hmm_pt_iter_next);
+
+dma_addr_t *hmm_pt_iter_update(struct hmm_pt_iter *iter,
+			       struct hmm_pt *pt,
+			       unsigned long addr)
+{
+	int i;
+
+	addr &= PAGE_MASK;
+
+	if (iter->ptd[pt->llevel - 1] &&
+	    addr >= hmm_pt_level_start(pt, iter->cur, pt->llevel) &&
+	    addr < hmm_pt_level_end(pt, iter->cur, pt->llevel))
+		return hmm_pt_iter_ptdp(iter, pt, addr);
+
+	/* First unprotect any directory that do not cover the address. */
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		if (addr >= hmm_pt_level_start(pt, iter->cur, i) &&
+		    addr < hmm_pt_level_end(pt, iter->cur, i))
+			break;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Walk down to last level of the directory tree. */
+	for (; i < pt->llevel; ++i) {
+		struct page *ptd;
+		dma_addr_t pte, *ptdp;
+
+		rcu_read_lock();
+		ptdp = i ? iter->ptdp[i - 1] : pt->pgd;
+		pte = ACCESS_ONCE(ptdp[hmm_pt_index(pt, addr, i)]);
+		if (!(pte & HMM_PTE_VALID)) {
+			rcu_read_unlock();
+			return NULL;
+		}
+		ptd = pfn_to_page(hmm_pte_pfn(pte));
+		/* RCU read unlock inside hmm_pt_iter_protect_directory(). */
+		if (hmm_pt_iter_protect_directory(iter, ptd, addr, i + 1) != 1)
+			return NULL;
+	}
+
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+EXPORT_SYMBOL(hmm_pt_iter_update);
+
+dma_addr_t *hmm_pt_iter_fault(struct hmm_pt_iter *iter,
+			      struct hmm_pt *pt,
+			      unsigned long addr)
+{
+	dma_addr_t *ptdp = hmm_pt_iter_update(iter, pt, addr);
+	struct page *new = NULL;
+	int i;
+
+	if (ptdp)
+		return ptdp;
+
+	/* Populate directory tree structures. */
+	for (i = 1; i <= pt->llevel; ++i) {
+		struct page *upper_ptd;
+		dma_addr_t *upper_ptdp;
+
+		if (iter->ptd[i - 1])
+			continue;
+
+		new = new ? new : alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+		if (!new)
+			return NULL;
+
+		upper_ptd = i > 1 ? iter->ptd[i - 2] : NULL;
+		upper_ptdp = i > 1 ? iter->ptdp[i - 2] : pt->pgd;
+		upper_ptdp = &upper_ptdp[hmm_pt_index(pt, addr, i - 1)];
+		hmm_pt_directory_lock(pt, upper_ptd, i - 1);
+		if (((*upper_ptdp) & HMM_PTE_VALID)) {
+			struct page *ptd;
+
+			ptd = pfn_to_page(hmm_pte_pfn(*upper_ptdp));
+			if (atomic_inc_not_zero(&ptd->_mapcount)) {
+				/* Already allocated by another thread. */
+				iter->ptd[i - 1] = ptd;
+				hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+				iter->ptdp[i - 1] = kmap(ptd);
+				iter->cur = hmm_pt_level_start(pt, addr, i);
+				continue;
+			}
+			/*
+			 * Means we raced with removal of dead directory it is
+			 * safe to overwritte *upper_ptdp entry with new entry.
+			 */
+		}
+		/* Initialize struct page field for the directory. */
+		atomic_set(&new->_mapcount, 1);
+#if USE_SPLIT_PTE_PTLOCKS && !ALLOC_SPLIT_PTLOCKS
+		spin_lock_init(&new->ptl);
+#endif
+		*upper_ptdp = hmm_pte_from_pfn(page_to_pfn(new));
+		hmm_pt_iter_directory_ref(iter, i - 1);
+		/* Unlock upper directory and map the new directory. */
+		hmm_pt_directory_unlock(pt, upper_ptd, i - 1);
+		iter->ptd[i - 1] = new;
+		iter->ptdp[i - 1] = kmap(new);
+		iter->cur = hmm_pt_level_start(pt, addr, i);
+		new = NULL;
+	}
+	if (new)
+		__free_page(new);
+	return hmm_pt_iter_ptdp(iter, pt, addr);
+}
+
+/* hmm_pt_iter_fini() - finalize iterator.
+ *
+ * @iter: Iterator states.
+ * @pt: HMM page table.
+ *
+ * This function will cleanup iterator by unmapping and unreferencing any
+ * directory still mapped and referenced. It will also free any dead directory.
+ */
+void hmm_pt_iter_fini(struct hmm_pt_iter *iter, struct hmm_pt *pt)
+{
+	struct page *ptd, *tmp;
+	unsigned i;
+
+	for (i = pt->llevel; i >= 1; --i) {
+		if (!iter->ptd[i - 1])
+			continue;
+		hmm_pt_iter_unprotect_directory(iter, pt, i);
+	}
+
+	/* Avoid useless synchronize_rcu() if there is no directory to free. */
+	if (list_empty(&iter->dead_directories))
+		return;
+
+	/*
+	 * Some iterator may have dereferenced a dead directory entry and looked
+	 * up the struct page but haven't check yet the reference count. As all
+	 * the above happen in rcu read critical section we know that we need
+	 * to wait for grace period before being able to free any of the dead
+	 * directory page.
+	 */
+	synchronize_rcu();
+	list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
+		list_del(&ptd->lru);
+		atomic_set(&ptd->_mapcount, -1);
+		__free_page(ptd);
+	}
+}
+EXPORT_SYMBOL(hmm_pt_iter_fini);
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] HMM: add per mirror page table.
  2014-12-22 16:48 ` j.glisse
@ 2014-12-22 16:48   ` j.glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

This patch add the per mirror page table. It also propagate CPU page table
update to this per mirror page table using mmu_notifier callback. All update
are contextualized with an HMM event structure that convey all information
needed by device driver to take proper actions (update its own mmu to reflect
changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once the device
driver is done with its update. Most importantly HMM will properly propagate
HMM page table dirty bit to underlying page.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h | 136 +++++++++++++++++++++++++++
 mm/hmm.c            | 263 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 399 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 8eddc15..dd34572 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,12 +46,65 @@
 #include <linux/mmu_notifier.h>
 #include <linux/workqueue.h>
 #include <linux/mman.h>
+#include <linux/hmm_pt.h>
 
 
 struct hmm_device;
 struct hmm_mirror;
+struct hmm_fence;
 struct hmm;
 
+/* hmm_fence - Device driver fence allowing to batch update and delay wait.
+ *
+ * @mirror: The HMM mirror this fence is associated with.
+ * @list: List of fence.
+ *
+ * Each time HMM callback into a device driver for update the device driver can
+ * return fence which core HMM will wait on. This allow HMM to batch update to
+ * several different device driver and then wait for each of them to complete.
+ *
+ * The hmm_fence structure is intended to be embedded inside a device driver
+ * specific fence structure.
+ */
+struct hmm_fence {
+	struct hmm_mirror	*mirror;
+	struct list_head	list;
+};
+
+
+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_ISDIRTY,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_DEVICE_RFAULT,
+	HMM_DEVICE_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @fences: List of device fences associated with this event.
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	struct list_head	fences;
+	dma_addr_t		pte_mask;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
 
 /* hmm_device - Each device must register one and only one hmm_device.
  *
@@ -72,6 +125,87 @@ struct hmm_device_ops {
 	 * from the mirror page table.
 	 */
 	void (*release)(struct hmm_mirror *mirror);
+
+	/* fence_wait() - to wait on device driver fence.
+	 *
+	 * @fence: The device driver fence struct.
+	 * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
+	 *
+	 * Called when hmm want to wait for all operations associated with a
+	 * fence to complete (including device cache flush if the event mandate
+	 * it).
+	 *
+	 * Device driver must free fence and associated resources if it returns
+	 * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+	 * as hmm will call back again.
+	 *
+	 * Return error if scheduled operation failed or if need to wait again.
+	 * -EIO Some input/output error with the device.
+	 * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*fence_wait)(struct hmm_fence *fence);
+
+	/* fence_ref() - take a reference fence structure.
+	 *
+	 * @fence: Fence structure hmm is referencing.
+	 */
+	void (*fence_ref)(struct hmm_fence *fence);
+
+	/* fence_unref() - drop a reference fence structure.
+	 *
+	 * @fence: Fence structure hmm is dereferencing.
+	 */
+	void (*fence_unref)(struct hmm_fence *fence);
+
+	/* update() - update device mmu following an event.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Thought most device driver only need to use pte_mask as it reflects
+	 * change that will happen to the HMM page table ie :
+	 *   new_pte = old_pte & event->pte_mask;
+	 *
+	 * Device driver must not update the HMM mirror page table. Core HMM
+	 * will update HMM page table after the update is done (ie if a fence
+	 * is returned after ->fence_wait() report fence is done).
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to local memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_mk_dirty helper
+	 * on each HMM page table entry.
+	 *
+	 * The driver should return a fence pointer or NULL on success. Device
+	 * driver should return fence and delay wait for the operation to the
+	 * fence wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*update)(struct hmm_mirror *mirror,
+				    const struct hmm_event *event);
 };
 
 /* struct hmm_device - per device HMM structure
@@ -108,6 +242,7 @@ int hmm_device_unregister(struct hmm_device *device);
  * @hmm: The hmm struct this hmm_mirror is associated to.
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
  *
  * Each device that want to mirror an address space must register one of this
  * struct for each of the address space it wants to mirror. Same device can
@@ -119,6 +254,7 @@ struct hmm_mirror {
 	struct hmm		*hmm;
 	struct list_head	dlist;
 	struct hlist_node	mlist;
+	struct hmm_pt		pt;
 };
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device);
diff --git a/mm/hmm.c b/mm/hmm.c
index 55afec0..90ebe75 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -71,6 +71,72 @@ struct hmm {
 
 static struct mmu_notifier_ops hmm_notifier_ops;
 
+static void hmm_device_fence_wait(struct hmm_device *device,
+				  struct hmm_fence *fence);
+static void hmm_mirror_release(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+				 struct hmm *hmm,
+				 unsigned long start,
+				 unsigned long end,
+				 enum hmm_etype etype)
+{
+	event->start = start & PAGE_MASK;
+	event->end = min(end, hmm->vm_end);
+	if (event->start >= event->end)
+		return -EINVAL;
+	event->etype = etype;
+	INIT_LIST_HEAD(&event->fences);
+	switch (etype) {
+	case HMM_ISDIRTY:
+		event->pte_mask = HMM_PTE_VALID | HMM_PTE_WRITE |
+				  HMM_PTE_DIRTY | HMM_PFN_MASK;
+		break;
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		event->pte_mask = HMM_PTE_VALID | HMM_PTE_WRITE |
+				  HMM_PFN_MASK;
+		break;
+	case HMM_WRITE_PROTECT:
+		event->pte_mask = HMM_PTE_VALID | HMM_PFN_MASK;
+		break;
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		event->pte_mask = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static inline void hmm_event_wait(struct hmm_event *event)
+{
+	struct hmm_fence *fence, *tmp;
+
+	if (list_empty(&event->fences))
+		/* Nothing to wait for. */
+		return;
+
+	io_schedule();
+
+	list_for_each_entry_safe(fence, tmp, &event->fences, list)
+		hmm_device_fence_wait(fence->mirror->device, fence);
+}
+
 
 /* hmm - core HMM functions.
  *
@@ -139,6 +205,29 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+	int id;
+
+	/* Is this hmm already fully stop ? */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+	id = srcu_read_lock(&srcu);
+
+	hlist_for_each_entry_rcu(mirror, &hmm->mirrors, mlist)
+		if (hmm_mirror_update(mirror, event))
+			hmm_mirror_release(mirror);
+
+	hmm_event_wait(event);
+
+	hlist_for_each_entry_rcu(mirror, &hmm->mirrors, mlist)
+		hmm_mirror_update_pt(mirror, event);
+
+	srcu_read_unlock(&srcu, id);
+}
+
 
 /* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
  *
@@ -180,8 +269,87 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	srcu_read_unlock(&srcu, id);
 }
 
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						const struct mmu_notifier_range *range)
+{
+	struct hmm_event event;
+	unsigned long start = range->start, end = range->end;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	if (start >= hmm->vm_end)
+		return;
+
+	switch (range->event) {
+	case MMU_MUNLOCK:
+		/* Still same physical ram backing same address. */
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_WRITE_BACK:
+	case MMU_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_ISDIRTY:
+		event.etype = HMM_ISDIRTY;
+		break;
+	case MMU_HSPLIT:
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	case MMU_MIGRATE:
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm_event_init(&event, hmm, start, end, event.etype);
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 enum mmu_event mmu_event)
+{
+	struct mmu_notifier_range range;
+
+	range.start = addr & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = mmu_event;
+	hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
 static struct mmu_notifier_ops hmm_notifier_ops = {
 	.release		= hmm_notifier_release,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
 };
 
 
@@ -196,6 +364,64 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
  * the device driver to fault in range of memory in the device page table.
  */
 
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+
+	fence = device->ops->update(mirror, event);
+	if (fence) {
+		if (IS_ERR(fence))
+			return PTR_ERR(fence);
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event)
+{
+	unsigned long addr;
+	struct hmm_pt_iter iter;
+
+	hmm_pt_iter_init(&iter);
+	for (addr = event->start; addr != event->end;) {
+		unsigned long end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt,
+						addr, event->end);
+			continue;
+		}
+		end = hmm_pt_level_next(&mirror->pt, addr, event->end,
+					 mirror->pt.llevel - 1);
+		hmm_pt_iter_directory_lock(&iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, end,
+						 mirror->pt.llevel);
+			if (!((*hmm_pte) & HMM_PTE_VALID))
+				continue;
+			if ((*hmm_pte) & HMM_PTE_DIRTY) {
+				struct page *page;
+
+				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+				set_page_dirty(page);
+				*hmm_pte &= ~HMM_PTE_DIRTY;
+			}
+			*hmm_pte &= event->pte_mask;
+			if (((*hmm_pte) & HMM_PTE_VALID))
+				continue;
+			hmm_pt_iter_directory_unref(&iter, mirror->pt.llevel);
+		} while (addr = next, hmm_pte++, addr != end);
+		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
+	}
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+}
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
@@ -226,6 +452,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device)
 	 * Initialize the mirror struct fields, the mlist init and del dance is
 	 * necessary to make the error path easier for driver and for hmm.
 	 */
+	mirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&mirror->pt)) {
+		kfree(mirror);
+		return -ENOMEM;
+	}
 	INIT_HLIST_NODE(&mirror->mlist);
 	INIT_LIST_HEAD(&mirror->dlist);
 	mutex_lock(&device->mutex);
@@ -263,6 +494,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device)
 		hmm_unref(hmm);
 		goto error;
 	}
+	BUG_ON(mirror->pt.last >= hmm->vm_end);
 	return 0;
 
 error:
@@ -275,6 +507,14 @@ EXPORT_SYMBOL(hmm_mirror_register);
 
 static void hmm_mirror_release(struct hmm_mirror *mirror)
 {
+	struct hmm_event event;
+
+	/* Make sure everything is unmapped. */
+	hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+	hmm_mirror_update(mirror, &event);
+	hmm_event_wait(&event);
+	hmm_mirror_update_pt(mirror, &event);
+
 	spin_lock(&mirror->hmm->lock);
 	if (!hlist_unhashed(&mirror->mlist)) {
 		hlist_del_init_rcu(&mirror->mlist);
@@ -310,6 +550,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	 */
 	synchronize_srcu(&srcu);
 
+	hmm_pt_fini(&mirror->pt);
 	mirror->hmm = hmm_unref(mirror->hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
@@ -366,6 +607,28 @@ int hmm_device_unregister(struct hmm_device *device)
 }
 EXPORT_SYMBOL(hmm_device_unregister);
 
+static void hmm_device_fence_wait(struct hmm_device *device,
+				  struct hmm_fence *fence)
+{
+	struct hmm_mirror *mirror;
+	int r;
+
+	if (fence == NULL)
+		return;
+
+	list_del_init(&fence->list);
+	do {
+		r = device->ops->fence_wait(fence);
+		if (r == -EAGAIN)
+			io_schedule();
+	} while (r == -EAGAIN);
+
+	mirror = fence->mirror;
+	device->ops->fence_unref(fence);
+	if (r)
+		hmm_mirror_release(mirror);
+}
+
 
 static int __init hmm_subsys_init(void)
 {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] HMM: add per mirror page table.
@ 2014-12-22 16:48   ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:48 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch add the per mirror page table. It also propagate CPU page table
update to this per mirror page table using mmu_notifier callback. All update
are contextualized with an HMM event structure that convey all information
needed by device driver to take proper actions (update its own mmu to reflect
changes and schedule proper flushing).

Core HMM is responsible for updating the per mirror page table once the device
driver is done with its update. Most importantly HMM will properly propagate
HMM page table dirty bit to underlying page.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h | 136 +++++++++++++++++++++++++++
 mm/hmm.c            | 263 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 399 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 8eddc15..dd34572 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -46,12 +46,65 @@
 #include <linux/mmu_notifier.h>
 #include <linux/workqueue.h>
 #include <linux/mman.h>
+#include <linux/hmm_pt.h>
 
 
 struct hmm_device;
 struct hmm_mirror;
+struct hmm_fence;
 struct hmm;
 
+/* hmm_fence - Device driver fence allowing to batch update and delay wait.
+ *
+ * @mirror: The HMM mirror this fence is associated with.
+ * @list: List of fence.
+ *
+ * Each time HMM callback into a device driver for update the device driver can
+ * return fence which core HMM will wait on. This allow HMM to batch update to
+ * several different device driver and then wait for each of them to complete.
+ *
+ * The hmm_fence structure is intended to be embedded inside a device driver
+ * specific fence structure.
+ */
+struct hmm_fence {
+	struct hmm_mirror	*mirror;
+	struct list_head	list;
+};
+
+
+/*
+ * hmm_event - each event is described by a type associated with a struct.
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_ISDIRTY,
+	HMM_MIGRATE,
+	HMM_MUNMAP,
+	HMM_DEVICE_RFAULT,
+	HMM_DEVICE_WFAULT,
+	HMM_WRITE_PROTECT,
+};
+
+/* struct hmm_event - memory event information.
+ *
+ * @list: So HMM can keep track of all active events.
+ * @start: First address (inclusive).
+ * @end: Last address (exclusive).
+ * @fences: List of device fences associated with this event.
+ * @pte_mask: HMM pte update mask (bit(s) that are still valid).
+ * @etype: Event type (munmap, migrate, truncate, ...).
+ * @backoff: Only meaningful for device page fault.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	struct list_head	fences;
+	dma_addr_t		pte_mask;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
 
 /* hmm_device - Each device must register one and only one hmm_device.
  *
@@ -72,6 +125,87 @@ struct hmm_device_ops {
 	 * from the mirror page table.
 	 */
 	void (*release)(struct hmm_mirror *mirror);
+
+	/* fence_wait() - to wait on device driver fence.
+	 *
+	 * @fence: The device driver fence struct.
+	 * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
+	 *
+	 * Called when hmm want to wait for all operations associated with a
+	 * fence to complete (including device cache flush if the event mandate
+	 * it).
+	 *
+	 * Device driver must free fence and associated resources if it returns
+	 * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+	 * as hmm will call back again.
+	 *
+	 * Return error if scheduled operation failed or if need to wait again.
+	 * -EIO Some input/output error with the device.
+	 * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*fence_wait)(struct hmm_fence *fence);
+
+	/* fence_ref() - take a reference fence structure.
+	 *
+	 * @fence: Fence structure hmm is referencing.
+	 */
+	void (*fence_ref)(struct hmm_fence *fence);
+
+	/* fence_unref() - drop a reference fence structure.
+	 *
+	 * @fence: Fence structure hmm is dereferencing.
+	 */
+	void (*fence_unref)(struct hmm_fence *fence);
+
+	/* update() - update device mmu following an event.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @event: The event that triggered the update.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device page table for a range of address.
+	 * The event type provide the nature of the update :
+	 *   - Range is no longer valid (munmap).
+	 *   - Range protection changes (mprotect, COW, ...).
+	 *   - Range is unmapped (swap, reclaim, page migration, ...).
+	 *   - Device page fault.
+	 *   - ...
+	 *
+	 * Thought most device driver only need to use pte_mask as it reflects
+	 * change that will happen to the HMM page table ie :
+	 *   new_pte = old_pte & event->pte_mask;
+	 *
+	 * Device driver must not update the HMM mirror page table. Core HMM
+	 * will update HMM page table after the update is done (ie if a fence
+	 * is returned after ->fence_wait() report fence is done).
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to local memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly set the dirty bit using hmm_pte_mk_dirty helper
+	 * on each HMM page table entry.
+	 *
+	 * The driver should return a fence pointer or NULL on success. Device
+	 * driver should return fence and delay wait for the operation to the
+	 * fence wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*update)(struct hmm_mirror *mirror,
+				    const struct hmm_event *event);
 };
 
 /* struct hmm_device - per device HMM structure
@@ -108,6 +242,7 @@ int hmm_device_unregister(struct hmm_device *device);
  * @hmm: The hmm struct this hmm_mirror is associated to.
  * @dlist: List of all hmm_mirror for same device.
  * @mlist: List of all hmm_mirror for same process.
+ * @pt: Mirror page table.
  *
  * Each device that want to mirror an address space must register one of this
  * struct for each of the address space it wants to mirror. Same device can
@@ -119,6 +254,7 @@ struct hmm_mirror {
 	struct hmm		*hmm;
 	struct list_head	dlist;
 	struct hlist_node	mlist;
+	struct hmm_pt		pt;
 };
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device);
diff --git a/mm/hmm.c b/mm/hmm.c
index 55afec0..90ebe75 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -71,6 +71,72 @@ struct hmm {
 
 static struct mmu_notifier_ops hmm_notifier_ops;
 
+static void hmm_device_fence_wait(struct hmm_device *device,
+				  struct hmm_fence *fence);
+static void hmm_mirror_release(struct hmm_mirror *mirror);
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event);
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event);
+
+
+/* hmm_event - use to track information relating to an event.
+ *
+ * Each change to cpu page table or fault from a device is considered as an
+ * event by hmm. For each event there is a common set of things that need to
+ * be tracked. The hmm_event struct centralize those and the helper functions
+ * help dealing with all this.
+ */
+
+static inline int hmm_event_init(struct hmm_event *event,
+				 struct hmm *hmm,
+				 unsigned long start,
+				 unsigned long end,
+				 enum hmm_etype etype)
+{
+	event->start = start & PAGE_MASK;
+	event->end = min(end, hmm->vm_end);
+	if (event->start >= event->end)
+		return -EINVAL;
+	event->etype = etype;
+	INIT_LIST_HEAD(&event->fences);
+	switch (etype) {
+	case HMM_ISDIRTY:
+		event->pte_mask = HMM_PTE_VALID | HMM_PTE_WRITE |
+				  HMM_PTE_DIRTY | HMM_PFN_MASK;
+		break;
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		event->pte_mask = HMM_PTE_VALID | HMM_PTE_WRITE |
+				  HMM_PFN_MASK;
+		break;
+	case HMM_WRITE_PROTECT:
+		event->pte_mask = HMM_PTE_VALID | HMM_PFN_MASK;
+		break;
+	case HMM_MIGRATE:
+	case HMM_MUNMAP:
+		event->pte_mask = 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static inline void hmm_event_wait(struct hmm_event *event)
+{
+	struct hmm_fence *fence, *tmp;
+
+	if (list_empty(&event->fences))
+		/* Nothing to wait for. */
+		return;
+
+	io_schedule();
+
+	list_for_each_entry_safe(fence, tmp, &event->fences, list)
+		hmm_device_fence_wait(fence->mirror->device, fence);
+}
+
 
 /* hmm - core HMM functions.
  *
@@ -139,6 +205,29 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static void hmm_update(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_mirror *mirror;
+	int id;
+
+	/* Is this hmm already fully stop ? */
+	if (hmm->mm->hmm != hmm)
+		return;
+
+	id = srcu_read_lock(&srcu);
+
+	hlist_for_each_entry_rcu(mirror, &hmm->mirrors, mlist)
+		if (hmm_mirror_update(mirror, event))
+			hmm_mirror_release(mirror);
+
+	hmm_event_wait(event);
+
+	hlist_for_each_entry_rcu(mirror, &hmm->mirrors, mlist)
+		hmm_mirror_update_pt(mirror, event);
+
+	srcu_read_unlock(&srcu, id);
+}
+
 
 /* hmm_notifier - HMM callback for mmu_notifier tracking change to process mm.
  *
@@ -180,8 +269,87 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	srcu_read_unlock(&srcu, id);
 }
 
+static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
+				   unsigned long addr,
+				   enum mmu_event mmu_event,
+				   enum hmm_etype *etype)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr || !(vma->vm_flags & VM_READ)) {
+		*etype = HMM_MUNMAP;
+		return;
+	}
+
+	if (!(vma->vm_flags & VM_WRITE)) {
+		*etype = HMM_WRITE_PROTECT;
+		return;
+	}
+
+	*etype = HMM_NONE;
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						const struct mmu_notifier_range *range)
+{
+	struct hmm_event event;
+	unsigned long start = range->start, end = range->end;
+	struct hmm *hmm;
+
+	hmm = container_of(mn, struct hmm, mmu_notifier);
+	if (start >= hmm->vm_end)
+		return;
+
+	switch (range->event) {
+	case MMU_MUNLOCK:
+		/* Still same physical ram backing same address. */
+		return;
+	case MMU_MPROT:
+		hmm_mmu_mprot_to_etype(mm, start, range->event, &event.etype);
+		if (event.etype == HMM_NONE)
+			return;
+		break;
+	case MMU_WRITE_BACK:
+	case MMU_WRITE_PROTECT:
+		event.etype = HMM_WRITE_PROTECT;
+		break;
+	case MMU_ISDIRTY:
+		event.etype = HMM_ISDIRTY;
+		break;
+	case MMU_HSPLIT:
+	case MMU_MUNMAP:
+		event.etype = HMM_MUNMAP;
+		break;
+	case MMU_MIGRATE:
+	default:
+		event.etype = HMM_MIGRATE;
+		break;
+	}
+
+	hmm_event_init(&event, hmm, start, end, event.etype);
+
+	hmm_update(hmm, &event);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 unsigned long addr,
+					 enum mmu_event mmu_event)
+{
+	struct mmu_notifier_range range;
+
+	range.start = addr & PAGE_MASK;
+	range.end = range.start + PAGE_SIZE;
+	range.event = mmu_event;
+	hmm_notifier_invalidate_range_start(mn, mm, &range);
+}
+
 static struct mmu_notifier_ops hmm_notifier_ops = {
 	.release		= hmm_notifier_release,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
 };
 
 
@@ -196,6 +364,64 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
  * the device driver to fault in range of memory in the device page table.
  */
 
+static inline int hmm_mirror_update(struct hmm_mirror *mirror,
+				    struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+
+	fence = device->ops->update(mirror, event);
+	if (fence) {
+		if (IS_ERR(fence))
+			return PTR_ERR(fence);
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
+				 struct hmm_event *event)
+{
+	unsigned long addr;
+	struct hmm_pt_iter iter;
+
+	hmm_pt_iter_init(&iter);
+	for (addr = event->start; addr != event->end;) {
+		unsigned long end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_update(&iter, &mirror->pt, addr);
+		if (!hmm_pte) {
+			addr = hmm_pt_iter_next(&iter, &mirror->pt,
+						addr, event->end);
+			continue;
+		}
+		end = hmm_pt_level_next(&mirror->pt, addr, event->end,
+					 mirror->pt.llevel - 1);
+		hmm_pt_iter_directory_lock(&iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, end,
+						 mirror->pt.llevel);
+			if (!((*hmm_pte) & HMM_PTE_VALID))
+				continue;
+			if ((*hmm_pte) & HMM_PTE_DIRTY) {
+				struct page *page;
+
+				page = pfn_to_page(hmm_pte_pfn(*hmm_pte));
+				set_page_dirty(page);
+				*hmm_pte &= ~HMM_PTE_DIRTY;
+			}
+			*hmm_pte &= event->pte_mask;
+			if (((*hmm_pte) & HMM_PTE_VALID))
+				continue;
+			hmm_pt_iter_directory_unref(&iter, mirror->pt.llevel);
+		} while (addr = next, hmm_pte++, addr != end);
+		hmm_pt_iter_directory_unlock(&iter, &mirror->pt);
+	}
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+}
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
@@ -226,6 +452,11 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device)
 	 * Initialize the mirror struct fields, the mlist init and del dance is
 	 * necessary to make the error path easier for driver and for hmm.
 	 */
+	mirror->pt.last = TASK_SIZE - 1;
+	if (hmm_pt_init(&mirror->pt)) {
+		kfree(mirror);
+		return -ENOMEM;
+	}
 	INIT_HLIST_NODE(&mirror->mlist);
 	INIT_LIST_HEAD(&mirror->dlist);
 	mutex_lock(&device->mutex);
@@ -263,6 +494,7 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device)
 		hmm_unref(hmm);
 		goto error;
 	}
+	BUG_ON(mirror->pt.last >= hmm->vm_end);
 	return 0;
 
 error:
@@ -275,6 +507,14 @@ EXPORT_SYMBOL(hmm_mirror_register);
 
 static void hmm_mirror_release(struct hmm_mirror *mirror)
 {
+	struct hmm_event event;
+
+	/* Make sure everything is unmapped. */
+	hmm_event_init(&event, mirror->hmm, 0, -1UL, HMM_MUNMAP);
+	hmm_mirror_update(mirror, &event);
+	hmm_event_wait(&event);
+	hmm_mirror_update_pt(mirror, &event);
+
 	spin_lock(&mirror->hmm->lock);
 	if (!hlist_unhashed(&mirror->mlist)) {
 		hlist_del_init_rcu(&mirror->mlist);
@@ -310,6 +550,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	 */
 	synchronize_srcu(&srcu);
 
+	hmm_pt_fini(&mirror->pt);
 	mirror->hmm = hmm_unref(mirror->hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
@@ -366,6 +607,28 @@ int hmm_device_unregister(struct hmm_device *device)
 }
 EXPORT_SYMBOL(hmm_device_unregister);
 
+static void hmm_device_fence_wait(struct hmm_device *device,
+				  struct hmm_fence *fence)
+{
+	struct hmm_mirror *mirror;
+	int r;
+
+	if (fence == NULL)
+		return;
+
+	list_del_init(&fence->list);
+	do {
+		r = device->ops->fence_wait(fence);
+		if (r == -EAGAIN)
+			io_schedule();
+	} while (r == -EAGAIN);
+
+	mirror = fence->mirror;
+	device->ops->fence_unref(fence);
+	if (r)
+		hmm_mirror_release(mirror);
+}
+
 
 static int __init hmm_subsys_init(void)
 {
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] HMM: add device page fault support.
  2014-12-22 16:48 ` j.glisse
@ 2014-12-22 16:49   ` j.glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:49 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

This patch add helper for device page fault. Device page fault helper will
fill the mirror page table using the CPU page table all this synchronized
with any update to CPU page table.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |   1 +
 mm/hmm.c            | 384 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 385 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index dd34572..72e168b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -259,6 +259,7 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 90ebe75..5fb7e19 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -55,6 +55,9 @@ static struct srcu_struct srcu;
  * @lock: Serialize the mirror list modifications.
  * @kref: Reference counter
  * @mmu_notifier: The mmu_notifier of this mm.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -67,6 +70,9 @@ struct hmm {
 	spinlock_t		lock;
 	struct kref		kref;
 	struct mmu_notifier	mmu_notifier;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
 };
 
 static struct mmu_notifier_ops hmm_notifier_ops;
@@ -88,6 +94,11 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
  * help dealing with all this.
  */
 
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 static inline int hmm_event_init(struct hmm_event *event,
 				 struct hmm *hmm,
 				 unsigned long start,
@@ -149,6 +160,9 @@ static int hmm_init(struct hmm *hmm)
 	hmm->vm_end = TASK_SIZE;
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
 	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
@@ -205,6 +219,60 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_valid(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_is_valid(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	hmm_event_wait(event);
+
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -214,6 +282,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 	id = srcu_read_lock(&srcu);
 
 	hlist_for_each_entry_rcu(mirror, &hmm->mirrors, mlist)
@@ -226,6 +296,35 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 		hmm_mirror_update_pt(mirror, event);
 
 	srcu_read_unlock(&srcu, id);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned flags;
+	int r;
+
+	flags = (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(mm, vma, addr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM)
+					return -ENOMEM;
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+	return 0;
 }
 
 
@@ -267,6 +366,8 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	spin_unlock(&hmm->lock);
 
 	srcu_read_unlock(&srcu, id);
+
+	wake_up(&hmm->wait_queue);
 }
 
 static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
@@ -422,6 +523,289 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter, &mirror->pt);
 }
 
+static inline bool hmm_mirror_is_stopped(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+
+			if (!(hmm_pte[i] & HMM_PTE_VALID)) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter,
+							  mirror->pt.llevel);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			hmm_pte[i] |= pmd_write(*pmdp) ? HMM_PTE_WRITE : 0;
+		} while (addr = next, pfn++, i++, addr != hmm_end);
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+			if (!pte_present(*ptep) || (write && !pte_write(*ptep))) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!(hmm_pte[i] & HMM_PTE_VALID)) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter,
+							  mirror->pt.llevel);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			hmm_pte[i] |= pte_write(*ptep) ? HMM_PTE_WRITE : 0;
+		} while (addr = next, ptep++, i++, addr != hmm_end);
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		struct hmm_fence *fence;
+
+		fence = mirror->device->ops->update(mirror, event);
+		if (fence) {
+			if (IS_ERR(fence)) {
+				ret = PTR_ERR(fence);
+				goto out;
+			} else {
+				fence->mirror = mirror;
+				list_add_tail(&fence->list, &event->fences);
+			}
+		}
+		addr = mirror_fault.addr;
+		goto again;
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	if (!mirror || event->start >= mirror->hmm->vm_end)
+		return -EINVAL;
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype))
+		return -EINVAL;
+	hmm_pt_iter_init(&iter);
+
+retry:
+	if (hmm_mirror_is_stopped(mirror))
+		return -ENODEV;
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] HMM: add device page fault support.
@ 2014-12-22 16:49   ` j.glisse
  0 siblings, 0 replies; 21+ messages in thread
From: j.glisse @ 2014-12-22 16:49 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This patch add helper for device page fault. Device page fault helper will
fill the mirror page table using the CPU page table all this synchronized
with any update to CPU page table.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h |   1 +
 mm/hmm.c            | 384 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 385 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index dd34572..72e168b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -259,6 +259,7 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct hmm_device *device);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event);
 
 
 #endif /* CONFIG_HMM */
diff --git a/mm/hmm.c b/mm/hmm.c
index 90ebe75..5fb7e19 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -55,6 +55,9 @@ static struct srcu_struct srcu;
  * @lock: Serialize the mirror list modifications.
  * @kref: Reference counter
  * @mmu_notifier: The mmu_notifier of this mm.
+ * @device_faults: List of all active device page faults.
+ * @ndevice_faults: Number of active device page faults.
+ * @wait_queue: Wait queue for event synchronization.
  *
  * For each process address space (mm_struct) there is one and only one hmm
  * struct. hmm functions will redispatch to each devices the change made to
@@ -67,6 +70,9 @@ struct hmm {
 	spinlock_t		lock;
 	struct kref		kref;
 	struct mmu_notifier	mmu_notifier;
+	struct list_head	device_faults;
+	unsigned		ndevice_faults;
+	wait_queue_head_t	wait_queue;
 };
 
 static struct mmu_notifier_ops hmm_notifier_ops;
@@ -88,6 +94,11 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
  * help dealing with all this.
  */
 
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->end <= b->start) || (a->start >= b->end));
+}
+
 static inline int hmm_event_init(struct hmm_event *event,
 				 struct hmm *hmm,
 				 unsigned long start,
@@ -149,6 +160,9 @@ static int hmm_init(struct hmm *hmm)
 	hmm->vm_end = TASK_SIZE;
 	kref_init(&hmm->kref);
 	INIT_HLIST_HEAD(&hmm->mirrors);
+	INIT_LIST_HEAD(&hmm->device_faults);
+	hmm->ndevice_faults = 0;
+	init_waitqueue_head(&hmm->wait_queue);
 	spin_lock_init(&hmm->lock);
 
 	/* register notifier */
@@ -205,6 +219,60 @@ static inline struct hmm *hmm_unref(struct hmm *hmm)
 	return NULL;
 }
 
+static int hmm_device_fault_start(struct hmm *hmm, struct hmm_event *event)
+{
+	int ret = 0;
+
+	mmu_notifier_range_wait_valid(hmm->mm, event->start, event->end);
+
+	spin_lock(&hmm->lock);
+	if (mmu_notifier_range_is_valid(hmm->mm, event->start, event->end)) {
+		list_add_tail(&event->list, &hmm->device_faults);
+		hmm->ndevice_faults++;
+		event->backoff = false;
+	} else
+		ret = -EAGAIN;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+
+	return ret;
+}
+
+static void hmm_device_fault_end(struct hmm *hmm, struct hmm_event *event)
+{
+	hmm_event_wait(event);
+
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	hmm->ndevice_faults--;
+	spin_unlock(&hmm->lock);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_wait_device_fault(struct hmm *hmm, struct hmm_event *ievent)
+{
+	struct hmm_event *fevent;
+	unsigned long wait_for = 0;
+
+again:
+	spin_lock(&hmm->lock);
+	list_for_each_entry(fevent, &hmm->device_faults, list) {
+		if (!hmm_event_overlap(fevent, ievent))
+			continue;
+		fevent->backoff = true;
+		wait_for = hmm->ndevice_faults;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wait_for > 0) {
+		wait_event(hmm->wait_queue, wait_for != hmm->ndevice_faults);
+		wait_for = 0;
+		goto again;
+	}
+}
+
 static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 {
 	struct hmm_mirror *mirror;
@@ -214,6 +282,8 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 	if (hmm->mm->hmm != hmm)
 		return;
 
+	hmm_wait_device_fault(hmm, event);
+
 	id = srcu_read_lock(&srcu);
 
 	hlist_for_each_entry_rcu(mirror, &hmm->mirrors, mlist)
@@ -226,6 +296,35 @@ static void hmm_update(struct hmm *hmm, struct hmm_event *event)
 		hmm_mirror_update_pt(mirror, event);
 
 	srcu_read_unlock(&srcu, id);
+
+	wake_up(&hmm->wait_queue);
+}
+
+static int hmm_mm_fault(struct hmm *hmm,
+			struct hmm_event *event,
+			struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned flags;
+	int r;
+
+	flags = (event->etype == HMM_DEVICE_WFAULT) ? FAULT_FLAG_WRITE : 0;
+	for (addr &= PAGE_MASK; addr < event->end; addr += PAGE_SIZE) {
+
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(mm, vma, addr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM)
+					return -ENOMEM;
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+	return 0;
 }
 
 
@@ -267,6 +366,8 @@ static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	spin_unlock(&hmm->lock);
 
 	srcu_read_unlock(&srcu, id);
+
+	wake_up(&hmm->wait_queue);
 }
 
 static void hmm_mmu_mprot_to_etype(struct mm_struct *mm,
@@ -422,6 +523,289 @@ static void hmm_mirror_update_pt(struct hmm_mirror *mirror,
 	hmm_pt_iter_fini(&iter, &mirror->pt);
 }
 
+static inline bool hmm_mirror_is_stopped(struct hmm_mirror *mirror)
+{
+	if (hlist_unhashed(&mirror->mlist) || list_empty(&mirror->dlist))
+		return true;
+	return false;
+}
+
+struct hmm_mirror_fault {
+	struct hmm_mirror	*mirror;
+	struct hmm_event	*event;
+	struct vm_area_struct	*vma;
+	unsigned long		addr;
+	struct hmm_pt_iter	*iter;
+};
+
+static int hmm_mirror_fault_hpmd(struct hmm_mirror *mirror,
+				 struct hmm_event *event,
+				 struct vm_area_struct *vma,
+				 struct hmm_pt_iter *iter,
+				 pmd_t *pmdp,
+				 struct hmm_mirror_fault *mirror_fault,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct page *page;
+	unsigned long addr, pfn;
+	unsigned flags = FOLL_TOUCH;
+	spinlock_t *ptl;
+	int ret;
+
+	ptl = pmd_lock(mirror->hmm->mm, pmdp);
+	if (unlikely(!pmd_trans_huge(*pmdp))) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+	if (unlikely(pmd_trans_splitting(*pmdp))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmdp);
+		return -EAGAIN;
+	}
+	flags |= event->etype == HMM_DEVICE_WFAULT ? FOLL_WRITE : 0;
+	page = follow_trans_huge_pmd(vma, start, pmdp, flags);
+	pfn = page_to_pfn(page);
+	spin_unlock(ptl);
+
+	/* Just fault in the whole PMD. */
+	start &= PMD_MASK;
+	end = start + PMD_SIZE - 1;
+
+	if (!pmd_write(*pmdp) && event->etype == HMM_DEVICE_WFAULT)
+			return -ENOENT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+
+			if (!(hmm_pte[i] & HMM_PTE_VALID)) {
+				hmm_pte[i] = hmm_pte_from_pfn(pfn);
+				hmm_pt_iter_directory_ref(iter,
+							  mirror->pt.llevel);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pfn);
+			hmm_pte[i] |= pmd_write(*pmdp) ? HMM_PTE_WRITE : 0;
+		} while (addr = next, pfn++, i++, addr != hmm_end);
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		mirror_fault->addr = addr;
+	}
+
+	return 0;
+}
+
+static int hmm_mirror_fault_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_mirror_fault *mirror_fault = walk->private;
+	struct hmm_mirror *mirror = mirror_fault->mirror;
+	struct hmm_event *event = mirror_fault->event;
+	struct hmm_pt_iter *iter = mirror_fault->iter;
+	bool write = (event->etype == HMM_DEVICE_WFAULT);
+	unsigned long addr;
+	int ret = 0;
+
+	/* Make sure there was no gap. */
+	if (start != mirror_fault->addr)
+		return -ENOENT;
+
+	if (event->backoff)
+		return -EAGAIN;
+
+	if (pmd_none(*pmdp))
+		return -ENOENT;
+
+	if (pmd_trans_huge(*pmdp))
+		return hmm_mirror_fault_hpmd(mirror, event, mirror_fault->vma,
+					     iter, pmdp, mirror_fault, start,
+					     end);
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp))
+		return -EFAULT;
+
+	for (ret = 0, addr = start; !ret && addr < end;) {
+		unsigned long i = 0, hmm_end, next;
+		dma_addr_t *hmm_pte;
+		pte_t *ptep;
+
+		hmm_pte = hmm_pt_iter_fault(iter, &mirror->pt, addr);
+		if (!hmm_pte)
+			return -ENOMEM;
+
+		hmm_end = hmm_pt_level_next(&mirror->pt, addr, end,
+					    mirror->pt.llevel - 1);
+		ptep = pte_offset_map(pmdp, start);
+		hmm_pt_iter_directory_lock(iter, &mirror->pt);
+		do {
+			next = hmm_pt_level_next(&mirror->pt, addr, hmm_end,
+						 mirror->pt.llevel);
+			if (!pte_present(*ptep) || (write && !pte_write(*ptep))) {
+				ret = -ENOENT;
+				ptep++;
+				break;
+			}
+
+			if (!(hmm_pte[i] & HMM_PTE_VALID)) {
+				hmm_pte[i] = hmm_pte_from_pfn(pte_pfn(*ptep));
+				hmm_pt_iter_directory_ref(iter,
+							  mirror->pt.llevel);
+			}
+			BUG_ON(hmm_pte_pfn(hmm_pte[i]) != pte_pfn(*ptep));
+			hmm_pte[i] |= pte_write(*ptep) ? HMM_PTE_WRITE : 0;
+		} while (addr = next, ptep++, i++, addr != hmm_end);
+		hmm_pt_iter_directory_unlock(iter, &mirror->pt);
+		pte_unmap(ptep - 1);
+		mirror_fault->addr = addr;
+	}
+
+	return ret;
+}
+
+static int hmm_mirror_handle_fault(struct hmm_mirror *mirror,
+				   struct hmm_event *event,
+				   struct vm_area_struct *vma,
+				   struct hmm_pt_iter *iter)
+{
+	struct hmm_mirror_fault mirror_fault;
+	unsigned long addr = event->start;
+	struct mm_walk walk = {0};
+	int ret = 0;
+
+	if ((event->etype == HMM_DEVICE_WFAULT) && !(vma->vm_flags & VM_WRITE))
+		return -EACCES;
+
+	ret = hmm_device_fault_start(mirror->hmm, event);
+	if (ret)
+		return ret;
+
+again:
+	if (event->backoff) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	if (addr >= event->end)
+		goto out;
+
+	mirror_fault.event = event;
+	mirror_fault.mirror = mirror;
+	mirror_fault.vma = vma;
+	mirror_fault.addr = addr;
+	mirror_fault.iter = iter;
+	walk.mm = mirror->hmm->mm;
+	walk.private = &mirror_fault;
+	walk.pmd_entry = hmm_mirror_fault_pmd;
+	ret = walk_page_range(addr, event->end, &walk);
+	if (!ret) {
+		struct hmm_fence *fence;
+
+		fence = mirror->device->ops->update(mirror, event);
+		if (fence) {
+			if (IS_ERR(fence)) {
+				ret = PTR_ERR(fence);
+				goto out;
+			} else {
+				fence->mirror = mirror;
+				list_add_tail(&fence->list, &event->fences);
+			}
+		}
+		addr = mirror_fault.addr;
+		goto again;
+	}
+
+out:
+	hmm_device_fault_end(mirror->hmm, event);
+	if (ret == -ENOENT) {
+		ret = hmm_mm_fault(mirror->hmm, event, vma, addr);
+		ret = ret ? ret : -EAGAIN;
+	}
+	return ret;
+}
+
+int hmm_mirror_fault(struct hmm_mirror *mirror, struct hmm_event *event)
+{
+	struct vm_area_struct *vma;
+	struct hmm_pt_iter iter;
+	int ret = 0;
+
+	if (!mirror || event->start >= mirror->hmm->vm_end)
+		return -EINVAL;
+	if (hmm_event_init(event, mirror->hmm, event->start,
+			   event->end, event->etype))
+		return -EINVAL;
+	hmm_pt_iter_init(&iter);
+
+retry:
+	if (hmm_mirror_is_stopped(mirror))
+		return -ENODEV;
+
+	/*
+	 * So synchronization with the cpu page table is the most important
+	 * and tedious aspect of device page fault. There must be a strong
+	 * ordering btw call to device->update() for device page fault and
+	 * device->update() for cpu page table invalidation/update.
+	 *
+	 * Page that are exposed to device driver must stay valid while the
+	 * callback is in progress ie any cpu page table invalidation that
+	 * render those pages obsolete must call device->update() after the
+	 * device->update() call that faulted those pages.
+	 *
+	 * To achieve this we rely on few things. First the mmap_sem insure
+	 * us that any munmap() syscall will serialize with us. So issue are
+	 * with unmap_mapping_range() and with migrate or merge page. For this
+	 * hmm keep track of affected range of address and block device page
+	 * fault that hit overlapping range.
+	 */
+	down_read(&mirror->hmm->mm->mmap_sem);
+	vma = find_vma_intersection(mirror->hmm->mm, event->start, event->end);
+	if (!vma) {
+		ret = -EFAULT;
+		goto out;
+	}
+	if (vma->vm_start > event->start) {
+		event->end = vma->vm_start;
+		ret = -EFAULT;
+		goto out;
+	}
+	event->end = min(event->end, vma->vm_end) & PAGE_MASK;
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	switch (event->etype) {
+	case HMM_DEVICE_RFAULT:
+	case HMM_DEVICE_WFAULT:
+		ret = hmm_mirror_handle_fault(mirror, event, vma, &iter);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+out:
+	/* Drop the mmap_sem so anyone waiting on it have a chance. */
+	up_read(&mirror->hmm->mm->mmap_sem);
+	wake_up(&mirror->hmm->wait_queue);
+	if (ret == -EAGAIN)
+		goto retry;
+	hmm_pt_iter_fini(&iter, &mirror->pt);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
 /* hmm_mirror_register() - register mirror against current process for a device.
  *
  * @mirror: The mirror struct being registered.
-- 
1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2
  2014-12-22 16:48   ` j.glisse
@ 2014-12-25  8:29     ` Haggai Eran
  -1 siblings, 0 replies; 21+ messages in thread
From: Haggai Eran @ 2014-12-25  8:29 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On 22/12/2014 18:48, j.glisse@gmail.com wrote:
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -						       unsigned long start,
> -						       unsigned long end,
> -						       enum mmu_event event)
> +						       struct mmu_notifier_range *range)
>  {
> +	/*
> +	 * Initialize list no matter what in case a mmu_notifier register after
> +	 * a range_start but before matching range_end.
> +	 */
> +	INIT_LIST_HEAD(&range->list);

I don't see how can an mmu_notifier register after a range_start but
before a matching range_end. The mmu_notifier registration locks all mm
locks, and that should prevent any invalidation from running, right?

>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> +		__mmu_notifier_invalidate_range_start(mm, range);
>  }

...

>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -					   unsigned long start,
> -					   unsigned long end,
> -					   enum mmu_event event)
> +					   struct mmu_notifier_range *range)
>  
>  {
>  	struct mmu_notifier *mn;
> @@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start,
> -							end, event);
> +			mn->ops->invalidate_range_start(mn, mm, range);
>  	}
>  	srcu_read_unlock(&srcu, id);
> +
> +	/*
> +	 * This must happen after the callback so that subsystem can block on
> +	 * new invalidation range to synchronize itself.
> +	 */
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> +	mm->mmu_notifier_mm->nranges++;
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

Don't you have a race here because you add the range struct after the
callback?

-------------------------------------------------------------------------
Thread A                    | Thread B
-------------------------------------------------------------------------
call mmu notifier callback  |
  clear SPTE                |
                            | device page fault
                            |   mmu_notifier_range_is_valid returns true
                            |   install new SPTE
add event struct to list    |
mm clears/modifies the PTE  |
-------------------------------------------------------------------------

So we are left with different entries in the host page table and the
secondary page table.

I would think you'd want the event struct to be added to the list before
the callback is run.

Best regards,
Haggai



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2
@ 2014-12-25  8:29     ` Haggai Eran
  0 siblings, 0 replies; 21+ messages in thread
From: Haggai Eran @ 2014-12-25  8:29 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On 22/12/2014 18:48, j.glisse@gmail.com wrote:
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -						       unsigned long start,
> -						       unsigned long end,
> -						       enum mmu_event event)
> +						       struct mmu_notifier_range *range)
>  {
> +	/*
> +	 * Initialize list no matter what in case a mmu_notifier register after
> +	 * a range_start but before matching range_end.
> +	 */
> +	INIT_LIST_HEAD(&range->list);

I don't see how can an mmu_notifier register after a range_start but
before a matching range_end. The mmu_notifier registration locks all mm
locks, and that should prevent any invalidation from running, right?

>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> +		__mmu_notifier_invalidate_range_start(mm, range);
>  }

...

>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -					   unsigned long start,
> -					   unsigned long end,
> -					   enum mmu_event event)
> +					   struct mmu_notifier_range *range)
>  
>  {
>  	struct mmu_notifier *mn;
> @@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start,
> -							end, event);
> +			mn->ops->invalidate_range_start(mn, mm, range);
>  	}
>  	srcu_read_unlock(&srcu, id);
> +
> +	/*
> +	 * This must happen after the callback so that subsystem can block on
> +	 * new invalidation range to synchronize itself.
> +	 */
> +	spin_lock(&mm->mmu_notifier_mm->lock);
> +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> +	mm->mmu_notifier_mm->nranges++;
> +	spin_unlock(&mm->mmu_notifier_mm->lock);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

Don't you have a race here because you add the range struct after the
callback?

-------------------------------------------------------------------------
Thread A                    | Thread B
-------------------------------------------------------------------------
call mmu notifier callback  |
  clear SPTE                |
                            | device page fault
                            |   mmu_notifier_range_is_valid returns true
                            |   install new SPTE
add event struct to list    |
mm clears/modifies the PTE  |
-------------------------------------------------------------------------

So we are left with different entries in the host page table and the
secondary page table.

I would think you'd want the event struct to be added to the list before
the callback is run.

Best regards,
Haggai


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2
  2014-12-25  8:29     ` Haggai Eran
@ 2014-12-26  7:20       ` Jerome Glisse
  -1 siblings, 0 replies; 21+ messages in thread
From: Jerome Glisse @ 2014-12-26  7:20 UTC (permalink / raw)
  To: Haggai Eran
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On Thu, Dec 25, 2014 at 10:29:44AM +0200, Haggai Eran wrote:
> On 22/12/2014 18:48, j.glisse@gmail.com wrote:
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -						       unsigned long start,
> > -						       unsigned long end,
> > -						       enum mmu_event event)
> > +						       struct mmu_notifier_range *range)
> >  {
> > +	/*
> > +	 * Initialize list no matter what in case a mmu_notifier register after
> > +	 * a range_start but before matching range_end.
> > +	 */
> > +	INIT_LIST_HEAD(&range->list);
> 
> I don't see how can an mmu_notifier register after a range_start but
> before a matching range_end. The mmu_notifier registration locks all mm
> locks, and that should prevent any invalidation from running, right?

File invalidation (like truncation) can lead to this case.

> 
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> > +		__mmu_notifier_invalidate_range_start(mm, range);
> >  }
> 
> ...
> 
> >  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -					   unsigned long start,
> > -					   unsigned long end,
> > -					   enum mmu_event event)
> > +					   struct mmu_notifier_range *range)
> >  
> >  {
> >  	struct mmu_notifier *mn;
> > @@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start)
> > -			mn->ops->invalidate_range_start(mn, mm, start,
> > -							end, event);
> > +			mn->ops->invalidate_range_start(mn, mm, range);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> > +
> > +	/*
> > +	 * This must happen after the callback so that subsystem can block on
> > +	 * new invalidation range to synchronize itself.
> > +	 */
> > +	spin_lock(&mm->mmu_notifier_mm->lock);
> > +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> > +	mm->mmu_notifier_mm->nranges++;
> > +	spin_unlock(&mm->mmu_notifier_mm->lock);
> >  }
> >  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
> 
> Don't you have a race here because you add the range struct after the
> callback?
> 
> -------------------------------------------------------------------------
> Thread A                    | Thread B
> -------------------------------------------------------------------------
> call mmu notifier callback  |
>   clear SPTE                |
>                             | device page fault
>                             |   mmu_notifier_range_is_valid returns true
>                             |   install new SPTE
> add event struct to list    |
> mm clears/modifies the PTE  |
> -------------------------------------------------------------------------
> 
> So we are left with different entries in the host page table and the
> secondary page table.
> 
> I would think you'd want the event struct to be added to the list before
> the callback is run.
> 

Yes you right, but the comment i left trigger memory that i did that on
purpose a one point probably with a different synch mecanism inside hmm.
I will try to medidate a bit see if i can bring back memory why i did it
that way in respect to previous design.

In all case i will respin with that order modified. Can i add you review
by after doing so ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2
@ 2014-12-26  7:20       ` Jerome Glisse
  0 siblings, 0 replies; 21+ messages in thread
From: Jerome Glisse @ 2014-12-26  7:20 UTC (permalink / raw)
  To: Haggai Eran
  Cc: akpm, linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse

On Thu, Dec 25, 2014 at 10:29:44AM +0200, Haggai Eran wrote:
> On 22/12/2014 18:48, j.glisse@gmail.com wrote:
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -						       unsigned long start,
> > -						       unsigned long end,
> > -						       enum mmu_event event)
> > +						       struct mmu_notifier_range *range)
> >  {
> > +	/*
> > +	 * Initialize list no matter what in case a mmu_notifier register after
> > +	 * a range_start but before matching range_end.
> > +	 */
> > +	INIT_LIST_HEAD(&range->list);
> 
> I don't see how can an mmu_notifier register after a range_start but
> before a matching range_end. The mmu_notifier registration locks all mm
> locks, and that should prevent any invalidation from running, right?

File invalidation (like truncation) can lead to this case.

> 
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> > +		__mmu_notifier_invalidate_range_start(mm, range);
> >  }
> 
> ...
> 
> >  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -					   unsigned long start,
> > -					   unsigned long end,
> > -					   enum mmu_event event)
> > +					   struct mmu_notifier_range *range)
> >  
> >  {
> >  	struct mmu_notifier *mn;
> > @@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start)
> > -			mn->ops->invalidate_range_start(mn, mm, start,
> > -							end, event);
> > +			mn->ops->invalidate_range_start(mn, mm, range);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> > +
> > +	/*
> > +	 * This must happen after the callback so that subsystem can block on
> > +	 * new invalidation range to synchronize itself.
> > +	 */
> > +	spin_lock(&mm->mmu_notifier_mm->lock);
> > +	list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> > +	mm->mmu_notifier_mm->nranges++;
> > +	spin_unlock(&mm->mmu_notifier_mm->lock);
> >  }
> >  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
> 
> Don't you have a race here because you add the range struct after the
> callback?
> 
> -------------------------------------------------------------------------
> Thread A                    | Thread B
> -------------------------------------------------------------------------
> call mmu notifier callback  |
>   clear SPTE                |
>                             | device page fault
>                             |   mmu_notifier_range_is_valid returns true
>                             |   install new SPTE
> add event struct to list    |
> mm clears/modifies the PTE  |
> -------------------------------------------------------------------------
> 
> So we are left with different entries in the host page table and the
> secondary page table.
> 
> I would think you'd want the event struct to be added to the list before
> the callback is run.
> 

Yes you right, but the comment i left trigger memory that i did that on
purpose a one point probably with a different synch mecanism inside hmm.
I will try to medidate a bit see if i can bring back memory why i did it
that way in respect to previous design.

In all case i will respin with that order modified. Can i add you review
by after doing so ?

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] HMM: introduce heterogeneous memory management.
  2014-12-22 16:48   ` j.glisse
@ 2014-12-31 15:46     ` Haggai Eran
  -1 siblings, 0 replies; 21+ messages in thread
From: Haggai Eran @ 2014-12-31 15:46 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

Hi,

On 22/12/2014 18:48, j.glisse@gmail.com wrote:
> +/* hmm_device_register() - register a device with HMM.
> + *
> + * @device: The hmm_device struct.
> + * Returns: 0 on success or -EINVAL otherwise.
> + *
> + *
> + * Call when device driver want to register itself with HMM. Device driver can
> + * only register once. It will return a reference on the device thus to release
> + * a device the driver must unreference the device.

I see that the code doesn't actually have a reference count on the
hmm_device, but just registers and unregisters it through the
hmm_device_register/hmm_device_unregister functions. Perhaps you should
update the comment here to tell that.

> + */
> +int hmm_device_register(struct hmm_device *device)
> +{
> +	/* sanity check */
> +	BUG_ON(!device);
> +	BUG_ON(!device->name);
> +	BUG_ON(!device->ops);
> +	BUG_ON(!device->ops->release);
> +
> +	mutex_init(&device->mutex);
> +	INIT_LIST_HEAD(&device->mirrors);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_device_register);

Regards,
Haggai

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] HMM: introduce heterogeneous memory management.
@ 2014-12-31 15:46     ` Haggai Eran
  0 siblings, 0 replies; 21+ messages in thread
From: Haggai Eran @ 2014-12-31 15:46 UTC (permalink / raw)
  To: j.glisse, akpm
  Cc: linux-kernel, linux-mm, Linus Torvalds, joro, Mel Gorman,
	H. Peter Anvin, Peter Zijlstra, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Jérôme Glisse,
	Jatin Kumar

Hi,

On 22/12/2014 18:48, j.glisse@gmail.com wrote:
> +/* hmm_device_register() - register a device with HMM.
> + *
> + * @device: The hmm_device struct.
> + * Returns: 0 on success or -EINVAL otherwise.
> + *
> + *
> + * Call when device driver want to register itself with HMM. Device driver can
> + * only register once. It will return a reference on the device thus to release
> + * a device the driver must unreference the device.

I see that the code doesn't actually have a reference count on the
hmm_device, but just registers and unregisters it through the
hmm_device_register/hmm_device_unregister functions. Perhaps you should
update the comment here to tell that.

> + */
> +int hmm_device_register(struct hmm_device *device)
> +{
> +	/* sanity check */
> +	BUG_ON(!device);
> +	BUG_ON(!device->name);
> +	BUG_ON(!device->ops);
> +	BUG_ON(!device->ops->release);
> +
> +	mutex_init(&device->mutex);
> +	INIT_LIST_HEAD(&device->mirrors);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_device_register);

Regards,
Haggai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-12-31 15:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-22 16:48 HMM (Heterogeneous Memory Management) v7 j.glisse
2014-12-22 16:48 ` j.glisse
2014-12-22 16:48 ` j.glisse
2014-12-22 16:48 ` [PATCH 1/7] mmu_notifier: add event information to address invalidation v6 j.glisse
2014-12-22 16:48   ` j.glisse
2014-12-22 16:48 ` [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2 j.glisse
2014-12-22 16:48   ` j.glisse
2014-12-25  8:29   ` Haggai Eran
2014-12-25  8:29     ` Haggai Eran
2014-12-26  7:20     ` Jerome Glisse
2014-12-26  7:20       ` Jerome Glisse
2014-12-22 16:48 ` [PATCH 3/7] HMM: introduce heterogeneous memory management j.glisse
2014-12-22 16:48   ` j.glisse
2014-12-31 15:46   ` Haggai Eran
2014-12-31 15:46     ` Haggai Eran
2014-12-22 16:48 ` [PATCH 4/7] HMM: add HMM page table j.glisse
2014-12-22 16:48   ` j.glisse
2014-12-22 16:48 ` [PATCH 5/7] HMM: add per mirror " j.glisse
2014-12-22 16:48   ` j.glisse
2014-12-22 16:49 ` [PATCH 6/7] HMM: add device page fault support j.glisse
2014-12-22 16:49   ` j.glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.