All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-02 13:51 ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:51 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Mel Gorman, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan

In a nutshell:

The heterogeneous memory management (hmm) patchset implement a new api that
sit on top of the mmu notifier api. It provides a simple api to device driver
to mirror a process address space without having to lock or take reference on
page and block them from being reclam or migrated. Any changes on a process
address space is mirrored to the device page table by the hmm code. To achieve
this not only we need each driver to implement a set of callback functions but
hmm also interface itself in many key location of the mm code and fs code.
Moreover hmm allow to migrate range of memory to the device remote memory to
take advantages of its lower latency and higher bandwidth.

The why:

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

To achieve this hmm requires :
 A.1 - Hardware requirements
 A.2 - sleeping inside mmu_notifier
 A.3 - context information for mmu_notifier callback (patch 1 and 2)
 A.4 - new helper function for memcg (patch 5)
 A.5 - special swap type and fault handling code
 A.6 - file backed memory and filesystem changes
 A.7 - The write back expectation

While avoiding :
 B.1 - No new page flag
 B.2 - No special page reclamation code

Finally the rest of this email deals with :
 C.1 - Alternative designs
 C.2 - Hardware solution
 C.3 - Routines marked EXPORT_SYMBOL
 C.4 - Planned features
 C.5 - Getting upstream

But first patchlist :

 0001 - Clarify the use of TTU_UNMAP as being done for VMSCAN or POISONING
 0002 - Give context information to mmu_notifier callback ie why the callback
        is made for (because of munmap call, or page migration, ...).
 0003 - Provide the vma for which the invalidation is happening to mmu_notifier
        callback. This is mostly and optimization to avoid looking up again the
        vma inside the mmu_notifier callback.
 0004 - Add new helper to the generic interval tree (which use rb tree).
 0005 - Add new helper to memcg so that anonymous page can be accounted as well
        as unaccounted without a page struct. Also add a new helper function to
        transfer a charge to a page (charge which have been accounted without a
        struct page in the first place).
 0006 - Introduce the hmm basic code to support simple device mirroring of the
        address space. It is fully functional modulo some missing bit (guard or
        huge page and few other small corner cases).
 0007 - Introduce support for migrating anonymous memory to device memory. This
        involve introducing a new special swap type and teach the mm page fault
        code about hmm.
 0008 - Introduce support for migrating shared or private memory that is backed
        by a file. This is way more complex than anonymous case as it needs to
        synchronize with and exclude other kernel code path that might try to
        access those pages.
 0009 - Add hmm support to ext4 filesystem.
 0010 - Introduce a simple dummy driver that showcase use of the hmm api.
 0011 - Add support for remote memory to the dummy driver.

I believe that patch 1, 2, 3 are use full on their own as they could help fix
some kvm issues (see https://lkml.org/lkml/2014/1/15/125) and they do not
modify behavior of any current code (except that patch 3 might result in a
larger number of call to mmu_notifier as many as there is different vma for
a range).

Other patches have many rough edges but we would like to validate our design
and see what we need to change before smoothing out any of them.


A.1 - Hardware requirements :

The hardware must have its own mmu with a page table per process it wants to
mirror. The device mmu mandatory features are :
  - per page read only flag.
  - page fault support that stop/suspend hardware thread and support resuming
    those hardware thread once the page fault have been serviced.
  - same number of bits for the virtual address as the target architecture (for
    instance 48 bits on current AMD 64).

Advanced optional features :
  - per page dirty bit (indicating the hardware did write to the page).
  - per page access bit (indicating the hardware did access the page).


A.2 - Sleeping in mmu notifier callback :

Because update device mmu might need to sleep, either for taking device driver
lock (which might be consider fixable) or simply because invalidating the mmu
might take several hundred millisecond and might involve allocating device or
driver resources to perform the operation any of which might require to sleep.

Thus we need to be able to sleep inside mmu_notifier_invalidate_range_start at
the very least. Also we need to call to mmu_notifier_change_pte to be bracketed
by mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end.
We need this because mmu_notifier_change_pte is call with the anon vma lock
held (and this is a non sleepable lock).


A.3 - Context information for mmu_notifier callback :

There is a need to provide more context information on why a mmu_notifier call
back does happen. Was it because userspace call munmap ? Or was it because the
kernel is trying to free some memory ? Or because page is being migrated ?

The context is provided by using unique enum value associated with call site of
mmu_notifier functions. The patch here just add the enum value and modify each
call site to pass along the proper value.

The context information is important for management of the secondary mmu. For
instance on a munmap the device driver will want to free all resources used by
that range (device page table memory). This could as well solve the issue that
was discussed in this thread https://lkml.org/lkml/2014/1/15/125 kvm can ignore
mmu_notifier_invalidate_range based on the enum value.


A.4 - New helper function for memcg :

To keep memory control working as expect with the introduction of remote memory
we need to add new helper function so we can account anonymous remote memory as
if it was backed by a page. We also need to be able to transfer charge from the
remote memory to pages and we need to be able clear a page cgroup without side
effect to the memcg.

The patchset currently does add a new type of memory resource but instead just
account remote memory as local memory (struct page) is. This is done with the
minimum amount of change to the memcg code. I believe they are correct.

It might make sense to introduce a new sub-type of memory down the road so that
device memory can be included inside the memcg accounting but we choose to not
do so at first.


A.5 - Special swap type and fault handling code :

When some range of address is backed by device memory we need cpu fault to be
aware of that so it can ask hmm to trigger migration back to local memory. To
avoid too much code disruption we do so by adding a new special hmm swap type
that is special cased in various place inside the mm page fault code. Refer to
patch 7 for details.


A.6 - File backed memory and filesystem changes :

Using remote memory for range of address backed by a file is more complex than
anonymous memory. There are lot more code path that might want to access pages
that cache a file (for read, write, splice, ...). To avoid disrupting the code
too much and sleeping inside page cache look up we decided to add hmm support
on a per filesystem basis. So each filesystem can be teach about hmm and how to
interact with it correctly.

The design is relatively simple, the radix tree is updated to use special hmm
swap entry for any page which is in remote memory. Thus any radix tree look up
will find the special entry and will know it needs to synchronize itself with
hmm to access the file.

There is however subtleties. Updating the radix tree does not guarantee that
hmm is the sole user of the page, another kernel/user thread might have done a
radix look up before the radix tree update.

The solution to this issue is to first update the radix tree, then lock each
page we are migrating, then unmap it from all the process using it and setting
its mapping field to NULL so that once we unlock the page all existing code
will thought that the page was either truncated or reclaimed in both cases all
existing kernel code path will eith perform new look and see the hmm special
entry or will just skip the page. Those code path were audited to insure that
their behavior and expected result are not modified by this.

However this does not insure us exclusive access to the page. So at first when
migrating such page to remote memory we map it read only inside the device and
keep the page around so that both the device copy and the page copy contain the
same data. If the device wishes to write to this remote memory then it call hmm
fault code.

To allow write on remote memory hmm will try to free the page, if the page can
be free then it means hmm is the unique user of the page and the remote memory
can safely be written to. If not then this means that the page content might
still be in use by some other process and the device driver have to choose to
either wait or use the local memory instead. So local memory page are kept as
long as there are other user for them. We likely need to hookup some special
page reclamation code to force reclaiming those pages after a while.


A.7 - The write back expectation :

We also wanted to preserve the writeback and dirty balancing as we believe this
is an important behavior (avoiding dirty content to stay for too long inside
remote memory without being write back to disk). To avoid constantly migrating
memory back and forth we decided to use existing page (hmm keep all shared page
around and never free them for the lifetime of rmem object they are associated
with) as temporary writeback source. On writeback the remote memory is mapped
read only on the device and copied back to local memory which is use as source
for the disk write.

This design choice can however be seen as counter productive as it means that
the device using hmm will see its rmem map read only for writeback and then
will have to wait for writeback to go through. Another choice would be to
forget writeback while memory is on the device and pretend page are clear but
this would break fsync and similar API for file that does have part of its
content inside some device memory.

Middle ground might be to keep fsync and alike working but to ignore any other
writeback.


B.1 - No new page flag :

While adding a new page flag would certainly help to find a different design to
implement the hmm feature set. We tried to only think about design that do not
require such a new flag.


B.2 - No special page reclamation code :

This is one of the big issue, should be isolate pages that are actively use
by a device from the regular lru to a specific lru managed by the hmm code.
In this patchset we decided to avoid doing so as it would just add complexity
to already complex code.

Current code will trigger sleep inside vmscan when trying to reclaim page that
belong to a process which is mirrored on a device. Is this acceptable or should
we add a new hmm lru list that would handle all pages used by device in special
way so that those pages are isolated from the regular page reclamation code.


C.1 - Alternative designs :

The current design is the one we believe provide enough ground to support all
necessary features while keeping complexity as low as possible. However i think
it is important to state that several others designs were tested and to explain
why they were discarded.

D1) One of the first design introduced a secondary page table directly updated
  by hmm helper functions. Hope was that this secondary page table could be in
  some way directly use by the device. That was naive ... to say the least.

D2) The secondary page table with hmm specific format, was another design that
  we tested. In this one the secondary page table was not intended to be use by
  the device but was intended to serve as a buffer btw the cpu page table and
  the device page table. Update to the device page table would use the hmm page
  table.

  While this secondary page table allow to track what is actively use and also
  gather statistics about it. It does require memory, in worst case as much as
  the cpu page table.

  Another issue is that synchronization between cpu update and device trying to
  access this secondary page table was either prone to lock contention. Or was
  getting awfully complex to avoid locking all while duplicating complexity
  inside each of the device driver.

  The killing bullet was however the fact that the code was littered with bug
  condition about discrepancy between the cpu and the hmm page table.

D3) Use a structure to track all actively mirrored range per process and per
  device. This allow to have an exact view of which range of memory is in use
  by which device.

  Again this need a lot of memory to track each of the active range and worst
  case would need more memory than a secondary page table (one struct range per
  page).

  Issue here was with the complexity or merging and splitting range on address
  space changes.

D4) Use a structure to track all active mirrored range per process (shared by
  all the devices that mirror the same process). This partially address the
  memory requirement of D3 but this leave the complexity of range merging and
  splitting intact.

The current design is a simplification of D4 in which we only track range of
memory for memory that have been migrated to device memory. So for any others
operations hmm directly access the cpu page table and forward the appropriate
information to the device driver through the hmm api. We might need to go back
to D4 design or a variation of it for some of the features we want add.


C.2 - Hardware solution :

What hmm try to achieve can be partially achieved using hardware solution. Such
hardware solution is part of PCIE specification with the PASID (process address
space id) and ATS (address translation service). With both of this PCIE feature
a device can ask for a virtual address of a given process to be translated into
its corresponding physical address. To achieve this the IOMMU bridge is capable
of understanding and walking the cpu page table of a process. See the IOMMUv2
implementation inside the linux kernel for reference.

There is two huge restriction with hardware solution to this problem. First an
obvious one is that you need hardware support. While HMM also require hardware
support on the GPU side it does not on the architecture side (no requirement on
IOMMU, or any bridges that are between the GPU and the system memory). This is
a strong advantages to HMM it only require hardware support to one specific
part.

The second restriction is that hardware solution like IOMMUv2 does not permit
migrating chunk of memory to the device local memory which means under-using
hardware resources (all discrete GPU comes with fast local memory that can
have more than ten times the bandwidth of system memory).

This two reasons alone, are we believe enough to justify hmm usefulness.

Moreover hmm can work in a hybrid solution where non migrated chunk of memory
goes through the hardware solution (IOMMUv2 for instance) and only the memory
that is migrated to the device is handled by the hmm code. The requirement for
the hardware is minimal, the hardware need to support the PASID & ATS (or any
other hardware implementation of the same idea) on page granularity basis (it
could be on the granularity of any level of the device page table so no need
to populate all levels of the device page table). Which is the best solution
for the problem.


C.3 - Routines marked EXPORT_SYMBOL

As these routines are intended to be referenced in device drivers, they
are marked EXPORT_SYMBOL as is common practice. This encourages adoption
of HMM in both GPL and non-GPL drivers, and allows ongoing collaboration
with one of the primary authors of this idea.

I think it would be beneficial to include this feature as soon as possible.
Early collaborators can go to the trouble of fixing and polishing the HMM
implementation, allowing it to fully bake by the time other drivers start
implementing features requiring it. We are confident that this API will be
useful to others as they catch up with supporting hardware.


C.4 - Planned features :

We are planning to add various features down the road once we can clear the
basic design. Most important ones are :
  - Allowing inter-device migration for compatible devices.
  - Allowing hmm_rmem without backing storage (simplify some of the driver).
  - Device specific memcg.
  - Improvement to allow APU to take advantages of rmem, by hiding the page
    from the cpu the gpu can use a different memory controller link that do
    not require cache coherency with the cpu and thus provide higher bandwidth.
  - Atomic device memory operation by unmapping on the cpu while the device is
    performing atomic operation (this require hardware mmu to differentiate
    between regular memory access and atomic memory access and to have a flag
    that allow atomic memory access on per page basis).
  - Pining private memory to rmem this would be a useful feature to add and
    would require addition of a new flag to madvise. Any cpu access would
    result in SIGBUS for the cpu process.


C.5 - Getting upstream :

So what should i do to get this patchset in a mergeable form at least at first
as a staging feature ? Right now the patchset has few rough edges around huge
page support and other smaller issues. But as said above i believe that patch
1, 2, 3 and 4 can be merge as is as they do not modify current behavior while
being useful to other.

Should i implement some secondary hmm specific lru and their associated worker
thread to avoid having the regular reclaim code to end up sleeping waiting for
a device to update its page table ?

Should i go for a totaly different design ? If so what direction ? As stated
above we explored other design and i listed there flaws.

Any others things that i need to fix/address/change/improve ?

Comments and flames are welcome.

Cheers,
Jérôme Glisse

To: <linux-kernel@vger.kernel.org>,
To: linux-mm <linux-mm@kvack.org>,
To: <linux-fsdevel@vger.kernel.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-02 13:51 ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:51 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Mel Gorman, H. Peter Anvin, Peter Zijlstra, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded

In a nutshell:

The heterogeneous memory management (hmm) patchset implement a new api that
sit on top of the mmu notifier api. It provides a simple api to device driver
to mirror a process address space without having to lock or take reference on
page and block them from being reclam or migrated. Any changes on a process
address space is mirrored to the device page table by the hmm code. To achieve
this not only we need each driver to implement a set of callback functions but
hmm also interface itself in many key location of the mm code and fs code.
Moreover hmm allow to migrate range of memory to the device remote memory to
take advantages of its lower latency and higher bandwidth.

The why:

We want to be able to mirror a process address space so that compute api such
as OpenCL or other similar api can start using the exact same address space on
the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
we believe that we will see more and more specialize unit functions that will
want to mirror process address using their own mmu.

To achieve this hmm requires :
 A.1 - Hardware requirements
 A.2 - sleeping inside mmu_notifier
 A.3 - context information for mmu_notifier callback (patch 1 and 2)
 A.4 - new helper function for memcg (patch 5)
 A.5 - special swap type and fault handling code
 A.6 - file backed memory and filesystem changes
 A.7 - The write back expectation

While avoiding :
 B.1 - No new page flag
 B.2 - No special page reclamation code

Finally the rest of this email deals with :
 C.1 - Alternative designs
 C.2 - Hardware solution
 C.3 - Routines marked EXPORT_SYMBOL
 C.4 - Planned features
 C.5 - Getting upstream

But first patchlist :

 0001 - Clarify the use of TTU_UNMAP as being done for VMSCAN or POISONING
 0002 - Give context information to mmu_notifier callback ie why the callback
        is made for (because of munmap call, or page migration, ...).
 0003 - Provide the vma for which the invalidation is happening to mmu_notifier
        callback. This is mostly and optimization to avoid looking up again the
        vma inside the mmu_notifier callback.
 0004 - Add new helper to the generic interval tree (which use rb tree).
 0005 - Add new helper to memcg so that anonymous page can be accounted as well
        as unaccounted without a page struct. Also add a new helper function to
        transfer a charge to a page (charge which have been accounted without a
        struct page in the first place).
 0006 - Introduce the hmm basic code to support simple device mirroring of the
        address space. It is fully functional modulo some missing bit (guard or
        huge page and few other small corner cases).
 0007 - Introduce support for migrating anonymous memory to device memory. This
        involve introducing a new special swap type and teach the mm page fault
        code about hmm.
 0008 - Introduce support for migrating shared or private memory that is backed
        by a file. This is way more complex than anonymous case as it needs to
        synchronize with and exclude other kernel code path that might try to
        access those pages.
 0009 - Add hmm support to ext4 filesystem.
 0010 - Introduce a simple dummy driver that showcase use of the hmm api.
 0011 - Add support for remote memory to the dummy driver.

I believe that patch 1, 2, 3 are use full on their own as they could help fix
some kvm issues (see https://lkml.org/lkml/2014/1/15/125) and they do not
modify behavior of any current code (except that patch 3 might result in a
larger number of call to mmu_notifier as many as there is different vma for
a range).

Other patches have many rough edges but we would like to validate our design
and see what we need to change before smoothing out any of them.


A.1 - Hardware requirements :

The hardware must have its own mmu with a page table per process it wants to
mirror. The device mmu mandatory features are :
  - per page read only flag.
  - page fault support that stop/suspend hardware thread and support resuming
    those hardware thread once the page fault have been serviced.
  - same number of bits for the virtual address as the target architecture (for
    instance 48 bits on current AMD 64).

Advanced optional features :
  - per page dirty bit (indicating the hardware did write to the page).
  - per page access bit (indicating the hardware did access the page).


A.2 - Sleeping in mmu notifier callback :

Because update device mmu might need to sleep, either for taking device driver
lock (which might be consider fixable) or simply because invalidating the mmu
might take several hundred millisecond and might involve allocating device or
driver resources to perform the operation any of which might require to sleep.

Thus we need to be able to sleep inside mmu_notifier_invalidate_range_start at
the very least. Also we need to call to mmu_notifier_change_pte to be bracketed
by mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end.
We need this because mmu_notifier_change_pte is call with the anon vma lock
held (and this is a non sleepable lock).


A.3 - Context information for mmu_notifier callback :

There is a need to provide more context information on why a mmu_notifier call
back does happen. Was it because userspace call munmap ? Or was it because the
kernel is trying to free some memory ? Or because page is being migrated ?

The context is provided by using unique enum value associated with call site of
mmu_notifier functions. The patch here just add the enum value and modify each
call site to pass along the proper value.

The context information is important for management of the secondary mmu. For
instance on a munmap the device driver will want to free all resources used by
that range (device page table memory). This could as well solve the issue that
was discussed in this thread https://lkml.org/lkml/2014/1/15/125 kvm can ignore
mmu_notifier_invalidate_range based on the enum value.


A.4 - New helper function for memcg :

To keep memory control working as expect with the introduction of remote memory
we need to add new helper function so we can account anonymous remote memory as
if it was backed by a page. We also need to be able to transfer charge from the
remote memory to pages and we need to be able clear a page cgroup without side
effect to the memcg.

The patchset currently does add a new type of memory resource but instead just
account remote memory as local memory (struct page) is. This is done with the
minimum amount of change to the memcg code. I believe they are correct.

It might make sense to introduce a new sub-type of memory down the road so that
device memory can be included inside the memcg accounting but we choose to not
do so at first.


A.5 - Special swap type and fault handling code :

When some range of address is backed by device memory we need cpu fault to be
aware of that so it can ask hmm to trigger migration back to local memory. To
avoid too much code disruption we do so by adding a new special hmm swap type
that is special cased in various place inside the mm page fault code. Refer to
patch 7 for details.


A.6 - File backed memory and filesystem changes :

Using remote memory for range of address backed by a file is more complex than
anonymous memory. There are lot more code path that might want to access pages
that cache a file (for read, write, splice, ...). To avoid disrupting the code
too much and sleeping inside page cache look up we decided to add hmm support
on a per filesystem basis. So each filesystem can be teach about hmm and how to
interact with it correctly.

The design is relatively simple, the radix tree is updated to use special hmm
swap entry for any page which is in remote memory. Thus any radix tree look up
will find the special entry and will know it needs to synchronize itself with
hmm to access the file.

There is however subtleties. Updating the radix tree does not guarantee that
hmm is the sole user of the page, another kernel/user thread might have done a
radix look up before the radix tree update.

The solution to this issue is to first update the radix tree, then lock each
page we are migrating, then unmap it from all the process using it and setting
its mapping field to NULL so that once we unlock the page all existing code
will thought that the page was either truncated or reclaimed in both cases all
existing kernel code path will eith perform new look and see the hmm special
entry or will just skip the page. Those code path were audited to insure that
their behavior and expected result are not modified by this.

However this does not insure us exclusive access to the page. So at first when
migrating such page to remote memory we map it read only inside the device and
keep the page around so that both the device copy and the page copy contain the
same data. If the device wishes to write to this remote memory then it call hmm
fault code.

To allow write on remote memory hmm will try to free the page, if the page can
be free then it means hmm is the unique user of the page and the remote memory
can safely be written to. If not then this means that the page content might
still be in use by some other process and the device driver have to choose to
either wait or use the local memory instead. So local memory page are kept as
long as there are other user for them. We likely need to hookup some special
page reclamation code to force reclaiming those pages after a while.


A.7 - The write back expectation :

We also wanted to preserve the writeback and dirty balancing as we believe this
is an important behavior (avoiding dirty content to stay for too long inside
remote memory without being write back to disk). To avoid constantly migrating
memory back and forth we decided to use existing page (hmm keep all shared page
around and never free them for the lifetime of rmem object they are associated
with) as temporary writeback source. On writeback the remote memory is mapped
read only on the device and copied back to local memory which is use as source
for the disk write.

This design choice can however be seen as counter productive as it means that
the device using hmm will see its rmem map read only for writeback and then
will have to wait for writeback to go through. Another choice would be to
forget writeback while memory is on the device and pretend page are clear but
this would break fsync and similar API for file that does have part of its
content inside some device memory.

Middle ground might be to keep fsync and alike working but to ignore any other
writeback.


B.1 - No new page flag :

While adding a new page flag would certainly help to find a different design to
implement the hmm feature set. We tried to only think about design that do not
require such a new flag.


B.2 - No special page reclamation code :

This is one of the big issue, should be isolate pages that are actively use
by a device from the regular lru to a specific lru managed by the hmm code.
In this patchset we decided to avoid doing so as it would just add complexity
to already complex code.

Current code will trigger sleep inside vmscan when trying to reclaim page that
belong to a process which is mirrored on a device. Is this acceptable or should
we add a new hmm lru list that would handle all pages used by device in special
way so that those pages are isolated from the regular page reclamation code.


C.1 - Alternative designs :

The current design is the one we believe provide enough ground to support all
necessary features while keeping complexity as low as possible. However i think
it is important to state that several others designs were tested and to explain
why they were discarded.

D1) One of the first design introduced a secondary page table directly updated
  by hmm helper functions. Hope was that this secondary page table could be in
  some way directly use by the device. That was naive ... to say the least.

D2) The secondary page table with hmm specific format, was another design that
  we tested. In this one the secondary page table was not intended to be use by
  the device but was intended to serve as a buffer btw the cpu page table and
  the device page table. Update to the device page table would use the hmm page
  table.

  While this secondary page table allow to track what is actively use and also
  gather statistics about it. It does require memory, in worst case as much as
  the cpu page table.

  Another issue is that synchronization between cpu update and device trying to
  access this secondary page table was either prone to lock contention. Or was
  getting awfully complex to avoid locking all while duplicating complexity
  inside each of the device driver.

  The killing bullet was however the fact that the code was littered with bug
  condition about discrepancy between the cpu and the hmm page table.

D3) Use a structure to track all actively mirrored range per process and per
  device. This allow to have an exact view of which range of memory is in use
  by which device.

  Again this need a lot of memory to track each of the active range and worst
  case would need more memory than a secondary page table (one struct range per
  page).

  Issue here was with the complexity or merging and splitting range on address
  space changes.

D4) Use a structure to track all active mirrored range per process (shared by
  all the devices that mirror the same process). This partially address the
  memory requirement of D3 but this leave the complexity of range merging and
  splitting intact.

The current design is a simplification of D4 in which we only track range of
memory for memory that have been migrated to device memory. So for any others
operations hmm directly access the cpu page table and forward the appropriate
information to the device driver through the hmm api. We might need to go back
to D4 design or a variation of it for some of the features we want add.


C.2 - Hardware solution :

What hmm try to achieve can be partially achieved using hardware solution. Such
hardware solution is part of PCIE specification with the PASID (process address
space id) and ATS (address translation service). With both of this PCIE feature
a device can ask for a virtual address of a given process to be translated into
its corresponding physical address. To achieve this the IOMMU bridge is capable
of understanding and walking the cpu page table of a process. See the IOMMUv2
implementation inside the linux kernel for reference.

There is two huge restriction with hardware solution to this problem. First an
obvious one is that you need hardware support. While HMM also require hardware
support on the GPU side it does not on the architecture side (no requirement on
IOMMU, or any bridges that are between the GPU and the system memory). This is
a strong advantages to HMM it only require hardware support to one specific
part.

The second restriction is that hardware solution like IOMMUv2 does not permit
migrating chunk of memory to the device local memory which means under-using
hardware resources (all discrete GPU comes with fast local memory that can
have more than ten times the bandwidth of system memory).

This two reasons alone, are we believe enough to justify hmm usefulness.

Moreover hmm can work in a hybrid solution where non migrated chunk of memory
goes through the hardware solution (IOMMUv2 for instance) and only the memory
that is migrated to the device is handled by the hmm code. The requirement for
the hardware is minimal, the hardware need to support the PASID & ATS (or any
other hardware implementation of the same idea) on page granularity basis (it
could be on the granularity of any level of the device page table so no need
to populate all levels of the device page table). Which is the best solution
for the problem.


C.3 - Routines marked EXPORT_SYMBOL

As these routines are intended to be referenced in device drivers, they
are marked EXPORT_SYMBOL as is common practice. This encourages adoption
of HMM in both GPL and non-GPL drivers, and allows ongoing collaboration
with one of the primary authors of this idea.

I think it would be beneficial to include this feature as soon as possible.
Early collaborators can go to the trouble of fixing and polishing the HMM
implementation, allowing it to fully bake by the time other drivers start
implementing features requiring it. We are confident that this API will be
useful to others as they catch up with supporting hardware.


C.4 - Planned features :

We are planning to add various features down the road once we can clear the
basic design. Most important ones are :
  - Allowing inter-device migration for compatible devices.
  - Allowing hmm_rmem without backing storage (simplify some of the driver).
  - Device specific memcg.
  - Improvement to allow APU to take advantages of rmem, by hiding the page
    from the cpu the gpu can use a different memory controller link that do
    not require cache coherency with the cpu and thus provide higher bandwidth.
  - Atomic device memory operation by unmapping on the cpu while the device is
    performing atomic operation (this require hardware mmu to differentiate
    between regular memory access and atomic memory access and to have a flag
    that allow atomic memory access on per page basis).
  - Pining private memory to rmem this would be a useful feature to add and
    would require addition of a new flag to madvise. Any cpu access would
    result in SIGBUS for the cpu process.


C.5 - Getting upstream :

So what should i do to get this patchset in a mergeable form at least at first
as a staging feature ? Right now the patchset has few rough edges around huge
page support and other smaller issues. But as said above i believe that patch
1, 2, 3 and 4 can be merge as is as they do not modify current behavior while
being useful to other.

Should i implement some secondary hmm specific lru and their associated worker
thread to avoid having the regular reclaim code to end up sleeping waiting for
a device to update its page table ?

Should i go for a totaly different design ? If so what direction ? As stated
above we explored other design and i listed there flaws.

Any others things that i need to fix/address/change/improve ?

Comments and flames are welcome.

Cheers,
JA(C)rA'me Glisse

To: <linux-kernel@vger.kernel.org>,
To: linux-mm <linux-mm@kvack.org>,
To: <linux-fsdevel@vger.kernel.org>,
Cc: "Mel Gorman" <mgorman@suse.de>,
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Cc: "Peter Zijlstra" <peterz@infradead.org>,
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
Cc: "Linda Wang" <lwang@redhat.com>,
Cc: "Kevin E Martin" <kem@redhat.com>,
Cc: "Jerome Glisse" <jglisse@redhat.com>,
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
Cc: "Johannes Weiner" <jweiner@redhat.com>,
Cc: "Larry Woodman" <lwoodman@redhat.com>,
Cc: "Rik van Riel" <riel@redhat.com>,
Cc: "Dave Airlie" <airlied@redhat.com>,
Cc: "Jeff Law" <law@redhat.com>,
Cc: "Brendan Conoboy" <blc@redhat.com>,
Cc: "Joe Donohue" <jdonohue@redhat.com>,
Cc: "Duncan Poole" <dpoole@nvidia.com>,
Cc: "Sherry Cheung" <SCheung@nvidia.com>,
Cc: "Subhash Gutti" <sgutti@nvidia.com>,
Cc: "John Hubbard" <jhubbard@nvidia.com>,
Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
Cc: "Lucien Dunning" <ldunning@nvidia.com>,
Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
Cc: "Haggai Eran" <haggaie@mellanox.com>,
Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
Cc: "Sagi Grimberg" <sagig@mellanox.com>
Cc: "Shachar Raindel" <raindel@mellanox.com>,
Cc: "Liran Liss" <liranl@mellanox.com>,
Cc: "Roland Dreier" <roland@purestorage.com>,
Cc: "Sander, Ben" <ben.sander@amd.com>,
Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
Cc: "Bridgman, John" <John.Bridgman@amd.com>,
Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 01/11] mm: differentiate unmap for vmscan from other unmap.
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/rmap.h | 7 ++++---
 mm/memory-failure.c  | 2 +-
 mm/vmscan.c          | 4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b66c211..575851f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,9 +72,10 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_UNMAP = 0,			/* unmap mode */
-	TTU_MIGRATION = 1,		/* migration mode */
-	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_VMSCAN = 0,			/* unmap for vmscan mode */
+	TTU_POISON = 1,			/* unmap mode */
+	TTU_MIGRATION = 2,		/* migration mode */
+	TTU_MUNLOCK = 3,		/* munlock mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index efb55b3..c61722b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -854,7 +854,7 @@ static int page_action(struct page_state *ps, struct page *p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno, int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 049f324..e261fc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1158,7 +1158,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
-			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			TTU_VMSCAN|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1511,7 +1511,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 01/11] mm: differentiate unmap for vmscan from other unmap.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/rmap.h | 7 ++++---
 mm/memory-failure.c  | 2 +-
 mm/vmscan.c          | 4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b66c211..575851f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,9 +72,10 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_UNMAP = 0,			/* unmap mode */
-	TTU_MIGRATION = 1,		/* migration mode */
-	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_VMSCAN = 0,			/* unmap for vmscan mode */
+	TTU_POISON = 1,			/* unmap mode */
+	TTU_MIGRATION = 2,		/* migration mode */
+	TTU_MUNLOCK = 3,		/* munlock mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index efb55b3..c61722b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -854,7 +854,7 @@ static int page_action(struct page_state *ps, struct page *p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno, int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 049f324..e261fc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1158,7 +1158,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
-			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			TTU_VMSCAN|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1511,7 +1511,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 01/11] mm: differentiate unmap for vmscan from other unmap.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/rmap.h | 7 ++++---
 mm/memory-failure.c  | 2 +-
 mm/vmscan.c          | 4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b66c211..575851f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,9 +72,10 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_UNMAP = 0,			/* unmap mode */
-	TTU_MIGRATION = 1,		/* migration mode */
-	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_VMSCAN = 0,			/* unmap for vmscan mode */
+	TTU_POISON = 1,			/* unmap mode */
+	TTU_MIGRATION = 2,		/* migration mode */
+	TTU_MUNLOCK = 3,		/* munlock mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index efb55b3..c61722b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -854,7 +854,7 @@ static int page_action(struct page_state *ps, struct page *p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno, int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 049f324..e261fc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1158,7 +1158,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
-			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			TTU_VMSCAN|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1511,7 +1511,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 02/11] mmu_notifier: add action information to address invalidation.
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

The action information will be usefull for new user of mmu_notifier API.
The action argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different action for instance on unmap the resource used
to track a vma are still valid and should stay around if need be.
While if the action is saying that a vma is being destroy it means that
that any resources used to track this vma can be free.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c |  14 ++++--
 drivers/xen/gntdev.c         |   9 ++--
 fs/proc/task_mmu.c           |   4 +-
 include/linux/hugetlb.h      |   4 +-
 include/linux/mmu_notifier.h | 108 ++++++++++++++++++++++++++++++++++---------
 kernel/events/uprobes.c      |   6 +--
 mm/filemap_xip.c             |   2 +-
 mm/fremap.c                  |   8 +++-
 mm/huge_memory.c             |  26 +++++------
 mm/hugetlb.c                 |  19 ++++----
 mm/ksm.c                     |  12 ++---
 mm/memory.c                  |  23 ++++-----
 mm/mempolicy.c               |   2 +-
 mm/migrate.c                 |   6 +--
 mm/mmu_notifier.c            |  26 +++++++----
 mm/mprotect.c                |  30 ++++++++----
 mm/mremap.c                  |   4 +-
 mm/rmap.c                    |  55 +++++++++++++++++++---
 virt/kvm/kvm_main.c          |  12 +++--
 19 files changed, 258 insertions(+), 112 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5208828..71f8a1c 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -421,21 +421,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
 			  struct mm_struct *mm,
 			  unsigned long address,
-			  pte_t pte)
+			  pte_t pte,
+			  enum mmu_action action)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_action action)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+				      unsigned long start,
+				      unsigned long end,
+				      enum mmu_action action)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
-				    unsigned long start, unsigned long end)
+				    unsigned long start,
+				    unsigned long end,
+				    enum mmu_action action)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..84aa5a7 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_action action)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_action action)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, action);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fa6d6a4..3c571ea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -818,11 +818,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0, -1, MMU_SOFT_DIRTY);
 		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			walk_page_vma(vma, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0, -1, MMU_SOFT_DIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0683f55..1c36581 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -6,6 +6,7 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/mmu_notifier.h>
 #include <linux/list.h>
 #include <linux/kref.h>
 
@@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_action action);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..90b9105 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,41 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* Action report finer information to the callback allowing the event listener
+ * to take better action. For instance WP means that the page are still valid
+ * and can be use as read only.
+ *
+ * UNMAP means the vma is still valid and that only pages are unmaped and thus
+ * they should no longer be read or written to.
+ *
+ * ZAP means vma is disappearing and that any resource that were use to track
+ * this vma can be freed.
+ *
+ * In doubt when adding a new notifier caller use ZAP it will always trigger
+ * right thing but won't be optimal.
+ */
+enum mmu_action {
+	MMU_MPROT_NONE = 0,
+	MMU_MPROT_RONLY,
+	MMU_MPROT_RANDW,
+	MMU_MPROT_WONLY,
+	MMU_COW,
+	MMU_KSM,
+	MMU_KSM_RONLY,
+	MMU_SOFT_DIRTY,
+	MMU_UNMAP,
+	MMU_VMSCAN,
+	MMU_POISON,
+	MMU_MREMAP,
+	MMU_MUNMAP,
+	MMU_MUNLOCK,
+	MMU_MIGRATE,
+	MMU_FILE_WB,
+	MMU_FAULT_WP,
+	MMU_THP_SPLIT,
+	MMU_THP_FAULT_WP,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -79,7 +114,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_action action);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -90,7 +126,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_action action);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -137,10 +174,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_action action);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_action action);
 };
 
 /*
@@ -177,13 +218,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_action action);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					   unsigned long address,
+					   enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_action action);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -208,31 +256,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, action);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, action);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, action);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, action);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -278,13 +333,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __action)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __action);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -307,22 +362,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_action action)
 {
 }
 
@@ -336,7 +398,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
-#define set_pte_at_notify set_pte_at
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __action) set_pte_at(__mm, __address, __ptep, __pte)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d1edc5e..9acd357 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -170,7 +170,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot), MMU_UNMAP);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -199,7 +199,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..d529ab9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_UNMAP);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index 2c5646f..f4a67e0 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -254,9 +254,13 @@ get_write_lock:
 		vma->vm_flags = vm_flags;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
+	/* Consider it a ZAP operation for now, it could be seen as an unmap but
+	 * remapping is trickier as it can change the vma to non linear and thus
+	 * trigger side effect.
+	 */
+	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
+	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c5ff461..4ad9b73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -993,7 +993,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1023,7 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1033,7 +1033,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -1123,7 +1123,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	spin_lock(ptl);
 	if (page)
@@ -1153,7 +1153,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 out:
 	return ret;
 out_unlock:
@@ -1588,7 +1588,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1603,7 +1603,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	return ret;
 }
@@ -2402,7 +2402,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2412,7 +2412,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2801,24 +2801,24 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e73f7bc..8006472 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2540,7 +2540,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end, MMU_COW);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2574,7 +2574,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end, MMU_COW);
 
 	return ret;
 }
@@ -2626,7 +2626,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2697,7 +2697,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2884,7 +2884,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2904,7 +2904,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3329,7 +3329,8 @@ same_page:
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_action action)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -3341,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, action);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3371,7 +3372,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, action);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 68710e8..6a32bc4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +904,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_RONLY);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
 out:
 	return err;
 }
@@ -949,7 +949,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -962,7 +962,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot), MMU_KSM);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -972,7 +972,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index b6b9c6e..69286e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1055,7 +1055,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_COW);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1072,7 +1072,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_COW);
 	return ret;
 }
 
@@ -1378,10 +1379,10 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1403,10 +1404,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1429,9 +1430,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2850,7 +2851,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_FAULT_WP);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2880,7 +2881,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_FAULT_WP);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2919,7 +2920,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_FAULT_WP);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ac621fa..e42f4b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -561,7 +561,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1, 0);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 6247be7..1accb9b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1804,12 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1875,7 +1875,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 41cefdf..a906744 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_action action)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -159,14 +165,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -174,7 +182,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..6c2846f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_action action)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
@@ -157,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start, end, action);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -185,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end, action);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
@@ -194,7 +195,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_action action)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -206,7 +208,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, action);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -214,7 +216,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, enum mmu_action action)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -231,7 +233,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, action);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -247,11 +249,21 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       int dirty_accountable, int prot_numa)
 {
 	unsigned long pages;
+	enum mmu_action action = MMU_MPROT_NONE;
+
+	/* At this points vm_flags is updated. */
+	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE)) {
+		action = MMU_MPROT_RANDW;
+	} else if (vma->vm_flags & VM_WRITE) {
+		action = MMU_MPROT_WONLY;
+	} else if (vma->vm_flags & VM_READ) {
+		action = MMU_MPROT_RONLY;
+	}
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end, newprot, action);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa, action);
 
 	return pages;
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index 0843feb..8c00e98 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -221,7 +221,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c08cbd..5504e31 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -834,7 +834,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_FILE_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1117,6 +1117,27 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_action action;
+
+	switch (TTU_ACTION(flags)) {
+	case TTU_VMSCAN:
+		action = MMU_VMSCAN;
+		break;
+	case TTU_POISON:
+		action = MMU_POISON;
+		break;
+	case TTU_MIGRATION:
+		action = MMU_MIGRATE;
+		break;
+	case TTU_MUNLOCK:
+		action = MMU_MUNLOCK;
+		break;
+	default:
+		/* Please report this ! */
+		BUG();
+		action = MMU_UNMAP;
+		break;
+	}
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1222,7 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL)
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, action);
 out:
 	return ret;
 
@@ -1276,7 +1297,8 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma, struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1290,6 +1312,27 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_action action;
+
+	switch (TTU_ACTION(flags)) {
+	case TTU_VMSCAN:
+		action = MMU_VMSCAN;
+		break;
+	case TTU_POISON:
+		action = MMU_POISON;
+		break;
+	case TTU_MIGRATION:
+		action = MMU_MIGRATE;
+		break;
+	case TTU_MUNLOCK:
+		action = MMU_MUNLOCK;
+		break;
+	default:
+		/* Please report this ! */
+		BUG();
+		action = MMU_UNMAP;
+		break;
+	}
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1304,7 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, action);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1369,7 +1412,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, action);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1425,7 +1468,7 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+						vma, page, (enum ttu_flags)arg) == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa70c6e..483f2e6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 02/11] mmu_notifier: add action information to address invalidation.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

The action information will be usefull for new user of mmu_notifier API.
The action argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different action for instance on unmap the resource used
to track a vma are still valid and should stay around if need be.
While if the action is saying that a vma is being destroy it means that
that any resources used to track this vma can be free.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c |  14 ++++--
 drivers/xen/gntdev.c         |   9 ++--
 fs/proc/task_mmu.c           |   4 +-
 include/linux/hugetlb.h      |   4 +-
 include/linux/mmu_notifier.h | 108 ++++++++++++++++++++++++++++++++++---------
 kernel/events/uprobes.c      |   6 +--
 mm/filemap_xip.c             |   2 +-
 mm/fremap.c                  |   8 +++-
 mm/huge_memory.c             |  26 +++++------
 mm/hugetlb.c                 |  19 ++++----
 mm/ksm.c                     |  12 ++---
 mm/memory.c                  |  23 ++++-----
 mm/mempolicy.c               |   2 +-
 mm/migrate.c                 |   6 +--
 mm/mmu_notifier.c            |  26 +++++++----
 mm/mprotect.c                |  30 ++++++++----
 mm/mremap.c                  |   4 +-
 mm/rmap.c                    |  55 +++++++++++++++++++---
 virt/kvm/kvm_main.c          |  12 +++--
 19 files changed, 258 insertions(+), 112 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5208828..71f8a1c 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -421,21 +421,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
 			  struct mm_struct *mm,
 			  unsigned long address,
-			  pte_t pte)
+			  pte_t pte,
+			  enum mmu_action action)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_action action)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+				      unsigned long start,
+				      unsigned long end,
+				      enum mmu_action action)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
-				    unsigned long start, unsigned long end)
+				    unsigned long start,
+				    unsigned long end,
+				    enum mmu_action action)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..84aa5a7 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_action action)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_action action)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, action);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fa6d6a4..3c571ea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -818,11 +818,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0, -1, MMU_SOFT_DIRTY);
 		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			walk_page_vma(vma, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0, -1, MMU_SOFT_DIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0683f55..1c36581 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -6,6 +6,7 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/mmu_notifier.h>
 #include <linux/list.h>
 #include <linux/kref.h>
 
@@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_action action);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..90b9105 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,41 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* Action report finer information to the callback allowing the event listener
+ * to take better action. For instance WP means that the page are still valid
+ * and can be use as read only.
+ *
+ * UNMAP means the vma is still valid and that only pages are unmaped and thus
+ * they should no longer be read or written to.
+ *
+ * ZAP means vma is disappearing and that any resource that were use to track
+ * this vma can be freed.
+ *
+ * In doubt when adding a new notifier caller use ZAP it will always trigger
+ * right thing but won't be optimal.
+ */
+enum mmu_action {
+	MMU_MPROT_NONE = 0,
+	MMU_MPROT_RONLY,
+	MMU_MPROT_RANDW,
+	MMU_MPROT_WONLY,
+	MMU_COW,
+	MMU_KSM,
+	MMU_KSM_RONLY,
+	MMU_SOFT_DIRTY,
+	MMU_UNMAP,
+	MMU_VMSCAN,
+	MMU_POISON,
+	MMU_MREMAP,
+	MMU_MUNMAP,
+	MMU_MUNLOCK,
+	MMU_MIGRATE,
+	MMU_FILE_WB,
+	MMU_FAULT_WP,
+	MMU_THP_SPLIT,
+	MMU_THP_FAULT_WP,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -79,7 +114,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_action action);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -90,7 +126,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_action action);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -137,10 +174,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_action action);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_action action);
 };
 
 /*
@@ -177,13 +218,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_action action);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					   unsigned long address,
+					   enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_action action);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -208,31 +256,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, action);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, action);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, action);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, action);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -278,13 +333,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __action)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __action);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -307,22 +362,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_action action)
 {
 }
 
@@ -336,7 +398,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
-#define set_pte_at_notify set_pte_at
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __action) set_pte_at(__mm, __address, __ptep, __pte)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d1edc5e..9acd357 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -170,7 +170,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot), MMU_UNMAP);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -199,7 +199,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..d529ab9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_UNMAP);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index 2c5646f..f4a67e0 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -254,9 +254,13 @@ get_write_lock:
 		vma->vm_flags = vm_flags;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
+	/* Consider it a ZAP operation for now, it could be seen as an unmap but
+	 * remapping is trickier as it can change the vma to non linear and thus
+	 * trigger side effect.
+	 */
+	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
+	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c5ff461..4ad9b73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -993,7 +993,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1023,7 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1033,7 +1033,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -1123,7 +1123,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	spin_lock(ptl);
 	if (page)
@@ -1153,7 +1153,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 out:
 	return ret;
 out_unlock:
@@ -1588,7 +1588,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1603,7 +1603,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	return ret;
 }
@@ -2402,7 +2402,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2412,7 +2412,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2801,24 +2801,24 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e73f7bc..8006472 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2540,7 +2540,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end, MMU_COW);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2574,7 +2574,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end, MMU_COW);
 
 	return ret;
 }
@@ -2626,7 +2626,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2697,7 +2697,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2884,7 +2884,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2904,7 +2904,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3329,7 +3329,8 @@ same_page:
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_action action)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -3341,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, action);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3371,7 +3372,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, action);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 68710e8..6a32bc4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +904,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_RONLY);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
 out:
 	return err;
 }
@@ -949,7 +949,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -962,7 +962,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot), MMU_KSM);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -972,7 +972,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index b6b9c6e..69286e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1055,7 +1055,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_COW);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1072,7 +1072,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_COW);
 	return ret;
 }
 
@@ -1378,10 +1379,10 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1403,10 +1404,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1429,9 +1430,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2850,7 +2851,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_FAULT_WP);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2880,7 +2881,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_FAULT_WP);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2919,7 +2920,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_FAULT_WP);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ac621fa..e42f4b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -561,7 +561,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1, 0);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 6247be7..1accb9b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1804,12 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1875,7 +1875,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 41cefdf..a906744 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_action action)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -159,14 +165,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -174,7 +182,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..6c2846f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_action action)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
@@ -157,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start, end, action);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -185,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end, action);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
@@ -194,7 +195,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_action action)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -206,7 +208,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, action);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -214,7 +216,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, enum mmu_action action)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -231,7 +233,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, action);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -247,11 +249,21 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       int dirty_accountable, int prot_numa)
 {
 	unsigned long pages;
+	enum mmu_action action = MMU_MPROT_NONE;
+
+	/* At this points vm_flags is updated. */
+	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE)) {
+		action = MMU_MPROT_RANDW;
+	} else if (vma->vm_flags & VM_WRITE) {
+		action = MMU_MPROT_WONLY;
+	} else if (vma->vm_flags & VM_READ) {
+		action = MMU_MPROT_RONLY;
+	}
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end, newprot, action);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa, action);
 
 	return pages;
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index 0843feb..8c00e98 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -221,7 +221,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c08cbd..5504e31 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -834,7 +834,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_FILE_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1117,6 +1117,27 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_action action;
+
+	switch (TTU_ACTION(flags)) {
+	case TTU_VMSCAN:
+		action = MMU_VMSCAN;
+		break;
+	case TTU_POISON:
+		action = MMU_POISON;
+		break;
+	case TTU_MIGRATION:
+		action = MMU_MIGRATE;
+		break;
+	case TTU_MUNLOCK:
+		action = MMU_MUNLOCK;
+		break;
+	default:
+		/* Please report this ! */
+		BUG();
+		action = MMU_UNMAP;
+		break;
+	}
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1222,7 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL)
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, action);
 out:
 	return ret;
 
@@ -1276,7 +1297,8 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma, struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1290,6 +1312,27 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_action action;
+
+	switch (TTU_ACTION(flags)) {
+	case TTU_VMSCAN:
+		action = MMU_VMSCAN;
+		break;
+	case TTU_POISON:
+		action = MMU_POISON;
+		break;
+	case TTU_MIGRATION:
+		action = MMU_MIGRATE;
+		break;
+	case TTU_MUNLOCK:
+		action = MMU_MUNLOCK;
+		break;
+	default:
+		/* Please report this ! */
+		BUG();
+		action = MMU_UNMAP;
+		break;
+	}
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1304,7 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, action);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1369,7 +1412,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, action);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1425,7 +1468,7 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+						vma, page, (enum ttu_flags)arg) == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa70c6e..483f2e6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 02/11] mmu_notifier: add action information to address invalidation.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The action information will be usefull for new user of mmu_notifier API.
The action argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different action for instance on unmap the resource used
to track a vma are still valid and should stay around if need be.
While if the action is saying that a vma is being destroy it means that
that any resources used to track this vma can be free.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c |  14 ++++--
 drivers/xen/gntdev.c         |   9 ++--
 fs/proc/task_mmu.c           |   4 +-
 include/linux/hugetlb.h      |   4 +-
 include/linux/mmu_notifier.h | 108 ++++++++++++++++++++++++++++++++++---------
 kernel/events/uprobes.c      |   6 +--
 mm/filemap_xip.c             |   2 +-
 mm/fremap.c                  |   8 +++-
 mm/huge_memory.c             |  26 +++++------
 mm/hugetlb.c                 |  19 ++++----
 mm/ksm.c                     |  12 ++---
 mm/memory.c                  |  23 ++++-----
 mm/mempolicy.c               |   2 +-
 mm/migrate.c                 |   6 +--
 mm/mmu_notifier.c            |  26 +++++++----
 mm/mprotect.c                |  30 ++++++++----
 mm/mremap.c                  |   4 +-
 mm/rmap.c                    |  55 +++++++++++++++++++---
 virt/kvm/kvm_main.c          |  12 +++--
 19 files changed, 258 insertions(+), 112 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5208828..71f8a1c 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -421,21 +421,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
 			  struct mm_struct *mm,
 			  unsigned long address,
-			  pte_t pte)
+			  pte_t pte,
+			  enum mmu_action action)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_action action)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+				      unsigned long start,
+				      unsigned long end,
+				      enum mmu_action action)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
-				    unsigned long start, unsigned long end)
+				    unsigned long start,
+				    unsigned long end,
+				    enum mmu_action action)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..84aa5a7 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_action action)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_action action)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, action);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fa6d6a4..3c571ea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -818,11 +818,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0, -1, MMU_SOFT_DIRTY);
 		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			walk_page_vma(vma, &clear_refs_walk);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0, -1, MMU_SOFT_DIRTY);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0683f55..1c36581 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -6,6 +6,7 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/mmu_notifier.h>
 #include <linux/list.h>
 #include <linux/kref.h>
 
@@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_action action);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..90b9105 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,41 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* Action report finer information to the callback allowing the event listener
+ * to take better action. For instance WP means that the page are still valid
+ * and can be use as read only.
+ *
+ * UNMAP means the vma is still valid and that only pages are unmaped and thus
+ * they should no longer be read or written to.
+ *
+ * ZAP means vma is disappearing and that any resource that were use to track
+ * this vma can be freed.
+ *
+ * In doubt when adding a new notifier caller use ZAP it will always trigger
+ * right thing but won't be optimal.
+ */
+enum mmu_action {
+	MMU_MPROT_NONE = 0,
+	MMU_MPROT_RONLY,
+	MMU_MPROT_RANDW,
+	MMU_MPROT_WONLY,
+	MMU_COW,
+	MMU_KSM,
+	MMU_KSM_RONLY,
+	MMU_SOFT_DIRTY,
+	MMU_UNMAP,
+	MMU_VMSCAN,
+	MMU_POISON,
+	MMU_MREMAP,
+	MMU_MUNMAP,
+	MMU_MUNLOCK,
+	MMU_MIGRATE,
+	MMU_FILE_WB,
+	MMU_FAULT_WP,
+	MMU_THP_SPLIT,
+	MMU_THP_FAULT_WP,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -79,7 +114,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_action action);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -90,7 +126,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_action action);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -137,10 +174,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_action action);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_action action);
 };
 
 /*
@@ -177,13 +218,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_action action);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					   unsigned long address,
+					   enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_action action);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -208,31 +256,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, action);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, action);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, action);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, action);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -278,13 +333,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __action)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __action);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -307,22 +362,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_action action)
 {
 }
 
@@ -336,7 +398,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
-#define set_pte_at_notify set_pte_at
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __action) set_pte_at(__mm, __address, __ptep, __pte)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d1edc5e..9acd357 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -170,7 +170,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot), MMU_UNMAP);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -199,7 +199,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..d529ab9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_UNMAP);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index 2c5646f..f4a67e0 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -254,9 +254,13 @@ get_write_lock:
 		vma->vm_flags = vm_flags;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, start + size);
+	/* Consider it a ZAP operation for now, it could be seen as an unmap but
+	 * remapping is trickier as it can change the vma to non linear and thus
+	 * trigger side effect.
+	 */
+	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size);
+	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c5ff461..4ad9b73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -993,7 +993,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1023,7 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1033,7 +1033,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -1123,7 +1123,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	spin_lock(ptl);
 	if (page)
@@ -1153,7 +1153,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 out:
 	return ret;
 out_unlock:
@@ -1588,7 +1588,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1603,7 +1603,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	return ret;
 }
@@ -2402,7 +2402,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2412,7 +2412,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2801,24 +2801,24 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e73f7bc..8006472 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2540,7 +2540,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end, MMU_COW);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2574,7 +2574,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end, MMU_COW);
 
 	return ret;
 }
@@ -2626,7 +2626,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2697,7 +2697,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2884,7 +2884,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2904,7 +2904,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3329,7 +3329,8 @@ same_page:
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_action action)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -3341,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, action);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3371,7 +3372,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, action);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 68710e8..6a32bc4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -904,7 +904,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_KSM_RONLY);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -912,7 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
 out:
 	return err;
 }
@@ -949,7 +949,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -962,7 +962,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot), MMU_KSM);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -972,7 +972,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index b6b9c6e..69286e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1055,7 +1055,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_COW);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1072,7 +1072,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_COW);
 	return ret;
 }
 
@@ -1378,10 +1379,10 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1403,10 +1404,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1429,9 +1430,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2850,7 +2851,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_FAULT_WP);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2880,7 +2881,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_FAULT_WP);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2919,7 +2920,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_FAULT_WP);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ac621fa..e42f4b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -561,7 +561,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 {
 	int nr_updated;
 
-	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1, 0);
 	if (nr_updated)
 		count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 6247be7..1accb9b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1804,12 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1875,7 +1875,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 41cefdf..a906744 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_action action)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -159,14 +165,16 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_action action)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -174,7 +182,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..6c2846f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_action action)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
@@ -157,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start, end, action);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -185,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end, action);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
@@ -194,7 +195,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_action action)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -206,7 +208,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, action);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -214,7 +216,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, enum mmu_action action)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -231,7 +233,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, action);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -247,11 +249,21 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       int dirty_accountable, int prot_numa)
 {
 	unsigned long pages;
+	enum mmu_action action = MMU_MPROT_NONE;
+
+	/* At this points vm_flags is updated. */
+	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE)) {
+		action = MMU_MPROT_RANDW;
+	} else if (vma->vm_flags & VM_WRITE) {
+		action = MMU_MPROT_WONLY;
+	} else if (vma->vm_flags & VM_READ) {
+		action = MMU_MPROT_RONLY;
+	}
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end, newprot, action);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa, action);
 
 	return pages;
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index 0843feb..8c00e98 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -221,7 +221,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 1c08cbd..5504e31 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -834,7 +834,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_FILE_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1117,6 +1117,27 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_action action;
+
+	switch (TTU_ACTION(flags)) {
+	case TTU_VMSCAN:
+		action = MMU_VMSCAN;
+		break;
+	case TTU_POISON:
+		action = MMU_POISON;
+		break;
+	case TTU_MIGRATION:
+		action = MMU_MIGRATE;
+		break;
+	case TTU_MUNLOCK:
+		action = MMU_MUNLOCK;
+		break;
+	default:
+		/* Please report this ! */
+		BUG();
+		action = MMU_UNMAP;
+		break;
+	}
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1222,7 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL)
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, action);
 out:
 	return ret;
 
@@ -1276,7 +1297,8 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma, struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1290,6 +1312,27 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_action action;
+
+	switch (TTU_ACTION(flags)) {
+	case TTU_VMSCAN:
+		action = MMU_VMSCAN;
+		break;
+	case TTU_POISON:
+		action = MMU_POISON;
+		break;
+	case TTU_MIGRATION:
+		action = MMU_MIGRATE;
+		break;
+	case TTU_MUNLOCK:
+		action = MMU_MUNLOCK;
+		break;
+	default:
+		/* Please report this ! */
+		BUG();
+		action = MMU_UNMAP;
+		break;
+	}
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1304,7 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, action);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1369,7 +1412,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, action);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1425,7 +1468,7 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+						vma, page, (enum ttu_flags)arg) == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa70c6e..483f2e6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_action action)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 03/11] mmu_notifier: pass through vma to invalidate_range and invalidate_page
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/xen/gntdev.c         |  4 +++-
 fs/proc/task_mmu.c           | 17 ++++++++++++-----
 include/linux/mmu_notifier.h | 18 +++++++++++++++---
 kernel/events/uprobes.c      |  4 ++--
 mm/filemap_xip.c             |  2 +-
 mm/fremap.c                  |  4 ++--
 mm/huge_memory.c             | 26 +++++++++++++-------------
 mm/hugetlb.c                 | 16 ++++++++--------
 mm/ksm.c                     |  8 ++++----
 mm/memory.c                  | 38 ++++++++++++++++++++++++++------------
 mm/migrate.c                 |  6 +++---
 mm/mmu_notifier.c            |  9 ++++++---
 mm/mprotect.c                |  4 ++--
 mm/mremap.c                  |  4 ++--
 mm/rmap.c                    |  8 ++++----
 virt/kvm/kvm_main.c          |  3 +++
 16 files changed, 106 insertions(+), 65 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84aa5a7..447c3fb 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long start,
 				unsigned long end,
 				enum mmu_action action)
@@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
+			 struct vm_area_struct *vma,
 			 unsigned long address,
 			 enum mmu_action action)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, action);
+	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, action);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3c571ea..7fd911b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -817,12 +817,19 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1, MMU_SOFT_DIRTY);
-		for (vma = mm->mmap; vma; vma = vma->vm_next)
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_start(mm, vma,
+								    vma->vm_start,
+								    vma->vm_end,
+								    MMU_SOFT_DIRTY);
 			walk_page_vma(vma, &clear_refs_walk);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1, MMU_SOFT_DIRTY);
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_end(mm, vma,
+								  vma->vm_start,
+								  vma->vm_end,
+								  MMU_SOFT_DIRTY);
+		}
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 90b9105..0794a73b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -126,6 +126,7 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_action action);
 
@@ -174,11 +175,13 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start,
 				       unsigned long end,
 				       enum mmu_action action);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
 				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_action action);
@@ -222,13 +225,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      pte_t pte,
 				      enum mmu_action action);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long address,
 					   enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long start,
 						unsigned long end,
 						enum mmu_action action);
@@ -265,29 +271,32 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, action);
+		__mmu_notifier_invalidate_page(mm, vma, address, action);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, action);
+		__mmu_notifier_invalidate_range_start(mm, vma, start, end, action);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, action);
+		__mmu_notifier_invalidate_range_end(mm, vma, start, end, action);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -369,12 +378,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_action action)
@@ -382,6 +393,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_action action)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9acd357..ed9b095 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -170,7 +170,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -199,7 +199,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d529ab9..e01c68b 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address, MMU_UNMAP);
+			mmu_notifier_invalidate_page(mm, vma, address, MMU_UNMAP);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index f4a67e0..92ac1df 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -258,9 +258,9 @@ get_write_lock:
 	 * remapping is trickier as it can change the vma to non linear and thus
 	 * trigger side effect.
 	 */
-	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ad9b73..05688b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -993,7 +993,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1023,7 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1033,7 +1033,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -1123,7 +1123,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	spin_lock(ptl);
 	if (page)
@@ -1153,7 +1153,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 out:
 	return ret;
 out_unlock:
@@ -1588,7 +1588,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1603,7 +1603,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	return ret;
 }
@@ -2402,7 +2402,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2412,7 +2412,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2801,24 +2801,24 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8006472..a05709a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2540,7 +2540,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end, MMU_COW);
+		mmu_notifier_invalidate_range_start(src, vma, mmun_start, mmun_end, MMU_COW);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2574,7 +2574,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end, MMU_COW);
+		mmu_notifier_invalidate_range_end(src, vma, mmun_start, mmun_end, MMU_COW);
 
 	return ret;
 }
@@ -2626,7 +2626,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2697,7 +2697,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2884,7 +2884,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2904,7 +2904,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3342,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, action);
+	mmu_notifier_invalidate_range_start(mm, vma, start, end, action);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3372,7 +3372,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end, action);
+	mmu_notifier_invalidate_range_end(mm, vma, start, end, action);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 6a32bc4..3752820 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_KSM_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -912,7 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_KSM_RONLY);
 out:
 	return err;
 }
@@ -949,7 +949,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_KSM);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -972,7 +972,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_KSM);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 69286e2..1e164a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1054,7 +1054,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_start = addr;
 	mmun_end   = end;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
+		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
 						    mmun_end, MMU_COW);
 
 	ret = 0;
@@ -1072,7 +1072,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start, mmun_end,
 						  MMU_COW);
 	return ret;
 }
@@ -1379,10 +1379,17 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start_addr, vma->vm_start),
+						    min(end_addr, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start_addr, vma->vm_start),
+						  min(end_addr, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 }
 
 /**
@@ -1404,10 +1411,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start, vma->vm_start),
+						    min(end, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start, vma->vm_start),
+						  min(end, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1430,9 +1444,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2851,7 +2865,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_FAULT_WP);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2920,7 +2934,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_FAULT_WP);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_FAULT_WP);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 1accb9b..4b426d1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1804,12 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1875,7 +1875,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a906744..0b0e1ca 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long address,
 				    enum mmu_action action)
 {
@@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, action);
+			mn->ops->invalidate_page(mn, mm, vma, address, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long start,
 					   unsigned long end,
 					   enum mmu_action action)
@@ -165,13 +167,14 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end, action);
+			mn->ops->invalidate_range_start(mn, mm, vma, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
 					 enum mmu_action action)
@@ -182,7 +185,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end, action);
+			mn->ops->invalidate_range_end(mn, mm, vma, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6c2846f..ebe92d1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end, action);
+			mmu_notifier_invalidate_range_start(mm, vma, mni_start, end, action);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -186,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end, action);
+		mmu_notifier_invalidate_range_end(mm, vma, mni_start, end, action);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 8c00e98..eb3f0f4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start, mmun_end, MMU_MREMAP);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -221,7 +221,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start, mmun_end, MMU_MREMAP);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 5504e31..e07450c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -834,7 +834,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_FILE_WB);
+		mmu_notifier_invalidate_page(mm, vma, address, MMU_FILE_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1243,7 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL)
-		mmu_notifier_invalidate_page(mm, address, action);
+		mmu_notifier_invalidate_page(mm, vma, address, action);
 out:
 	return ret;
 
@@ -1347,7 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, action);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, action);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1412,7 +1412,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, action);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, action);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 483f2e6..e6dab1a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
+					     struct vm_area_struct *vma,
 					     unsigned long address,
 					     enum mmu_action action)
 {
@@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
+						    struct vm_area_struct *vma,
 						    unsigned long start,
 						    unsigned long end,
 						    enum mmu_action action)
@@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_action action)
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 03/11] mmu_notifier: pass through vma to invalidate_range and invalidate_page
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/xen/gntdev.c         |  4 +++-
 fs/proc/task_mmu.c           | 17 ++++++++++++-----
 include/linux/mmu_notifier.h | 18 +++++++++++++++---
 kernel/events/uprobes.c      |  4 ++--
 mm/filemap_xip.c             |  2 +-
 mm/fremap.c                  |  4 ++--
 mm/huge_memory.c             | 26 +++++++++++++-------------
 mm/hugetlb.c                 | 16 ++++++++--------
 mm/ksm.c                     |  8 ++++----
 mm/memory.c                  | 38 ++++++++++++++++++++++++++------------
 mm/migrate.c                 |  6 +++---
 mm/mmu_notifier.c            |  9 ++++++---
 mm/mprotect.c                |  4 ++--
 mm/mremap.c                  |  4 ++--
 mm/rmap.c                    |  8 ++++----
 virt/kvm/kvm_main.c          |  3 +++
 16 files changed, 106 insertions(+), 65 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84aa5a7..447c3fb 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long start,
 				unsigned long end,
 				enum mmu_action action)
@@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
+			 struct vm_area_struct *vma,
 			 unsigned long address,
 			 enum mmu_action action)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, action);
+	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, action);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3c571ea..7fd911b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -817,12 +817,19 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1, MMU_SOFT_DIRTY);
-		for (vma = mm->mmap; vma; vma = vma->vm_next)
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_start(mm, vma,
+								    vma->vm_start,
+								    vma->vm_end,
+								    MMU_SOFT_DIRTY);
 			walk_page_vma(vma, &clear_refs_walk);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1, MMU_SOFT_DIRTY);
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_end(mm, vma,
+								  vma->vm_start,
+								  vma->vm_end,
+								  MMU_SOFT_DIRTY);
+		}
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 90b9105..0794a73b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -126,6 +126,7 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_action action);
 
@@ -174,11 +175,13 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start,
 				       unsigned long end,
 				       enum mmu_action action);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
 				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_action action);
@@ -222,13 +225,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      pte_t pte,
 				      enum mmu_action action);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long address,
 					   enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long start,
 						unsigned long end,
 						enum mmu_action action);
@@ -265,29 +271,32 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, action);
+		__mmu_notifier_invalidate_page(mm, vma, address, action);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, action);
+		__mmu_notifier_invalidate_range_start(mm, vma, start, end, action);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, action);
+		__mmu_notifier_invalidate_range_end(mm, vma, start, end, action);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -369,12 +378,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_action action)
@@ -382,6 +393,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_action action)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9acd357..ed9b095 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -170,7 +170,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -199,7 +199,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d529ab9..e01c68b 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address, MMU_UNMAP);
+			mmu_notifier_invalidate_page(mm, vma, address, MMU_UNMAP);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index f4a67e0..92ac1df 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -258,9 +258,9 @@ get_write_lock:
 	 * remapping is trickier as it can change the vma to non linear and thus
 	 * trigger side effect.
 	 */
-	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ad9b73..05688b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -993,7 +993,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1023,7 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1033,7 +1033,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -1123,7 +1123,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	spin_lock(ptl);
 	if (page)
@@ -1153,7 +1153,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 out:
 	return ret;
 out_unlock:
@@ -1588,7 +1588,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1603,7 +1603,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	return ret;
 }
@@ -2402,7 +2402,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2412,7 +2412,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2801,24 +2801,24 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8006472..a05709a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2540,7 +2540,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end, MMU_COW);
+		mmu_notifier_invalidate_range_start(src, vma, mmun_start, mmun_end, MMU_COW);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2574,7 +2574,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end, MMU_COW);
+		mmu_notifier_invalidate_range_end(src, vma, mmun_start, mmun_end, MMU_COW);
 
 	return ret;
 }
@@ -2626,7 +2626,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2697,7 +2697,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2884,7 +2884,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2904,7 +2904,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3342,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, action);
+	mmu_notifier_invalidate_range_start(mm, vma, start, end, action);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3372,7 +3372,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end, action);
+	mmu_notifier_invalidate_range_end(mm, vma, start, end, action);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 6a32bc4..3752820 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_KSM_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -912,7 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_KSM_RONLY);
 out:
 	return err;
 }
@@ -949,7 +949,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_KSM);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -972,7 +972,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_KSM);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 69286e2..1e164a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1054,7 +1054,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_start = addr;
 	mmun_end   = end;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
+		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
 						    mmun_end, MMU_COW);
 
 	ret = 0;
@@ -1072,7 +1072,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start, mmun_end,
 						  MMU_COW);
 	return ret;
 }
@@ -1379,10 +1379,17 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start_addr, vma->vm_start),
+						    min(end_addr, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start_addr, vma->vm_start),
+						  min(end_addr, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 }
 
 /**
@@ -1404,10 +1411,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start, vma->vm_start),
+						    min(end, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start, vma->vm_start),
+						  min(end, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1430,9 +1444,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2851,7 +2865,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_FAULT_WP);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2920,7 +2934,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_FAULT_WP);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_FAULT_WP);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 1accb9b..4b426d1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1804,12 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1875,7 +1875,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a906744..0b0e1ca 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long address,
 				    enum mmu_action action)
 {
@@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, action);
+			mn->ops->invalidate_page(mn, mm, vma, address, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long start,
 					   unsigned long end,
 					   enum mmu_action action)
@@ -165,13 +167,14 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end, action);
+			mn->ops->invalidate_range_start(mn, mm, vma, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
 					 enum mmu_action action)
@@ -182,7 +185,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end, action);
+			mn->ops->invalidate_range_end(mn, mm, vma, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6c2846f..ebe92d1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end, action);
+			mmu_notifier_invalidate_range_start(mm, vma, mni_start, end, action);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -186,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end, action);
+		mmu_notifier_invalidate_range_end(mm, vma, mni_start, end, action);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 8c00e98..eb3f0f4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start, mmun_end, MMU_MREMAP);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -221,7 +221,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start, mmun_end, MMU_MREMAP);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 5504e31..e07450c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -834,7 +834,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_FILE_WB);
+		mmu_notifier_invalidate_page(mm, vma, address, MMU_FILE_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1243,7 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL)
-		mmu_notifier_invalidate_page(mm, address, action);
+		mmu_notifier_invalidate_page(mm, vma, address, action);
 out:
 	return ret;
 
@@ -1347,7 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, action);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, action);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1412,7 +1412,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, action);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, action);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 483f2e6..e6dab1a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
+					     struct vm_area_struct *vma,
 					     unsigned long address,
 					     enum mmu_action action)
 {
@@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
+						    struct vm_area_struct *vma,
 						    unsigned long start,
 						    unsigned long end,
 						    enum mmu_action action)
@@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_action action)
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 03/11] mmu_notifier: pass through vma to invalidate_range and invalidate_page
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/xen/gntdev.c         |  4 +++-
 fs/proc/task_mmu.c           | 17 ++++++++++++-----
 include/linux/mmu_notifier.h | 18 +++++++++++++++---
 kernel/events/uprobes.c      |  4 ++--
 mm/filemap_xip.c             |  2 +-
 mm/fremap.c                  |  4 ++--
 mm/huge_memory.c             | 26 +++++++++++++-------------
 mm/hugetlb.c                 | 16 ++++++++--------
 mm/ksm.c                     |  8 ++++----
 mm/memory.c                  | 38 ++++++++++++++++++++++++++------------
 mm/migrate.c                 |  6 +++---
 mm/mmu_notifier.c            |  9 ++++++---
 mm/mprotect.c                |  4 ++--
 mm/mremap.c                  |  4 ++--
 mm/rmap.c                    |  8 ++++----
 virt/kvm/kvm_main.c          |  3 +++
 16 files changed, 106 insertions(+), 65 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84aa5a7..447c3fb 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long start,
 				unsigned long end,
 				enum mmu_action action)
@@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
+			 struct vm_area_struct *vma,
 			 unsigned long address,
 			 enum mmu_action action)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, action);
+	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, action);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3c571ea..7fd911b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -817,12 +817,19 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1, MMU_SOFT_DIRTY);
-		for (vma = mm->mmap; vma; vma = vma->vm_next)
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_start(mm, vma,
+								    vma->vm_start,
+								    vma->vm_end,
+								    MMU_SOFT_DIRTY);
 			walk_page_vma(vma, &clear_refs_walk);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1, MMU_SOFT_DIRTY);
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_end(mm, vma,
+								  vma->vm_start,
+								  vma->vm_end,
+								  MMU_SOFT_DIRTY);
+		}
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 90b9105..0794a73b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -126,6 +126,7 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_action action);
 
@@ -174,11 +175,13 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start,
 				       unsigned long end,
 				       enum mmu_action action);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
 				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_action action);
@@ -222,13 +225,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      pte_t pte,
 				      enum mmu_action action);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long address,
 					   enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_action action);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long start,
 						unsigned long end,
 						enum mmu_action action);
@@ -265,29 +271,32 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, action);
+		__mmu_notifier_invalidate_page(mm, vma, address, action);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, action);
+		__mmu_notifier_invalidate_range_start(mm, vma, start, end, action);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_action action)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, action);
+		__mmu_notifier_invalidate_range_end(mm, vma, start, end, action);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -369,12 +378,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_action action)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_action action)
@@ -382,6 +393,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_action action)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9acd357..ed9b095 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -170,7 +170,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -199,7 +199,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d529ab9..e01c68b 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address, MMU_UNMAP);
+			mmu_notifier_invalidate_page(mm, vma, address, MMU_UNMAP);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/fremap.c b/mm/fremap.c
index f4a67e0..92ac1df 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -258,9 +258,9 @@ get_write_lock:
 	 * remapping is trickier as it can change the vma to non linear and thus
 	 * trigger side effect.
 	 */
-	mmu_notifier_invalidate_range_start(mm, start, start + size, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, start, start + size, MMU_MUNMAP);
 	err = vma->vm_ops->remap_pages(vma, start, size, pgoff);
-	mmu_notifier_invalidate_range_end(mm, start, start + size, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, start, start + size, MMU_MUNMAP);
 
 	/*
 	 * We can't clear VM_NONLINEAR because we'd have to do
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ad9b73..05688b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -993,7 +993,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
@@ -1023,7 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1033,7 +1033,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -1123,7 +1123,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 
 	spin_lock(ptl);
 	if (page)
@@ -1153,7 +1153,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_FAULT_WP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_FAULT_WP);
 out:
 	return ret;
 out_unlock:
@@ -1588,7 +1588,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1603,7 +1603,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	return ret;
 }
@@ -2402,7 +2402,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2412,7 +2412,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2801,24 +2801,24 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_THP_SPLIT);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_THP_SPLIT);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8006472..a05709a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2540,7 +2540,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end, MMU_COW);
+		mmu_notifier_invalidate_range_start(src, vma, mmun_start, mmun_end, MMU_COW);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2574,7 +2574,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end, MMU_COW);
+		mmu_notifier_invalidate_range_end(src, vma, mmun_start, mmun_end, MMU_COW);
 
 	return ret;
 }
@@ -2626,7 +2626,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2697,7 +2697,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2884,7 +2884,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2904,7 +2904,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_UNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_UNMAP);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3342,7 +3342,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, action);
+	mmu_notifier_invalidate_range_start(mm, vma, start, end, action);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3372,7 +3372,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end, action);
+	mmu_notifier_invalidate_range_end(mm, vma, start, end, action);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 6a32bc4..3752820 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_KSM_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -912,7 +912,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM_RONLY);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_KSM_RONLY);
 out:
 	return err;
 }
@@ -949,7 +949,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_KSM);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_KSM);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -972,7 +972,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_KSM);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_KSM);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 69286e2..1e164a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1054,7 +1054,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_start = addr;
 	mmun_end   = end;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
+		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
 						    mmun_end, MMU_COW);
 
 	ret = 0;
@@ -1072,7 +1072,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start, mmun_end,
 						  MMU_COW);
 	return ret;
 }
@@ -1379,10 +1379,17 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start_addr, vma->vm_start),
+						    min(end_addr, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start_addr, vma->vm_start),
+						  min(end_addr, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 }
 
 /**
@@ -1404,10 +1411,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start, vma->vm_start),
+						    min(end, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start, vma->vm_start),
+						  min(end, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1430,9 +1444,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2851,7 +2865,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_FAULT_WP);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_FAULT_WP);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2920,7 +2934,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_FAULT_WP);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_FAULT_WP);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index 1accb9b..4b426d1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1804,12 +1804,12 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1875,7 +1875,7 @@ fail_putback:
 	 */
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, MMU_MIGRATE);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index a906744..0b0e1ca 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long address,
 				    enum mmu_action action)
 {
@@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, action);
+			mn->ops->invalidate_page(mn, mm, vma, address, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long start,
 					   unsigned long end,
 					   enum mmu_action action)
@@ -165,13 +167,14 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end, action);
+			mn->ops->invalidate_range_start(mn, mm, vma, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
 					 enum mmu_action action)
@@ -182,7 +185,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end, action);
+			mn->ops->invalidate_range_end(mn, mm, vma, start, end, action);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6c2846f..ebe92d1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end, action);
+			mmu_notifier_invalidate_range_start(mm, vma, mni_start, end, action);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -186,7 +186,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end, action);
+		mmu_notifier_invalidate_range_end(mm, vma, mni_start, end, action);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 8c00e98..eb3f0f4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start, mmun_end, MMU_MREMAP);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -221,7 +221,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end, MMU_MREMAP);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start, mmun_end, MMU_MREMAP);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 5504e31..e07450c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -834,7 +834,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_FILE_WB);
+		mmu_notifier_invalidate_page(mm, vma, address, MMU_FILE_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1243,7 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL)
-		mmu_notifier_invalidate_page(mm, address, action);
+		mmu_notifier_invalidate_page(mm, vma, address, action);
 out:
 	return ret;
 
@@ -1347,7 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, action);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start, mmun_end, action);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1412,7 +1412,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, action);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, action);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 483f2e6..e6dab1a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
+					     struct vm_area_struct *vma,
 					     unsigned long address,
 					     enum mmu_action action)
 {
@@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
+						    struct vm_area_struct *vma,
 						    unsigned long start,
 						    unsigned long end,
 						    enum mmu_action action)
@@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_action action)
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 04/11] interval_tree: helper to find previous item of a node in rb interval tree
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

It is often usefull to find the entry right before a given one in an rb
interval tree.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/interval_tree_generic.h | 79 +++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/include/linux/interval_tree_generic.h b/include/linux/interval_tree_generic.h
index 58370e1..97dd71b 100644
--- a/include/linux/interval_tree_generic.h
+++ b/include/linux/interval_tree_generic.h
@@ -188,4 +188,83 @@ ITPREFIX ## _iter_next(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
 		else if (start <= ITLAST(node))		/* Cond2 */	      \
 			return node;					      \
 	}								      \
+}									      \
+									      \
+static ITSTRUCT *							      \
+ITPREFIX ## _subtree_rmost(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariant: last >= ITSTART(node)		      \
+		 * (Cond1 is satisfied)					      \
+		 */							      \
+		if (node->ITRB.rb_right) {				      \
+			ITSTRUCT *right = rb_entry(node->ITRB.rb_right,	      \
+						   ITSTRUCT, ITRB);	      \
+			if (last >= ITSTART(right)) {			      \
+				/*					      \
+				 * Some nodes in right subtree satisfy Cond1. \
+				 * Iterate to find the rightmost such node N. \
+				 * If it also satisfies Cond2, that's the     \
+				 * match we are looking for.		      \
+				 */					      \
+				node = right;				      \
+				continue;				      \
+			}						      \
+			/* Left branch might still have a candidate. */	      \
+			if (right->ITRB.rb_left) {			      \
+				right = rb_entry(right->ITRB.rb_left,	      \
+						 ITSTRUCT, ITRB);	      \
+				if (last >= ITSTART(right)) {		      \
+					node = right;			      \
+					continue;			      \
+				}					      \
+			}						      \
+		}							      \
+		/* At this point node is the rightmost candidate. */	      \
+		if (last >= ITSTART(node)) {		/* Cond1 */	      \
+			if (start <= ITLAST(node))	/* Cond2 */	      \
+				return node;	/* node is rightmost match */ \
+		}							      \
+		return NULL;	/* No match */				      \
+	}								      \
+}									      \
+									      \
+ITSTATIC ITSTRUCT *							      \
+ITPREFIX ## _iter_prev(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	struct rb_node *rb = node->ITRB.rb_left, *prev;			      \
+									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariants:					      \
+		 *   Cond2: start <= ITLAST(node)			      \
+		 *   rb == node->ITRB.rb_left				      \
+		 *							      \
+		 * First, search left subtree if suitable		      \
+		 */							      \
+		if (rb) {						      \
+			ITSTRUCT *left = rb_entry(rb, ITSTRUCT, ITRB);	      \
+			if (start <= left->ITSUBTREE)			      \
+				return ITPREFIX ## _subtree_rmost(left,       \
+								  start,      \
+								  last);      \
+		}							      \
+									      \
+		/* Move up the tree until we come from a node's right child */\
+		do {							      \
+			rb = rb_parent(&node->ITRB);			      \
+			if (!rb)					      \
+				return NULL;				      \
+			prev = &node->ITRB;				      \
+			node = rb_entry(rb, ITSTRUCT, ITRB);		      \
+			rb = node->ITRB.rb_left;			      \
+		} while (prev == rb);					      \
+									      \
+		/* Check if the node intersects [start;last] */		      \
+		if (start > ITLAST(node))		/* !Cond2 */	      \
+			return NULL;					      \
+		else if (ITSTART(node) <= last)		/* Cond1 */	      \
+			return node;					      \
+	}								      \
 }
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 04/11] interval_tree: helper to find previous item of a node in rb interval tree
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

It is often usefull to find the entry right before a given one in an rb
interval tree.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/interval_tree_generic.h | 79 +++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/include/linux/interval_tree_generic.h b/include/linux/interval_tree_generic.h
index 58370e1..97dd71b 100644
--- a/include/linux/interval_tree_generic.h
+++ b/include/linux/interval_tree_generic.h
@@ -188,4 +188,83 @@ ITPREFIX ## _iter_next(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
 		else if (start <= ITLAST(node))		/* Cond2 */	      \
 			return node;					      \
 	}								      \
+}									      \
+									      \
+static ITSTRUCT *							      \
+ITPREFIX ## _subtree_rmost(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariant: last >= ITSTART(node)		      \
+		 * (Cond1 is satisfied)					      \
+		 */							      \
+		if (node->ITRB.rb_right) {				      \
+			ITSTRUCT *right = rb_entry(node->ITRB.rb_right,	      \
+						   ITSTRUCT, ITRB);	      \
+			if (last >= ITSTART(right)) {			      \
+				/*					      \
+				 * Some nodes in right subtree satisfy Cond1. \
+				 * Iterate to find the rightmost such node N. \
+				 * If it also satisfies Cond2, that's the     \
+				 * match we are looking for.		      \
+				 */					      \
+				node = right;				      \
+				continue;				      \
+			}						      \
+			/* Left branch might still have a candidate. */	      \
+			if (right->ITRB.rb_left) {			      \
+				right = rb_entry(right->ITRB.rb_left,	      \
+						 ITSTRUCT, ITRB);	      \
+				if (last >= ITSTART(right)) {		      \
+					node = right;			      \
+					continue;			      \
+				}					      \
+			}						      \
+		}							      \
+		/* At this point node is the rightmost candidate. */	      \
+		if (last >= ITSTART(node)) {		/* Cond1 */	      \
+			if (start <= ITLAST(node))	/* Cond2 */	      \
+				return node;	/* node is rightmost match */ \
+		}							      \
+		return NULL;	/* No match */				      \
+	}								      \
+}									      \
+									      \
+ITSTATIC ITSTRUCT *							      \
+ITPREFIX ## _iter_prev(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	struct rb_node *rb = node->ITRB.rb_left, *prev;			      \
+									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariants:					      \
+		 *   Cond2: start <= ITLAST(node)			      \
+		 *   rb == node->ITRB.rb_left				      \
+		 *							      \
+		 * First, search left subtree if suitable		      \
+		 */							      \
+		if (rb) {						      \
+			ITSTRUCT *left = rb_entry(rb, ITSTRUCT, ITRB);	      \
+			if (start <= left->ITSUBTREE)			      \
+				return ITPREFIX ## _subtree_rmost(left,       \
+								  start,      \
+								  last);      \
+		}							      \
+									      \
+		/* Move up the tree until we come from a node's right child */\
+		do {							      \
+			rb = rb_parent(&node->ITRB);			      \
+			if (!rb)					      \
+				return NULL;				      \
+			prev = &node->ITRB;				      \
+			node = rb_entry(rb, ITSTRUCT, ITRB);		      \
+			rb = node->ITRB.rb_left;			      \
+		} while (prev == rb);					      \
+									      \
+		/* Check if the node intersects [start;last] */		      \
+		if (start > ITLAST(node))		/* !Cond2 */	      \
+			return NULL;					      \
+		else if (ITSTART(node) <= last)		/* Cond1 */	      \
+			return node;					      \
+	}								      \
 }
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 04/11] interval_tree: helper to find previous item of a node in rb interval tree
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

It is often usefull to find the entry right before a given one in an rb
interval tree.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/interval_tree_generic.h | 79 +++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/include/linux/interval_tree_generic.h b/include/linux/interval_tree_generic.h
index 58370e1..97dd71b 100644
--- a/include/linux/interval_tree_generic.h
+++ b/include/linux/interval_tree_generic.h
@@ -188,4 +188,83 @@ ITPREFIX ## _iter_next(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
 		else if (start <= ITLAST(node))		/* Cond2 */	      \
 			return node;					      \
 	}								      \
+}									      \
+									      \
+static ITSTRUCT *							      \
+ITPREFIX ## _subtree_rmost(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariant: last >= ITSTART(node)		      \
+		 * (Cond1 is satisfied)					      \
+		 */							      \
+		if (node->ITRB.rb_right) {				      \
+			ITSTRUCT *right = rb_entry(node->ITRB.rb_right,	      \
+						   ITSTRUCT, ITRB);	      \
+			if (last >= ITSTART(right)) {			      \
+				/*					      \
+				 * Some nodes in right subtree satisfy Cond1. \
+				 * Iterate to find the rightmost such node N. \
+				 * If it also satisfies Cond2, that's the     \
+				 * match we are looking for.		      \
+				 */					      \
+				node = right;				      \
+				continue;				      \
+			}						      \
+			/* Left branch might still have a candidate. */	      \
+			if (right->ITRB.rb_left) {			      \
+				right = rb_entry(right->ITRB.rb_left,	      \
+						 ITSTRUCT, ITRB);	      \
+				if (last >= ITSTART(right)) {		      \
+					node = right;			      \
+					continue;			      \
+				}					      \
+			}						      \
+		}							      \
+		/* At this point node is the rightmost candidate. */	      \
+		if (last >= ITSTART(node)) {		/* Cond1 */	      \
+			if (start <= ITLAST(node))	/* Cond2 */	      \
+				return node;	/* node is rightmost match */ \
+		}							      \
+		return NULL;	/* No match */				      \
+	}								      \
+}									      \
+									      \
+ITSTATIC ITSTRUCT *							      \
+ITPREFIX ## _iter_prev(ITSTRUCT *node, ITTYPE start, ITTYPE last)	      \
+{									      \
+	struct rb_node *rb = node->ITRB.rb_left, *prev;			      \
+									      \
+	while (true) {							      \
+		/*							      \
+		 * Loop invariants:					      \
+		 *   Cond2: start <= ITLAST(node)			      \
+		 *   rb == node->ITRB.rb_left				      \
+		 *							      \
+		 * First, search left subtree if suitable		      \
+		 */							      \
+		if (rb) {						      \
+			ITSTRUCT *left = rb_entry(rb, ITSTRUCT, ITRB);	      \
+			if (start <= left->ITSUBTREE)			      \
+				return ITPREFIX ## _subtree_rmost(left,       \
+								  start,      \
+								  last);      \
+		}							      \
+									      \
+		/* Move up the tree until we come from a node's right child */\
+		do {							      \
+			rb = rb_parent(&node->ITRB);			      \
+			if (!rb)					      \
+				return NULL;				      \
+			prev = &node->ITRB;				      \
+			node = rb_entry(rb, ITSTRUCT, ITRB);		      \
+			rb = node->ITRB.rb_left;			      \
+		} while (prev == rb);					      \
+									      \
+		/* Check if the node intersects [start;last] */		      \
+		if (start > ITLAST(node))		/* !Cond2 */	      \
+			return NULL;					      \
+		else if (ITSTART(node) <= last)		/* Cond1 */	      \
+			return node;					      \
+	}								      \
 }
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 05/11] mm/memcg: support accounting null page and transfering null charge to new page.
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

When migrating memory to some device specific memory we still want to properly
account memcg memory usage. To do so we need to be able to account for page not
allocated in system memory. We also need to be able to transfer previous charge
from device memory to a page in an atomic way from memcg point of view.

Also introduce helper function to clear page memcg.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/memcontrol.h |  17 +++++
 mm/memcontrol.c            | 161 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 159 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fa2324..1737323 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -67,6 +67,8 @@ struct mem_cgroup_reclaim_cookie {
 
 extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+extern void mem_cgroup_transfer_charge_anon(struct page *page,
+					    struct mm_struct *mm);
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
@@ -85,6 +87,8 @@ extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
 
 extern void mem_cgroup_uncharge_page(struct page *page);
+extern void mem_cgroup_uncharge_mm(struct mm_struct *mm);
+extern void mem_cgroup_clear_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
@@ -245,6 +249,11 @@ static inline int mem_cgroup_charge_file(struct page *page,
 	return 0;
 }
 
+static inline void mem_cgroup_transfer_charge_anon(struct page *page,
+						   struct mm_struct *mm)
+{
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
@@ -272,6 +281,14 @@ static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
 
+static inline void mem_cgroup_uncharge_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mem_cgroup_clear_page(struct page *page)
+{
+}
+
 static inline void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19d620b..ceaf4d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -940,7 +940,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (page && PageTransHuge(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
 
@@ -2842,12 +2842,17 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 				       enum charge_type ctype,
 				       bool lrucare)
 {
-	struct page_cgroup *pc = lookup_page_cgroup(page);
+	struct page_cgroup *pc;
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
 	bool anon;
 
+	if (page == NULL) {
+		goto charge;
+	}
+
+	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
@@ -2891,20 +2896,24 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
+charge:
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
 		anon = true;
 	else
 		anon = false;
 
 	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
 
-	/*
-	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
-	 */
-	memcg_check_events(memcg, page);
+	if (page) {
+		unlock_page_cgroup(pc);
+
+		/*
+		 * "charge_statistics" updated event counter. Then, check it.
+		 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+		 * if they exceeds softlimit.
+		 */
+		memcg_check_events(memcg, page);
+	}
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3745,20 +3754,23 @@ int mem_cgroup_charge_anon(struct page *page,
 	if (mem_cgroup_disabled())
 		return 0;
 
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
 	VM_BUG_ON(!mm);
+	if (page) {
+		VM_BUG_ON_PAGE(page_mapped(page), page);
+		VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		/*
-		 * Never OOM-kill a process for a huge page.  The
-		 * fault handler will fall back to regular pages.
-		 */
-		oom = false;
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			/*
+			 * Never OOM-kill a process for a huge page.  The
+			 * fault handler will fall back to regular pages.
+			 */
+			oom = false;
+		}
 	}
 
+
 	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
 	if (!memcg)
 		return -ENOMEM;
@@ -3767,6 +3779,60 @@ int mem_cgroup_charge_anon(struct page *page,
 	return 0;
 }
 
+void mem_cgroup_transfer_charge_anon(struct page *page, struct mm_struct *mm)
+{
+	struct page_cgroup *pc;
+	struct task_struct *task;
+	struct mem_cgroup *memcg;
+	struct zone *uninitialized_var(zone);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(page->mapping && !PageAnon(page));
+	VM_BUG_ON(!mm);
+
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	/*
+	 * Because we don't have task_lock(), "p" can exit.
+	 * In that case, "memcg" can point to root or p can be NULL with
+	 * race with swapoff. Then, we have small risk of mis-accouning.
+	 * But such kind of mis-account by race always happens because
+	 * we don't have cgroup_mutex(). It's overkill and we allo that
+	 * small race, here.
+	 * (*) swapoff at el will charge against mm-struct not against
+	 * task-struct. So, mm->owner can be NULL.
+	 */
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg) {
+		memcg = root_mem_cgroup;
+	}
+	rcu_read_unlock();
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	VM_BUG_ON(PageCgroupUsed(pc));
+	/*
+	 * we don't need page_cgroup_lock about tail pages, becase they are not
+	 * accessed by any other context at this point.
+	 */
+
+	pc->mem_cgroup = memcg;
+	/*
+	 * We access a page_cgroup asynchronously without lock_page_cgroup().
+	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
+	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
+	 * before USED bit, we need memory barrier here.
+	 * See mem_cgroup_add_lru_list(), etc.
+	 */
+	smp_wmb();
+	SetPageCgroupUsed(pc);
+
+	unlock_page_cgroup(pc);
+	memcg_check_events(memcg, page);
+}
+
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
  * And when try_charge() successfully returns, one refcnt to memcg without
@@ -4087,6 +4153,63 @@ void mem_cgroup_uncharge_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
 }
 
+void mem_cgroup_uncharge_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+	struct task_struct *task;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(!mm);
+
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	/*
+	 * Because we don't have task_lock(), "p" can exit.
+	 * In that case, "memcg" can point to root or p can be NULL with
+	 * race with swapoff. Then, we have small risk of mis-accouning.
+	 * But such kind of mis-account by race always happens because
+	 * we don't have cgroup_mutex(). It's overkill and we allo that
+	 * small race, here.
+	 * (*) swapoff at el will charge against mm-struct not against
+	 * task-struct. So, mm->owner can be NULL.
+	 */
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg) {
+		memcg = root_mem_cgroup;
+	}
+	rcu_read_unlock();
+
+	mem_cgroup_charge_statistics(memcg, NULL, true, -1);
+	if (!mem_cgroup_is_root(memcg))
+		mem_cgroup_do_uncharge(memcg, 1, MEM_CGROUP_CHARGE_TYPE_ANON);
+}
+
+void mem_cgroup_clear_page(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/*
+	 * Check if our page_cgroup is valid
+	 */
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!PageCgroupUsed(pc)))
+		return;
+	lock_page_cgroup(pc);
+	ClearPageCgroupUsed(pc);
+	/*
+	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
+	 * freed from LRU. This is safe because uncharged page is expected not
+	 * to be reused (freed soon). Exception is SwapCache, it's handled by
+	 * special functions.
+	 */
+	unlock_page_cgroup(pc);
+}
+
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON_PAGE(page_mapped(page), page);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 05/11] mm/memcg: support accounting null page and transfering null charge to new page.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

When migrating memory to some device specific memory we still want to properly
account memcg memory usage. To do so we need to be able to account for page not
allocated in system memory. We also need to be able to transfer previous charge
from device memory to a page in an atomic way from memcg point of view.

Also introduce helper function to clear page memcg.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/memcontrol.h |  17 +++++
 mm/memcontrol.c            | 161 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 159 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fa2324..1737323 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -67,6 +67,8 @@ struct mem_cgroup_reclaim_cookie {
 
 extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+extern void mem_cgroup_transfer_charge_anon(struct page *page,
+					    struct mm_struct *mm);
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
@@ -85,6 +87,8 @@ extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
 
 extern void mem_cgroup_uncharge_page(struct page *page);
+extern void mem_cgroup_uncharge_mm(struct mm_struct *mm);
+extern void mem_cgroup_clear_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
@@ -245,6 +249,11 @@ static inline int mem_cgroup_charge_file(struct page *page,
 	return 0;
 }
 
+static inline void mem_cgroup_transfer_charge_anon(struct page *page,
+						   struct mm_struct *mm)
+{
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
@@ -272,6 +281,14 @@ static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
 
+static inline void mem_cgroup_uncharge_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mem_cgroup_clear_page(struct page *page)
+{
+}
+
 static inline void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19d620b..ceaf4d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -940,7 +940,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (page && PageTransHuge(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
 
@@ -2842,12 +2842,17 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 				       enum charge_type ctype,
 				       bool lrucare)
 {
-	struct page_cgroup *pc = lookup_page_cgroup(page);
+	struct page_cgroup *pc;
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
 	bool anon;
 
+	if (page == NULL) {
+		goto charge;
+	}
+
+	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
@@ -2891,20 +2896,24 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
+charge:
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
 		anon = true;
 	else
 		anon = false;
 
 	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
 
-	/*
-	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
-	 */
-	memcg_check_events(memcg, page);
+	if (page) {
+		unlock_page_cgroup(pc);
+
+		/*
+		 * "charge_statistics" updated event counter. Then, check it.
+		 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+		 * if they exceeds softlimit.
+		 */
+		memcg_check_events(memcg, page);
+	}
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3745,20 +3754,23 @@ int mem_cgroup_charge_anon(struct page *page,
 	if (mem_cgroup_disabled())
 		return 0;
 
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
 	VM_BUG_ON(!mm);
+	if (page) {
+		VM_BUG_ON_PAGE(page_mapped(page), page);
+		VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		/*
-		 * Never OOM-kill a process for a huge page.  The
-		 * fault handler will fall back to regular pages.
-		 */
-		oom = false;
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			/*
+			 * Never OOM-kill a process for a huge page.  The
+			 * fault handler will fall back to regular pages.
+			 */
+			oom = false;
+		}
 	}
 
+
 	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
 	if (!memcg)
 		return -ENOMEM;
@@ -3767,6 +3779,60 @@ int mem_cgroup_charge_anon(struct page *page,
 	return 0;
 }
 
+void mem_cgroup_transfer_charge_anon(struct page *page, struct mm_struct *mm)
+{
+	struct page_cgroup *pc;
+	struct task_struct *task;
+	struct mem_cgroup *memcg;
+	struct zone *uninitialized_var(zone);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(page->mapping && !PageAnon(page));
+	VM_BUG_ON(!mm);
+
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	/*
+	 * Because we don't have task_lock(), "p" can exit.
+	 * In that case, "memcg" can point to root or p can be NULL with
+	 * race with swapoff. Then, we have small risk of mis-accouning.
+	 * But such kind of mis-account by race always happens because
+	 * we don't have cgroup_mutex(). It's overkill and we allo that
+	 * small race, here.
+	 * (*) swapoff at el will charge against mm-struct not against
+	 * task-struct. So, mm->owner can be NULL.
+	 */
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg) {
+		memcg = root_mem_cgroup;
+	}
+	rcu_read_unlock();
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	VM_BUG_ON(PageCgroupUsed(pc));
+	/*
+	 * we don't need page_cgroup_lock about tail pages, becase they are not
+	 * accessed by any other context at this point.
+	 */
+
+	pc->mem_cgroup = memcg;
+	/*
+	 * We access a page_cgroup asynchronously without lock_page_cgroup().
+	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
+	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
+	 * before USED bit, we need memory barrier here.
+	 * See mem_cgroup_add_lru_list(), etc.
+	 */
+	smp_wmb();
+	SetPageCgroupUsed(pc);
+
+	unlock_page_cgroup(pc);
+	memcg_check_events(memcg, page);
+}
+
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
  * And when try_charge() successfully returns, one refcnt to memcg without
@@ -4087,6 +4153,63 @@ void mem_cgroup_uncharge_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
 }
 
+void mem_cgroup_uncharge_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+	struct task_struct *task;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(!mm);
+
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	/*
+	 * Because we don't have task_lock(), "p" can exit.
+	 * In that case, "memcg" can point to root or p can be NULL with
+	 * race with swapoff. Then, we have small risk of mis-accouning.
+	 * But such kind of mis-account by race always happens because
+	 * we don't have cgroup_mutex(). It's overkill and we allo that
+	 * small race, here.
+	 * (*) swapoff at el will charge against mm-struct not against
+	 * task-struct. So, mm->owner can be NULL.
+	 */
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg) {
+		memcg = root_mem_cgroup;
+	}
+	rcu_read_unlock();
+
+	mem_cgroup_charge_statistics(memcg, NULL, true, -1);
+	if (!mem_cgroup_is_root(memcg))
+		mem_cgroup_do_uncharge(memcg, 1, MEM_CGROUP_CHARGE_TYPE_ANON);
+}
+
+void mem_cgroup_clear_page(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/*
+	 * Check if our page_cgroup is valid
+	 */
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!PageCgroupUsed(pc)))
+		return;
+	lock_page_cgroup(pc);
+	ClearPageCgroupUsed(pc);
+	/*
+	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
+	 * freed from LRU. This is safe because uncharged page is expected not
+	 * to be reused (freed soon). Exception is SwapCache, it's handled by
+	 * special functions.
+	 */
+	unlock_page_cgroup(pc);
+}
+
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON_PAGE(page_mapped(page), page);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 05/11] mm/memcg: support accounting null page and transfering null charge to new page.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

When migrating memory to some device specific memory we still want to properly
account memcg memory usage. To do so we need to be able to account for page not
allocated in system memory. We also need to be able to transfer previous charge
from device memory to a page in an atomic way from memcg point of view.

Also introduce helper function to clear page memcg.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/memcontrol.h |  17 +++++
 mm/memcontrol.c            | 161 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 159 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1fa2324..1737323 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -67,6 +67,8 @@ struct mem_cgroup_reclaim_cookie {
 
 extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+extern void mem_cgroup_transfer_charge_anon(struct page *page,
+					    struct mm_struct *mm);
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
@@ -85,6 +87,8 @@ extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
 
 extern void mem_cgroup_uncharge_page(struct page *page);
+extern void mem_cgroup_uncharge_mm(struct mm_struct *mm);
+extern void mem_cgroup_clear_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
@@ -245,6 +249,11 @@ static inline int mem_cgroup_charge_file(struct page *page,
 	return 0;
 }
 
+static inline void mem_cgroup_transfer_charge_anon(struct page *page,
+						   struct mm_struct *mm)
+{
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
@@ -272,6 +281,14 @@ static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
 
+static inline void mem_cgroup_uncharge_mm(struct mm_struct *mm)
+{
+}
+
+static inline void mem_cgroup_clear_page(struct page *page)
+{
+}
+
 static inline void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19d620b..ceaf4d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -940,7 +940,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (page && PageTransHuge(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
 
@@ -2842,12 +2842,17 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 				       enum charge_type ctype,
 				       bool lrucare)
 {
-	struct page_cgroup *pc = lookup_page_cgroup(page);
+	struct page_cgroup *pc;
 	struct zone *uninitialized_var(zone);
 	struct lruvec *lruvec;
 	bool was_on_lru = false;
 	bool anon;
 
+	if (page == NULL) {
+		goto charge;
+	}
+
+	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	/*
@@ -2891,20 +2896,24 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
+charge:
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
 		anon = true;
 	else
 		anon = false;
 
 	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
 
-	/*
-	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
-	 */
-	memcg_check_events(memcg, page);
+	if (page) {
+		unlock_page_cgroup(pc);
+
+		/*
+		 * "charge_statistics" updated event counter. Then, check it.
+		 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+		 * if they exceeds softlimit.
+		 */
+		memcg_check_events(memcg, page);
+	}
 }
 
 static DEFINE_MUTEX(set_limit_mutex);
@@ -3745,20 +3754,23 @@ int mem_cgroup_charge_anon(struct page *page,
 	if (mem_cgroup_disabled())
 		return 0;
 
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
 	VM_BUG_ON(!mm);
+	if (page) {
+		VM_BUG_ON_PAGE(page_mapped(page), page);
+		VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		/*
-		 * Never OOM-kill a process for a huge page.  The
-		 * fault handler will fall back to regular pages.
-		 */
-		oom = false;
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			/*
+			 * Never OOM-kill a process for a huge page.  The
+			 * fault handler will fall back to regular pages.
+			 */
+			oom = false;
+		}
 	}
 
+
 	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
 	if (!memcg)
 		return -ENOMEM;
@@ -3767,6 +3779,60 @@ int mem_cgroup_charge_anon(struct page *page,
 	return 0;
 }
 
+void mem_cgroup_transfer_charge_anon(struct page *page, struct mm_struct *mm)
+{
+	struct page_cgroup *pc;
+	struct task_struct *task;
+	struct mem_cgroup *memcg;
+	struct zone *uninitialized_var(zone);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(page->mapping && !PageAnon(page));
+	VM_BUG_ON(!mm);
+
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	/*
+	 * Because we don't have task_lock(), "p" can exit.
+	 * In that case, "memcg" can point to root or p can be NULL with
+	 * race with swapoff. Then, we have small risk of mis-accouning.
+	 * But such kind of mis-account by race always happens because
+	 * we don't have cgroup_mutex(). It's overkill and we allo that
+	 * small race, here.
+	 * (*) swapoff at el will charge against mm-struct not against
+	 * task-struct. So, mm->owner can be NULL.
+	 */
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg) {
+		memcg = root_mem_cgroup;
+	}
+	rcu_read_unlock();
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	VM_BUG_ON(PageCgroupUsed(pc));
+	/*
+	 * we don't need page_cgroup_lock about tail pages, becase they are not
+	 * accessed by any other context at this point.
+	 */
+
+	pc->mem_cgroup = memcg;
+	/*
+	 * We access a page_cgroup asynchronously without lock_page_cgroup().
+	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
+	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
+	 * before USED bit, we need memory barrier here.
+	 * See mem_cgroup_add_lru_list(), etc.
+	 */
+	smp_wmb();
+	SetPageCgroupUsed(pc);
+
+	unlock_page_cgroup(pc);
+	memcg_check_events(memcg, page);
+}
+
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
  * And when try_charge() successfully returns, one refcnt to memcg without
@@ -4087,6 +4153,63 @@ void mem_cgroup_uncharge_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
 }
 
+void mem_cgroup_uncharge_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+	struct task_struct *task;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	VM_BUG_ON(!mm);
+
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	/*
+	 * Because we don't have task_lock(), "p" can exit.
+	 * In that case, "memcg" can point to root or p can be NULL with
+	 * race with swapoff. Then, we have small risk of mis-accouning.
+	 * But such kind of mis-account by race always happens because
+	 * we don't have cgroup_mutex(). It's overkill and we allo that
+	 * small race, here.
+	 * (*) swapoff at el will charge against mm-struct not against
+	 * task-struct. So, mm->owner can be NULL.
+	 */
+	memcg = mem_cgroup_from_task(task);
+	if (!memcg) {
+		memcg = root_mem_cgroup;
+	}
+	rcu_read_unlock();
+
+	mem_cgroup_charge_statistics(memcg, NULL, true, -1);
+	if (!mem_cgroup_is_root(memcg))
+		mem_cgroup_do_uncharge(memcg, 1, MEM_CGROUP_CHARGE_TYPE_ANON);
+}
+
+void mem_cgroup_clear_page(struct page *page)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	/*
+	 * Check if our page_cgroup is valid
+	 */
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!PageCgroupUsed(pc)))
+		return;
+	lock_page_cgroup(pc);
+	ClearPageCgroupUsed(pc);
+	/*
+	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
+	 * freed from LRU. This is safe because uncharged page is expected not
+	 * to be reused (freed soon). Exception is SwapCache, it's handled by
+	 * special functions.
+	 */
+	unlock_page_cgroup(pc);
+}
+
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON_PAGE(page_mapped(page), page);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 06/11] hmm: heterogeneous memory management
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range of and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities.

We expect that graphic processing unit and network interface to be among the
first prominent users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
  - hardware have its own page table per process (can be share btw != devices)
  - hardware mmu support page fault and suspend execution until the page fault
    is serviced by hmm code. The page fault must also trigger some form of
    interrupt so that hmm code can be call by the device driver.
  - hardware must support at least read only mapping (otherwise it can not
    access read only range of the process address space).

For better memory management it is highly recommanded that the device also
support the following features :
  - hardware mmu set access bit in its page table on memory access (like cpu).
  - hardware page table can be updated from cpu or through a fast path.
  - hardware provide advanced statistic over which range of memory it access
    the most.
  - hardware differentiate atomic memory access from regular access allowing
    to support atomic operation even on platform that do not have atomic
    support with there bus link with the device.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h      |  470 ++++++++++++++++++
 include/linux/mm_types.h |   14 +
 kernel/fork.c            |    6 +
 mm/Kconfig               |   12 +
 mm/Makefile              |    1 +
 mm/hmm.c                 | 1194 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1697 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..e9c7722
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,470 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu and its
+ * own page table for the process. It supports everything except special/mixed
+ * vma.
+ *
+ * To use this the hardware must have :
+ *   - mmu with pagetable
+ *   - pagetable must support read only (supporting dirtyness accounting is
+ *     preferable but is not mandatory).
+ *   - support pagefault ie hardware thread should stop on fault and resume
+ *     once hmm has provided valid memory to use.
+ *   - some way to report fault.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap.h>
+#include <linux/kref.h>
+#include <linux/swapops.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_device_ops;
+struct hmm_migrate;
+struct hmm_mirror;
+struct hmm_fault;
+struct hmm_event;
+struct hmm;
+
+/* The hmm provide page informations to the device using hmm pfn value. Below
+ * are the various flags that define the current state the pfn is in (valid,
+ * type of page, dirty page, page is locked or not, ...).
+ *
+ *   HMM_PFN_VALID_PAGE this means the pfn correspond to valid page.
+ *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page.
+ *   HMM_PFN_DIRTY set when the page is dirty.
+ *   HMM_PFN_WRITE is set if there is no need to call page_mkwrite
+ */
+#define HMM_PFN_SHIFT		(PAGE_SHIFT)
+#define HMM_PFN_VALID_PAGE	(0UL)
+#define HMM_PFN_VALID_ZERO	(1UL)
+#define HMM_PFN_DIRTY		(2UL)
+#define HMM_PFN_WRITE		(3UL)
+
+static inline struct page *hmm_pfn_to_page(unsigned long pfn)
+{
+	/* Ok to test on bit after the other as it can not flip from one to
+	 * the other. Both bit are constant for the lifetime of an rmem
+	 * object.
+	 */
+	if (!test_bit(HMM_PFN_VALID_PAGE, &pfn) &&
+	    !test_bit(HMM_PFN_VALID_ZERO, &pfn)) {
+		return NULL;
+	}
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+static inline void hmm_pfn_set_dirty(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_DIRTY, pfn);
+}
+
+
+/* hmm_fence - device driver fence to wait for device driver operations.
+ *
+ * In order to concurrently update several different devices mmu the hmm rely
+ * on device driver fence to wait for operation hmm has schedule to complete on
+ * the device. It is strongly recommanded to implement fences and have the hmm
+ * callback do as little as possible (just scheduling the update). Moreover the
+ * hmm code will reschedule for i/o the current process if necessary once it
+ * has scheduled all updates on all devices.
+ *
+ * Each fence is created as a result of either an update to range of memory or
+ * for remote memory to/from local memory dma.
+ *
+ * Update to range of memory correspond to a specific event type. For instance
+ * range of memory is unmap for page reclamation, or range of memory is unmap
+ * from process address as result of munmap syscall (HMM_RANGE_FINI), or there
+ * a memory protection change on the range. There is one hmm_etype for each of
+ * those event allowing the device driver to take appropriate action like for
+ * instance freeing device page table on HMM_RANGE_FINI but keeping it if it is
+ * HMM_RANGE_UNMAP (which means that the range is unmap but the range is still
+ * valid).
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_UNREGISTER,
+	HMM_DEVICE_FAULT,
+	HMM_MPROT_RONLY,
+	HMM_MPROT_RANDW,
+	HMM_MPROT_WONLY,
+	HMM_COW,
+	HMM_MUNMAP,
+	HMM_UNMAP,
+	HMM_MIGRATE_TO_LMEM,
+	HMM_MIGRATE_TO_RMEM,
+};
+
+struct hmm_fence {
+	struct list_head	list;
+	struct hmm_mirror	*mirror;
+};
+
+
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* struct hmm_device_operations - hmm device operation callback
+ */
+struct hmm_device_ops {
+	/* device_destroy - free hmm_device (call when refcount drop to 0).
+	 *
+	 * @device: The device hmm specific structure.
+	 */
+	void (*device_destroy)(struct hmm_device *device);
+
+	/* mirror_release() - device must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * Called when as result of hmm_mirror_unregister or when mm is being
+	 * destroy.
+	 *
+	 * It's illegal for the device to call any hmm helper function after
+	 * this call back. The device driver must kill any pending device
+	 * thread and wait for completion of all of them.
+	 *
+	 * Note that even after this callback returns the device driver might
+	 * get call back from hmm. Callback will stop only once mirror_destroy
+	 * is call.
+	 */
+	void (*mirror_release)(struct hmm_mirror *hmm_mirror);
+
+	/* mirror_destroy - free hmm_mirror (call when refcount drop to 0).
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 */
+	void (*mirror_destroy)(struct hmm_mirror *mirror);
+
+	/* fence_wait() - to wait on device driver fence.
+	 *
+	 * @fence:      The device driver fence struct.
+	 * Returns:     0 on success,-EIO on error, -EAGAIN to wait again.
+	 *
+	 * Called when hmm want to wait for all operations associated with a
+	 * fence to complete (including device cache flush if the event mandate
+	 * it).
+	 *
+	 * Device driver must free fence and associated resources if it returns
+	 * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+	 * as hmm will call back again.
+	 *
+	 * Return error if scheduled operation failed or if need to wait again.
+	 * -EIO    Some input/output error with the device.
+	 * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*fence_wait)(struct hmm_fence *fence);
+
+	/* lmem_update() - update device mmu for a range of local memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @etype:  The type of memory event (unmap, fini, read only, ...).
+	 * @dirty:  Device driver should call set_page_dirty_lock.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device mmu permission/usage for a range of local
+	 * memory. The event type provide the nature of the update :
+	 *   - range is no longer valid (munmap).
+	 *   - range protection changes (mprotect, COW, ...).
+	 *   - range is unmapped (swap, reclaim, page migration, ...).
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to local memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly call set_page_dirty on any page the device did
+	 * write to since last call to update_lmem. This is only needed if the
+	 * dirty parameter is true.
+	 *
+	 * The driver should return a fence pointer or NULL on success. It is
+	 * advice to return fence and delay wait for the operation to complete
+	 * to the wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * IMPORTANT IF DEVICE DRIVER GET HMM_MPROT_RANDW or HMM_MPROT_WONLY IT
+	 * MUST NOT MAP SPECIAL ZERO PFN WITH WRITE PERMISSION. SPECIAL ZERO
+	 * PFN IS SET THROUGH lmem_fault WITH THE HMM_PFN_VALID_ZERO BIT FLAG
+	 * SET.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*lmem_update)(struct hmm_mirror *mirror,
+					 unsigned long faddr,
+					 unsigned long laddr,
+					 enum hmm_etype etype,
+					 bool dirty);
+
+	/* lmem_fault() - fault range of lmem on the device mmu.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @pfns:   Array of pfn for the range (each of the pfn is valid).
+	 * @fault:  The fault structure provided by device driver.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Called to give the device driver each of the pfn backing a range of
+	 * memory. It is only call as a result of a call to hmm_mirror_fault.
+	 *
+	 * Note that the pfns array content is only valid for the duration of
+	 * the callback. Once the device driver callback return further memory
+	 * activities might invalidate the value of the pfns array. The device
+	 * driver will be inform of such changes through the update callback.
+	 *
+	 * Allowed return value are :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return error if scheduled operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*lmem_fault)(struct hmm_mirror *mirror,
+			  unsigned long faddr,
+			  unsigned long laddr,
+			  unsigned long *pfns,
+			  struct hmm_fault *fault);
+};
+
+/* struct hmm_device - per device hmm structure
+ *
+ * @kref:       Reference count.
+ * @mirrors:    List of all active mirrors for the device.
+ * @mutex:      Mutex protecting mirrors list.
+ * @ops:        The hmm operations callback.
+ * @name:       Device name (uniquely identify the device on the system).
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once).
+ */
+struct hmm_device {
+	struct kref			kref;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+	const struct hmm_device_ops	*ops;
+	const char			*name;
+};
+
+/* hmm_device_register() - register a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ * @name:   A unique name string for the device (use in error messages).
+ * Returns: 0 on success, -EINVAL otherwise.
+ *
+ * Call when device driver want to register itself with hmm. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device, const char *name);
+
+struct hmm_device *hmm_device_ref(struct hmm_device *device);
+struct hmm_device *hmm_device_unref(struct hmm_device *device);
+
+
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. A process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm hmm structure
+ *
+ * @kref:       Reference count.
+ * @dlist:      List of all hmm_mirror for same device.
+ * @mlist:      List of all hmm_mirror for same mm.
+ * @device:     The hmm_device struct this hmm_mirror is associated to.
+ * @hmm:        The hmm struct this hmm_mirror is associated to.
+ * @dead:       The hmm_mirror is dead and should no longer be use.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct kref		kref;
+	struct list_head	dlist;
+	struct list_head	mlist;
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	bool			dead;
+};
+
+/* hmm_mirror_register() - register a device mirror against an mm struct
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @device: The device struct to associate this mirror with.
+ * @mm:     The mm struct of the process.
+ * Returns: 0 success, -ENOMEM, -EBUSY or -EINVAL if process already mirrored.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * hmm shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The mm pin must also be hold (either task is current or using get_task_mm).
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return -EINVAL
+ * if the hmm_device already has an hmm_mirror for the the mm.
+ *
+ * If the mm or previous hmm is in transient state then this will return -EBUSY
+ * and device driver must retry the call after unpinning the mm and checking
+ * again that the mm is valid.
+ *
+ * On success the mirror is returned with one reference for the caller, thus to
+ * release mirror call hmm_mirror_unref.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm);
+
+/* hmm_mirror_unregister() - unregister an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Call when device driver want to stop mirroring a process address space.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+/* struct hmm_fault - device mirror fault informations
+ *
+ * @vma:    The vma into which the fault range is (set by hmm).
+ * @faddr:  First address of the range device want to fault (set by driver and
+ *          updated by hmm to the actual first faulted address).
+ * @laddr:  Last address of the range device want to fault (set by driver and
+ *          updated by hmm to the actual last faulted address).
+ * @pfns:   Array to hold the pfn value of each page in the range (provided by
+ *          device driver, big enough to hold (laddr - faddr) >> PAGE_SHIFT).
+ * @flags:  Fault flags (set by driver).
+ *
+ * This structure is given by the device driver to hmm_mirror_fault. The device
+ * driver can encapsulate the hmm_fault struct into its own fault structure and
+ * use that to provide private device driver information to the lmem_fault
+ * callback.
+ */
+struct hmm_fault {
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	unsigned long		flags;
+};
+
+#define HMM_FAULT_WRITE		(1 << 0)
+
+/* hmm_mirror_fault() - call by the device driver on device memory fault.
+ *
+ * @mirror:     The mirror that link process address space with the device.
+ * @fault:      The mirror fault struct holding fault range informations.
+ *
+ * Call when device is trying to access an invalid address in the device page
+ * table. The hmm shim will call lmem_fault with strong ordering in respect to
+ * call to lmem_update (ie any information provided to lmem_fault is valid
+ * until the device callback return).
+ *
+ * It will try to fault all pages in the range and give their pfn. If the vma
+ * covering the range needs to grow then it will.
+ *
+ * Also the fault will clamp the requested range to valid vma range (unless
+ * the vma into which event->faddr falls to, can grow).
+ *
+ * All error must be handled by device driver and most likely result in the
+ * process device tasks to be kill by the device driver.
+ *
+ * Returns:
+ * > 0 Number of pages faulted.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EACCES if trying to write to read only address (only for faddr).
+ * -EFAULT if trying to access an invalid address (only for faddr).
+ * -ENODEV if mirror is in process of being destroy.
+ */
+int hmm_mirror_fault(struct hmm_mirror *mirror,
+		     struct hmm_fault *fault);
+
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror);
+
+
+
+
+/* Functions used by core mm code. Device driver should not use any of them. */
+void __hmm_destroy(struct mm_struct *mm);
+static inline void hmm_destroy(struct mm_struct *mm)
+{
+	if (mm->hmm) {
+		__hmm_destroy(mm);
+	}
+}
+
+#else /* !CONFIG_HMM */
+
+static inline void hmm_destroy(struct mm_struct *mm)
+{
+}
+
+#endif /* !CONFIG_HMM */
+
+#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de16272..8fa66cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -425,6 +429,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d53eb0..56fce77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -602,6 +603,8 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+	/* hmm_destroy needs to be call after mmu_notifier_mm_destroy */
+	hmm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
 }
@@ -820,6 +823,9 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 
 	memcpy(mm, oldmm, sizeof(*mm));
 	mm_init_cpumask(mm);
+#ifdef CONFIG_HMM
+	mm->hmm = NULL;
+#endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 30cb6cb..7836f17 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -584,3 +584,15 @@ config PGTABLE_MAPPING
 
 config GENERIC_EARLY_IOREMAP
 	bool
+
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
diff --git a/mm/Makefile b/mm/Makefile
index b484452..d231646 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -63,3 +63,4 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..2b8986c
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,1194 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between local memory and device memory.
+ *
+ * Refer to include/linux/hmm.h for further informations on general design.
+ */
+/* Locking :
+ *
+ *   To synchronize with various mm event there is a simple serialization of
+ *   event touching overlapping range of address. Each mm event is associated
+ *   with an hmm_event structure which store the address range of the event.
+ *
+ *   When a new mm event call in hmm (most call comes through the mmu_notifier
+ *   call backs) hmm allocate an hmm_event structure and wait for all pending
+ *   event that overlap with the new event.
+ *
+ *   To avoid deadlock with mmap_sem the rules it to always allocate new hmm
+ *   event after taking the mmap_sem lock. In case of mmu_notifier call we do
+ *   not take the mmap_sem lock as if it was needed it would have been taken
+ *   by the caller of the mmu_notifier API.
+ *
+ *   Hence hmm only need to make sure to allocate new hmm event after taking
+ *   the mmap_sem.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/srcu.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/mman.h>
+#include <asm/tlb.h>
+#include <asm/tlbflush.h>
+#include <linux/delay.h>
+
+#include "internal.h"
+
+#define HMM_MAX_RANGE_BITS	(PAGE_SHIFT + 3UL)
+#define HMM_MAX_RANGE_SIZE	(PAGE_SIZE << HMM_MAX_RANGE_BITS)
+#define MM_MAX_SWAP_PAGES (swp_offset(pte_to_swp_entry(swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1UL)
+#define HMM_MAX_ADDR		(((unsigned long)PTRS_PER_PGD) << ((unsigned long)PGDIR_SHIFT))
+
+#define HMM_MAX_EVENTS		16
+
+/* global SRCU for all MMs */
+static struct srcu_struct srcu;
+
+
+
+
+/* struct hmm_event - used to serialize change to overlapping range of address.
+ *
+ * @list:       Current event list for the corresponding hmm.
+ * @faddr:      First address (inclusive) for the range this event affect.
+ * @laddr:      Last address (exclusive) for the range this event affect.
+ * @fences:     List of device fences associated with this event.
+ * @etype:      Event type (munmap, migrate, truncate, ...).
+ * @backoff:    Should this event backoff ie a new event render it obsolete.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	struct list_head	fences;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+/* struct hmm - per mm_struct hmm structure
+ *
+ * @mm:             The mm struct.
+ * @kref:           Reference counter
+ * @lock:           Serialize the mirror list modifications.
+ * @mirrors:        List of all mirror for this mm (one per device)
+ * @mmu_notifier:   The mmu_notifier of this mm
+ * @wait_queue:     Wait queue for synchronization btw cpu and device
+ * @events:         Events.
+ * @nevents:        Number of events currently happening.
+ * @dead:           The mm is being destroy.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change into the
+ * process address space.
+ */
+struct hmm {
+	struct mm_struct 	*mm;
+	struct kref		kref;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	struct list_head	pending;
+	struct mmu_notifier	mmu_notifier;
+	wait_queue_head_t	wait_queue;
+	struct hmm_event	events[HMM_MAX_EVENTS];
+	int			nevents;
+	bool			dead;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm *hmm_ref(struct hmm *hmm);
+static inline struct hmm *hmm_unref(struct hmm *hmm);
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event);
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence);
+
+
+
+
+/* hmm_event - use to synchronize various mm events with each others.
+ *
+ * During life time of process various mm events will happen, hmm serialize
+ * event that affect overlapping range of address. The hmm_event are use for
+ * that purpose.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+}
+
+static inline unsigned long hmm_event_size(struct hmm_event *event)
+{
+	return (event->laddr - event->faddr);
+}
+
+
+
+
+/* hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * This code deals with reading the cpu page table to find the pages that are
+ * backing a range of address. It is use as an helper to the device page fault
+ * code.
+ */
+
+/* struct hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * @mm:     The mm of the process the device fault is happening in.
+ * @vma:    The vma in which the fault is happening.
+ * @faddr:  The first address for the range the device want to fault.
+ * @laddr:  The last address for the range the device want to fault.
+ * @pfns:   Array of hmm pfns (contains the result of the fault).
+ * @write:  Is this write fault.
+ */
+struct hmm_fault_mm {
+	struct mm_struct	*mm;
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	bool			write;
+};
+
+static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct mm_walk *walk)
+{
+	struct hmm_fault_mm *fault_mm = walk->private;
+	unsigned long idx, *pfns;
+	pte_t *ptep;
+
+	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
+	pfns = &fault_mm->pfns[idx];
+	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
+	if (pmd_none(*pmdp)) {
+		return -ENOENT;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME */
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		return -EINVAL;
+	}
+
+	ptep = pte_offset_map(pmdp, faddr);
+	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
+		pte_t pte = *ptep;
+
+		if (pte_none(pte)) {
+			if (fault_mm->write) {
+				ptep++;
+				break;
+			}
+			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
+			set_bit(HMM_PFN_VALID_ZERO, pfns);
+			continue;
+		}
+		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+
+		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, pfns);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_WRITE, pfns);
+		}
+		/* Consider the page as hot as a device want to use it. */
+		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
+		fault_mm->laddr = faddr + PAGE_SIZE;
+	}
+	pte_unmap(ptep - 1);
+
+	return (faddr == laddr) ? 0 : -ENOENT;
+}
+
+static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
+{
+	struct mm_walk walk = {0};
+	unsigned long faddr, laddr;
+	int ret;
+
+	faddr = fault_mm->faddr;
+	laddr = fault_mm->laddr;
+	fault_mm->laddr = faddr;
+
+	walk.pmd_entry = hmm_fault_mm_fault_pmd;
+	walk.mm = fault_mm->mm;
+	walk.private = fault_mm;
+
+	ret = walk_page_range(faddr, laddr, &walk);
+	return ret;
+}
+
+
+
+
+/* hmm - core hmm functions.
+ *
+ * Core hmm functions that deal with all the process mm activities and use
+ * event for synchronization. Those function are use mostly as result of cpu
+ * mm event.
+ */
+
+static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
+{
+	int i, ret;
+
+	hmm->mm = mm;
+	kref_init(&hmm->kref);
+	INIT_LIST_HEAD(&hmm->mirrors);
+	INIT_LIST_HEAD(&hmm->pending);
+	spin_lock_init(&hmm->lock);
+	init_waitqueue_head(&hmm->wait_queue);
+
+	for (i = 0; i < HMM_MAX_EVENTS; ++i) {
+		hmm->events[i].etype = HMM_NONE;
+		INIT_LIST_HEAD(&hmm->events[i].fences);
+	}
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	ret = __mmu_notifier_register(&hmm->mmu_notifier, mm);
+	return ret;
+}
+
+static enum hmm_etype hmm_event_mmu(enum mmu_action action)
+{
+	switch (action) {
+	case MMU_MPROT_RONLY:
+		return HMM_MPROT_RONLY;
+	case MMU_MPROT_RANDW:
+		return HMM_MPROT_RANDW;
+	case MMU_MPROT_WONLY:
+		return HMM_MPROT_WONLY;
+	case MMU_COW:
+		return HMM_COW;
+	case MMU_MPROT_NONE:
+	case MMU_KSM:
+	case MMU_KSM_RONLY:
+	case MMU_UNMAP:
+	case MMU_VMSCAN:
+	case MMU_MUNLOCK:
+	case MMU_MIGRATE:
+	case MMU_FILE_WB:
+	case MMU_FAULT_WP:
+	case MMU_THP_SPLIT:
+	case MMU_THP_FAULT_WP:
+		return HMM_UNMAP;
+	case MMU_POISON:
+	case MMU_MREMAP:
+	case MMU_MUNMAP:
+		return HMM_MUNMAP;
+	case MMU_SOFT_DIRTY:
+	default:
+		return HMM_NONE;
+	}
+}
+
+static void hmm_event_unqueue_locked(struct hmm *hmm, struct hmm_event *event)
+{
+	list_del_init(&event->list);
+	event->etype = HMM_NONE;
+	hmm->nevents--;
+}
+
+static void hmm_event_unqueue(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	event->etype = HMM_NONE;
+	hmm->nevents--;
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_destroy_kref(struct kref *kref)
+{
+	struct hmm *hmm;
+	struct mm_struct *mm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	mm = hmm->mm;
+	mm->hmm = NULL;
+	mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+
+	if (!list_empty(&hmm->mirrors)) {
+		BUG();
+		printk(KERN_ERR "destroying an hmm with still active mirror\n"
+		       "Leaking memory instead to avoid something worst.\n");
+		return;
+	}
+	kfree(hmm);
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (hmm) {
+		kref_get(&hmm->kref);
+		return hmm;
+	}
+	return NULL;
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm) {
+		kref_put(&hmm->kref, hmm_destroy_kref);
+	}
+	return NULL;
+}
+
+static struct hmm_event *hmm_event_get(struct hmm *hmm,
+				       unsigned long faddr,
+				       unsigned long laddr,
+				       enum hmm_etype etype)
+{
+	struct hmm_event *event, *wait = NULL;
+	enum hmm_etype wait_type;
+	unsigned id;
+
+	do {
+		wait_event(hmm->wait_queue, hmm->nevents < HMM_MAX_EVENTS);
+		spin_lock(&hmm->lock);
+		for (id = 0; id < HMM_MAX_EVENTS; ++id) {
+			if (hmm->events[id].etype == HMM_NONE) {
+				event = &hmm->events[id];
+				goto out;
+			}
+		}
+		spin_unlock(&hmm->lock);
+	} while (1);
+
+out:
+	event->etype = etype;
+	event->faddr = faddr;
+	event->laddr = laddr;
+	event->backoff = false;
+	INIT_LIST_HEAD(&event->fences);
+	hmm->nevents++;
+	list_add_tail(&event->list, &hmm->pending);
+
+retry_wait:
+	wait = event;
+	list_for_each_entry_continue_reverse (wait, &hmm->pending, list) {
+		if (!hmm_event_overlap(event, wait)) {
+			continue;
+		}
+		switch (event->etype) {
+		case HMM_UNMAP:
+		case HMM_MUNMAP:
+			switch (wait->etype) {
+			case HMM_DEVICE_FAULT:
+			case HMM_MIGRATE_TO_RMEM:
+				wait->backoff = true;
+				/* fall through */
+			default:
+				wait_type = wait->etype;
+				goto wait;
+			}
+		default:
+			wait_type = wait->etype;
+			goto wait;
+		}
+	}
+	spin_unlock(&hmm->lock);
+
+	return event;
+
+wait:
+	spin_unlock(&hmm->lock);
+	wait_event(hmm->wait_queue, wait->etype != wait_type);
+	spin_lock(&hmm->lock);
+	goto retry_wait;
+}
+
+static void hmm_update_mirrors(struct hmm *hmm,
+			       struct vm_area_struct *vma,
+			       struct hmm_event *event)
+{
+	unsigned long faddr, laddr;
+
+	for (faddr = event->faddr; faddr < event->laddr; faddr = laddr) {
+		struct hmm_mirror *mirror;
+		struct hmm_fence *fence = NULL, *tmp;
+		int ticket;
+
+		laddr = event->laddr;
+
+retry_ranges:
+		ticket = srcu_read_lock(&srcu);
+		/* Because of retry we might already have scheduled some mirror
+		 * skip those.
+		 */
+		mirror = list_first_entry(&hmm->mirrors,
+					  struct hmm_mirror,
+					  mlist);
+		mirror = fence ? fence->mirror : mirror;
+		list_for_each_entry_continue (mirror, &hmm->mirrors, mlist) {
+			int r;
+
+			r = hmm_mirror_update(mirror,vma,faddr,laddr,event);
+			if (r) {
+				srcu_read_unlock(&srcu, ticket);
+				hmm_mirror_cleanup(mirror);
+				goto retry_ranges;
+			}
+		}
+		srcu_read_unlock(&srcu, ticket);
+
+		list_for_each_entry_safe (fence, tmp, &event->fences, list) {
+			struct hmm_device *device;
+			int r;
+
+			mirror = fence->mirror;
+			device = mirror->device;
+
+			r = hmm_device_fence_wait(device, fence);
+			if (r) {
+				hmm_mirror_cleanup(mirror);
+			}
+		}
+	}
+}
+
+static int hmm_fault_mm(struct hmm *hmm,
+			struct vm_area_struct *vma,
+			unsigned long faddr,
+			unsigned long laddr,
+			bool write)
+{
+	int r;
+
+	if (laddr <= faddr) {
+		return -EINVAL;
+	}
+
+	for (; faddr < laddr; faddr += PAGE_SIZE) {
+		unsigned flags = 0;
+
+		flags |= write ? FAULT_FLAG_WRITE : 0;
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(hmm->mm, vma, faddr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM) {
+					return -ENOMEM;
+				}
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+
+	return 0;
+}
+
+
+
+
+/* hmm_notifier - mmu_notifier hmm funcs tracking change to process mm.
+ *
+ * Callbacks for mmu notifier. We use use mmu notifier to track change made to
+ * process address space.
+ *
+ * Note that none of this callback needs to take a reference, as we sure that
+ * mm won't be destroy thus hmm won't be destroy either and it's fine if some
+ * hmm_mirror/hmm_device are destroy during those callbacks because this is
+ * serialize through either the hmm lock or the device lock.
+ */
+
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm)) || hmm->dead) {
+		/* Already clean. */
+		hmm_unref(hmm);
+		return;
+	}
+
+	hmm->dead = true;
+
+	/*
+	 * hmm->lock allow synchronization with hmm_mirror_unregister() an
+	 * hmm_mirror can be removed only once.
+	 */
+	spin_lock(&hmm->lock);
+	while (unlikely(!list_empty(&hmm->mirrors))) {
+		struct hmm_mirror *mirror;
+		struct hmm_device *device;
+
+		mirror = list_first_entry(&hmm->mirrors,
+					  struct hmm_mirror,
+					  mlist);
+		device = mirror->device;
+		if (!mirror->dead) {
+			/* Update mirror as being dead and remove it from the
+			 * mirror list before freeing up any of its resources.
+			 */
+			mirror->dead = true;
+			list_del_init(&mirror->mlist);
+			spin_unlock(&hmm->lock);
+
+			synchronize_srcu(&srcu);
+
+			device->ops->mirror_release(mirror);
+			hmm_mirror_cleanup(mirror);
+			spin_lock(&hmm->lock);
+		}
+	}
+	spin_unlock(&hmm->lock);
+	hmm_unref(hmm);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						struct vm_area_struct *vma,
+						unsigned long faddr,
+						unsigned long laddr,
+						enum mmu_action action)
+{
+	struct hmm_event *event;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm))) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		hmm_unref(hmm);
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = PAGE_ALIGN(laddr);
+
+	event = hmm_event_get(hmm, faddr, laddr, etype);
+	hmm_update_mirrors(hmm, vma, event);
+	/* Do not drop hmm reference here but in the range_end instead. */
+}
+
+static void hmm_notifier_invalidate_range_end(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      struct vm_area_struct *vma,
+					      unsigned long faddr,
+					      unsigned long laddr,
+					      enum mmu_action action)
+{
+	struct hmm_event *event = NULL;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+	int i;
+
+	if (!(hmm = mm->hmm)) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = PAGE_ALIGN(laddr);
+
+	spin_lock(&hmm->lock);
+	for (i = 0; i < HMM_MAX_EVENTS; ++i, event = NULL) {
+		event = &hmm->events[i];
+		if (event->etype == etype &&
+		    event->faddr == faddr &&
+		    event->laddr == laddr &&
+		    !list_empty(&event->list)) {
+			hmm_event_unqueue_locked(hmm, event);
+			break;
+		}
+	}
+	spin_unlock(&hmm->lock);
+
+	/* Drop reference from invalidate_range_start. */
+	hmm_unref(hmm);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 struct vm_area_struct *vma,
+					 unsigned long faddr,
+					 enum mmu_action action)
+{
+	unsigned long laddr;
+	struct hmm_event *event;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm))) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = faddr + PAGE_SIZE;
+
+	event = hmm_event_get(hmm, faddr, laddr, etype);
+	hmm_update_mirrors(hmm, vma, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+	/* .clear_flush_young FIXME we probably want to do something. */
+	/* .test_young FIXME we probably want to do something. */
+	/* WARNING .change_pte must always bracketed by range_start/end there
+	 * was patches to remove that behavior we must make sure that those
+	 * patches are not included as alternative solution to issue they are
+	 * trying to solve can be use.
+	 *
+	 * While hmm can not use the change_pte callback as non sleeping lock
+	 * are held during change_pte callback.
+	 */
+	.change_pte		= NULL,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
+	.invalidate_range_end	= hmm_notifier_invalidate_range_end,
+};
+
+
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and there helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+	bool dirty = !!(vma->vm_file);
+
+	fence = device->ops->lmem_update(mirror, faddr, laddr,
+					 event->etype, dirty);
+	if (fence) {
+		if (IS_ERR(fence)) {
+			return PTR_ERR(fence);
+		}
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
+{
+	struct vm_area_struct *vma;
+	struct hmm_device *device = mirror->device;
+	struct hmm_event *event;
+	unsigned long faddr, laddr;
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (mirror->dead) {
+		spin_unlock(&hmm->lock);
+		return;
+	}
+	mirror->dead = true;
+	list_del(&mirror->mlist);
+	spin_unlock(&hmm->lock);
+	synchronize_srcu(&srcu);
+	INIT_LIST_HEAD(&mirror->mlist);
+
+
+	event = hmm_event_get(hmm, 0UL, HMM_MAX_ADDR, HMM_UNREGISTER);
+	faddr = 0UL;
+	vma = find_vma(hmm->mm, faddr);
+	for (; vma && (faddr < HMM_MAX_ADDR); faddr = laddr) {
+		struct hmm_fence *fence, *next;
+
+		faddr = max(faddr, vma->vm_start);
+		laddr = vma->vm_end;
+
+		hmm_mirror_update(mirror, vma, faddr, laddr, event);
+		list_for_each_entry_safe (fence, next, &event->fences, list) {
+			hmm_device_fence_wait(device, fence);
+		}
+
+		if (laddr >= vma->vm_end) {
+			vma = vma->vm_next;
+		}
+	}
+	hmm_event_unqueue(hmm, event);
+
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&device->mutex);
+
+	mirror->hmm = hmm_unref(hmm);
+	hmm_mirror_unref(mirror);
+}
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_mirror *mirror;
+	struct hmm_device *device;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+
+	BUG_ON(!list_empty(&mirror->mlist));
+	BUG_ON(!list_empty(&mirror->dlist));
+
+	device->ops->mirror_destroy(mirror);
+	hmm_device_unref(device);
+}
+
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (mirror) {
+		kref_get(&mirror->kref);
+		return mirror;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror)
+{
+	if (mirror) {
+		kref_put(&mirror->kref, hmm_mirror_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm)
+{
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!device);
+	BUG_ON(!mm);
+
+	/* Take reference on device only on success. */
+	kref_init(&mirror->kref);
+	mirror->device = device;
+	mirror->dead = false;
+	INIT_LIST_HEAD(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+
+	down_write(&mm->mmap_sem);
+	if (mm->hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			ret = -ENOMEM;
+			goto out_cleanup;
+		}
+
+		ret = hmm_init(hmm, mm);
+		if (ret) {
+			kfree(hmm);
+			hmm = NULL;
+			goto out_cleanup;
+		}
+
+		/* now set hmm, make sure no mmu notifer callback might be call */
+		ret = mm_take_all_locks(mm);
+		if (unlikely(ret)) {
+			goto out_cleanup;
+		}
+		mm->hmm = hmm;
+		mirror->hmm = hmm;
+		hmm = NULL;
+	} else {
+		struct hmm_mirror *tmp;
+		int id;
+
+		id = srcu_read_lock(&srcu);
+		list_for_each_entry(tmp, &mm->hmm->mirrors, mlist) {
+			if (tmp->device == mirror->device) {
+				/* A process can be mirrored only once by same
+				 * device.
+				 */
+				srcu_read_unlock(&srcu, id);
+				ret = -EINVAL;
+				goto out_cleanup;
+			}
+		}
+		srcu_read_unlock(&srcu, id);
+
+		ret = mm_take_all_locks(mm);
+		if (unlikely(ret)) {
+			goto out_cleanup;
+		}
+		mirror->hmm = hmm_ref(mm->hmm);
+	}
+
+	/*
+	 * A side note: hmm_notifier_release() can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 *
+	 * We can't race against any other mmu notifier method either
+	 * thanks to mm_take_all_locks().
+	 */
+	spin_lock(&mm->hmm->lock);
+	list_add_rcu(&mirror->mlist, &mm->hmm->mirrors);
+	spin_unlock(&mm->hmm->lock);
+	mm_drop_all_locks(mm);
+
+out_cleanup:
+	if (hmm) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+	up_write(&mm->mmap_sem);
+
+	if (!ret) {
+		struct hmm_device *device = mirror->device;
+
+		hmm_device_ref(device);
+		mutex_lock(&device->mutex);
+		list_add(&mirror->dlist, &device->mirrors);
+		mutex_unlock(&device->mutex);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm;
+
+	if (!mirror) {
+		return;
+	}
+	hmm = hmm_ref(mirror->hmm);
+	if (!hmm) {
+		return;
+	}
+
+	down_read(&hmm->mm->mmap_sem);
+	hmm_mirror_cleanup(mirror);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_unref(hmm);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror,
+				 struct hmm_fault *fault,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long *pfns)
+{
+	struct hmm_device *device = mirror->device;
+	int ret;
+
+	ret = device->ops->lmem_fault(mirror, faddr, laddr, pfns, fault);
+	return ret;
+}
+
+/* see include/linux/hmm.h */
+int hmm_mirror_fault(struct hmm_mirror *mirror,
+		     struct hmm_fault *fault)
+{
+	struct vm_area_struct *vma;
+	struct hmm_event *event;
+	unsigned long caddr, naddr, vm_flags;
+	struct hmm *hmm;
+	bool do_fault = false, write;
+	int ret = 0;
+
+	if (!mirror || !fault || fault->faddr >= fault->laddr) {
+		return -EINVAL;
+	}
+	if (mirror->dead) {
+		return -ENODEV;
+	}
+	hmm = mirror->hmm;
+
+	write = !!(fault->flags & HMM_FAULT_WRITE);
+	fault->faddr = fault->faddr & PAGE_MASK;
+	fault->laddr = PAGE_ALIGN(fault->laddr);
+	caddr = fault->faddr;
+	naddr = fault->laddr;
+	/* FIXME arbitrary value clamp fault to 4M at a time. */
+	if ((fault->laddr - fault->faddr) > (4UL << 20UL)) {
+		fault->laddr = fault->faddr + (4UL << 20UL);
+	}
+	hmm_mirror_ref(mirror);
+
+retry:
+	down_read(&hmm->mm->mmap_sem);
+	event = hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT);
+	/* FIXME handle gate area ? and guard page */
+	vma = find_extend_vma(hmm->mm, caddr);
+	if (!vma) {
+		if (caddr > fault->faddr) {
+			/* Fault succeed up to addr. */
+			fault->laddr = caddr;
+			ret = 0;
+			goto out;
+		}
+		/* Allow device driver to learn about first valid address in
+		 * the range it was trying to fault in so it can restart the
+		 * fault at this address.
+		 */
+		vma = find_vma_intersection(hmm->mm,event->faddr,event->laddr);
+		if (vma) {
+			fault->laddr = vma->vm_start;
+		}
+		ret = -EFAULT;
+		goto out;
+	}
+	/* FIXME support HUGETLB */
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+	vm_flags = write ? VM_WRITE : VM_READ;
+	if (!(vma->vm_flags & vm_flags)) {
+		ret = -EACCES;
+		goto out;
+	}
+	/* Adjust range to this vma only. */
+	fault->laddr = naddr = event->laddr = min(event->laddr, vma->vm_end);
+	fault->vma = vma;
+
+	for (; caddr < event->laddr;) {
+		struct hmm_fault_mm fault_mm;
+
+		fault_mm.mm = vma->vm_mm;
+		fault_mm.vma = vma;
+		fault_mm.faddr = caddr;
+		fault_mm.laddr = naddr;
+		fault_mm.pfns = fault->pfns;
+		fault_mm.write = write;
+		ret = hmm_fault_mm_fault(&fault_mm);
+		if (ret == -ENOENT && fault_mm.laddr == caddr) {
+			do_fault = true;
+			goto out;
+		}
+		if (ret && ret != -ENOENT) {
+			goto out;
+		}
+		if (mirror->dead) {
+			ret = -ENODEV;
+			goto out;
+		}
+		if (event->backoff) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		ret = hmm_mirror_lmem_fault(mirror, fault,
+					    fault_mm.faddr,
+					    fault_mm.laddr,
+					    fault_mm.pfns);
+		if (ret) {
+			goto out;
+		}
+		caddr = fault_mm.laddr;
+		naddr = event->laddr;
+	}
+
+out:
+	hmm_event_unqueue(hmm, event);
+	if (do_fault && !event->backoff && !mirror->dead) {
+		do_fault = false;
+		ret = hmm_fault_mm(hmm, vma, caddr, naddr, write);
+		if (!ret) {
+			ret = -ENOENT;
+		}
+	}
+	wake_up(&hmm->wait_queue);
+	up_read(&hmm->mm->mmap_sem);
+	if (ret == -ENOENT) {
+		if (!mirror->dead) {
+			naddr = fault->laddr;
+			goto retry;
+		}
+		ret = -ENODEV;
+	}
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
+
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+static void hmm_device_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+
+	device = container_of(kref, struct hmm_device, kref);
+	BUG_ON(!list_empty(&device->mirrors));
+
+	device->ops->device_destroy(device);
+}
+
+struct hmm_device *hmm_device_ref(struct hmm_device *device)
+{
+	if (device) {
+		kref_get(&device->kref);
+		return device;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_device_ref);
+
+struct hmm_device *hmm_device_unref(struct hmm_device *device)
+{
+	if (device) {
+		kref_put(&device->kref, hmm_device_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_device_unref);
+
+/* see include/linux/hmm.h */
+int hmm_device_register(struct hmm_device *device, const char *name)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->device_destroy);
+	BUG_ON(!device->ops->mirror_release);
+	BUG_ON(!device->ops->mirror_destroy);
+	BUG_ON(!device->ops->fence_wait);
+	BUG_ON(!device->ops->lmem_update);
+	BUG_ON(!device->ops->lmem_fault);
+
+	kref_init(&device->kref);
+	device->name = name;
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence)
+{
+	int ret;
+
+	if (fence == NULL) {
+		return 0;
+	}
+
+	list_del_init(&fence->list);
+	do {
+		io_schedule();
+		ret = device->ops->fence_wait(fence);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+
+
+
+/* This is called after the last hmm_notifier_release() returned */
+void __hmm_destroy(struct mm_struct *mm)
+{
+	kref_put(&mm->hmm->kref, hmm_destroy_kref);
+}
+
+static int __init hmm_module_init(void)
+{
+	int ret;
+
+	ret = init_srcu_struct(&srcu);
+	if (ret) {
+		return ret;
+	}
+	return 0;
+}
+module_init(hmm_module_init);
+
+static void __exit hmm_module_exit(void)
+{
+	cleanup_srcu_struct(&srcu);
+}
+module_exit(hmm_module_exit);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 06/11] hmm: heterogeneous memory management
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range of and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities.

We expect that graphic processing unit and network interface to be among the
first prominent users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
  - hardware have its own page table per process (can be share btw != devices)
  - hardware mmu support page fault and suspend execution until the page fault
    is serviced by hmm code. The page fault must also trigger some form of
    interrupt so that hmm code can be call by the device driver.
  - hardware must support at least read only mapping (otherwise it can not
    access read only range of the process address space).

For better memory management it is highly recommanded that the device also
support the following features :
  - hardware mmu set access bit in its page table on memory access (like cpu).
  - hardware page table can be updated from cpu or through a fast path.
  - hardware provide advanced statistic over which range of memory it access
    the most.
  - hardware differentiate atomic memory access from regular access allowing
    to support atomic operation even on platform that do not have atomic
    support with there bus link with the device.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h      |  470 ++++++++++++++++++
 include/linux/mm_types.h |   14 +
 kernel/fork.c            |    6 +
 mm/Kconfig               |   12 +
 mm/Makefile              |    1 +
 mm/hmm.c                 | 1194 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1697 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..e9c7722
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,470 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu and its
+ * own page table for the process. It supports everything except special/mixed
+ * vma.
+ *
+ * To use this the hardware must have :
+ *   - mmu with pagetable
+ *   - pagetable must support read only (supporting dirtyness accounting is
+ *     preferable but is not mandatory).
+ *   - support pagefault ie hardware thread should stop on fault and resume
+ *     once hmm has provided valid memory to use.
+ *   - some way to report fault.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap.h>
+#include <linux/kref.h>
+#include <linux/swapops.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_device_ops;
+struct hmm_migrate;
+struct hmm_mirror;
+struct hmm_fault;
+struct hmm_event;
+struct hmm;
+
+/* The hmm provide page informations to the device using hmm pfn value. Below
+ * are the various flags that define the current state the pfn is in (valid,
+ * type of page, dirty page, page is locked or not, ...).
+ *
+ *   HMM_PFN_VALID_PAGE this means the pfn correspond to valid page.
+ *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page.
+ *   HMM_PFN_DIRTY set when the page is dirty.
+ *   HMM_PFN_WRITE is set if there is no need to call page_mkwrite
+ */
+#define HMM_PFN_SHIFT		(PAGE_SHIFT)
+#define HMM_PFN_VALID_PAGE	(0UL)
+#define HMM_PFN_VALID_ZERO	(1UL)
+#define HMM_PFN_DIRTY		(2UL)
+#define HMM_PFN_WRITE		(3UL)
+
+static inline struct page *hmm_pfn_to_page(unsigned long pfn)
+{
+	/* Ok to test on bit after the other as it can not flip from one to
+	 * the other. Both bit are constant for the lifetime of an rmem
+	 * object.
+	 */
+	if (!test_bit(HMM_PFN_VALID_PAGE, &pfn) &&
+	    !test_bit(HMM_PFN_VALID_ZERO, &pfn)) {
+		return NULL;
+	}
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+static inline void hmm_pfn_set_dirty(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_DIRTY, pfn);
+}
+
+
+/* hmm_fence - device driver fence to wait for device driver operations.
+ *
+ * In order to concurrently update several different devices mmu the hmm rely
+ * on device driver fence to wait for operation hmm has schedule to complete on
+ * the device. It is strongly recommanded to implement fences and have the hmm
+ * callback do as little as possible (just scheduling the update). Moreover the
+ * hmm code will reschedule for i/o the current process if necessary once it
+ * has scheduled all updates on all devices.
+ *
+ * Each fence is created as a result of either an update to range of memory or
+ * for remote memory to/from local memory dma.
+ *
+ * Update to range of memory correspond to a specific event type. For instance
+ * range of memory is unmap for page reclamation, or range of memory is unmap
+ * from process address as result of munmap syscall (HMM_RANGE_FINI), or there
+ * a memory protection change on the range. There is one hmm_etype for each of
+ * those event allowing the device driver to take appropriate action like for
+ * instance freeing device page table on HMM_RANGE_FINI but keeping it if it is
+ * HMM_RANGE_UNMAP (which means that the range is unmap but the range is still
+ * valid).
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_UNREGISTER,
+	HMM_DEVICE_FAULT,
+	HMM_MPROT_RONLY,
+	HMM_MPROT_RANDW,
+	HMM_MPROT_WONLY,
+	HMM_COW,
+	HMM_MUNMAP,
+	HMM_UNMAP,
+	HMM_MIGRATE_TO_LMEM,
+	HMM_MIGRATE_TO_RMEM,
+};
+
+struct hmm_fence {
+	struct list_head	list;
+	struct hmm_mirror	*mirror;
+};
+
+
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* struct hmm_device_operations - hmm device operation callback
+ */
+struct hmm_device_ops {
+	/* device_destroy - free hmm_device (call when refcount drop to 0).
+	 *
+	 * @device: The device hmm specific structure.
+	 */
+	void (*device_destroy)(struct hmm_device *device);
+
+	/* mirror_release() - device must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * Called when as result of hmm_mirror_unregister or when mm is being
+	 * destroy.
+	 *
+	 * It's illegal for the device to call any hmm helper function after
+	 * this call back. The device driver must kill any pending device
+	 * thread and wait for completion of all of them.
+	 *
+	 * Note that even after this callback returns the device driver might
+	 * get call back from hmm. Callback will stop only once mirror_destroy
+	 * is call.
+	 */
+	void (*mirror_release)(struct hmm_mirror *hmm_mirror);
+
+	/* mirror_destroy - free hmm_mirror (call when refcount drop to 0).
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 */
+	void (*mirror_destroy)(struct hmm_mirror *mirror);
+
+	/* fence_wait() - to wait on device driver fence.
+	 *
+	 * @fence:      The device driver fence struct.
+	 * Returns:     0 on success,-EIO on error, -EAGAIN to wait again.
+	 *
+	 * Called when hmm want to wait for all operations associated with a
+	 * fence to complete (including device cache flush if the event mandate
+	 * it).
+	 *
+	 * Device driver must free fence and associated resources if it returns
+	 * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+	 * as hmm will call back again.
+	 *
+	 * Return error if scheduled operation failed or if need to wait again.
+	 * -EIO    Some input/output error with the device.
+	 * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*fence_wait)(struct hmm_fence *fence);
+
+	/* lmem_update() - update device mmu for a range of local memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @etype:  The type of memory event (unmap, fini, read only, ...).
+	 * @dirty:  Device driver should call set_page_dirty_lock.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device mmu permission/usage for a range of local
+	 * memory. The event type provide the nature of the update :
+	 *   - range is no longer valid (munmap).
+	 *   - range protection changes (mprotect, COW, ...).
+	 *   - range is unmapped (swap, reclaim, page migration, ...).
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to local memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly call set_page_dirty on any page the device did
+	 * write to since last call to update_lmem. This is only needed if the
+	 * dirty parameter is true.
+	 *
+	 * The driver should return a fence pointer or NULL on success. It is
+	 * advice to return fence and delay wait for the operation to complete
+	 * to the wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * IMPORTANT IF DEVICE DRIVER GET HMM_MPROT_RANDW or HMM_MPROT_WONLY IT
+	 * MUST NOT MAP SPECIAL ZERO PFN WITH WRITE PERMISSION. SPECIAL ZERO
+	 * PFN IS SET THROUGH lmem_fault WITH THE HMM_PFN_VALID_ZERO BIT FLAG
+	 * SET.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*lmem_update)(struct hmm_mirror *mirror,
+					 unsigned long faddr,
+					 unsigned long laddr,
+					 enum hmm_etype etype,
+					 bool dirty);
+
+	/* lmem_fault() - fault range of lmem on the device mmu.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @pfns:   Array of pfn for the range (each of the pfn is valid).
+	 * @fault:  The fault structure provided by device driver.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Called to give the device driver each of the pfn backing a range of
+	 * memory. It is only call as a result of a call to hmm_mirror_fault.
+	 *
+	 * Note that the pfns array content is only valid for the duration of
+	 * the callback. Once the device driver callback return further memory
+	 * activities might invalidate the value of the pfns array. The device
+	 * driver will be inform of such changes through the update callback.
+	 *
+	 * Allowed return value are :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return error if scheduled operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*lmem_fault)(struct hmm_mirror *mirror,
+			  unsigned long faddr,
+			  unsigned long laddr,
+			  unsigned long *pfns,
+			  struct hmm_fault *fault);
+};
+
+/* struct hmm_device - per device hmm structure
+ *
+ * @kref:       Reference count.
+ * @mirrors:    List of all active mirrors for the device.
+ * @mutex:      Mutex protecting mirrors list.
+ * @ops:        The hmm operations callback.
+ * @name:       Device name (uniquely identify the device on the system).
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once).
+ */
+struct hmm_device {
+	struct kref			kref;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+	const struct hmm_device_ops	*ops;
+	const char			*name;
+};
+
+/* hmm_device_register() - register a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ * @name:   A unique name string for the device (use in error messages).
+ * Returns: 0 on success, -EINVAL otherwise.
+ *
+ * Call when device driver want to register itself with hmm. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device, const char *name);
+
+struct hmm_device *hmm_device_ref(struct hmm_device *device);
+struct hmm_device *hmm_device_unref(struct hmm_device *device);
+
+
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. A process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm hmm structure
+ *
+ * @kref:       Reference count.
+ * @dlist:      List of all hmm_mirror for same device.
+ * @mlist:      List of all hmm_mirror for same mm.
+ * @device:     The hmm_device struct this hmm_mirror is associated to.
+ * @hmm:        The hmm struct this hmm_mirror is associated to.
+ * @dead:       The hmm_mirror is dead and should no longer be use.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct kref		kref;
+	struct list_head	dlist;
+	struct list_head	mlist;
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	bool			dead;
+};
+
+/* hmm_mirror_register() - register a device mirror against an mm struct
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @device: The device struct to associate this mirror with.
+ * @mm:     The mm struct of the process.
+ * Returns: 0 success, -ENOMEM, -EBUSY or -EINVAL if process already mirrored.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * hmm shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The mm pin must also be hold (either task is current or using get_task_mm).
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return -EINVAL
+ * if the hmm_device already has an hmm_mirror for the the mm.
+ *
+ * If the mm or previous hmm is in transient state then this will return -EBUSY
+ * and device driver must retry the call after unpinning the mm and checking
+ * again that the mm is valid.
+ *
+ * On success the mirror is returned with one reference for the caller, thus to
+ * release mirror call hmm_mirror_unref.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm);
+
+/* hmm_mirror_unregister() - unregister an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Call when device driver want to stop mirroring a process address space.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+/* struct hmm_fault - device mirror fault informations
+ *
+ * @vma:    The vma into which the fault range is (set by hmm).
+ * @faddr:  First address of the range device want to fault (set by driver and
+ *          updated by hmm to the actual first faulted address).
+ * @laddr:  Last address of the range device want to fault (set by driver and
+ *          updated by hmm to the actual last faulted address).
+ * @pfns:   Array to hold the pfn value of each page in the range (provided by
+ *          device driver, big enough to hold (laddr - faddr) >> PAGE_SHIFT).
+ * @flags:  Fault flags (set by driver).
+ *
+ * This structure is given by the device driver to hmm_mirror_fault. The device
+ * driver can encapsulate the hmm_fault struct into its own fault structure and
+ * use that to provide private device driver information to the lmem_fault
+ * callback.
+ */
+struct hmm_fault {
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	unsigned long		flags;
+};
+
+#define HMM_FAULT_WRITE		(1 << 0)
+
+/* hmm_mirror_fault() - call by the device driver on device memory fault.
+ *
+ * @mirror:     The mirror that link process address space with the device.
+ * @fault:      The mirror fault struct holding fault range informations.
+ *
+ * Call when device is trying to access an invalid address in the device page
+ * table. The hmm shim will call lmem_fault with strong ordering in respect to
+ * call to lmem_update (ie any information provided to lmem_fault is valid
+ * until the device callback return).
+ *
+ * It will try to fault all pages in the range and give their pfn. If the vma
+ * covering the range needs to grow then it will.
+ *
+ * Also the fault will clamp the requested range to valid vma range (unless
+ * the vma into which event->faddr falls to, can grow).
+ *
+ * All error must be handled by device driver and most likely result in the
+ * process device tasks to be kill by the device driver.
+ *
+ * Returns:
+ * > 0 Number of pages faulted.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EACCES if trying to write to read only address (only for faddr).
+ * -EFAULT if trying to access an invalid address (only for faddr).
+ * -ENODEV if mirror is in process of being destroy.
+ */
+int hmm_mirror_fault(struct hmm_mirror *mirror,
+		     struct hmm_fault *fault);
+
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror);
+
+
+
+
+/* Functions used by core mm code. Device driver should not use any of them. */
+void __hmm_destroy(struct mm_struct *mm);
+static inline void hmm_destroy(struct mm_struct *mm)
+{
+	if (mm->hmm) {
+		__hmm_destroy(mm);
+	}
+}
+
+#else /* !CONFIG_HMM */
+
+static inline void hmm_destroy(struct mm_struct *mm)
+{
+}
+
+#endif /* !CONFIG_HMM */
+
+#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de16272..8fa66cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -425,6 +429,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d53eb0..56fce77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -602,6 +603,8 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+	/* hmm_destroy needs to be call after mmu_notifier_mm_destroy */
+	hmm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
 }
@@ -820,6 +823,9 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 
 	memcpy(mm, oldmm, sizeof(*mm));
 	mm_init_cpumask(mm);
+#ifdef CONFIG_HMM
+	mm->hmm = NULL;
+#endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 30cb6cb..7836f17 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -584,3 +584,15 @@ config PGTABLE_MAPPING
 
 config GENERIC_EARLY_IOREMAP
 	bool
+
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
diff --git a/mm/Makefile b/mm/Makefile
index b484452..d231646 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -63,3 +63,4 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..2b8986c
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,1194 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between local memory and device memory.
+ *
+ * Refer to include/linux/hmm.h for further informations on general design.
+ */
+/* Locking :
+ *
+ *   To synchronize with various mm event there is a simple serialization of
+ *   event touching overlapping range of address. Each mm event is associated
+ *   with an hmm_event structure which store the address range of the event.
+ *
+ *   When a new mm event call in hmm (most call comes through the mmu_notifier
+ *   call backs) hmm allocate an hmm_event structure and wait for all pending
+ *   event that overlap with the new event.
+ *
+ *   To avoid deadlock with mmap_sem the rules it to always allocate new hmm
+ *   event after taking the mmap_sem lock. In case of mmu_notifier call we do
+ *   not take the mmap_sem lock as if it was needed it would have been taken
+ *   by the caller of the mmu_notifier API.
+ *
+ *   Hence hmm only need to make sure to allocate new hmm event after taking
+ *   the mmap_sem.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/srcu.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/mman.h>
+#include <asm/tlb.h>
+#include <asm/tlbflush.h>
+#include <linux/delay.h>
+
+#include "internal.h"
+
+#define HMM_MAX_RANGE_BITS	(PAGE_SHIFT + 3UL)
+#define HMM_MAX_RANGE_SIZE	(PAGE_SIZE << HMM_MAX_RANGE_BITS)
+#define MM_MAX_SWAP_PAGES (swp_offset(pte_to_swp_entry(swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1UL)
+#define HMM_MAX_ADDR		(((unsigned long)PTRS_PER_PGD) << ((unsigned long)PGDIR_SHIFT))
+
+#define HMM_MAX_EVENTS		16
+
+/* global SRCU for all MMs */
+static struct srcu_struct srcu;
+
+
+
+
+/* struct hmm_event - used to serialize change to overlapping range of address.
+ *
+ * @list:       Current event list for the corresponding hmm.
+ * @faddr:      First address (inclusive) for the range this event affect.
+ * @laddr:      Last address (exclusive) for the range this event affect.
+ * @fences:     List of device fences associated with this event.
+ * @etype:      Event type (munmap, migrate, truncate, ...).
+ * @backoff:    Should this event backoff ie a new event render it obsolete.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	struct list_head	fences;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+/* struct hmm - per mm_struct hmm structure
+ *
+ * @mm:             The mm struct.
+ * @kref:           Reference counter
+ * @lock:           Serialize the mirror list modifications.
+ * @mirrors:        List of all mirror for this mm (one per device)
+ * @mmu_notifier:   The mmu_notifier of this mm
+ * @wait_queue:     Wait queue for synchronization btw cpu and device
+ * @events:         Events.
+ * @nevents:        Number of events currently happening.
+ * @dead:           The mm is being destroy.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change into the
+ * process address space.
+ */
+struct hmm {
+	struct mm_struct 	*mm;
+	struct kref		kref;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	struct list_head	pending;
+	struct mmu_notifier	mmu_notifier;
+	wait_queue_head_t	wait_queue;
+	struct hmm_event	events[HMM_MAX_EVENTS];
+	int			nevents;
+	bool			dead;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm *hmm_ref(struct hmm *hmm);
+static inline struct hmm *hmm_unref(struct hmm *hmm);
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event);
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence);
+
+
+
+
+/* hmm_event - use to synchronize various mm events with each others.
+ *
+ * During life time of process various mm events will happen, hmm serialize
+ * event that affect overlapping range of address. The hmm_event are use for
+ * that purpose.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+}
+
+static inline unsigned long hmm_event_size(struct hmm_event *event)
+{
+	return (event->laddr - event->faddr);
+}
+
+
+
+
+/* hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * This code deals with reading the cpu page table to find the pages that are
+ * backing a range of address. It is use as an helper to the device page fault
+ * code.
+ */
+
+/* struct hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * @mm:     The mm of the process the device fault is happening in.
+ * @vma:    The vma in which the fault is happening.
+ * @faddr:  The first address for the range the device want to fault.
+ * @laddr:  The last address for the range the device want to fault.
+ * @pfns:   Array of hmm pfns (contains the result of the fault).
+ * @write:  Is this write fault.
+ */
+struct hmm_fault_mm {
+	struct mm_struct	*mm;
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	bool			write;
+};
+
+static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct mm_walk *walk)
+{
+	struct hmm_fault_mm *fault_mm = walk->private;
+	unsigned long idx, *pfns;
+	pte_t *ptep;
+
+	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
+	pfns = &fault_mm->pfns[idx];
+	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
+	if (pmd_none(*pmdp)) {
+		return -ENOENT;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME */
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		return -EINVAL;
+	}
+
+	ptep = pte_offset_map(pmdp, faddr);
+	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
+		pte_t pte = *ptep;
+
+		if (pte_none(pte)) {
+			if (fault_mm->write) {
+				ptep++;
+				break;
+			}
+			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
+			set_bit(HMM_PFN_VALID_ZERO, pfns);
+			continue;
+		}
+		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+
+		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, pfns);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_WRITE, pfns);
+		}
+		/* Consider the page as hot as a device want to use it. */
+		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
+		fault_mm->laddr = faddr + PAGE_SIZE;
+	}
+	pte_unmap(ptep - 1);
+
+	return (faddr == laddr) ? 0 : -ENOENT;
+}
+
+static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
+{
+	struct mm_walk walk = {0};
+	unsigned long faddr, laddr;
+	int ret;
+
+	faddr = fault_mm->faddr;
+	laddr = fault_mm->laddr;
+	fault_mm->laddr = faddr;
+
+	walk.pmd_entry = hmm_fault_mm_fault_pmd;
+	walk.mm = fault_mm->mm;
+	walk.private = fault_mm;
+
+	ret = walk_page_range(faddr, laddr, &walk);
+	return ret;
+}
+
+
+
+
+/* hmm - core hmm functions.
+ *
+ * Core hmm functions that deal with all the process mm activities and use
+ * event for synchronization. Those function are use mostly as result of cpu
+ * mm event.
+ */
+
+static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
+{
+	int i, ret;
+
+	hmm->mm = mm;
+	kref_init(&hmm->kref);
+	INIT_LIST_HEAD(&hmm->mirrors);
+	INIT_LIST_HEAD(&hmm->pending);
+	spin_lock_init(&hmm->lock);
+	init_waitqueue_head(&hmm->wait_queue);
+
+	for (i = 0; i < HMM_MAX_EVENTS; ++i) {
+		hmm->events[i].etype = HMM_NONE;
+		INIT_LIST_HEAD(&hmm->events[i].fences);
+	}
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	ret = __mmu_notifier_register(&hmm->mmu_notifier, mm);
+	return ret;
+}
+
+static enum hmm_etype hmm_event_mmu(enum mmu_action action)
+{
+	switch (action) {
+	case MMU_MPROT_RONLY:
+		return HMM_MPROT_RONLY;
+	case MMU_MPROT_RANDW:
+		return HMM_MPROT_RANDW;
+	case MMU_MPROT_WONLY:
+		return HMM_MPROT_WONLY;
+	case MMU_COW:
+		return HMM_COW;
+	case MMU_MPROT_NONE:
+	case MMU_KSM:
+	case MMU_KSM_RONLY:
+	case MMU_UNMAP:
+	case MMU_VMSCAN:
+	case MMU_MUNLOCK:
+	case MMU_MIGRATE:
+	case MMU_FILE_WB:
+	case MMU_FAULT_WP:
+	case MMU_THP_SPLIT:
+	case MMU_THP_FAULT_WP:
+		return HMM_UNMAP;
+	case MMU_POISON:
+	case MMU_MREMAP:
+	case MMU_MUNMAP:
+		return HMM_MUNMAP;
+	case MMU_SOFT_DIRTY:
+	default:
+		return HMM_NONE;
+	}
+}
+
+static void hmm_event_unqueue_locked(struct hmm *hmm, struct hmm_event *event)
+{
+	list_del_init(&event->list);
+	event->etype = HMM_NONE;
+	hmm->nevents--;
+}
+
+static void hmm_event_unqueue(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	event->etype = HMM_NONE;
+	hmm->nevents--;
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_destroy_kref(struct kref *kref)
+{
+	struct hmm *hmm;
+	struct mm_struct *mm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	mm = hmm->mm;
+	mm->hmm = NULL;
+	mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+
+	if (!list_empty(&hmm->mirrors)) {
+		BUG();
+		printk(KERN_ERR "destroying an hmm with still active mirror\n"
+		       "Leaking memory instead to avoid something worst.\n");
+		return;
+	}
+	kfree(hmm);
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (hmm) {
+		kref_get(&hmm->kref);
+		return hmm;
+	}
+	return NULL;
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm) {
+		kref_put(&hmm->kref, hmm_destroy_kref);
+	}
+	return NULL;
+}
+
+static struct hmm_event *hmm_event_get(struct hmm *hmm,
+				       unsigned long faddr,
+				       unsigned long laddr,
+				       enum hmm_etype etype)
+{
+	struct hmm_event *event, *wait = NULL;
+	enum hmm_etype wait_type;
+	unsigned id;
+
+	do {
+		wait_event(hmm->wait_queue, hmm->nevents < HMM_MAX_EVENTS);
+		spin_lock(&hmm->lock);
+		for (id = 0; id < HMM_MAX_EVENTS; ++id) {
+			if (hmm->events[id].etype == HMM_NONE) {
+				event = &hmm->events[id];
+				goto out;
+			}
+		}
+		spin_unlock(&hmm->lock);
+	} while (1);
+
+out:
+	event->etype = etype;
+	event->faddr = faddr;
+	event->laddr = laddr;
+	event->backoff = false;
+	INIT_LIST_HEAD(&event->fences);
+	hmm->nevents++;
+	list_add_tail(&event->list, &hmm->pending);
+
+retry_wait:
+	wait = event;
+	list_for_each_entry_continue_reverse (wait, &hmm->pending, list) {
+		if (!hmm_event_overlap(event, wait)) {
+			continue;
+		}
+		switch (event->etype) {
+		case HMM_UNMAP:
+		case HMM_MUNMAP:
+			switch (wait->etype) {
+			case HMM_DEVICE_FAULT:
+			case HMM_MIGRATE_TO_RMEM:
+				wait->backoff = true;
+				/* fall through */
+			default:
+				wait_type = wait->etype;
+				goto wait;
+			}
+		default:
+			wait_type = wait->etype;
+			goto wait;
+		}
+	}
+	spin_unlock(&hmm->lock);
+
+	return event;
+
+wait:
+	spin_unlock(&hmm->lock);
+	wait_event(hmm->wait_queue, wait->etype != wait_type);
+	spin_lock(&hmm->lock);
+	goto retry_wait;
+}
+
+static void hmm_update_mirrors(struct hmm *hmm,
+			       struct vm_area_struct *vma,
+			       struct hmm_event *event)
+{
+	unsigned long faddr, laddr;
+
+	for (faddr = event->faddr; faddr < event->laddr; faddr = laddr) {
+		struct hmm_mirror *mirror;
+		struct hmm_fence *fence = NULL, *tmp;
+		int ticket;
+
+		laddr = event->laddr;
+
+retry_ranges:
+		ticket = srcu_read_lock(&srcu);
+		/* Because of retry we might already have scheduled some mirror
+		 * skip those.
+		 */
+		mirror = list_first_entry(&hmm->mirrors,
+					  struct hmm_mirror,
+					  mlist);
+		mirror = fence ? fence->mirror : mirror;
+		list_for_each_entry_continue (mirror, &hmm->mirrors, mlist) {
+			int r;
+
+			r = hmm_mirror_update(mirror,vma,faddr,laddr,event);
+			if (r) {
+				srcu_read_unlock(&srcu, ticket);
+				hmm_mirror_cleanup(mirror);
+				goto retry_ranges;
+			}
+		}
+		srcu_read_unlock(&srcu, ticket);
+
+		list_for_each_entry_safe (fence, tmp, &event->fences, list) {
+			struct hmm_device *device;
+			int r;
+
+			mirror = fence->mirror;
+			device = mirror->device;
+
+			r = hmm_device_fence_wait(device, fence);
+			if (r) {
+				hmm_mirror_cleanup(mirror);
+			}
+		}
+	}
+}
+
+static int hmm_fault_mm(struct hmm *hmm,
+			struct vm_area_struct *vma,
+			unsigned long faddr,
+			unsigned long laddr,
+			bool write)
+{
+	int r;
+
+	if (laddr <= faddr) {
+		return -EINVAL;
+	}
+
+	for (; faddr < laddr; faddr += PAGE_SIZE) {
+		unsigned flags = 0;
+
+		flags |= write ? FAULT_FLAG_WRITE : 0;
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(hmm->mm, vma, faddr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM) {
+					return -ENOMEM;
+				}
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+
+	return 0;
+}
+
+
+
+
+/* hmm_notifier - mmu_notifier hmm funcs tracking change to process mm.
+ *
+ * Callbacks for mmu notifier. We use use mmu notifier to track change made to
+ * process address space.
+ *
+ * Note that none of this callback needs to take a reference, as we sure that
+ * mm won't be destroy thus hmm won't be destroy either and it's fine if some
+ * hmm_mirror/hmm_device are destroy during those callbacks because this is
+ * serialize through either the hmm lock or the device lock.
+ */
+
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm)) || hmm->dead) {
+		/* Already clean. */
+		hmm_unref(hmm);
+		return;
+	}
+
+	hmm->dead = true;
+
+	/*
+	 * hmm->lock allow synchronization with hmm_mirror_unregister() an
+	 * hmm_mirror can be removed only once.
+	 */
+	spin_lock(&hmm->lock);
+	while (unlikely(!list_empty(&hmm->mirrors))) {
+		struct hmm_mirror *mirror;
+		struct hmm_device *device;
+
+		mirror = list_first_entry(&hmm->mirrors,
+					  struct hmm_mirror,
+					  mlist);
+		device = mirror->device;
+		if (!mirror->dead) {
+			/* Update mirror as being dead and remove it from the
+			 * mirror list before freeing up any of its resources.
+			 */
+			mirror->dead = true;
+			list_del_init(&mirror->mlist);
+			spin_unlock(&hmm->lock);
+
+			synchronize_srcu(&srcu);
+
+			device->ops->mirror_release(mirror);
+			hmm_mirror_cleanup(mirror);
+			spin_lock(&hmm->lock);
+		}
+	}
+	spin_unlock(&hmm->lock);
+	hmm_unref(hmm);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						struct vm_area_struct *vma,
+						unsigned long faddr,
+						unsigned long laddr,
+						enum mmu_action action)
+{
+	struct hmm_event *event;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm))) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		hmm_unref(hmm);
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = PAGE_ALIGN(laddr);
+
+	event = hmm_event_get(hmm, faddr, laddr, etype);
+	hmm_update_mirrors(hmm, vma, event);
+	/* Do not drop hmm reference here but in the range_end instead. */
+}
+
+static void hmm_notifier_invalidate_range_end(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      struct vm_area_struct *vma,
+					      unsigned long faddr,
+					      unsigned long laddr,
+					      enum mmu_action action)
+{
+	struct hmm_event *event = NULL;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+	int i;
+
+	if (!(hmm = mm->hmm)) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = PAGE_ALIGN(laddr);
+
+	spin_lock(&hmm->lock);
+	for (i = 0; i < HMM_MAX_EVENTS; ++i, event = NULL) {
+		event = &hmm->events[i];
+		if (event->etype == etype &&
+		    event->faddr == faddr &&
+		    event->laddr == laddr &&
+		    !list_empty(&event->list)) {
+			hmm_event_unqueue_locked(hmm, event);
+			break;
+		}
+	}
+	spin_unlock(&hmm->lock);
+
+	/* Drop reference from invalidate_range_start. */
+	hmm_unref(hmm);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 struct vm_area_struct *vma,
+					 unsigned long faddr,
+					 enum mmu_action action)
+{
+	unsigned long laddr;
+	struct hmm_event *event;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm))) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = faddr + PAGE_SIZE;
+
+	event = hmm_event_get(hmm, faddr, laddr, etype);
+	hmm_update_mirrors(hmm, vma, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+	/* .clear_flush_young FIXME we probably want to do something. */
+	/* .test_young FIXME we probably want to do something. */
+	/* WARNING .change_pte must always bracketed by range_start/end there
+	 * was patches to remove that behavior we must make sure that those
+	 * patches are not included as alternative solution to issue they are
+	 * trying to solve can be use.
+	 *
+	 * While hmm can not use the change_pte callback as non sleeping lock
+	 * are held during change_pte callback.
+	 */
+	.change_pte		= NULL,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
+	.invalidate_range_end	= hmm_notifier_invalidate_range_end,
+};
+
+
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and there helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+	bool dirty = !!(vma->vm_file);
+
+	fence = device->ops->lmem_update(mirror, faddr, laddr,
+					 event->etype, dirty);
+	if (fence) {
+		if (IS_ERR(fence)) {
+			return PTR_ERR(fence);
+		}
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
+{
+	struct vm_area_struct *vma;
+	struct hmm_device *device = mirror->device;
+	struct hmm_event *event;
+	unsigned long faddr, laddr;
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (mirror->dead) {
+		spin_unlock(&hmm->lock);
+		return;
+	}
+	mirror->dead = true;
+	list_del(&mirror->mlist);
+	spin_unlock(&hmm->lock);
+	synchronize_srcu(&srcu);
+	INIT_LIST_HEAD(&mirror->mlist);
+
+
+	event = hmm_event_get(hmm, 0UL, HMM_MAX_ADDR, HMM_UNREGISTER);
+	faddr = 0UL;
+	vma = find_vma(hmm->mm, faddr);
+	for (; vma && (faddr < HMM_MAX_ADDR); faddr = laddr) {
+		struct hmm_fence *fence, *next;
+
+		faddr = max(faddr, vma->vm_start);
+		laddr = vma->vm_end;
+
+		hmm_mirror_update(mirror, vma, faddr, laddr, event);
+		list_for_each_entry_safe (fence, next, &event->fences, list) {
+			hmm_device_fence_wait(device, fence);
+		}
+
+		if (laddr >= vma->vm_end) {
+			vma = vma->vm_next;
+		}
+	}
+	hmm_event_unqueue(hmm, event);
+
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&device->mutex);
+
+	mirror->hmm = hmm_unref(hmm);
+	hmm_mirror_unref(mirror);
+}
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_mirror *mirror;
+	struct hmm_device *device;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+
+	BUG_ON(!list_empty(&mirror->mlist));
+	BUG_ON(!list_empty(&mirror->dlist));
+
+	device->ops->mirror_destroy(mirror);
+	hmm_device_unref(device);
+}
+
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (mirror) {
+		kref_get(&mirror->kref);
+		return mirror;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror)
+{
+	if (mirror) {
+		kref_put(&mirror->kref, hmm_mirror_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm)
+{
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!device);
+	BUG_ON(!mm);
+
+	/* Take reference on device only on success. */
+	kref_init(&mirror->kref);
+	mirror->device = device;
+	mirror->dead = false;
+	INIT_LIST_HEAD(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+
+	down_write(&mm->mmap_sem);
+	if (mm->hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			ret = -ENOMEM;
+			goto out_cleanup;
+		}
+
+		ret = hmm_init(hmm, mm);
+		if (ret) {
+			kfree(hmm);
+			hmm = NULL;
+			goto out_cleanup;
+		}
+
+		/* now set hmm, make sure no mmu notifer callback might be call */
+		ret = mm_take_all_locks(mm);
+		if (unlikely(ret)) {
+			goto out_cleanup;
+		}
+		mm->hmm = hmm;
+		mirror->hmm = hmm;
+		hmm = NULL;
+	} else {
+		struct hmm_mirror *tmp;
+		int id;
+
+		id = srcu_read_lock(&srcu);
+		list_for_each_entry(tmp, &mm->hmm->mirrors, mlist) {
+			if (tmp->device == mirror->device) {
+				/* A process can be mirrored only once by same
+				 * device.
+				 */
+				srcu_read_unlock(&srcu, id);
+				ret = -EINVAL;
+				goto out_cleanup;
+			}
+		}
+		srcu_read_unlock(&srcu, id);
+
+		ret = mm_take_all_locks(mm);
+		if (unlikely(ret)) {
+			goto out_cleanup;
+		}
+		mirror->hmm = hmm_ref(mm->hmm);
+	}
+
+	/*
+	 * A side note: hmm_notifier_release() can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 *
+	 * We can't race against any other mmu notifier method either
+	 * thanks to mm_take_all_locks().
+	 */
+	spin_lock(&mm->hmm->lock);
+	list_add_rcu(&mirror->mlist, &mm->hmm->mirrors);
+	spin_unlock(&mm->hmm->lock);
+	mm_drop_all_locks(mm);
+
+out_cleanup:
+	if (hmm) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+	up_write(&mm->mmap_sem);
+
+	if (!ret) {
+		struct hmm_device *device = mirror->device;
+
+		hmm_device_ref(device);
+		mutex_lock(&device->mutex);
+		list_add(&mirror->dlist, &device->mirrors);
+		mutex_unlock(&device->mutex);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm;
+
+	if (!mirror) {
+		return;
+	}
+	hmm = hmm_ref(mirror->hmm);
+	if (!hmm) {
+		return;
+	}
+
+	down_read(&hmm->mm->mmap_sem);
+	hmm_mirror_cleanup(mirror);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_unref(hmm);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror,
+				 struct hmm_fault *fault,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long *pfns)
+{
+	struct hmm_device *device = mirror->device;
+	int ret;
+
+	ret = device->ops->lmem_fault(mirror, faddr, laddr, pfns, fault);
+	return ret;
+}
+
+/* see include/linux/hmm.h */
+int hmm_mirror_fault(struct hmm_mirror *mirror,
+		     struct hmm_fault *fault)
+{
+	struct vm_area_struct *vma;
+	struct hmm_event *event;
+	unsigned long caddr, naddr, vm_flags;
+	struct hmm *hmm;
+	bool do_fault = false, write;
+	int ret = 0;
+
+	if (!mirror || !fault || fault->faddr >= fault->laddr) {
+		return -EINVAL;
+	}
+	if (mirror->dead) {
+		return -ENODEV;
+	}
+	hmm = mirror->hmm;
+
+	write = !!(fault->flags & HMM_FAULT_WRITE);
+	fault->faddr = fault->faddr & PAGE_MASK;
+	fault->laddr = PAGE_ALIGN(fault->laddr);
+	caddr = fault->faddr;
+	naddr = fault->laddr;
+	/* FIXME arbitrary value clamp fault to 4M at a time. */
+	if ((fault->laddr - fault->faddr) > (4UL << 20UL)) {
+		fault->laddr = fault->faddr + (4UL << 20UL);
+	}
+	hmm_mirror_ref(mirror);
+
+retry:
+	down_read(&hmm->mm->mmap_sem);
+	event = hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT);
+	/* FIXME handle gate area ? and guard page */
+	vma = find_extend_vma(hmm->mm, caddr);
+	if (!vma) {
+		if (caddr > fault->faddr) {
+			/* Fault succeed up to addr. */
+			fault->laddr = caddr;
+			ret = 0;
+			goto out;
+		}
+		/* Allow device driver to learn about first valid address in
+		 * the range it was trying to fault in so it can restart the
+		 * fault at this address.
+		 */
+		vma = find_vma_intersection(hmm->mm,event->faddr,event->laddr);
+		if (vma) {
+			fault->laddr = vma->vm_start;
+		}
+		ret = -EFAULT;
+		goto out;
+	}
+	/* FIXME support HUGETLB */
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+	vm_flags = write ? VM_WRITE : VM_READ;
+	if (!(vma->vm_flags & vm_flags)) {
+		ret = -EACCES;
+		goto out;
+	}
+	/* Adjust range to this vma only. */
+	fault->laddr = naddr = event->laddr = min(event->laddr, vma->vm_end);
+	fault->vma = vma;
+
+	for (; caddr < event->laddr;) {
+		struct hmm_fault_mm fault_mm;
+
+		fault_mm.mm = vma->vm_mm;
+		fault_mm.vma = vma;
+		fault_mm.faddr = caddr;
+		fault_mm.laddr = naddr;
+		fault_mm.pfns = fault->pfns;
+		fault_mm.write = write;
+		ret = hmm_fault_mm_fault(&fault_mm);
+		if (ret == -ENOENT && fault_mm.laddr == caddr) {
+			do_fault = true;
+			goto out;
+		}
+		if (ret && ret != -ENOENT) {
+			goto out;
+		}
+		if (mirror->dead) {
+			ret = -ENODEV;
+			goto out;
+		}
+		if (event->backoff) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		ret = hmm_mirror_lmem_fault(mirror, fault,
+					    fault_mm.faddr,
+					    fault_mm.laddr,
+					    fault_mm.pfns);
+		if (ret) {
+			goto out;
+		}
+		caddr = fault_mm.laddr;
+		naddr = event->laddr;
+	}
+
+out:
+	hmm_event_unqueue(hmm, event);
+	if (do_fault && !event->backoff && !mirror->dead) {
+		do_fault = false;
+		ret = hmm_fault_mm(hmm, vma, caddr, naddr, write);
+		if (!ret) {
+			ret = -ENOENT;
+		}
+	}
+	wake_up(&hmm->wait_queue);
+	up_read(&hmm->mm->mmap_sem);
+	if (ret == -ENOENT) {
+		if (!mirror->dead) {
+			naddr = fault->laddr;
+			goto retry;
+		}
+		ret = -ENODEV;
+	}
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
+
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+static void hmm_device_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+
+	device = container_of(kref, struct hmm_device, kref);
+	BUG_ON(!list_empty(&device->mirrors));
+
+	device->ops->device_destroy(device);
+}
+
+struct hmm_device *hmm_device_ref(struct hmm_device *device)
+{
+	if (device) {
+		kref_get(&device->kref);
+		return device;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_device_ref);
+
+struct hmm_device *hmm_device_unref(struct hmm_device *device)
+{
+	if (device) {
+		kref_put(&device->kref, hmm_device_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_device_unref);
+
+/* see include/linux/hmm.h */
+int hmm_device_register(struct hmm_device *device, const char *name)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->device_destroy);
+	BUG_ON(!device->ops->mirror_release);
+	BUG_ON(!device->ops->mirror_destroy);
+	BUG_ON(!device->ops->fence_wait);
+	BUG_ON(!device->ops->lmem_update);
+	BUG_ON(!device->ops->lmem_fault);
+
+	kref_init(&device->kref);
+	device->name = name;
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence)
+{
+	int ret;
+
+	if (fence == NULL) {
+		return 0;
+	}
+
+	list_del_init(&fence->list);
+	do {
+		io_schedule();
+		ret = device->ops->fence_wait(fence);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+
+
+
+/* This is called after the last hmm_notifier_release() returned */
+void __hmm_destroy(struct mm_struct *mm)
+{
+	kref_put(&mm->hmm->kref, hmm_destroy_kref);
+}
+
+static int __init hmm_module_init(void)
+{
+	int ret;
+
+	ret = init_srcu_struct(&srcu);
+	if (ret) {
+		return ret;
+	}
+	return 0;
+}
+module_init(hmm_module_init);
+
+static void __exit hmm_module_exit(void)
+{
+	cleanup_srcu_struct(&srcu);
+}
+module_exit(hmm_module_exit);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 06/11] hmm: heterogeneous memory management
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Motivation:

Heterogeneous memory management is intended to allow a device to transparently
access a process address space without having to lock pages of the process or
take references on them. In other word mirroring a process address space while
allowing the regular memory management event such as page reclamation or page
migration, to happen seamlessly.

Recent years have seen a surge into the number of specialized devices that are
part of a computer platform (from desktop to phone). So far each of those
devices have operated on there own private address space that is not link or
expose to the process address space that is using them. This separation often
leads to multiple memory copy happening between the device owned memory and the
process memory. This of course is both a waste of cpu cycle and memory.

Over the last few years most of those devices have gained a full mmu allowing
them to support multiple page table, page fault and other features that are
found inside cpu mmu. There is now a strong incentive to start leveraging
capabilities of such devices and to start sharing process address to avoid
any unnecessary memory copy as well as simplifying the programming model of
those devices by sharing an unique and common address space with the process
that use them.

The aim of the heterogeneous memory management is to provide a common API that
can be use by any such devices in order to mirror process address. The hmm code
provide an unique entry point and interface itself with the core mm code of the
linux kernel avoiding duplicate implementation and shielding device driver code
from core mm code.

Moreover, hmm also intend to provide support for migrating memory to device
private memory, allowing device to work on its own fast local memory. The hmm
code would be responsible to intercept cpu page fault on migrated range of and
to migrate it back to system memory allowing cpu to resume its access to the
memory.

Another feature hmm intend to provide is support for atomic operation for the
device even if the bus linking the device and the cpu do not have any such
capabilities.

We expect that graphic processing unit and network interface to be among the
first prominent users of such api.

Hardware requirement:

Because hmm is intended to be use by device driver there are minimum features
requirement for the hardware mmu :
  - hardware have its own page table per process (can be share btw != devices)
  - hardware mmu support page fault and suspend execution until the page fault
    is serviced by hmm code. The page fault must also trigger some form of
    interrupt so that hmm code can be call by the device driver.
  - hardware must support at least read only mapping (otherwise it can not
    access read only range of the process address space).

For better memory management it is highly recommanded that the device also
support the following features :
  - hardware mmu set access bit in its page table on memory access (like cpu).
  - hardware page table can be updated from cpu or through a fast path.
  - hardware provide advanced statistic over which range of memory it access
    the most.
  - hardware differentiate atomic memory access from regular access allowing
    to support atomic operation even on platform that do not have atomic
    support with there bus link with the device.

Implementation:

The hmm layer provide a simple API to the device driver. Each device driver
have to register and hmm device that holds pointer to all the callback the hmm
code will make to synchronize the device page table with the cpu page table of
a given process.

For each process it wants to mirror the device driver must register a mirror
hmm structure that holds all the informations specific to the process being
mirrored. Each hmm mirror uniquely link an hmm device with a process address
space (the mm struct).

This design allow several different device driver to mirror concurrently the
same process. The hmm layer will dispatch approprietly to each device driver
modification that are happening to the process address space.

The hmm layer rely on the mmu notifier api to monitor change to the process
address space. Because update to device page table can have unbound completion
time, the hmm layer need the capability to sleep during mmu notifier callback.

This patch only implement the core of the hmm layer and do not support feature
such as migration to device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h      |  470 ++++++++++++++++++
 include/linux/mm_types.h |   14 +
 kernel/fork.c            |    6 +
 mm/Kconfig               |   12 +
 mm/Makefile              |    1 +
 mm/hmm.c                 | 1194 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 1697 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..e9c7722
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,470 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a heterogeneous memory management (hmm). In a nutshell this provide
+ * an API to mirror a process address on a device which has its own mmu and its
+ * own page table for the process. It supports everything except special/mixed
+ * vma.
+ *
+ * To use this the hardware must have :
+ *   - mmu with pagetable
+ *   - pagetable must support read only (supporting dirtyness accounting is
+ *     preferable but is not mandatory).
+ *   - support pagefault ie hardware thread should stop on fault and resume
+ *     once hmm has provided valid memory to use.
+ *   - some way to report fault.
+ *
+ * The hmm code handle all the interfacing with the core kernel mm code and
+ * provide a simple API. It does support migrating system memory to device
+ * memory and handle migration back to system memory on cpu page fault.
+ *
+ * Migrated memory is considered as swaped from cpu and core mm code point of
+ * view.
+ */
+#ifndef _HMM_H
+#define _HMM_H
+
+#ifdef CONFIG_HMM
+
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap.h>
+#include <linux/kref.h>
+#include <linux/swapops.h>
+#include <linux/mman.h>
+
+
+struct hmm_device;
+struct hmm_device_ops;
+struct hmm_migrate;
+struct hmm_mirror;
+struct hmm_fault;
+struct hmm_event;
+struct hmm;
+
+/* The hmm provide page informations to the device using hmm pfn value. Below
+ * are the various flags that define the current state the pfn is in (valid,
+ * type of page, dirty page, page is locked or not, ...).
+ *
+ *   HMM_PFN_VALID_PAGE this means the pfn correspond to valid page.
+ *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page.
+ *   HMM_PFN_DIRTY set when the page is dirty.
+ *   HMM_PFN_WRITE is set if there is no need to call page_mkwrite
+ */
+#define HMM_PFN_SHIFT		(PAGE_SHIFT)
+#define HMM_PFN_VALID_PAGE	(0UL)
+#define HMM_PFN_VALID_ZERO	(1UL)
+#define HMM_PFN_DIRTY		(2UL)
+#define HMM_PFN_WRITE		(3UL)
+
+static inline struct page *hmm_pfn_to_page(unsigned long pfn)
+{
+	/* Ok to test on bit after the other as it can not flip from one to
+	 * the other. Both bit are constant for the lifetime of an rmem
+	 * object.
+	 */
+	if (!test_bit(HMM_PFN_VALID_PAGE, &pfn) &&
+	    !test_bit(HMM_PFN_VALID_ZERO, &pfn)) {
+		return NULL;
+	}
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+static inline void hmm_pfn_set_dirty(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_DIRTY, pfn);
+}
+
+
+/* hmm_fence - device driver fence to wait for device driver operations.
+ *
+ * In order to concurrently update several different devices mmu the hmm rely
+ * on device driver fence to wait for operation hmm has schedule to complete on
+ * the device. It is strongly recommanded to implement fences and have the hmm
+ * callback do as little as possible (just scheduling the update). Moreover the
+ * hmm code will reschedule for i/o the current process if necessary once it
+ * has scheduled all updates on all devices.
+ *
+ * Each fence is created as a result of either an update to range of memory or
+ * for remote memory to/from local memory dma.
+ *
+ * Update to range of memory correspond to a specific event type. For instance
+ * range of memory is unmap for page reclamation, or range of memory is unmap
+ * from process address as result of munmap syscall (HMM_RANGE_FINI), or there
+ * a memory protection change on the range. There is one hmm_etype for each of
+ * those event allowing the device driver to take appropriate action like for
+ * instance freeing device page table on HMM_RANGE_FINI but keeping it if it is
+ * HMM_RANGE_UNMAP (which means that the range is unmap but the range is still
+ * valid).
+ */
+enum hmm_etype {
+	HMM_NONE = 0,
+	HMM_UNREGISTER,
+	HMM_DEVICE_FAULT,
+	HMM_MPROT_RONLY,
+	HMM_MPROT_RANDW,
+	HMM_MPROT_WONLY,
+	HMM_COW,
+	HMM_MUNMAP,
+	HMM_UNMAP,
+	HMM_MIGRATE_TO_LMEM,
+	HMM_MIGRATE_TO_RMEM,
+};
+
+struct hmm_fence {
+	struct list_head	list;
+	struct hmm_mirror	*mirror;
+};
+
+
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device.
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+/* struct hmm_device_operations - hmm device operation callback
+ */
+struct hmm_device_ops {
+	/* device_destroy - free hmm_device (call when refcount drop to 0).
+	 *
+	 * @device: The device hmm specific structure.
+	 */
+	void (*device_destroy)(struct hmm_device *device);
+
+	/* mirror_release() - device must stop using the address space.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 *
+	 * Called when as result of hmm_mirror_unregister or when mm is being
+	 * destroy.
+	 *
+	 * It's illegal for the device to call any hmm helper function after
+	 * this call back. The device driver must kill any pending device
+	 * thread and wait for completion of all of them.
+	 *
+	 * Note that even after this callback returns the device driver might
+	 * get call back from hmm. Callback will stop only once mirror_destroy
+	 * is call.
+	 */
+	void (*mirror_release)(struct hmm_mirror *hmm_mirror);
+
+	/* mirror_destroy - free hmm_mirror (call when refcount drop to 0).
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 */
+	void (*mirror_destroy)(struct hmm_mirror *mirror);
+
+	/* fence_wait() - to wait on device driver fence.
+	 *
+	 * @fence:      The device driver fence struct.
+	 * Returns:     0 on success,-EIO on error, -EAGAIN to wait again.
+	 *
+	 * Called when hmm want to wait for all operations associated with a
+	 * fence to complete (including device cache flush if the event mandate
+	 * it).
+	 *
+	 * Device driver must free fence and associated resources if it returns
+	 * something else thant -EAGAIN. On -EAGAIN the fence must not be free
+	 * as hmm will call back again.
+	 *
+	 * Return error if scheduled operation failed or if need to wait again.
+	 * -EIO    Some input/output error with the device.
+	 * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*fence_wait)(struct hmm_fence *fence);
+
+	/* lmem_update() - update device mmu for a range of local memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @etype:  The type of memory event (unmap, fini, read only, ...).
+	 * @dirty:  Device driver should call set_page_dirty_lock.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device mmu permission/usage for a range of local
+	 * memory. The event type provide the nature of the update :
+	 *   - range is no longer valid (munmap).
+	 *   - range protection changes (mprotect, COW, ...).
+	 *   - range is unmapped (swap, reclaim, page migration, ...).
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to local memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly call set_page_dirty on any page the device did
+	 * write to since last call to update_lmem. This is only needed if the
+	 * dirty parameter is true.
+	 *
+	 * The driver should return a fence pointer or NULL on success. It is
+	 * advice to return fence and delay wait for the operation to complete
+	 * to the wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * IMPORTANT IF DEVICE DRIVER GET HMM_MPROT_RANDW or HMM_MPROT_WONLY IT
+	 * MUST NOT MAP SPECIAL ZERO PFN WITH WRITE PERMISSION. SPECIAL ZERO
+	 * PFN IS SET THROUGH lmem_fault WITH THE HMM_PFN_VALID_ZERO BIT FLAG
+	 * SET.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*lmem_update)(struct hmm_mirror *mirror,
+					 unsigned long faddr,
+					 unsigned long laddr,
+					 enum hmm_etype etype,
+					 bool dirty);
+
+	/* lmem_fault() - fault range of lmem on the device mmu.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @pfns:   Array of pfn for the range (each of the pfn is valid).
+	 * @fault:  The fault structure provided by device driver.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Called to give the device driver each of the pfn backing a range of
+	 * memory. It is only call as a result of a call to hmm_mirror_fault.
+	 *
+	 * Note that the pfns array content is only valid for the duration of
+	 * the callback. Once the device driver callback return further memory
+	 * activities might invalidate the value of the pfns array. The device
+	 * driver will be inform of such changes through the update callback.
+	 *
+	 * Allowed return value are :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return error if scheduled operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*lmem_fault)(struct hmm_mirror *mirror,
+			  unsigned long faddr,
+			  unsigned long laddr,
+			  unsigned long *pfns,
+			  struct hmm_fault *fault);
+};
+
+/* struct hmm_device - per device hmm structure
+ *
+ * @kref:       Reference count.
+ * @mirrors:    List of all active mirrors for the device.
+ * @mutex:      Mutex protecting mirrors list.
+ * @ops:        The hmm operations callback.
+ * @name:       Device name (uniquely identify the device on the system).
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct (only once).
+ */
+struct hmm_device {
+	struct kref			kref;
+	struct list_head		mirrors;
+	struct mutex			mutex;
+	const struct hmm_device_ops	*ops;
+	const char			*name;
+};
+
+/* hmm_device_register() - register a device with hmm.
+ *
+ * @device: The hmm_device struct.
+ * @name:   A unique name string for the device (use in error messages).
+ * Returns: 0 on success, -EINVAL otherwise.
+ *
+ * Call when device driver want to register itself with hmm. Device driver can
+ * only register once. It will return a reference on the device thus to release
+ * a device the driver must unreference the device.
+ */
+int hmm_device_register(struct hmm_device *device, const char *name);
+
+struct hmm_device *hmm_device_ref(struct hmm_device *device);
+struct hmm_device *hmm_device_unref(struct hmm_device *device);
+
+
+
+
+/* hmm_mirror - device specific mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct associating
+ * the process address space with the device. A process can be mirrored by
+ * several different devices at the same time.
+ */
+
+/* struct hmm_mirror - per device and per mm hmm structure
+ *
+ * @kref:       Reference count.
+ * @dlist:      List of all hmm_mirror for same device.
+ * @mlist:      List of all hmm_mirror for same mm.
+ * @device:     The hmm_device struct this hmm_mirror is associated to.
+ * @hmm:        The hmm struct this hmm_mirror is associated to.
+ * @dead:       The hmm_mirror is dead and should no longer be use.
+ *
+ * Each device that want to mirror an address space must register one of this
+ * struct for each of the address space it wants to mirror. Same device can
+ * mirror several different address space. As well same address space can be
+ * mirror by different devices.
+ */
+struct hmm_mirror {
+	struct kref		kref;
+	struct list_head	dlist;
+	struct list_head	mlist;
+	struct hmm_device	*device;
+	struct hmm		*hmm;
+	bool			dead;
+};
+
+/* hmm_mirror_register() - register a device mirror against an mm struct
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @device: The device struct to associate this mirror with.
+ * @mm:     The mm struct of the process.
+ * Returns: 0 success, -ENOMEM, -EBUSY or -EINVAL if process already mirrored.
+ *
+ * Call when device driver want to start mirroring a process address space. The
+ * hmm shim will register mmu_notifier and start monitoring process address
+ * space changes. Hence callback to device driver might happen even before this
+ * function return.
+ *
+ * The mm pin must also be hold (either task is current or using get_task_mm).
+ *
+ * Only one mirror per mm and hmm_device can be created, it will return -EINVAL
+ * if the hmm_device already has an hmm_mirror for the the mm.
+ *
+ * If the mm or previous hmm is in transient state then this will return -EBUSY
+ * and device driver must retry the call after unpinning the mm and checking
+ * again that the mm is valid.
+ *
+ * On success the mirror is returned with one reference for the caller, thus to
+ * release mirror call hmm_mirror_unref.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm);
+
+/* hmm_mirror_unregister() - unregister an hmm_mirror.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ *
+ * Call when device driver want to stop mirroring a process address space.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+/* struct hmm_fault - device mirror fault informations
+ *
+ * @vma:    The vma into which the fault range is (set by hmm).
+ * @faddr:  First address of the range device want to fault (set by driver and
+ *          updated by hmm to the actual first faulted address).
+ * @laddr:  Last address of the range device want to fault (set by driver and
+ *          updated by hmm to the actual last faulted address).
+ * @pfns:   Array to hold the pfn value of each page in the range (provided by
+ *          device driver, big enough to hold (laddr - faddr) >> PAGE_SHIFT).
+ * @flags:  Fault flags (set by driver).
+ *
+ * This structure is given by the device driver to hmm_mirror_fault. The device
+ * driver can encapsulate the hmm_fault struct into its own fault structure and
+ * use that to provide private device driver information to the lmem_fault
+ * callback.
+ */
+struct hmm_fault {
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	unsigned long		flags;
+};
+
+#define HMM_FAULT_WRITE		(1 << 0)
+
+/* hmm_mirror_fault() - call by the device driver on device memory fault.
+ *
+ * @mirror:     The mirror that link process address space with the device.
+ * @fault:      The mirror fault struct holding fault range informations.
+ *
+ * Call when device is trying to access an invalid address in the device page
+ * table. The hmm shim will call lmem_fault with strong ordering in respect to
+ * call to lmem_update (ie any information provided to lmem_fault is valid
+ * until the device callback return).
+ *
+ * It will try to fault all pages in the range and give their pfn. If the vma
+ * covering the range needs to grow then it will.
+ *
+ * Also the fault will clamp the requested range to valid vma range (unless
+ * the vma into which event->faddr falls to, can grow).
+ *
+ * All error must be handled by device driver and most likely result in the
+ * process device tasks to be kill by the device driver.
+ *
+ * Returns:
+ * > 0 Number of pages faulted.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EACCES if trying to write to read only address (only for faddr).
+ * -EFAULT if trying to access an invalid address (only for faddr).
+ * -ENODEV if mirror is in process of being destroy.
+ */
+int hmm_mirror_fault(struct hmm_mirror *mirror,
+		     struct hmm_fault *fault);
+
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror);
+struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror);
+
+
+
+
+/* Functions used by core mm code. Device driver should not use any of them. */
+void __hmm_destroy(struct mm_struct *mm);
+static inline void hmm_destroy(struct mm_struct *mm)
+{
+	if (mm->hmm) {
+		__hmm_destroy(mm);
+	}
+}
+
+#else /* !CONFIG_HMM */
+
+static inline void hmm_destroy(struct mm_struct *mm)
+{
+}
+
+#endif /* !CONFIG_HMM */
+
+#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de16272..8fa66cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -16,6 +16,10 @@
 #include <asm/page.h>
 #include <asm/mmu.h>
 
+#ifdef CONFIG_HMM
+struct hmm;
+#endif
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -425,6 +429,16 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_HMM
+	/*
+	 * hmm always register an mmu_notifier we rely on mmu notifier to keep
+	 * refcount on mm struct as well as forbiding registering hmm on a
+	 * dying mm
+	 *
+	 * This field is set with mmap_sem old in write mode.
+	 */
+	struct hmm *hmm;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d53eb0..56fce77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -602,6 +603,8 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+	/* hmm_destroy needs to be call after mmu_notifier_mm_destroy */
+	hmm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
 }
@@ -820,6 +823,9 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 
 	memcpy(mm, oldmm, sizeof(*mm));
 	mm_init_cpumask(mm);
+#ifdef CONFIG_HMM
+	mm->hmm = NULL;
+#endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/Kconfig b/mm/Kconfig
index 30cb6cb..7836f17 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -584,3 +584,15 @@ config PGTABLE_MAPPING
 
 config GENERIC_EARLY_IOREMAP
 	bool
+
+config HMM
+	bool "Enable heterogeneous memory management (HMM)"
+	depends on MMU
+	select MMU_NOTIFIER
+	default n
+	help
+	  Heterogeneous memory management provide infrastructure for a device
+	  to mirror a process address space into an hardware mmu or into any
+	  things supporting pagefault like event.
+
+	  If unsure, say N to disable hmm.
diff --git a/mm/Makefile b/mm/Makefile
index b484452..d231646 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -63,3 +63,4 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
+obj-$(CONFIG_HMM) += hmm.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..2b8986c
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,1194 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is the core code for heterogeneous memory management (HMM). HMM intend
+ * to provide helper for mirroring a process address space on a device as well
+ * as allowing migration of data between local memory and device memory.
+ *
+ * Refer to include/linux/hmm.h for further informations on general design.
+ */
+/* Locking :
+ *
+ *   To synchronize with various mm event there is a simple serialization of
+ *   event touching overlapping range of address. Each mm event is associated
+ *   with an hmm_event structure which store the address range of the event.
+ *
+ *   When a new mm event call in hmm (most call comes through the mmu_notifier
+ *   call backs) hmm allocate an hmm_event structure and wait for all pending
+ *   event that overlap with the new event.
+ *
+ *   To avoid deadlock with mmap_sem the rules it to always allocate new hmm
+ *   event after taking the mmap_sem lock. In case of mmu_notifier call we do
+ *   not take the mmap_sem lock as if it was needed it would have been taken
+ *   by the caller of the mmu_notifier API.
+ *
+ *   Hence hmm only need to make sure to allocate new hmm event after taking
+ *   the mmap_sem.
+ */
+#include <linux/export.h>
+#include <linux/bitmap.h>
+#include <linux/srcu.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/ksm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_context.h>
+#include <linux/memcontrol.h>
+#include <linux/hmm.h>
+#include <linux/wait.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/mman.h>
+#include <asm/tlb.h>
+#include <asm/tlbflush.h>
+#include <linux/delay.h>
+
+#include "internal.h"
+
+#define HMM_MAX_RANGE_BITS	(PAGE_SHIFT + 3UL)
+#define HMM_MAX_RANGE_SIZE	(PAGE_SIZE << HMM_MAX_RANGE_BITS)
+#define MM_MAX_SWAP_PAGES (swp_offset(pte_to_swp_entry(swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1UL)
+#define HMM_MAX_ADDR		(((unsigned long)PTRS_PER_PGD) << ((unsigned long)PGDIR_SHIFT))
+
+#define HMM_MAX_EVENTS		16
+
+/* global SRCU for all MMs */
+static struct srcu_struct srcu;
+
+
+
+
+/* struct hmm_event - used to serialize change to overlapping range of address.
+ *
+ * @list:       Current event list for the corresponding hmm.
+ * @faddr:      First address (inclusive) for the range this event affect.
+ * @laddr:      Last address (exclusive) for the range this event affect.
+ * @fences:     List of device fences associated with this event.
+ * @etype:      Event type (munmap, migrate, truncate, ...).
+ * @backoff:    Should this event backoff ie a new event render it obsolete.
+ */
+struct hmm_event {
+	struct list_head	list;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	struct list_head	fences;
+	enum hmm_etype		etype;
+	bool			backoff;
+};
+
+/* struct hmm - per mm_struct hmm structure
+ *
+ * @mm:             The mm struct.
+ * @kref:           Reference counter
+ * @lock:           Serialize the mirror list modifications.
+ * @mirrors:        List of all mirror for this mm (one per device)
+ * @mmu_notifier:   The mmu_notifier of this mm
+ * @wait_queue:     Wait queue for synchronization btw cpu and device
+ * @events:         Events.
+ * @nevents:        Number of events currently happening.
+ * @dead:           The mm is being destroy.
+ *
+ * For each process address space (mm_struct) there is one and only one hmm
+ * struct. hmm functions will redispatch to each devices the change into the
+ * process address space.
+ */
+struct hmm {
+	struct mm_struct 	*mm;
+	struct kref		kref;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	struct list_head	pending;
+	struct mmu_notifier	mmu_notifier;
+	wait_queue_head_t	wait_queue;
+	struct hmm_event	events[HMM_MAX_EVENTS];
+	int			nevents;
+	bool			dead;
+};
+
+static struct mmu_notifier_ops hmm_notifier_ops;
+
+static inline struct hmm *hmm_ref(struct hmm *hmm);
+static inline struct hmm *hmm_unref(struct hmm *hmm);
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event);
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence);
+
+
+
+
+/* hmm_event - use to synchronize various mm events with each others.
+ *
+ * During life time of process various mm events will happen, hmm serialize
+ * event that affect overlapping range of address. The hmm_event are use for
+ * that purpose.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+}
+
+static inline unsigned long hmm_event_size(struct hmm_event *event)
+{
+	return (event->laddr - event->faddr);
+}
+
+
+
+
+/* hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * This code deals with reading the cpu page table to find the pages that are
+ * backing a range of address. It is use as an helper to the device page fault
+ * code.
+ */
+
+/* struct hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * @mm:     The mm of the process the device fault is happening in.
+ * @vma:    The vma in which the fault is happening.
+ * @faddr:  The first address for the range the device want to fault.
+ * @laddr:  The last address for the range the device want to fault.
+ * @pfns:   Array of hmm pfns (contains the result of the fault).
+ * @write:  Is this write fault.
+ */
+struct hmm_fault_mm {
+	struct mm_struct	*mm;
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	bool			write;
+};
+
+static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct mm_walk *walk)
+{
+	struct hmm_fault_mm *fault_mm = walk->private;
+	unsigned long idx, *pfns;
+	pte_t *ptep;
+
+	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
+	pfns = &fault_mm->pfns[idx];
+	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
+	if (pmd_none(*pmdp)) {
+		return -ENOENT;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME */
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		return -EINVAL;
+	}
+
+	ptep = pte_offset_map(pmdp, faddr);
+	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
+		pte_t pte = *ptep;
+
+		if (pte_none(pte)) {
+			if (fault_mm->write) {
+				ptep++;
+				break;
+			}
+			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
+			set_bit(HMM_PFN_VALID_ZERO, pfns);
+			continue;
+		}
+		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+
+		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, pfns);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_WRITE, pfns);
+		}
+		/* Consider the page as hot as a device want to use it. */
+		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
+		fault_mm->laddr = faddr + PAGE_SIZE;
+	}
+	pte_unmap(ptep - 1);
+
+	return (faddr == laddr) ? 0 : -ENOENT;
+}
+
+static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
+{
+	struct mm_walk walk = {0};
+	unsigned long faddr, laddr;
+	int ret;
+
+	faddr = fault_mm->faddr;
+	laddr = fault_mm->laddr;
+	fault_mm->laddr = faddr;
+
+	walk.pmd_entry = hmm_fault_mm_fault_pmd;
+	walk.mm = fault_mm->mm;
+	walk.private = fault_mm;
+
+	ret = walk_page_range(faddr, laddr, &walk);
+	return ret;
+}
+
+
+
+
+/* hmm - core hmm functions.
+ *
+ * Core hmm functions that deal with all the process mm activities and use
+ * event for synchronization. Those function are use mostly as result of cpu
+ * mm event.
+ */
+
+static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
+{
+	int i, ret;
+
+	hmm->mm = mm;
+	kref_init(&hmm->kref);
+	INIT_LIST_HEAD(&hmm->mirrors);
+	INIT_LIST_HEAD(&hmm->pending);
+	spin_lock_init(&hmm->lock);
+	init_waitqueue_head(&hmm->wait_queue);
+
+	for (i = 0; i < HMM_MAX_EVENTS; ++i) {
+		hmm->events[i].etype = HMM_NONE;
+		INIT_LIST_HEAD(&hmm->events[i].fences);
+	}
+
+	/* register notifier */
+	hmm->mmu_notifier.ops = &hmm_notifier_ops;
+	ret = __mmu_notifier_register(&hmm->mmu_notifier, mm);
+	return ret;
+}
+
+static enum hmm_etype hmm_event_mmu(enum mmu_action action)
+{
+	switch (action) {
+	case MMU_MPROT_RONLY:
+		return HMM_MPROT_RONLY;
+	case MMU_MPROT_RANDW:
+		return HMM_MPROT_RANDW;
+	case MMU_MPROT_WONLY:
+		return HMM_MPROT_WONLY;
+	case MMU_COW:
+		return HMM_COW;
+	case MMU_MPROT_NONE:
+	case MMU_KSM:
+	case MMU_KSM_RONLY:
+	case MMU_UNMAP:
+	case MMU_VMSCAN:
+	case MMU_MUNLOCK:
+	case MMU_MIGRATE:
+	case MMU_FILE_WB:
+	case MMU_FAULT_WP:
+	case MMU_THP_SPLIT:
+	case MMU_THP_FAULT_WP:
+		return HMM_UNMAP;
+	case MMU_POISON:
+	case MMU_MREMAP:
+	case MMU_MUNMAP:
+		return HMM_MUNMAP;
+	case MMU_SOFT_DIRTY:
+	default:
+		return HMM_NONE;
+	}
+}
+
+static void hmm_event_unqueue_locked(struct hmm *hmm, struct hmm_event *event)
+{
+	list_del_init(&event->list);
+	event->etype = HMM_NONE;
+	hmm->nevents--;
+}
+
+static void hmm_event_unqueue(struct hmm *hmm, struct hmm_event *event)
+{
+	spin_lock(&hmm->lock);
+	list_del_init(&event->list);
+	event->etype = HMM_NONE;
+	hmm->nevents--;
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_destroy_kref(struct kref *kref)
+{
+	struct hmm *hmm;
+	struct mm_struct *mm;
+
+	hmm = container_of(kref, struct hmm, kref);
+	mm = hmm->mm;
+	mm->hmm = NULL;
+	mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+
+	if (!list_empty(&hmm->mirrors)) {
+		BUG();
+		printk(KERN_ERR "destroying an hmm with still active mirror\n"
+		       "Leaking memory instead to avoid something worst.\n");
+		return;
+	}
+	kfree(hmm);
+}
+
+static inline struct hmm *hmm_ref(struct hmm *hmm)
+{
+	if (hmm) {
+		kref_get(&hmm->kref);
+		return hmm;
+	}
+	return NULL;
+}
+
+static inline struct hmm *hmm_unref(struct hmm *hmm)
+{
+	if (hmm) {
+		kref_put(&hmm->kref, hmm_destroy_kref);
+	}
+	return NULL;
+}
+
+static struct hmm_event *hmm_event_get(struct hmm *hmm,
+				       unsigned long faddr,
+				       unsigned long laddr,
+				       enum hmm_etype etype)
+{
+	struct hmm_event *event, *wait = NULL;
+	enum hmm_etype wait_type;
+	unsigned id;
+
+	do {
+		wait_event(hmm->wait_queue, hmm->nevents < HMM_MAX_EVENTS);
+		spin_lock(&hmm->lock);
+		for (id = 0; id < HMM_MAX_EVENTS; ++id) {
+			if (hmm->events[id].etype == HMM_NONE) {
+				event = &hmm->events[id];
+				goto out;
+			}
+		}
+		spin_unlock(&hmm->lock);
+	} while (1);
+
+out:
+	event->etype = etype;
+	event->faddr = faddr;
+	event->laddr = laddr;
+	event->backoff = false;
+	INIT_LIST_HEAD(&event->fences);
+	hmm->nevents++;
+	list_add_tail(&event->list, &hmm->pending);
+
+retry_wait:
+	wait = event;
+	list_for_each_entry_continue_reverse (wait, &hmm->pending, list) {
+		if (!hmm_event_overlap(event, wait)) {
+			continue;
+		}
+		switch (event->etype) {
+		case HMM_UNMAP:
+		case HMM_MUNMAP:
+			switch (wait->etype) {
+			case HMM_DEVICE_FAULT:
+			case HMM_MIGRATE_TO_RMEM:
+				wait->backoff = true;
+				/* fall through */
+			default:
+				wait_type = wait->etype;
+				goto wait;
+			}
+		default:
+			wait_type = wait->etype;
+			goto wait;
+		}
+	}
+	spin_unlock(&hmm->lock);
+
+	return event;
+
+wait:
+	spin_unlock(&hmm->lock);
+	wait_event(hmm->wait_queue, wait->etype != wait_type);
+	spin_lock(&hmm->lock);
+	goto retry_wait;
+}
+
+static void hmm_update_mirrors(struct hmm *hmm,
+			       struct vm_area_struct *vma,
+			       struct hmm_event *event)
+{
+	unsigned long faddr, laddr;
+
+	for (faddr = event->faddr; faddr < event->laddr; faddr = laddr) {
+		struct hmm_mirror *mirror;
+		struct hmm_fence *fence = NULL, *tmp;
+		int ticket;
+
+		laddr = event->laddr;
+
+retry_ranges:
+		ticket = srcu_read_lock(&srcu);
+		/* Because of retry we might already have scheduled some mirror
+		 * skip those.
+		 */
+		mirror = list_first_entry(&hmm->mirrors,
+					  struct hmm_mirror,
+					  mlist);
+		mirror = fence ? fence->mirror : mirror;
+		list_for_each_entry_continue (mirror, &hmm->mirrors, mlist) {
+			int r;
+
+			r = hmm_mirror_update(mirror,vma,faddr,laddr,event);
+			if (r) {
+				srcu_read_unlock(&srcu, ticket);
+				hmm_mirror_cleanup(mirror);
+				goto retry_ranges;
+			}
+		}
+		srcu_read_unlock(&srcu, ticket);
+
+		list_for_each_entry_safe (fence, tmp, &event->fences, list) {
+			struct hmm_device *device;
+			int r;
+
+			mirror = fence->mirror;
+			device = mirror->device;
+
+			r = hmm_device_fence_wait(device, fence);
+			if (r) {
+				hmm_mirror_cleanup(mirror);
+			}
+		}
+	}
+}
+
+static int hmm_fault_mm(struct hmm *hmm,
+			struct vm_area_struct *vma,
+			unsigned long faddr,
+			unsigned long laddr,
+			bool write)
+{
+	int r;
+
+	if (laddr <= faddr) {
+		return -EINVAL;
+	}
+
+	for (; faddr < laddr; faddr += PAGE_SIZE) {
+		unsigned flags = 0;
+
+		flags |= write ? FAULT_FLAG_WRITE : 0;
+		flags |= FAULT_FLAG_ALLOW_RETRY;
+		do {
+			r = handle_mm_fault(hmm->mm, vma, faddr, flags);
+			if (!(r & VM_FAULT_RETRY) && (r & VM_FAULT_ERROR)) {
+				if (r & VM_FAULT_OOM) {
+					return -ENOMEM;
+				}
+				/* Same error code for all other cases. */
+				return -EFAULT;
+			}
+			flags &= ~FAULT_FLAG_ALLOW_RETRY;
+		} while (r & VM_FAULT_RETRY);
+	}
+
+	return 0;
+}
+
+
+
+
+/* hmm_notifier - mmu_notifier hmm funcs tracking change to process mm.
+ *
+ * Callbacks for mmu notifier. We use use mmu notifier to track change made to
+ * process address space.
+ *
+ * Note that none of this callback needs to take a reference, as we sure that
+ * mm won't be destroy thus hmm won't be destroy either and it's fine if some
+ * hmm_mirror/hmm_device are destroy during those callbacks because this is
+ * serialize through either the hmm lock or the device lock.
+ */
+
+static void hmm_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm)) || hmm->dead) {
+		/* Already clean. */
+		hmm_unref(hmm);
+		return;
+	}
+
+	hmm->dead = true;
+
+	/*
+	 * hmm->lock allow synchronization with hmm_mirror_unregister() an
+	 * hmm_mirror can be removed only once.
+	 */
+	spin_lock(&hmm->lock);
+	while (unlikely(!list_empty(&hmm->mirrors))) {
+		struct hmm_mirror *mirror;
+		struct hmm_device *device;
+
+		mirror = list_first_entry(&hmm->mirrors,
+					  struct hmm_mirror,
+					  mlist);
+		device = mirror->device;
+		if (!mirror->dead) {
+			/* Update mirror as being dead and remove it from the
+			 * mirror list before freeing up any of its resources.
+			 */
+			mirror->dead = true;
+			list_del_init(&mirror->mlist);
+			spin_unlock(&hmm->lock);
+
+			synchronize_srcu(&srcu);
+
+			device->ops->mirror_release(mirror);
+			hmm_mirror_cleanup(mirror);
+			spin_lock(&hmm->lock);
+		}
+	}
+	spin_unlock(&hmm->lock);
+	hmm_unref(hmm);
+}
+
+static void hmm_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						struct vm_area_struct *vma,
+						unsigned long faddr,
+						unsigned long laddr,
+						enum mmu_action action)
+{
+	struct hmm_event *event;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm))) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		hmm_unref(hmm);
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = PAGE_ALIGN(laddr);
+
+	event = hmm_event_get(hmm, faddr, laddr, etype);
+	hmm_update_mirrors(hmm, vma, event);
+	/* Do not drop hmm reference here but in the range_end instead. */
+}
+
+static void hmm_notifier_invalidate_range_end(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      struct vm_area_struct *vma,
+					      unsigned long faddr,
+					      unsigned long laddr,
+					      enum mmu_action action)
+{
+	struct hmm_event *event = NULL;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+	int i;
+
+	if (!(hmm = mm->hmm)) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = PAGE_ALIGN(laddr);
+
+	spin_lock(&hmm->lock);
+	for (i = 0; i < HMM_MAX_EVENTS; ++i, event = NULL) {
+		event = &hmm->events[i];
+		if (event->etype == etype &&
+		    event->faddr == faddr &&
+		    event->laddr == laddr &&
+		    !list_empty(&event->list)) {
+			hmm_event_unqueue_locked(hmm, event);
+			break;
+		}
+	}
+	spin_unlock(&hmm->lock);
+
+	/* Drop reference from invalidate_range_start. */
+	hmm_unref(hmm);
+}
+
+static void hmm_notifier_invalidate_page(struct mmu_notifier *mn,
+					 struct mm_struct *mm,
+					 struct vm_area_struct *vma,
+					 unsigned long faddr,
+					 enum mmu_action action)
+{
+	unsigned long laddr;
+	struct hmm_event *event;
+	enum hmm_etype etype;
+	struct hmm *hmm;
+
+	if (!(hmm = hmm_ref(mm->hmm))) {
+		return;
+	}
+
+	etype = hmm_event_mmu(action);
+	switch (etype) {
+	case HMM_NONE:
+		return;
+	default:
+		break;
+	}
+
+	faddr = faddr & PAGE_MASK;
+	laddr = faddr + PAGE_SIZE;
+
+	event = hmm_event_get(hmm, faddr, laddr, etype);
+	hmm_update_mirrors(hmm, vma, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_unref(hmm);
+}
+
+static struct mmu_notifier_ops hmm_notifier_ops = {
+	.release		= hmm_notifier_release,
+	/* .clear_flush_young FIXME we probably want to do something. */
+	/* .test_young FIXME we probably want to do something. */
+	/* WARNING .change_pte must always bracketed by range_start/end there
+	 * was patches to remove that behavior we must make sure that those
+	 * patches are not included as alternative solution to issue they are
+	 * trying to solve can be use.
+	 *
+	 * While hmm can not use the change_pte callback as non sleeping lock
+	 * are held during change_pte callback.
+	 */
+	.change_pte		= NULL,
+	.invalidate_page	= hmm_notifier_invalidate_page,
+	.invalidate_range_start	= hmm_notifier_invalidate_range_start,
+	.invalidate_range_end	= hmm_notifier_invalidate_range_end,
+};
+
+
+
+
+/* hmm_mirror - per device mirroring functions.
+ *
+ * Each device that mirror a process has a uniq hmm_mirror struct. A process
+ * can be mirror by several devices at the same time.
+ *
+ * Below are all the functions and there helpers use by device driver to mirror
+ * the process address space. Those functions either deals with updating the
+ * device page table (through hmm callback). Or provide helper functions use by
+ * the device driver to fault in range of memory in the device page table.
+ */
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+	bool dirty = !!(vma->vm_file);
+
+	fence = device->ops->lmem_update(mirror, faddr, laddr,
+					 event->etype, dirty);
+	if (fence) {
+		if (IS_ERR(fence)) {
+			return PTR_ERR(fence);
+		}
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
+{
+	struct vm_area_struct *vma;
+	struct hmm_device *device = mirror->device;
+	struct hmm_event *event;
+	unsigned long faddr, laddr;
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (mirror->dead) {
+		spin_unlock(&hmm->lock);
+		return;
+	}
+	mirror->dead = true;
+	list_del(&mirror->mlist);
+	spin_unlock(&hmm->lock);
+	synchronize_srcu(&srcu);
+	INIT_LIST_HEAD(&mirror->mlist);
+
+
+	event = hmm_event_get(hmm, 0UL, HMM_MAX_ADDR, HMM_UNREGISTER);
+	faddr = 0UL;
+	vma = find_vma(hmm->mm, faddr);
+	for (; vma && (faddr < HMM_MAX_ADDR); faddr = laddr) {
+		struct hmm_fence *fence, *next;
+
+		faddr = max(faddr, vma->vm_start);
+		laddr = vma->vm_end;
+
+		hmm_mirror_update(mirror, vma, faddr, laddr, event);
+		list_for_each_entry_safe (fence, next, &event->fences, list) {
+			hmm_device_fence_wait(device, fence);
+		}
+
+		if (laddr >= vma->vm_end) {
+			vma = vma->vm_next;
+		}
+	}
+	hmm_event_unqueue(hmm, event);
+
+	mutex_lock(&device->mutex);
+	list_del_init(&mirror->dlist);
+	mutex_unlock(&device->mutex);
+
+	mirror->hmm = hmm_unref(hmm);
+	hmm_mirror_unref(mirror);
+}
+
+static void hmm_mirror_destroy(struct kref *kref)
+{
+	struct hmm_mirror *mirror;
+	struct hmm_device *device;
+
+	mirror = container_of(kref, struct hmm_mirror, kref);
+	device = mirror->device;
+
+	BUG_ON(!list_empty(&mirror->mlist));
+	BUG_ON(!list_empty(&mirror->dlist));
+
+	device->ops->mirror_destroy(mirror);
+	hmm_device_unref(device);
+}
+
+struct hmm_mirror *hmm_mirror_ref(struct hmm_mirror *mirror)
+{
+	if (mirror) {
+		kref_get(&mirror->kref);
+		return mirror;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_ref);
+
+struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror)
+{
+	if (mirror) {
+		kref_put(&mirror->kref, hmm_mirror_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_mirror_unref);
+
+int hmm_mirror_register(struct hmm_mirror *mirror,
+			struct hmm_device *device,
+			struct mm_struct *mm)
+{
+	struct hmm *hmm = NULL;
+	int ret = 0;
+
+	/* Sanity checks. */
+	BUG_ON(!mirror);
+	BUG_ON(!device);
+	BUG_ON(!mm);
+
+	/* Take reference on device only on success. */
+	kref_init(&mirror->kref);
+	mirror->device = device;
+	mirror->dead = false;
+	INIT_LIST_HEAD(&mirror->mlist);
+	INIT_LIST_HEAD(&mirror->dlist);
+
+	down_write(&mm->mmap_sem);
+	if (mm->hmm == NULL) {
+		/* no hmm registered yet so register one */
+		hmm = kzalloc(sizeof(*mm->hmm), GFP_KERNEL);
+		if (hmm == NULL) {
+			ret = -ENOMEM;
+			goto out_cleanup;
+		}
+
+		ret = hmm_init(hmm, mm);
+		if (ret) {
+			kfree(hmm);
+			hmm = NULL;
+			goto out_cleanup;
+		}
+
+		/* now set hmm, make sure no mmu notifer callback might be call */
+		ret = mm_take_all_locks(mm);
+		if (unlikely(ret)) {
+			goto out_cleanup;
+		}
+		mm->hmm = hmm;
+		mirror->hmm = hmm;
+		hmm = NULL;
+	} else {
+		struct hmm_mirror *tmp;
+		int id;
+
+		id = srcu_read_lock(&srcu);
+		list_for_each_entry(tmp, &mm->hmm->mirrors, mlist) {
+			if (tmp->device == mirror->device) {
+				/* A process can be mirrored only once by same
+				 * device.
+				 */
+				srcu_read_unlock(&srcu, id);
+				ret = -EINVAL;
+				goto out_cleanup;
+			}
+		}
+		srcu_read_unlock(&srcu, id);
+
+		ret = mm_take_all_locks(mm);
+		if (unlikely(ret)) {
+			goto out_cleanup;
+		}
+		mirror->hmm = hmm_ref(mm->hmm);
+	}
+
+	/*
+	 * A side note: hmm_notifier_release() can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 *
+	 * We can't race against any other mmu notifier method either
+	 * thanks to mm_take_all_locks().
+	 */
+	spin_lock(&mm->hmm->lock);
+	list_add_rcu(&mirror->mlist, &mm->hmm->mirrors);
+	spin_unlock(&mm->hmm->lock);
+	mm_drop_all_locks(mm);
+
+out_cleanup:
+	if (hmm) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+	up_write(&mm->mmap_sem);
+
+	if (!ret) {
+		struct hmm_device *device = mirror->device;
+
+		hmm_device_ref(device);
+		mutex_lock(&device->mutex);
+		list_add(&mirror->dlist, &device->mirrors);
+		mutex_unlock(&device->mutex);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm;
+
+	if (!mirror) {
+		return;
+	}
+	hmm = hmm_ref(mirror->hmm);
+	if (!hmm) {
+		return;
+	}
+
+	down_read(&hmm->mm->mmap_sem);
+	hmm_mirror_cleanup(mirror);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_unref(hmm);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror,
+				 struct hmm_fault *fault,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long *pfns)
+{
+	struct hmm_device *device = mirror->device;
+	int ret;
+
+	ret = device->ops->lmem_fault(mirror, faddr, laddr, pfns, fault);
+	return ret;
+}
+
+/* see include/linux/hmm.h */
+int hmm_mirror_fault(struct hmm_mirror *mirror,
+		     struct hmm_fault *fault)
+{
+	struct vm_area_struct *vma;
+	struct hmm_event *event;
+	unsigned long caddr, naddr, vm_flags;
+	struct hmm *hmm;
+	bool do_fault = false, write;
+	int ret = 0;
+
+	if (!mirror || !fault || fault->faddr >= fault->laddr) {
+		return -EINVAL;
+	}
+	if (mirror->dead) {
+		return -ENODEV;
+	}
+	hmm = mirror->hmm;
+
+	write = !!(fault->flags & HMM_FAULT_WRITE);
+	fault->faddr = fault->faddr & PAGE_MASK;
+	fault->laddr = PAGE_ALIGN(fault->laddr);
+	caddr = fault->faddr;
+	naddr = fault->laddr;
+	/* FIXME arbitrary value clamp fault to 4M at a time. */
+	if ((fault->laddr - fault->faddr) > (4UL << 20UL)) {
+		fault->laddr = fault->faddr + (4UL << 20UL);
+	}
+	hmm_mirror_ref(mirror);
+
+retry:
+	down_read(&hmm->mm->mmap_sem);
+	event = hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT);
+	/* FIXME handle gate area ? and guard page */
+	vma = find_extend_vma(hmm->mm, caddr);
+	if (!vma) {
+		if (caddr > fault->faddr) {
+			/* Fault succeed up to addr. */
+			fault->laddr = caddr;
+			ret = 0;
+			goto out;
+		}
+		/* Allow device driver to learn about first valid address in
+		 * the range it was trying to fault in so it can restart the
+		 * fault at this address.
+		 */
+		vma = find_vma_intersection(hmm->mm,event->faddr,event->laddr);
+		if (vma) {
+			fault->laddr = vma->vm_start;
+		}
+		ret = -EFAULT;
+		goto out;
+	}
+	/* FIXME support HUGETLB */
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		ret = -EFAULT;
+		goto out;
+	}
+	vm_flags = write ? VM_WRITE : VM_READ;
+	if (!(vma->vm_flags & vm_flags)) {
+		ret = -EACCES;
+		goto out;
+	}
+	/* Adjust range to this vma only. */
+	fault->laddr = naddr = event->laddr = min(event->laddr, vma->vm_end);
+	fault->vma = vma;
+
+	for (; caddr < event->laddr;) {
+		struct hmm_fault_mm fault_mm;
+
+		fault_mm.mm = vma->vm_mm;
+		fault_mm.vma = vma;
+		fault_mm.faddr = caddr;
+		fault_mm.laddr = naddr;
+		fault_mm.pfns = fault->pfns;
+		fault_mm.write = write;
+		ret = hmm_fault_mm_fault(&fault_mm);
+		if (ret == -ENOENT && fault_mm.laddr == caddr) {
+			do_fault = true;
+			goto out;
+		}
+		if (ret && ret != -ENOENT) {
+			goto out;
+		}
+		if (mirror->dead) {
+			ret = -ENODEV;
+			goto out;
+		}
+		if (event->backoff) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		ret = hmm_mirror_lmem_fault(mirror, fault,
+					    fault_mm.faddr,
+					    fault_mm.laddr,
+					    fault_mm.pfns);
+		if (ret) {
+			goto out;
+		}
+		caddr = fault_mm.laddr;
+		naddr = event->laddr;
+	}
+
+out:
+	hmm_event_unqueue(hmm, event);
+	if (do_fault && !event->backoff && !mirror->dead) {
+		do_fault = false;
+		ret = hmm_fault_mm(hmm, vma, caddr, naddr, write);
+		if (!ret) {
+			ret = -ENOENT;
+		}
+	}
+	wake_up(&hmm->wait_queue);
+	up_read(&hmm->mm->mmap_sem);
+	if (ret == -ENOENT) {
+		if (!mirror->dead) {
+			naddr = fault->laddr;
+			goto retry;
+		}
+		ret = -ENODEV;
+	}
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_mirror_fault);
+
+
+
+
+/* hmm_device - Each device driver must register one and only one hmm_device
+ *
+ * The hmm_device is the link btw hmm and each device driver.
+ */
+
+static void hmm_device_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+
+	device = container_of(kref, struct hmm_device, kref);
+	BUG_ON(!list_empty(&device->mirrors));
+
+	device->ops->device_destroy(device);
+}
+
+struct hmm_device *hmm_device_ref(struct hmm_device *device)
+{
+	if (device) {
+		kref_get(&device->kref);
+		return device;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_device_ref);
+
+struct hmm_device *hmm_device_unref(struct hmm_device *device)
+{
+	if (device) {
+		kref_put(&device->kref, hmm_device_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_device_unref);
+
+/* see include/linux/hmm.h */
+int hmm_device_register(struct hmm_device *device, const char *name)
+{
+	/* sanity check */
+	BUG_ON(!device);
+	BUG_ON(!device->ops);
+	BUG_ON(!device->ops->device_destroy);
+	BUG_ON(!device->ops->mirror_release);
+	BUG_ON(!device->ops->mirror_destroy);
+	BUG_ON(!device->ops->fence_wait);
+	BUG_ON(!device->ops->lmem_update);
+	BUG_ON(!device->ops->lmem_fault);
+
+	kref_init(&device->kref);
+	device->name = name;
+	mutex_init(&device->mutex);
+	INIT_LIST_HEAD(&device->mirrors);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_device_register);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence)
+{
+	int ret;
+
+	if (fence == NULL) {
+		return 0;
+	}
+
+	list_del_init(&fence->list);
+	do {
+		io_schedule();
+		ret = device->ops->fence_wait(fence);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+
+
+
+/* This is called after the last hmm_notifier_release() returned */
+void __hmm_destroy(struct mm_struct *mm)
+{
+	kref_put(&mm->hmm->kref, hmm_destroy_kref);
+}
+
+static int __init hmm_module_init(void)
+{
+	int ret;
+
+	ret = init_srcu_struct(&srcu);
+	if (ret) {
+		return ret;
+	}
+	return 0;
+}
+module_init(hmm_module_init);
+
+static void __exit hmm_module_exit(void)
+{
+	cleanup_srcu_struct(&srcu);
+}
+module_exit(hmm_module_exit);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 07/11] hmm: support moving anonymous page to remote memory
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Motivation:

Migrating to device memory can allow device to access memory through a link
with far greater bandwidth as well as with lower latency. Migration to device
memory is of course only meaningfull if the memory will only be access by the
device over a long period of time.

Because hmm aim to only provide an API to facilitate such use it does not
deal with policy on when, what and to migrate to remote memory. It is expected
that device driver that use hmm will have the informations to make such choice.

Implementation:

This use a two level structure to track remote memory. The first level is a
range structure that match a range of address with a specific remote memory
object. This allow for different range of address to point to the same remote
memory object (usefull for shared memory).

The second level is a structure holding informations specific to hmm about the
remote memory. This remote memory structure are allocated by device driver and
thus can be included inside the remote memory structure that is specific to the
device driver.

Each remote memory is given a range of unique id. Those unique id are use to
create special hmm swap entry. For anonymous memory the cpu page table entry
are set to this hmm swap entry and on cpu page fault the unique id is use to
find the remote memory and migrate it back to system memory.

Other event than cpu page fault can trigger migration back to system memory.
For instance on fork, to simplify things, the remote memory is migrated back
to system memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h          |  469 ++++++++-
 include/linux/mmu_notifier.h |    1 +
 include/linux/swap.h         |   12 +-
 include/linux/swapops.h      |   33 +-
 mm/hmm.c                     | 2307 ++++++++++++++++++++++++++++++++++++++++--
 mm/memcontrol.c              |   46 +
 mm/memory.c                  |    7 +
 7 files changed, 2768 insertions(+), 107 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e9c7722..96f41c4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -56,10 +56,10 @@
 
 struct hmm_device;
 struct hmm_device_ops;
-struct hmm_migrate;
 struct hmm_mirror;
 struct hmm_fault;
 struct hmm_event;
+struct hmm_rmem;
 struct hmm;
 
 /* The hmm provide page informations to the device using hmm pfn value. Below
@@ -67,15 +67,34 @@ struct hmm;
  * type of page, dirty page, page is locked or not, ...).
  *
  *   HMM_PFN_VALID_PAGE this means the pfn correspond to valid page.
- *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page.
+ *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page either use
+ *     it or directly clear rmem with zero what ever is the fastest method for
+ *     the device.
  *   HMM_PFN_DIRTY set when the page is dirty.
  *   HMM_PFN_WRITE is set if there is no need to call page_mkwrite
+ *   HMM_PFN_LOCK is only set while the rmem object is under going migration.
+ *   HMM_PFN_LMEM_UPTODATE the page that is in the rmem pfn array has uptodate.
+ *   HMM_PFN_RMEM_UPTODATE the rmem copy of the page is uptodate.
+ *
+ * Device driver only need to worry about :
+ *   HMM_PFN_VALID_PAGE
+ *   HMM_PFN_VALID_ZERO
+ *   HMM_PFN_DIRTY
+ *   HMM_PFN_WRITE
+ * Device driver must set/clear following flag after successfull dma :
+ *   HMM_PFN_LMEM_UPTODATE
+ *   HMM_PFN_RMEM_UPTODATE
+ * All the others flags are for hmm internal use only.
  */
 #define HMM_PFN_SHIFT		(PAGE_SHIFT)
+#define HMM_PFN_CLEAR		(((1UL << HMM_PFN_SHIFT) - 1UL) & ~0x3UL)
 #define HMM_PFN_VALID_PAGE	(0UL)
 #define HMM_PFN_VALID_ZERO	(1UL)
 #define HMM_PFN_DIRTY		(2UL)
 #define HMM_PFN_WRITE		(3UL)
+#define HMM_PFN_LOCK		(4UL)
+#define HMM_PFN_LMEM_UPTODATE	(5UL)
+#define HMM_PFN_RMEM_UPTODATE	(6UL)
 
 static inline struct page *hmm_pfn_to_page(unsigned long pfn)
 {
@@ -95,6 +114,28 @@ static inline void hmm_pfn_set_dirty(unsigned long *pfn)
 	set_bit(HMM_PFN_DIRTY, pfn);
 }
 
+static inline void hmm_pfn_set_lmem_uptodate(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_LMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_set_rmem_uptodate(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_RMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_clear_lmem_uptodate(unsigned long *pfn)
+{
+	clear_bit(HMM_PFN_LMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_clear_rmem_uptodate(unsigned long *pfn)
+{
+	clear_bit(HMM_PFN_RMEM_UPTODATE, pfn);
+}
+
+
+
 
 /* hmm_fence - device driver fence to wait for device driver operations.
  *
@@ -283,6 +324,255 @@ struct hmm_device_ops {
 			  unsigned long laddr,
 			  unsigned long *pfns,
 			  struct hmm_fault *fault);
+
+	/* rmem_alloc - allocate a new rmem object.
+	 *
+	 * @device: Device into which to allocate the remote memory.
+	 * @fault:  The fault for which this remote memory is allocated.
+	 * Returns: Valid rmem ptr on success, NULL or ERR_PTR otherwise.
+	 *
+	 * This allow migration to remote memory to operate in several steps.
+	 * First the hmm code will clamp the range that can migrated and will
+	 * unmap pages and prepare them for migration.
+	 *
+	 * It is only once migration is done with all above step that we know
+	 * how much memory can be migrated which is when rmem_alloc is call to
+	 * allocate the device rmem object to which memory should be migrated.
+	 *
+	 * Device driver can decide through this callback to abort migration
+	 * by returning NULL, or it can decide to continue with migration by
+	 * returning a properly allocated rmem object.
+	 *
+	 * Return rmem or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_rmem *(*rmem_alloc)(struct hmm_device *device,
+				       struct hmm_fault *fault);
+
+	/* rmem_update() - update device mmu for a range of remote memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @rmem:   The remote memory under update.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @fuid:   First uid of the remote memory at which the update begin.
+	 * @etype:  The type of memory event (unmap, fini, read only, ...).
+	 * @dirty:  Device driver should call hmm_pfn_set_dirty.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device mmu permission/usage for a range of remote
+	 * memory. The event type provide the nature of the update :
+	 *   - range is no longer valid (munmap).
+	 *   - range protection changes (mprotect, COW, ...).
+	 *   - range is unmapped (swap, reclaim, page migration, ...).
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to remote memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly call hmm_pfn_set_dirty on any page the device
+	 * did write to since last call to update_rmem. This is only needed if
+	 * the dirty parameter is true.
+	 *
+	 * The driver should return a fence pointer or NULL on success. It is
+	 * advice to return fence and delay wait for the operation to complete
+	 * to the wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*rmem_update)(struct hmm_mirror *mirror,
+					 struct hmm_rmem *rmem,
+					 unsigned long faddr,
+					 unsigned long laddr,
+					 unsigned long fuid,
+					 enum hmm_etype etype,
+					 bool dirty);
+
+	/* rmem_fault() - fault range of rmem on the device mmu.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @rmem:   The rmem backing this range.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @fuid:   First rmem unique id (inclusive).
+	 * @fault:  The fault structure provided by device driver.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Called to give the device driver the remote memory that is backing a
+	 * range of memory. The device driver can only map rmem page with write
+	 * permission only if the HMM_PFN_WRITE bit is set. If device want to
+	 * write to this range of rmem it can call hmm_mirror_fault.
+	 *
+	 * Return error if scheduled operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_fault)(struct hmm_mirror *mirror,
+			  struct hmm_rmem *rmem,
+			  unsigned long faddr,
+			  unsigned long laddr,
+			  unsigned long fuid,
+			  struct hmm_fault *fault);
+
+	/* rmem_to_lmem - copy remote memory to local memory.
+	 *
+	 * @rmem:   The remote memory structure.
+	 * @fuid:   First rmem unique id (inclusive) of range to copy.
+	 * @luid:   Last rmem unique id (exclusive) of range to copy.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * This is call to copy remote memory back to local memory. The device
+	 * driver need to schedule the dma to copy the remote memory to the
+	 * pages given by the pfns array. Device driver should return a fence
+	 * or an error pointer.
+	 *
+	 * If device driver does not return a fence then the device driver must
+	 * wait until the dma is done and all device cache are flush. Moreover
+	 * device driver must set the HMM_PFN_LMEM_UPTODATE on all successfully
+	 * copied pages (setting this flag can be delayed to the fence_wait
+	 * callback).
+	 *
+	 * If a valid fence is returned then hmm will wait on it and reschedule
+	 * any thread that need rescheduling.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * CPU THREAD WILL GET A SIGBUS.
+	 *
+	 * DEVICE DRIVER MUST SET HMM_PFN_LMEM_UPTODATE ON ALL SUCCESSFULLY
+	 * COPIED PAGES.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*rmem_to_lmem)(struct hmm_rmem *rmem,
+					  unsigned long fuid,
+					  unsigned long luid);
+
+	/* lmem_to_rmem - copy local memory to remote memory.
+	 *
+	 * @rmem:   The remote memory structure.
+	 * @fuid:   First rmem unique id (inclusive) of range to copy.
+	 * @luid:   Last rmem unique id (exclusive) of range to copy.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * This is call to copy local memory to remote memory. The driver need
+	 * to schedule the dma to copy the local memory from the pages given by
+	 * the pfns array, to the remote memory.
+	 *
+	 * Device driver should return a fence or an error pointer. If device
+	 * driver does not return a fence then the it must wait until the dma
+	 * is done. The device driver must set the HMM_PFN_RMEM_UPTODATE on all
+	 * successfully copied pages.
+	 *
+	 * If a valid fence is returned then hmm will wait on it and reschedule
+	 * any thread that need rescheduling.
+	 *
+	 * Failure will result in aborting migration to remote memory.
+	 *
+	 * DEVICE DRIVER MUST SET HMM_PFN_RMEM_UPTODATE ON ALL SUCCESSFULLY
+	 * COPIED PAGES.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*lmem_to_rmem)(struct hmm_rmem *rmem,
+					  unsigned long fuid,
+					  unsigned long luid);
+
+	/* rmem_split - split rmem.
+	 *
+	 * @rmem:   The remote memory to split.
+	 * @fuid:   First rmem unique id (inclusive) of range to split.
+	 * @luid:   Last rmem unique id (exclusive) of range to split.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Split remote memory, first the device driver must allocate a new
+	 * remote memory struct, second it must call hmm_rmem_split_new and
+	 * last it must transfer private driver resource from splited rmem to
+	 * the new remote memory struct.
+	 *
+	 * Device driver _can not_ adjust nor the fuid nor the luid.
+	 *
+	 * Failure should be forwarded if any of the step fails. The device
+	 * driver does not need to worry about freeing the new remote memory
+	 * object once hmm_rmem_split_new is call as it will be freed through
+	 * the rmem_destroy callback if anything fails.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * THE WHOLE RMEM WILL BE MIGRATED BACK TO LMEM.
+	 *
+	 * Return error if operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_split)(struct hmm_rmem *rmem,
+			  unsigned long fuid,
+			  unsigned long luid);
+
+	/* rmem_split_adjust - split rmem.
+	 *
+	 * @rmem:   The remote memory to split.
+	 * @fuid:   First rmem unique id (inclusive) of range to split.
+	 * @luid:   Last rmem unique id (exclusive) of range to split.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Split remote memory, first the device driver must allocate a new
+	 * remote memory struct, second it must call hmm_rmem_split_new and
+	 * last it must transfer private driver resource from splited rmem to
+	 * the new remote memory struct.
+	 *
+	 * Device driver _can_ adjust the fuid or the luid with constraint that
+	 * adjusted_fuid <= fuid and adjusted_luid >= luid.
+	 *
+	 * Failure should be forwarded if any of the step fails. The device
+	 * driver does not need to worry about freeing the new remote memory
+	 * object once hmm_rmem_split_new is call as it will be freed through
+	 * the rmem_destroy callback if anything fails.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * THE WHOLE RMEM WILL BE MIGRATED BACK TO LMEM.
+	 *
+	 * Return error if operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_split_adjust)(struct hmm_rmem *rmem,
+				 unsigned long fuid,
+				 unsigned long luid);
+
+	/* rmem_destroy - destroy rmem.
+	 *
+	 * @rmem:   The remote memory to destroy.
+	 *
+	 * Destroying remote memory structure once all ref are gone.
+	 */
+	void (*rmem_destroy)(struct hmm_rmem *rmem);
 };
 
 /* struct hmm_device - per device hmm structure
@@ -292,6 +582,7 @@ struct hmm_device_ops {
  * @mutex:      Mutex protecting mirrors list.
  * @ops:        The hmm operations callback.
  * @name:       Device name (uniquely identify the device on the system).
+ * @wait_queue: Wait queue for remote memory operations.
  *
  * Each device that want to mirror an address space must register one of this
  * struct (only once).
@@ -302,6 +593,8 @@ struct hmm_device {
 	struct mutex			mutex;
 	const struct hmm_device_ops	*ops;
 	const char			*name;
+	wait_queue_head_t		wait_queue;
+	bool				rmem;
 };
 
 /* hmm_device_register() - register a device with hmm.
@@ -322,6 +615,88 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
 
 
 
+/* hmm_rmem - The rmem struct hold hmm infos of a remote memory block.
+ *
+ * The device driver should derivate its remote memory tracking structure from
+ * the hmm_rmem structure. The hmm_rmem structure dos not hold any infos about
+ * the specific of the remote memory block (device address or anything else).
+ * It solely store informations needed for finding rmem when cpu try to access
+ * it.
+ */
+
+/* struct hmm_rmem - remote memory block
+ *
+ * @kref:           Reference count.
+ * @device:         The hmm device the remote memory is allocated on.
+ * @event:          The event currently associated with the rmem.
+ * @lock:           Lock protecting the ranges list and event field.
+ * @ranges:         The list of address ranges that point to this rmem.
+ * @node:           Node for rmem unique id tree.
+ * @pgoff:          Page offset into file (in PAGE_SIZE not PAGE_CACHE_SIZE).
+ * @fuid:           First unique id associated with this specific hmm_rmem.
+ * @fuid:           Last unique id associated with this specific hmm_rmem.
+ * @subtree_luid:   Optimization for red and black interval tree.
+ * @pfns:           Array of pfn for local memory when some is attached.
+ * @dead:           The remote memory is no longer valid restart lookup.
+ *
+ * Each hmm_rmem has a uniq range of id that is use to uniquely identify remote
+ * memory on cpu side. Those uniq id do not relate in any way with the device
+ * physical address at which the remote memory is located.
+ */
+struct hmm_rmem {
+	struct kref		kref;
+	struct hmm_device	*device;
+	struct hmm_event	*event;
+	spinlock_t		lock;
+	struct list_head	ranges;
+	struct rb_node		node;
+	unsigned long		pgoff;
+	unsigned long		fuid;
+	unsigned long		luid;
+	unsigned long		subtree_luid;
+	unsigned long		*pfns;
+	bool			dead;
+};
+
+struct hmm_rmem *hmm_rmem_ref(struct hmm_rmem *rmem);
+struct hmm_rmem *hmm_rmem_unref(struct hmm_rmem *rmem);
+
+/* hmm_rmem_split_new - helper to split rmem.
+ *
+ * @rmem:   The remote memory to split.
+ * @new:    The new remote memory struct.
+ * Returns: 0 on success, error value otherwise.
+ *
+ * The new remote memory struct must be allocated by the device driver and its
+ * fuid and lui field must be set to the range the device wish to new rmem to
+ * cover.
+ *
+ * Moreover all below conditions must be true :
+ *   (new->fuid < new->luid)
+ *   (new->fuid >= rmem->fuid && new->luid <= rmem->luid)
+ *   (new->fuid == rmem->fuid || new->luid == rmem->luid)
+ *
+ * This hmm helper function will split range and perform internal hmm update on
+ * behalf of the device driver.
+ *
+ * Note that this function must be call by the rmem_split and rmem_split_adjust
+ * callback.
+ *
+ * Once this function is call the device driver should not try to free the new
+ * rmem structure no matter what is the return value. Moreover if the function
+ * return 0 then the device driver should properly update the new rmem struct.
+ *
+ * Return error if operation failed. Valid value :
+ * -EINVAL If one of the above condition is false.
+ * -ENOMEM If it failed to allocate memory.
+ * 0 on success.
+ */
+int hmm_rmem_split_new(struct hmm_rmem *rmem,
+		       struct hmm_rmem *new);
+
+
+
+
 /* hmm_mirror - device specific mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct associating
@@ -406,6 +781,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  */
 struct hmm_fault {
 	struct vm_area_struct	*vma;
+	struct hmm_rmem		*rmem;
 	unsigned long		faddr;
 	unsigned long		laddr;
 	unsigned long		*pfns;
@@ -450,6 +826,56 @@ struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror);
 
 
 
+/* hmm_migrate - Memory migration from local memory to remote memory.
+ *
+ * Below are functions that handle migration from local memory to remote memory
+ * (represented by hmm_rmem struct). This is a multi-step process first the
+ * range is unmap, then the device driver depending on the size of the unmaped
+ * range can decide to proceed or abort the migration.
+ */
+
+/* hmm_migrate_rmem_to_lmem() - force migration of some rmem to lmem.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @faddr:  First address of the range to migrate to lmem.
+ * @laddr:  Last address of the range to migrate to lmem.
+ * Returns: 0 on success, -EIO or -EINVAL.
+ *
+ * This migrate any remote memory behind a range of address to local memory.
+ *
+ * Returns:
+ * 0 success.
+ * -EINVAL if invalid argument.
+ * -EIO if one of the device driver returned this error.
+ */
+int hmm_migrate_rmem_to_lmem(struct hmm_mirror *mirror,
+			     unsigned long faddr,
+			     unsigned long laddr);
+
+/* hmm_migrate_lmem_to_rmem() - call to migrate lmem to rmem.
+ *
+ * @migrate:    The migration temporary struct.
+ * @mirror:     The mirror that link process address space with the device.
+ * Returns:     0, -EINVAL, -ENOMEM, -EFAULT, -EACCES, -ENODEV, -EBUSY, -EIO.
+ *
+ * On success the migrate struct is updated with the range that was migrated.
+ *
+ * Returns:
+ * 0 success.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EFAULT if range of address is invalid (no vma backing any of the range).
+ * -EACCES if vma backing the range is special vma.
+ * -ENODEV if mirror is in process of being destroy.
+ * -EBUSY if range can not be migrated (many different reasons).
+ * -EIO if one of the device driver returned this error.
+ */
+int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
+			     struct hmm_mirror *mirror);
+
+
+
+
 /* Functions used by core mm code. Device driver should not use any of them. */
 void __hmm_destroy(struct mm_struct *mm);
 static inline void hmm_destroy(struct mm_struct *mm)
@@ -459,12 +885,51 @@ static inline void hmm_destroy(struct mm_struct *mm)
 	}
 }
 
+/* hmm_mm_fault() - call when cpu pagefault on special hmm pte entry.
+ *
+ * @mm:             The mm of the thread triggering the fault.
+ * @vma:            The vma in which the fault happen.
+ * @addr:           The address of the fault.
+ * @pte:            Pointer to the pte entry inside the cpu page table.
+ * @pmd:            Pointer to the pmd entry into which the pte is.
+ * @fault_flags:    Fault flags (read, write, ...).
+ * @orig_pte:       The original pte value when this fault happened.
+ *
+ * When the cpu try to access a range of memory that is in remote memory it
+ * fault in face of hmm special swap pte which will end up calling this
+ * function that should trigger the appropriate memory migration.
+ *
+ * Returns:
+ *   0 if some one else already migrated the rmem back.
+ *   VM_FAULT_SIGBUS on any i/o error during migration.
+ *   VM_FAULT_OOM if it fails to allocate memory for migration.
+ *   VM_FAULT_MAJOR on successfull migration.
+ */
+int hmm_mm_fault(struct mm_struct *mm,
+		 struct vm_area_struct *vma,
+		 unsigned long addr,
+		 pte_t *pte,
+		 pmd_t *pmd,
+		 unsigned int fault_flags,
+		 pte_t orig_pte);
+
 #else /* !CONFIG_HMM */
 
 static inline void hmm_destroy(struct mm_struct *mm)
 {
 }
 
+static inline int hmm_mm_fault(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long addr,
+			       pte_t *pte,
+			       pmd_t *pmd,
+			       unsigned int fault_flags,
+			       pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 #endif /* !CONFIG_HMM */
 
 #endif
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 0794a73b..bb2c23f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -42,6 +42,7 @@ enum mmu_action {
 	MMU_FAULT_WP,
 	MMU_THP_SPLIT,
 	MMU_THP_FAULT_WP,
+	MMU_HMM,
 };
 
 #ifdef CONFIG_MMU_NOTIFIER
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5a14b92..0739b32 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,18 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM			(MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_HMM_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6adfb7b..9a490d3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -188,7 +188,38 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#ifdef CONFIG_HMM
+
+static inline swp_entry_t make_hmm_entry(unsigned long pgoff)
+{
+	/* We don't need to keep the page pfn, so use offset to store writeable
+	 * flag.
+	 */
+	return swp_entry(SWP_HMM, pgoff);
+}
+
+static inline unsigned long hmm_entry_uid(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_HMM);
+}
+#else /* !CONFIG_HMM */
+#define make_hmm_entry(page, write) swp_entry(0, 0)
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+	return 0;
+}
+
+static inline void make_hmm_entry_read(swp_entry_t *entry)
+{
+}
+#endif /* !CONFIG_HMM */
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/mm/hmm.c b/mm/hmm.c
index 2b8986c..599d4f6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -77,6 +77,9 @@
 /* global SRCU for all MMs */
 static struct srcu_struct srcu;
 
+static spinlock_t _hmm_rmems_lock;
+static struct rb_root _hmm_rmems = RB_ROOT;
+
 
 
 
@@ -94,6 +97,7 @@ struct hmm_event {
 	unsigned long		faddr;
 	unsigned long		laddr;
 	struct list_head	fences;
+	struct list_head	ranges;
 	enum hmm_etype		etype;
 	bool			backoff;
 };
@@ -106,6 +110,7 @@ struct hmm_event {
  * @mirrors:        List of all mirror for this mm (one per device)
  * @mmu_notifier:   The mmu_notifier of this mm
  * @wait_queue:     Wait queue for synchronization btw cpu and device
+ * @ranges:         Tree of rmem ranges (sorted by address).
  * @events:         Events.
  * @nevents:        Number of events currently happening.
  * @dead:           The mm is being destroy.
@@ -122,6 +127,7 @@ struct hmm {
 	struct list_head	pending;
 	struct mmu_notifier	mmu_notifier;
 	wait_queue_head_t	wait_queue;
+	struct rb_root		ranges;
 	struct hmm_event	events[HMM_MAX_EVENTS];
 	int			nevents;
 	bool			dead;
@@ -132,137 +138,1456 @@ static struct mmu_notifier_ops hmm_notifier_ops;
 static inline struct hmm *hmm_ref(struct hmm *hmm);
 static inline struct hmm *hmm_unref(struct hmm *hmm);
 
-static int hmm_mirror_update(struct hmm_mirror *mirror,
-			     struct vm_area_struct *vma,
-			     unsigned long faddr,
-			     unsigned long laddr,
-			     struct hmm_event *event);
-static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long fuid);
+static void hmm_rmem_poison_range(struct hmm_rmem *rmem,
+				  struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid);
+
+static int hmm_mirror_rmem_update(struct hmm_mirror *mirror,
+				  struct hmm_rmem *rmem,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid,
+				  struct hmm_event *event,
+				  bool dirty);
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event);
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence);
+
+
+
+
+/* hmm_event - use to synchronize various mm events with each others.
+ *
+ * During life time of process various mm events will happen, hmm serialize
+ * event that affect overlapping range of address. The hmm_event are use for
+ * that purpose.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+}
+
+static inline unsigned long hmm_event_size(struct hmm_event *event)
+{
+	return (event->laddr - event->faddr);
+}
+
+
+
+
+/* hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * This code deals with reading the cpu page table to find the pages that are
+ * backing a range of address. It is use as an helper to the device page fault
+ * code.
+ */
+
+/* struct hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * @mm:     The mm of the process the device fault is happening in.
+ * @vma:    The vma in which the fault is happening.
+ * @faddr:  The first address for the range the device want to fault.
+ * @laddr:  The last address for the range the device want to fault.
+ * @pfns:   Array of hmm pfns (contains the result of the fault).
+ * @write:  Is this write fault.
+ */
+struct hmm_fault_mm {
+	struct mm_struct	*mm;
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	bool			write;
+};
+
+static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct mm_walk *walk)
+{
+	struct hmm_fault_mm *fault_mm = walk->private;
+	unsigned long idx, *pfns;
+	pte_t *ptep;
+
+	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
+	pfns = &fault_mm->pfns[idx];
+	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
+	if (pmd_none(*pmdp)) {
+		return -ENOENT;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME */
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		return -EINVAL;
+	}
+
+	ptep = pte_offset_map(pmdp, faddr);
+	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
+		pte_t pte = *ptep;
+
+		if (pte_none(pte)) {
+			if (fault_mm->write) {
+				ptep++;
+				break;
+			}
+			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
+			set_bit(HMM_PFN_VALID_ZERO, pfns);
+			continue;
+		}
+		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+		if (fault_mm->write && !pte_write(pte)) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+
+		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, pfns);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_WRITE, pfns);
+		}
+		/* Consider the page as hot as a device want to use it. */
+		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
+		fault_mm->laddr = faddr + PAGE_SIZE;
+	}
+	pte_unmap(ptep - 1);
+
+	return (faddr == laddr) ? 0 : -ENOENT;
+}
+
+static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
+{
+	struct mm_walk walk = {0};
+	unsigned long faddr, laddr;
+	int ret;
+
+	faddr = fault_mm->faddr;
+	laddr = fault_mm->laddr;
+	fault_mm->laddr = faddr;
+
+	walk.pmd_entry = hmm_fault_mm_fault_pmd;
+	walk.mm = fault_mm->mm;
+	walk.private = fault_mm;
+
+	ret = walk_page_range(faddr, laddr, &walk);
+	return ret;
+}
+
+
+
+
+/* hmm_range - address range backed by remote memory.
+ *
+ * Each address range backed by remote memory is tracked so that on cpu page
+ * fault for a given address we can find the corresponding remote memory. We
+ * use a separate structure from remote memory as several different address
+ * range can point to the same remote memory (in case of shared mapping).
+ */
+
+/* struct hmm_range - address range backed by remote memory.
+ *
+ * @kref:           Reference count.
+ * @rmem:           Remote memory that back this address range.
+ * @mirror:         Mirror with which this range is associated.
+ * @fuid:           First unique id of rmem for this range.
+ * @faddr:          First address (inclusive) of the range.
+ * @laddr:          Last address (exclusive) of the range.
+ * @subtree_laddr:  Optimization for red black interval tree.
+ * @rlist:          List of all range associated with same rmem.
+ * @elist:          List of all range associated with an event.
+ */
+struct hmm_range {
+	struct kref		kref;
+	struct hmm_rmem		*rmem;
+	struct hmm_mirror	*mirror;
+	unsigned long		fuid;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		subtree_laddr;
+	struct rb_node		node;
+	struct list_head	rlist;
+	struct list_head	elist;
+};
+
+static inline unsigned long hmm_range_faddr(struct hmm_range *range)
+{
+	return range->faddr;
+}
+
+static inline unsigned long hmm_range_laddr(struct hmm_range *range)
+{
+	return range->laddr - 1UL;
+}
+
+INTERVAL_TREE_DEFINE(struct hmm_range,
+		     node,
+		     unsigned long,
+		     subtree_laddr,
+		     hmm_range_faddr,
+		     hmm_range_laddr,,
+		     hmm_range_tree)
+
+static inline unsigned long hmm_range_npages(struct hmm_range *range)
+{
+	return (range->laddr - range->faddr) >> PAGE_SHIFT;
+}
+
+static inline unsigned long hmm_range_fuid(struct hmm_range *range)
+{
+	return range->fuid;
+}
+
+static inline unsigned long hmm_range_luid(struct hmm_range *range)
+{
+	return range->fuid + hmm_range_npages(range);
+}
+
+static void hmm_range_destroy(struct kref *kref)
+{
+	struct hmm_range *range;
+
+	range = container_of(kref, struct hmm_range, kref);
+	BUG_ON(!list_empty(&range->elist));
+	BUG_ON(!list_empty(&range->rlist));
+	BUG_ON(!RB_EMPTY_NODE(&range->node));
+
+	range->rmem = hmm_rmem_unref(range->rmem);
+	range->mirror = hmm_mirror_unref(range->mirror);
+	kfree(range);
+}
+
+static struct hmm_range *hmm_range_unref(struct hmm_range *range)
+{
+	if (range) {
+		kref_put(&range->kref, hmm_range_destroy);
+	}
+	return NULL;
+}
+
+static void hmm_range_init(struct hmm_range *range,
+			   struct hmm_mirror *mirror,
+			   struct hmm_rmem *rmem,
+			   unsigned long faddr,
+			   unsigned long laddr,
+			   unsigned long fuid)
+{
+	kref_init(&range->kref);
+	range->mirror = hmm_mirror_ref(mirror);
+	range->rmem = hmm_rmem_ref(rmem);
+	range->fuid = fuid;
+	range->faddr = faddr;
+	range->laddr = laddr;
+	RB_CLEAR_NODE(&range->node);
+
+	spin_lock(&rmem->lock);
+	list_add_tail(&range->rlist, &rmem->ranges);
+	if (rmem->event) {
+		list_add_tail(&range->elist, &rmem->event->ranges);
+	}
+	spin_unlock(&rmem->lock);
+}
+
+static void hmm_range_insert(struct hmm_range *range)
+{
+	struct hmm_mirror *mirror = range->mirror;
+
+	spin_lock(&mirror->hmm->lock);
+	if (RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_insert(range, &mirror->hmm->ranges);
+	}
+	spin_unlock(&mirror->hmm->lock);
+}
+
+static inline void hmm_range_adjust_locked(struct hmm_range *range,
+					   unsigned long faddr,
+					   unsigned long laddr)
+{
+	if (!RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_remove(range, &range->mirror->hmm->ranges);
+	}
+	if (faddr < range->faddr) {
+		range->fuid -= ((range->faddr - faddr) >> PAGE_SHIFT);
+	} else {
+		range->fuid += ((faddr - range->faddr) >> PAGE_SHIFT);
+	}
+	range->faddr = faddr;
+	range->laddr = laddr;
+	hmm_range_tree_insert(range, &range->mirror->hmm->ranges);
+}
+
+static int hmm_range_split(struct hmm_range *range,
+			   unsigned long saddr)
+{
+	struct hmm_mirror *mirror = range->mirror;
+	struct hmm_range *new;
+
+	if (range->faddr >= saddr) {
+		BUG();
+		return -EINVAL;
+	}
+
+	new = kmalloc(sizeof(struct hmm_range), GFP_KERNEL);
+	if (new == NULL) {
+		return -ENOMEM;
+	}
+
+	hmm_range_init(new,mirror,range->rmem,range->faddr,saddr,range->fuid);
+	spin_lock(&mirror->hmm->lock);
+	hmm_range_adjust_locked(range, saddr, range->laddr);
+	hmm_range_tree_insert(new, &mirror->hmm->ranges);
+	spin_unlock(&mirror->hmm->lock);
+	return 0;
+}
+
+static void hmm_range_fini(struct hmm_range *range)
+{
+	struct hmm_rmem *rmem = range->rmem;
+	struct hmm *hmm = range->mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (!RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_remove(range, &hmm->ranges);
+		RB_CLEAR_NODE(&range->node);
+	}
+	spin_unlock(&hmm->lock);
+
+	spin_lock(&rmem->lock);
+	list_del_init(&range->elist);
+	list_del_init(&range->rlist);
+	spin_unlock(&rmem->lock);
+
+	hmm_range_unref(range);
+}
+
+static void hmm_range_fini_clear(struct hmm_range *range,
+				 struct vm_area_struct *vma)
+{
+	hmm_rmem_clear_range(range->rmem, vma, range->faddr,
+			     range->laddr, range->fuid);
+	hmm_range_fini(range);
+}
+
+static inline bool hmm_range_reserve(struct hmm_range *range,
+				     struct hmm_event *event)
+{
+	bool reserved = false;
+
+	spin_lock(&range->rmem->lock);
+	if (range->rmem->event == NULL || range->rmem->event == event) {
+		range->rmem->event = event;
+		list_add_tail(&range->elist, &range->rmem->event->ranges);
+		reserved = true;
+	}
+	spin_unlock(&range->rmem->lock);
+	return reserved;
+}
+
+static inline void hmm_range_release(struct hmm_range *range,
+				     struct hmm_event *event)
+{
+	struct hmm_device *device = NULL;
+	spin_lock(&range->rmem->lock);
+	if (range->rmem->event != event) {
+		spin_unlock(&range->rmem->lock);
+		WARN_ONCE(1,"hmm: trying to release range from wrong event.\n");
+		return;
+	}
+	list_del_init(&range->elist);
+	if (list_empty(&range->rmem->event->ranges)) {
+		range->rmem->event = NULL;
+		device = range->rmem->device;
+	}
+	spin_unlock(&range->rmem->lock);
+
+	if (device) {
+		wake_up(&device->wait_queue);
+	}
+}
+
+
+
+
+/* hmm_rmem - The remote memory.
+ *
+ * Below are functions that deals with remote memory.
+ */
+
+/* struct hmm_rmem_mm - used during memory migration from/to rmem.
+ *
+ * @vma:            The vma that cover the range.
+ * @rmem:           The remote memory object.
+ * @faddr:          The first address in the range.
+ * @laddr:          The last address in the range.
+ * @fuid:           The first uid for the range.
+ * @rmeap_pages:    List of page to remap.
+ * @tlb:            For gathering cpu tlb flushes.
+ * @force_flush:    Force cpu tlb flush.
+ */
+struct hmm_rmem_mm {
+	struct vm_area_struct	*vma;
+	struct hmm_rmem		*rmem;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		fuid;
+	struct list_head	remap_pages;
+	struct mmu_gather	tlb;
+	int			force_flush;
+};
+
+/* Interval tree for the hmm_rmem object. Providing the following functions :
+ * hmm_rmem_tree_insert(struct hmm_rmem *, struct rb_root *)
+ * hmm_rmem_tree_remove(struct hmm_rmem *, struct rb_root *)
+ * hmm_rmem_tree_iter_first(struct rb_root *, fpgoff, lpgoff)
+ * hmm_rmem_tree_iter_next(struct hmm_rmem *, fpgoff, lpgoff)
+ */
+static inline unsigned long hmm_rmem_fuid(struct hmm_rmem *rmem)
+{
+	return rmem->fuid;
+}
+
+static inline unsigned long hmm_rmem_luid(struct hmm_rmem *rmem)
+{
+	return rmem->luid - 1UL;
+}
+
+INTERVAL_TREE_DEFINE(struct hmm_rmem,
+		     node,
+		     unsigned long,
+		     subtree_luid,
+		     hmm_rmem_fuid,
+		     hmm_rmem_luid,,
+		     hmm_rmem_tree)
+
+static inline unsigned long hmm_rmem_npages(struct hmm_rmem *rmem)
+{
+	return (rmem->luid - rmem->fuid);
+}
+
+static inline unsigned long hmm_rmem_size(struct hmm_rmem *rmem)
+{
+	return hmm_rmem_npages(rmem) << PAGE_SHIFT;
+}
+
+static void hmm_rmem_free(struct hmm_rmem *rmem)
+{
+	unsigned long i;
+
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!page || test_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[i])) {
+			continue;
+		}
+		/* Fake mapping so that page_remove_rmap behave as we want. */
+		VM_BUG_ON(page_mapcount(page));
+		atomic_set(&page->_mapcount, 0);
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0;
+	}
+	kfree(rmem->pfns);
+	rmem->pfns = NULL;
+
+	spin_lock(&_hmm_rmems_lock);
+	if (!RB_EMPTY_NODE(&rmem->node)) {
+		hmm_rmem_tree_remove(rmem, &_hmm_rmems);
+		RB_CLEAR_NODE(&rmem->node);
+	}
+	spin_unlock(&_hmm_rmems_lock);
+}
+
+static void hmm_rmem_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_rmem *rmem;
+
+	rmem = container_of(kref, struct hmm_rmem, kref);
+	device = rmem->device;
+	BUG_ON(!list_empty(&rmem->ranges));
+	hmm_rmem_free(rmem);
+	device->ops->rmem_destroy(rmem);
+}
+
+struct hmm_rmem *hmm_rmem_ref(struct hmm_rmem *rmem)
+{
+	if (rmem) {
+		kref_get(&rmem->kref);
+		return rmem;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_rmem_ref);
+
+struct hmm_rmem *hmm_rmem_unref(struct hmm_rmem *rmem)
+{
+	if (rmem) {
+		kref_put(&rmem->kref, hmm_rmem_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_rmem_unref);
+
+static void hmm_rmem_init(struct hmm_rmem *rmem,
+			  struct hmm_device *device)
+{
+	kref_init(&rmem->kref);
+	rmem->device = device;
+	rmem->fuid = 0;
+	rmem->luid = 0;
+	rmem->pfns = NULL;
+	rmem->dead = false;
+	INIT_LIST_HEAD(&rmem->ranges);
+	spin_lock_init(&rmem->lock);
+	RB_CLEAR_NODE(&rmem->node);
+}
+
+static int hmm_rmem_alloc(struct hmm_rmem *rmem, unsigned long npages)
+{
+	rmem->pfns = kzalloc(sizeof(long) * npages, GFP_KERNEL);
+	if (rmem->pfns == NULL) {
+		return -ENOMEM;
+	}
+
+	spin_lock(&_hmm_rmems_lock);
+	if (_hmm_rmems.rb_node == NULL) {
+		rmem->fuid = 1;
+		rmem->luid = 1 + npages;
+	} else {
+		struct hmm_rmem *head;
+
+		head = container_of(_hmm_rmems.rb_node,struct hmm_rmem,node);
+		/* The subtree_luid of root node is the current luid. */
+		rmem->fuid = head->subtree_luid;
+		rmem->luid = head->subtree_luid + npages;
+	}
+	/* The rmem uid value must fit into swap entry. FIXME can we please
+	 * have an ARCH define for the maximum swap entry value !
+	 */
+	if (rmem->luid < MM_MAX_SWAP_PAGES) {
+		hmm_rmem_tree_insert(rmem, &_hmm_rmems);
+		spin_unlock(&_hmm_rmems_lock);
+		return 0;
+	}
+	spin_unlock(&_hmm_rmems_lock);
+	rmem->fuid = 0;
+	rmem->luid = 0;
+	return -ENOSPC;
+}
+
+static struct hmm_rmem *hmm_rmem_find(unsigned long uid)
+{
+	struct hmm_rmem *rmem;
+
+	spin_lock(&_hmm_rmems_lock);
+	rmem = hmm_rmem_tree_iter_first(&_hmm_rmems, uid, uid);
+	hmm_rmem_ref(rmem);
+	spin_unlock(&_hmm_rmems_lock);
+	return rmem;
+}
+
+int hmm_rmem_split_new(struct hmm_rmem *rmem,
+		       struct hmm_rmem *new)
+{
+	struct hmm_range *range, *next;
+	unsigned long i, pgoff, npages;
+
+	hmm_rmem_init(new, rmem->device);
+
+	/* Sanity check, the new rmem is either at the begining or at the end
+	 * of the old rmem it can not be in the middle.
+	 */
+	if (!(new->fuid < new->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+	if (!(new->fuid >= rmem->fuid && new->luid <= rmem->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+	if (!(new->fuid == rmem->fuid || new->luid == rmem->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+
+	npages = hmm_rmem_npages(new);
+	new->pfns = kzalloc(sizeof(long) * npages, GFP_KERNEL);
+	if (new->pfns == NULL) {
+		hmm_rmem_unref(new);
+		return -ENOMEM;
+	}
+
+retry:
+	spin_lock(&rmem->lock);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		if (hmm_range_fuid(range) < new->fuid &&
+		    hmm_range_luid(range) > new->fuid) {
+			unsigned long soff;
+			int ret;
+
+			soff = ((new->fuid - range->fuid) << PAGE_SHIFT);
+			spin_unlock(&rmem->lock);
+			ret = hmm_range_split(range, soff + range->faddr);
+			if (ret) {
+				hmm_rmem_unref(new);
+				return ret;
+			}
+			goto retry;
+		}
+		if (hmm_range_fuid(range) < new->luid &&
+		    hmm_range_luid(range) > new->luid) {
+			unsigned long soff;
+			int ret;
+
+			soff = ((new->luid - range->fuid) << PAGE_SHIFT);
+			spin_unlock(&rmem->lock);
+			ret = hmm_range_split(range, soff + range->faddr);
+			if (ret) {
+				hmm_rmem_unref(new);
+				return ret;
+			}
+			goto retry;
+		}
+	}
+	spin_unlock(&rmem->lock);
+
+	spin_lock(&_hmm_rmems_lock);
+	hmm_rmem_tree_remove(rmem, &_hmm_rmems);
+	if (new->fuid != rmem->fuid) {
+		for (i = 0, pgoff = (new->fuid-rmem->fuid); i < npages; ++i) {
+			new->pfns[i] = rmem->pfns[i + pgoff];
+		}
+		rmem->luid = new->fuid;
+	} else {
+		for (i = 0; i < npages; ++i) {
+			new->pfns[i] = rmem->pfns[i];
+		}
+		rmem->fuid = new->luid;
+		for (i = 0, pgoff = npages; i < hmm_rmem_npages(rmem); ++i) {
+			rmem->pfns[i] = rmem->pfns[i + pgoff];
+		}
+	}
+	hmm_rmem_tree_insert(rmem, &_hmm_rmems);
+	hmm_rmem_tree_insert(new, &_hmm_rmems);
+
+	/* No need to lock the new ranges list as we are holding the
+	 * rmem uid tree lock and thus no one can find about the new
+	 * rmem yet.
+	 */
+	spin_lock(&rmem->lock);
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		if (range->fuid >= rmem->fuid) {
+			continue;
+		}
+		list_del(&range->rlist);
+		list_add_tail(&range->rlist, &new->ranges);
+	}
+	spin_unlock(&rmem->lock);
+	spin_unlock(&_hmm_rmems_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_rmem_split_new);
+
+static int hmm_rmem_split(struct hmm_rmem *rmem,
+			  unsigned long fuid,
+			  unsigned long luid,
+			  bool adjust)
+{
+	struct hmm_device *device = rmem->device;
+	int ret;
+
+	if (fuid < rmem->fuid || luid > rmem->luid) {
+		WARN_ONCE(1, "hmm: rmem split received invalid range.\n");
+		return -EINVAL;
+	}
+
+	if (fuid == rmem->fuid && luid == rmem->luid) {
+		return 0;
+	}
+
+	if (adjust) {
+		ret = device->ops->rmem_split_adjust(rmem, fuid, luid);
+	} else {
+		ret = device->ops->rmem_split(rmem, fuid, luid);
+	}
+	return ret;
+}
+
+static void hmm_rmem_clear_range_page(struct hmm_rmem_mm *rmem_mm,
+				      unsigned long addr,
+				      pte_t *ptep,
+				      pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long uid;
+	pte_t pte;
+
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte, swp_entry_to_pte(make_hmm_entry(uid)))) {
+//		print_bad_pte(vma, addr, ptep, NULL);
+		set_pte_at(mm, addr, ptep, pte);
+	}
+}
+
+static int hmm_rmem_clear_range_pmd(pmd_t *pmdp,
+				    unsigned long addr,
+				    unsigned long next,
+				    struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return 0;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* FIXME I do not think this can happen at this point given
+		 * that during unmap all thp pmd were split.
+		 */
+		BUG();
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		hmm_rmem_clear_range_page(rmem_mm, addr, ptep, pmdp);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long i, idx, npages;
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_clear_range_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	walk_page_range(faddr, laddr, &walk);
+
+	npages = (laddr - faddr) >> PAGE_SHIFT;
+	for (i = 0, idx = fuid - rmem->fuid; i < npages; ++i, ++idx) {
+		if (current->mm == vma->vm_mm) {
+			sync_mm_rss(vma->vm_mm);
+		}
+
+		/* Properly uncharge memory. */
+		mem_cgroup_uncharge_mm(vma->vm_mm);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+	}
+}
+
+static void hmm_rmem_poison_range_page(struct hmm_rmem_mm *rmem_mm,
+				       struct vm_area_struct *vma,
+				       unsigned long addr,
+				       pte_t *ptep,
+				       pmd_t *pmdp)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long uid;
+	pte_t pte;
+
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte, swp_entry_to_pte(make_hmm_entry(uid)))) {
+//		print_bad_pte(vma, addr, ptep, NULL);
+		set_pte_at(mm, addr, ptep, pte);
+	} else {
+		/* The 0 fuid is special poison value. */
+		pte = swp_entry_to_pte(make_hmm_entry(0));
+		set_pte_at(mm, addr, ptep, pte);
+	}
+}
+
+static int hmm_rmem_poison_range_pmd(pmd_t *pmdp,
+				     unsigned long addr,
+				     unsigned long next,
+				     struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (!vma) {
+		vma = find_vma(walk->mm, addr);
+	}
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return 0;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* FIXME I do not think this can happen at this point given
+		 * that during unmap all thp pmd were split.
+		 */
+		BUG();
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		hmm_rmem_poison_range_page(rmem_mm, vma, addr, ptep, pmdp);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+static void hmm_rmem_poison_range(struct hmm_rmem *rmem,
+				  struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_poison_range_pmd;
+	walk.mm = mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	walk_page_range(faddr, laddr, &walk);
+}
+
+static int hmm_rmem_remap_page(struct hmm_rmem_mm *rmem_mm,
+			       unsigned long addr,
+			       pte_t *ptep,
+			       pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	uid = rmem_mm->fuid + ((rmem_mm->faddr - addr) >> PAGE_SHIFT);
+	idx = (uid - rmem_mm->fuid);
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte,swp_entry_to_pte(make_hmm_entry(uid)))) {
+		set_pte_at(mm, addr, ptep, pte);
+		if (vma->vm_file) {
+			/* Just ignore it, it might means that the shared page
+			 * backing this address was remapped right after being
+			 * added to the pagecache.
+			 */
+			return 0;
+		} else {
+//			print_bad_pte(vma, addr, ptep, NULL);
+			return -EFAULT;
+		}
+	}
+	page = hmm_pfn_to_page(rmem->pfns[idx]);
+	if (!page) {
+		/* Nothing to do. */
+		return 0;
+	}
+
+	/* The remap code must lock page prior to remapping. */
+	BUG_ON(PageHuge(page));
+	if (test_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx])) {
+		BUG_ON(!PageLocked(page));
+		pte = mk_pte(page, vma->vm_page_prot);
+		if (test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			pte = pte_mkwrite(pte);
+		}
+		if (test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx])) {
+			pte = pte_mkdirty(pte);
+		}
+		get_page(page);
+		/* Private anonymous page. */
+		page_add_anon_rmap(page, vma, addr);
+		/* FIXME is this necessary ? I do not think so. */
+		if (!reuse_swap_page(page)) {
+			/* Page is still mapped in another process. */
+			pte = pte_wrprotect(pte);
+		}
+	} else {
+		/* Special zero page. */
+		pte = pte_mkspecial(pfn_pte(page_to_pfn(page),
+				    vma->vm_page_prot));
+	}
+	set_pte_at(mm, addr, ptep, pte);
+
+	return 0;
+}
+
+static int hmm_rmem_remap_pmd(pmd_t *pmdp,
+			      unsigned long addr,
+			      unsigned long next,
+			      struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	int ret = 0;
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* No pmd here. */
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		ret = hmm_rmem_remap_page(rmem_mm, addr, ptep, pmdp);
+		if (ret) {
+			/* Increment ptep so unlock works on correct pte. */
+			ptep++;
+			break;
+		}
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return ret;
+}
+
+static int hmm_rmem_remap_anon(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr,
+			       unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	int ret;
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_remap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	ret = walk_page_range(faddr, laddr, &walk);
+
+	return ret;
+}
+
+static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
+				    unsigned long addr,
+				    pte_t *ptep,
+				    pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	/* New pte value. */
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	idx = uid - rmem->fuid;
+	pte = ptep_get_and_clear_full(mm, addr, ptep, rmem_mm->tlb.fullmm);
+	tlb_remove_tlb_entry((&rmem_mm->tlb), ptep, addr);
+	rmem->pfns[idx] = 0;
+
+	if (pte_none(pte)) {
+		if (mem_cgroup_charge_anon(NULL, mm, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+		/* Zero pte means nothing is there and thus nothing to copy. */
+		pte = swp_entry_to_pte(make_hmm_entry(uid));
+		set_pte_at(mm, addr, ptep, pte);
+		rmem->pfns[idx] = my_zero_pfn(addr) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[idx]);
+		if (vma->vm_flags & VM_WRITE) {
+			set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+		}
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+		rmem_mm->laddr = addr + PAGE_SIZE;
+		return 0;
+	}
+	if (!pte_present(pte)) {
+		/* Page is not present it must be faulted, restore pte. */
+		set_pte_at(mm, addr, ptep, pte);
+		return -ENOENT;
+	}
+
+	page = pfn_to_page(pte_pfn(pte));
+	/* FIXME do we want to be able to unmap mlocked page ? */
+	if (PageMlocked(page)) {
+		set_pte_at(mm, addr, ptep, pte);
+		return -EBUSY;
+	}
+
+	rmem->pfns[idx] = pte_pfn(pte) << HMM_PFN_SHIFT;
+	if (is_zero_pfn(pte_pfn(pte))) {
+		set_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+	} else {
+		flush_cache_page(vma, addr, pte_pfn(pte));
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+		/* Anonymous private memory always writeable. */
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		if (trylock_page(page)) {
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[idx]);
+		}
+		rmem_mm->force_flush=!__tlb_remove_page(&rmem_mm->tlb,page);
+
+		/* tlb_flush_mmu drop one ref so take an extra ref here. */
+		get_page(page);
+	}
+	if (vma->vm_flags & VM_WRITE) {
+		set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+	}
+	rmem_mm->laddr = addr + PAGE_SIZE;
+
+	pte = swp_entry_to_pte(make_hmm_entry(uid));
+	set_pte_at(mm, addr, ptep, pte);
+
+	/* What a journey ! */
+	return 0;
+}
+
+static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
+			      unsigned long addr,
+			      unsigned long next,
+			      struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	int ret = 0;
+
+	if (pmd_none(*pmdp)) {
+		if (unlikely(__pte_alloc(vma->vm_mm, vma, pmdp, addr))) {
+			return -ENOENT;
+		}
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME this will dead lock because it does mmu_notifier_range_invalidate */
+		split_huge_page_pmd(vma, addr, pmdp);
+		return -EAGAIN;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* It is already be handled above. */
+		BUG();
+		return -EINVAL;
+	}
+
+again:
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
+					       ptep, pmdp);
+		if (ret || rmem_mm->force_flush) {
+			/* Increment ptep so unlock works on correct
+			 * pte.
+			 */
+			ptep++;
+			break;
+		}
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	/* mmu_gather ran out of room to batch pages, we break out of the PTE
+	 * lock to avoid doing the potential expensive TLB invalidate and
+	 * page-free while holding it.
+	 */
+	if (rmem_mm->force_flush) {
+		unsigned long old_end;
+
+		rmem_mm->force_flush = 0;
+		/*
+		 * Flush the TLB just for the previous segment,
+		 * then update the range to be the remaining
+		 * TLB range.
+		 */
+		old_end = rmem_mm->tlb.end;
+		rmem_mm->tlb.end = addr;
+
+		tlb_flush_mmu(&rmem_mm->tlb);
+
+		rmem_mm->tlb.start = addr;
+		rmem_mm->tlb.end = old_end;
+
+		if (!ret && addr != next) {
+			goto again;
+		}
+	}
+
+	return ret;
+}
+
+static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long i, npages;
+	int ret;
+
+	if (vma->vm_file) {
+		return -EINVAL;
+	}
 
-static int hmm_device_fence_wait(struct hmm_device *device,
-				 struct hmm_fence *fence);
+	npages = (laddr - faddr) >> PAGE_SHIFT;
+	rmem->pgoff = faddr;
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = faddr;
+	rmem_mm.fuid = rmem->fuid;
+	memset(rmem->pfns, 0, sizeof(long) * npages);
+
+	rmem_mm.force_flush = 0;
+	walk.pmd_entry = hmm_rmem_unmap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	mmu_notifier_invalidate_range_start(walk.mm,vma,faddr,laddr,MMU_HMM);
+	tlb_gather_mmu(&rmem_mm.tlb, walk.mm, faddr, laddr);
+	tlb_start_vma(&rmem_mm.tlb, rmem_mm->vma);
+	ret = walk_page_range(faddr, laddr, &walk);
+	tlb_end_vma(&rmem_mm.tlb, rmem_mm->vma);
+	tlb_finish_mmu(&rmem_mm.tlb, faddr, laddr);
+	mmu_notifier_invalidate_range_end(walk.mm, vma, faddr, laddr, MMU_HMM);
 
+	/* Before migrating page we must lock them. Here we lock all page we
+	 * could not lock while holding pte lock.
+	 */
+	npages = (rmem_mm.laddr - faddr) >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
 
+		if (test_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[i])) {
+			continue;
+		}
 
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+		}
+	}
 
-/* hmm_event - use to synchronize various mm events with each others.
- *
- * During life time of process various mm events will happen, hmm serialize
- * event that affect overlapping range of address. The hmm_event are use for
- * that purpose.
- */
+	return ret;
+}
 
-static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+static inline int hmm_rmem_unmap(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr)
 {
-	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+	if (vma->vm_file) {
+		return -EBUSY;
+	} else {
+		return hmm_rmem_unmap_anon(rmem, vma, faddr, laddr);
+	}
 }
 
-static inline unsigned long hmm_event_size(struct hmm_event *event)
+static int hmm_rmem_alloc_pages(struct hmm_rmem *rmem,
+				struct vm_area_struct *vma,
+				unsigned long addr)
 {
-	return (event->laddr - event->faddr);
-}
+	unsigned long i, npages = hmm_rmem_npages(rmem);
+	unsigned long *pfns = rmem->pfns;
+	struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+	int ret = 0;
 
+	if (vma && !(vma->vm_file)) {
+		if (unlikely(anon_vma_prepare(vma))) {
+			return -ENOMEM;
+		}
+	}
 
+	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
+		struct page *page;
 
+		/* (i) This does happen if vma is being split and rmem split
+		 * failed thus we are falling back to full rmem migration and
+		 * there might not be a vma covering all the address (ie some
+		 * of the migration is useless but to make code simpler we just
+		 * copy more stuff than necessary).
+		 */
+		if (vma && addr >= vma->vm_end) {
+			vma = mm ? find_vma(mm, addr) : NULL;
+		}
 
-/* hmm_fault_mm - used for reading cpu page table on device fault.
- *
- * This code deals with reading the cpu page table to find the pages that are
- * backing a range of address. It is use as an helper to the device page fault
- * code.
- */
+		/* No need to clear page they will be dma to of course this does
+		 * means we trust the device driver.
+		 */
+		if (!vma) {
+			/* See above (i) for when this does happen. */
+			page = alloc_page(GFP_HIGHUSER_MOVABLE);
+		} else {
+			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+		}
+		if (!page) {
+			ret = ret ? ret : -ENOMEM;
+			continue;
+		}
+		lock_page(page);
+		pfns[i] = page_to_pfn(page) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_WRITE, &pfns[i]);
+		set_bit(HMM_PFN_LOCK, &pfns[i]);
+		set_bit(HMM_PFN_VALID_PAGE, &pfns[i]);
+		page_add_new_anon_rmap(page, vma, addr);
+	}
 
-/* struct hmm_fault_mm - used for reading cpu page table on device fault.
- *
- * @mm:     The mm of the process the device fault is happening in.
- * @vma:    The vma in which the fault is happening.
- * @faddr:  The first address for the range the device want to fault.
- * @laddr:  The last address for the range the device want to fault.
- * @pfns:   Array of hmm pfns (contains the result of the fault).
- * @write:  Is this write fault.
- */
-struct hmm_fault_mm {
-	struct mm_struct	*mm;
-	struct vm_area_struct	*vma;
-	unsigned long		faddr;
-	unsigned long		laddr;
-	unsigned long		*pfns;
-	bool			write;
-};
+	return ret;
+}
 
-static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
-				  unsigned long faddr,
-				  unsigned long laddr,
-				  struct mm_walk *walk)
+int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
+			     struct vm_area_struct *vma,
+			     unsigned long addr,
+			     unsigned long fuid,
+			     unsigned long luid,
+			     bool adjust)
 {
-	struct hmm_fault_mm *fault_mm = walk->private;
-	unsigned long idx, *pfns;
-	pte_t *ptep;
+	struct hmm_device *device = rmem->device;
+	struct hmm_range *range, *next;
+	struct hmm_fence *fence, *tmp;
+	struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+	struct list_head fences;
+	unsigned long i;
+	int ret = 0;
 
-	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
-	pfns = &fault_mm->pfns[idx];
-	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
-	if (pmd_none(*pmdp)) {
-		return -ENOENT;
+	BUG_ON(vma && ((addr < vma->vm_start) || (addr >= vma->vm_end)));
+
+	/* Ignore split error will fallback to full migration. */
+	hmm_rmem_split(rmem, fuid, luid, adjust);
+
+	if (rmem->fuid > fuid || rmem->luid < luid) {
+		WARN_ONCE(1, "hmm: rmem split out of constraint.\n");
+		ret = -EINVAL;
+		goto error;
 	}
 
-	if (pmd_trans_huge(*pmdp)) {
-		/* FIXME */
-		return -EINVAL;
+	/* Adjust start address for page allocation if necessary. */
+	if (vma && (rmem->fuid < fuid)) {
+		if (((addr-vma->vm_start)>>PAGE_SHIFT) < (fuid-rmem->fuid)) {
+			/* FIXME can this happen ? I would say now but right
+			 * now i can not hold in my brain all code path that
+			 * leads to this place.
+			 */
+			vma = NULL;
+		} else {
+			addr -= ((fuid - rmem->fuid) << PAGE_SHIFT);
+		}
 	}
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
-		return -EINVAL;
+	ret = hmm_rmem_alloc_pages(rmem, vma, addr);
+	if (ret) {
+		goto error;
 	}
 
-	ptep = pte_offset_map(pmdp, faddr);
-	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
-		pte_t pte = *ptep;
+	INIT_LIST_HEAD(&fences);
 
-		if (pte_none(pte)) {
-			if (fault_mm->write) {
-				ptep++;
-				break;
-			}
-			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
-			set_bit(HMM_PFN_VALID_ZERO, pfns);
-			continue;
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		fence = device->ops->rmem_update(range->mirror,
+						 range->rmem,
+						 range->faddr,
+						 range->laddr,
+						 range->fuid,
+						 HMM_MIGRATE_TO_LMEM,
+						 false);
+		if (IS_ERR(fence)) {
+			ret = PTR_ERR(fence);
+			goto error;
 		}
-		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
-			/* Need to inc ptep so unmap unlock on right pmd. */
-			ptep++;
-			break;
+		if (fence) {
+			list_add_tail(&fence->list, &fences);
 		}
+	}
 
-		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
-		set_bit(HMM_PFN_VALID_PAGE, pfns);
-		if (pte_write(pte)) {
-			set_bit(HMM_PFN_WRITE, pfns);
+	list_for_each_entry_safe (fence, tmp, &fences, list) {
+		int r;
+
+		r = hmm_device_fence_wait(device, fence);
+		ret = ret ? min(ret, r) : r;
+	}
+	if (ret) {
+		goto error;
+	}
+
+	fence = device->ops->rmem_to_lmem(rmem, rmem->fuid, rmem->luid);
+	if (IS_ERR(fence)) {
+		/* FIXME Check return value. */
+		ret = PTR_ERR(fence);
+		goto error;
+	}
+
+	if (fence) {
+		INIT_LIST_HEAD(&fence->list);
+		ret = hmm_device_fence_wait(device, fence);
+		if (ret) {
+			goto error;
 		}
-		/* Consider the page as hot as a device want to use it. */
-		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
-		fault_mm->laddr = faddr + PAGE_SIZE;
 	}
-	pte_unmap(ptep - 1);
 
-	return (faddr == laddr) ? 0 : -ENOENT;
-}
+	/* Now the remote memory is officialy dead and nothing below can fails
+	 * badly.
+	 */
+	rmem->dead = true;
 
-static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
-{
-	struct mm_walk walk = {0};
-	unsigned long faddr, laddr;
-	int ret;
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		VM_BUG_ON(!vma);
+		VM_BUG_ON(range->faddr < vma->vm_start);
+		VM_BUG_ON(range->laddr > vma->vm_end);
+
+		/* The remapping fail only if something goes terribly wrong. */
+		ret = hmm_rmem_remap_anon(rmem, vma, range->faddr,
+					  range->laddr, range->fuid);
+		if (ret) {
+			WARN_ONCE(1, "hmm: something is terribly wrong.\n");
+			hmm_rmem_poison_range(rmem, mm, vma, range->faddr,
+					      range->laddr, range->fuid);
+		}
+		hmm_range_fini(range);
+	}
 
-	faddr = fault_mm->faddr;
-	laddr = fault_mm->laddr;
-	fault_mm->laddr = faddr;
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
-	walk.pmd_entry = hmm_fault_mm_fault_pmd;
-	walk.mm = fault_mm->mm;
-	walk.private = fault_mm;
+		unlock_page(page);
+		mem_cgroup_transfer_charge_anon(page, mm);
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0UL;
+	}
+	return 0;
 
-	ret = walk_page_range(faddr, laddr, &walk);
+error:
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	/* There is two case here :
+	 * (1) rmem is mirroring shared memory in which case we are facing the
+	 *     issue of poisoning all the mapping in all the process for that
+	 *     file.
+	 * (2) rmem is mirroring private memory, easy case poison all ranges
+	 *     referencing the rmem.
+	 */
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
+
+		if (!page) {
+			if (vma && !(vma->vm_flags & VM_SHARED)) {
+				/* Properly uncharge memory. */
+				mem_cgroup_uncharge_mm(mm);
+			}
+			continue;
+		}
+		/* Properly uncharge memory. */
+		mem_cgroup_transfer_charge_anon(page, mm);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			unlock_page(page);
+		}
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0UL;
+	}
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		mm = range->mirror->hmm->mm;
+		hmm_rmem_poison_range(rmem, mm, NULL, range->faddr,
+				      range->laddr, range->fuid);
+		hmm_range_fini(range);
+	}
 	return ret;
 }
 
@@ -285,6 +1610,7 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	INIT_LIST_HEAD(&hmm->mirrors);
 	INIT_LIST_HEAD(&hmm->pending);
 	spin_lock_init(&hmm->lock);
+	hmm->ranges = RB_ROOT;
 	init_waitqueue_head(&hmm->wait_queue);
 
 	for (i = 0; i < HMM_MAX_EVENTS; ++i) {
@@ -298,6 +1624,12 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	return ret;
 }
 
+static inline bool hmm_event_cover_range(struct hmm_event *a,
+					 struct hmm_range *b)
+{
+	return ((a->faddr <= b->faddr) && (a->laddr >= b->laddr));
+}
+
 static enum hmm_etype hmm_event_mmu(enum mmu_action action)
 {
 	switch (action) {
@@ -326,6 +1658,7 @@ static enum hmm_etype hmm_event_mmu(enum mmu_action action)
 	case MMU_MUNMAP:
 		return HMM_MUNMAP;
 	case MMU_SOFT_DIRTY:
+	case MMU_HMM:
 	default:
 		return HMM_NONE;
 	}
@@ -357,6 +1690,8 @@ static void hmm_destroy_kref(struct kref *kref)
 	mm->hmm = NULL;
 	mmu_notifier_unregister(&hmm->mmu_notifier, mm);
 
+	BUG_ON(!RB_EMPTY_ROOT(&hmm->ranges));
+
 	if (!list_empty(&hmm->mirrors)) {
 		BUG();
 		printk(KERN_ERR "destroying an hmm with still active mirror\n"
@@ -410,6 +1745,7 @@ out:
 	event->laddr = laddr;
 	event->backoff = false;
 	INIT_LIST_HEAD(&event->fences);
+	INIT_LIST_HEAD(&event->ranges);
 	hmm->nevents++;
 	list_add_tail(&event->list, &hmm->pending);
 
@@ -447,11 +1783,116 @@ wait:
 	goto retry_wait;
 }
 
+static int hmm_migrate_to_lmem(struct hmm *hmm,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr,
+			       bool adjust)
+{
+	struct hmm_range *range;
+	struct hmm_rmem *rmem;
+	int ret = 0;
+
+	if (unlikely(anon_vma_prepare(vma))) {
+		return -ENOMEM;
+	}
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range && faddr < laddr) {
+		struct hmm_device *device;
+		unsigned long fuid, luid, cfaddr, claddr;
+		int r;
+
+		cfaddr = max(faddr, range->faddr);
+		claddr = min(laddr, range->laddr);
+		fuid = range->fuid + ((cfaddr - range->faddr) >> PAGE_SHIFT);
+		luid = fuid + ((claddr - cfaddr) >> PAGE_SHIFT);
+		faddr = min(range->laddr, laddr);
+		rmem = hmm_rmem_ref(range->rmem);
+		device = rmem->device;
+		spin_unlock(&hmm->lock);
+
+		r = hmm_rmem_migrate_to_lmem(rmem, vma, cfaddr, fuid,
+					     luid, adjust);
+		hmm_rmem_unref(rmem);
+		if (r) {
+			ret = ret ? ret : r;
+			hmm_mirror_cleanup(range->mirror);
+			goto retry;
+		}
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,faddr,laddr-1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return ret;
+}
+
+static unsigned long hmm_ranges_reserve(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_range *range;
+	unsigned long faddr, laddr, count = 0;
+
+	faddr = event->faddr;
+	laddr = event->laddr;
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range) {
+		if (!hmm_range_reserve(range, event)) {
+			struct hmm_rmem *rmem = hmm_rmem_ref(range->rmem);
+			spin_unlock(&hmm->lock);
+			wait_event(hmm->wait_queue, rmem->event != NULL);
+			hmm_rmem_unref(rmem);
+			goto retry;
+		}
+
+		if (list_empty(&range->elist)) {
+			list_add_tail(&range->elist, &event->ranges);
+			count++;
+		}
+
+		range = hmm_range_tree_iter_next(range, faddr, laddr - 1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return count;
+}
+
+static void hmm_ranges_release(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_range *range, *next;
+
+	list_for_each_entry_safe (range, next, &event->ranges, elist) {
+		hmm_range_release(range, event);
+	}
+}
+
 static void hmm_update_mirrors(struct hmm *hmm,
 			       struct vm_area_struct *vma,
 			       struct hmm_event *event)
 {
 	unsigned long faddr, laddr;
+	bool migrate = false;
+
+	switch (event->etype) {
+	case HMM_COW:
+		migrate = true;
+		break;
+	case HMM_MUNMAP:
+		migrate = vma->vm_file ? true : false;
+		break;
+	default:
+		break;
+	}
+
+	if (hmm_ranges_reserve(hmm, event) && migrate) {
+		hmm_migrate_to_lmem(hmm,vma,event->faddr,event->laddr,false);
+	}
 
 	for (faddr = event->faddr; faddr < event->laddr; faddr = laddr) {
 		struct hmm_mirror *mirror;
@@ -494,6 +1935,7 @@ retry_ranges:
 			}
 		}
 	}
+	hmm_ranges_release(hmm, event);
 }
 
 static int hmm_fault_mm(struct hmm *hmm,
@@ -529,6 +1971,98 @@ static int hmm_fault_mm(struct hmm *hmm,
 	return 0;
 }
 
+/* see include/linux/hmm.h */
+int hmm_mm_fault(struct mm_struct *mm,
+		 struct vm_area_struct *vma,
+		 unsigned long addr,
+		 pte_t *pte,
+		 pmd_t *pmd,
+		 unsigned int fault_flags,
+		 pte_t opte)
+{
+	struct hmm_mirror *mirror = NULL;
+	struct hmm_device *device;
+	struct hmm_event *event;
+	struct hmm_range *range;
+	struct hmm_rmem *rmem = NULL;
+	unsigned long uid, faddr, laddr;
+	swp_entry_t entry;
+	struct hmm *hmm = hmm_ref(mm->hmm);
+	int ret;
+
+	if (!hmm) {
+		BUG();
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* Find the corresponding rmem. */
+	entry = pte_to_swp_entry(opte);
+	if (!is_hmm_entry(entry)) {
+		//print_bad_pte(vma, addr, opte, NULL);
+		hmm_unref(hmm);
+		return VM_FAULT_SIGBUS;
+	}
+	uid = hmm_entry_uid(entry);
+	if (!uid) {
+		/* Poisonous hmm swap entry. */
+		hmm_unref(hmm);
+		return VM_FAULT_SIGBUS;
+	}
+
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		hmm_unref(hmm);
+		if (pte_same(*pte, opte)) {
+			//print_bad_pte(vma, addr, opte, NULL);
+			return VM_FAULT_SIGBUS;
+		}
+		return 0;
+	}
+
+	faddr = addr & PAGE_MASK;
+	/* FIXME use the readahead value as a hint on how much to migrate. */
+	laddr = min(faddr + (16 << PAGE_SHIFT), vma->vm_end);
+	spin_lock(&rmem->lock);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		if (faddr < range->faddr || faddr >= range->laddr) {
+			continue;
+		}
+		if (range->mirror->hmm == hmm) {
+			laddr = min(laddr, range->laddr);
+			mirror = hmm_mirror_ref(range->mirror);
+			break;
+		}
+	}
+	spin_unlock(&rmem->lock);
+	hmm_rmem_unref(rmem);
+	hmm_unref(hmm);
+	if (mirror == NULL) {
+		if (pte_same(*pte, opte)) {
+			//print_bad_pte(vma, addr, opte, NULL);
+			return VM_FAULT_SIGBUS;
+		}
+		return 0;
+	}
+
+	device = rmem->device;
+	event = hmm_event_get(hmm, faddr, laddr, HMM_MIGRATE_TO_LMEM);
+	hmm_ranges_reserve(hmm, event);
+	ret = hmm_migrate_to_lmem(hmm, vma, faddr, laddr, true);
+	hmm_ranges_release(hmm, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_mirror_unref(mirror);
+	switch (ret) {
+	case 0:
+		break;
+	case -ENOMEM:
+		return VM_FAULT_OOM;
+	default:
+		return VM_FAULT_SIGBUS;
+	}
+
+	return VM_FAULT_MAJOR;
+}
+
 
 
 
@@ -726,16 +2260,15 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
  * device page table (through hmm callback). Or provide helper functions use by
  * the device driver to fault in range of memory in the device page table.
  */
-
-static int hmm_mirror_update(struct hmm_mirror *mirror,
-			     struct vm_area_struct *vma,
-			     unsigned long faddr,
-			     unsigned long laddr,
-			     struct hmm_event *event)
+
+static int hmm_mirror_lmem_update(struct hmm_mirror *mirror,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct hmm_event *event,
+				  bool dirty)
 {
 	struct hmm_device *device = mirror->device;
 	struct hmm_fence *fence;
-	bool dirty = !!(vma->vm_file);
 
 	fence = device->ops->lmem_update(mirror, faddr, laddr,
 					 event->etype, dirty);
@@ -749,6 +2282,175 @@ static int hmm_mirror_update(struct hmm_mirror *mirror,
 	return 0;
 }
 
+static int hmm_mirror_rmem_update(struct hmm_mirror *mirror,
+				  struct hmm_rmem *rmem,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid,
+				  struct hmm_event *event,
+				  bool dirty)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+
+	fence = device->ops->rmem_update(mirror, rmem, faddr, laddr,
+					 fuid, event->etype, dirty);
+	if (fence) {
+		if (IS_ERR(fence)) {
+			return PTR_ERR(fence);
+		}
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event)
+{
+	struct hmm *hmm = mirror->hmm;
+	unsigned long caddr = faddr;
+	bool free = false, dirty = !!(vma->vm_flags & VM_SHARED);
+	int ret;
+
+	switch (event->etype) {
+	case HMM_MUNMAP:
+		free = true;
+		break;
+	default:
+		break;
+	}
+
+	for (; caddr < laddr;) {
+		struct hmm_range *range;
+		unsigned long naddr;
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,caddr,laddr-1);
+		if (range && range->mirror != mirror) {
+			range = NULL;
+		}
+		spin_unlock(&hmm->lock);
+
+		/* At this point the range is on the event list and thus it can
+		 * not disappear.
+		 */
+		BUG_ON(range && list_empty(&range->elist));
+
+		if (!range || (range->faddr > caddr)) {
+			naddr = range ? range->faddr : laddr;
+			ret = hmm_mirror_lmem_update(mirror, caddr, naddr,
+						     event, dirty);
+			if (ret) {
+				return ret;
+			}
+			caddr = naddr;
+		}
+		if (range) {
+			unsigned long fuid;
+
+			naddr = min(range->laddr, laddr);
+			fuid = range->fuid+((caddr-range->faddr)>>PAGE_SHIFT);
+			ret = hmm_mirror_rmem_update(mirror,range->rmem,caddr,
+						     naddr,fuid,event,dirty);
+			caddr = naddr;
+			if (ret) {
+				return ret;
+			}
+			if (free) {
+				BUG_ON((caddr > range->faddr) ||
+				       (naddr < range->laddr));
+				hmm_range_fini_clear(range, vma);
+			}
+		}
+	}
+	return 0;
+}
+
+static unsigned long hmm_mirror_ranges_reserve(struct hmm_mirror *mirror,
+					       struct hmm_event *event)
+{
+	struct hmm_range *range;
+	unsigned long faddr, laddr, count = 0;
+	struct hmm *hmm = mirror->hmm;
+
+	faddr = event->faddr;
+	laddr = event->laddr;
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range) {
+		if (range->mirror == mirror) {
+			if (!hmm_range_reserve(range, event)) {
+				struct hmm_rmem *rmem;
+
+				rmem = hmm_rmem_ref(range->rmem);
+				spin_unlock(&hmm->lock);
+				wait_event(hmm->wait_queue, rmem->event!=NULL);
+				hmm_rmem_unref(rmem);
+				goto retry;
+			}
+			if (list_empty(&range->elist)) {
+				list_add_tail(&range->elist, &event->ranges);
+				count++;
+			}
+		}
+		range = hmm_range_tree_iter_next(range, faddr, laddr - 1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return count;
+}
+
+static void hmm_mirror_ranges_migrate(struct hmm_mirror *mirror,
+				      struct vm_area_struct *vma,
+				      struct hmm_event *event)
+{
+	struct hmm_range *range;
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges,
+					  vma->vm_start,
+					  vma->vm_end - 1);
+	while (range) {
+		struct hmm_rmem *rmem;
+
+		if (range->mirror != mirror) {
+			goto next;
+		}
+		rmem = hmm_rmem_ref(range->rmem);
+		spin_unlock(&hmm->lock);
+
+		hmm_rmem_migrate_to_lmem(rmem, vma, range->faddr,
+					 hmm_range_fuid(range),
+					 hmm_range_luid(range),
+					 true);
+		hmm_rmem_unref(rmem);
+
+		spin_lock(&hmm->lock);
+	next:
+		range = hmm_range_tree_iter_first(&hmm->ranges,
+						  vma->vm_start,
+						  vma->vm_end - 1);
+	}
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_mirror_ranges_release(struct hmm_mirror *mirror,
+				      struct hmm_event *event)
+{
+	struct hmm_range *range, *next;
+
+	list_for_each_entry_safe (range, next, &event->ranges, elist) {
+		hmm_range_release(range, event);
+	}
+}
+
 static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
 {
 	struct vm_area_struct *vma;
@@ -778,11 +2480,16 @@ static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
 		faddr = max(faddr, vma->vm_start);
 		laddr = vma->vm_end;
 
+		hmm_mirror_ranges_reserve(mirror, event);
+
 		hmm_mirror_update(mirror, vma, faddr, laddr, event);
 		list_for_each_entry_safe (fence, next, &event->fences, list) {
 			hmm_device_fence_wait(device, fence);
 		}
 
+		hmm_mirror_ranges_migrate(mirror, vma, event);
+		hmm_mirror_ranges_release(mirror, event);
+
 		if (laddr >= vma->vm_end) {
 			vma = vma->vm_next;
 		}
@@ -949,6 +2656,33 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
+				 struct hmm_fault *fault,
+				 struct vm_area_struct *vma,
+				 struct hmm_range *range,
+				 struct hmm_event *event,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 bool write)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_rmem *rmem = range->rmem;
+	unsigned long fuid, luid, npages;
+	int ret;
+
+	if (range->mirror != mirror) {
+		/* Returning -EAGAIN will force cpu page fault path. */
+		return -EAGAIN;
+	}
+
+	npages = (range->laddr - range->faddr) >> PAGE_SHIFT;
+	fuid = range->fuid + ((faddr - range->faddr) >> PAGE_SHIFT);
+	luid = fuid + npages;
+
+	ret = device->ops->rmem_fault(mirror, rmem, faddr, laddr, fuid, fault);
+	return ret;
+}
+
 static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror,
 				 struct hmm_fault *fault,
 				 unsigned long faddr,
@@ -995,6 +2729,7 @@ int hmm_mirror_fault(struct hmm_mirror *mirror,
 retry:
 	down_read(&hmm->mm->mmap_sem);
 	event = hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT);
+	hmm_ranges_reserve(hmm, event);
 	/* FIXME handle gate area ? and guard page */
 	vma = find_extend_vma(hmm->mm, caddr);
 	if (!vma) {
@@ -1031,6 +2766,29 @@ retry:
 
 	for (; caddr < event->laddr;) {
 		struct hmm_fault_mm fault_mm;
+		struct hmm_range *range;
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,
+						  caddr,
+						  naddr - 1);
+		if (range && range->faddr > caddr) {
+			naddr = range->faddr;
+			range = NULL;
+		}
+		spin_unlock(&hmm->lock);
+		if (range) {
+			naddr = min(range->laddr, event->laddr);
+			ret = hmm_mirror_rmem_fault(mirror,fault,vma,range,
+						    event,caddr,naddr,write);
+			if (ret) {
+				do_fault = (ret == -EAGAIN);
+				goto out;
+			}
+			caddr = naddr;
+			naddr = event->laddr;
+			continue;
+		}
 
 		fault_mm.mm = vma->vm_mm;
 		fault_mm.vma = vma;
@@ -1067,6 +2825,7 @@ retry:
 	}
 
 out:
+	hmm_ranges_release(hmm, event);
 	hmm_event_unqueue(hmm, event);
 	if (do_fault && !event->backoff && !mirror->dead) {
 		do_fault = false;
@@ -1092,6 +2851,334 @@ EXPORT_SYMBOL(hmm_mirror_fault);
 
 
 
+/* hmm_migrate - Memory migration to/from local memory from/to remote memory.
+ *
+ * Below are functions that handle migration to/from local memory from/to
+ * remote memory (rmem).
+ *
+ * Migration to remote memory is a multi-step process first pages are unmap and
+ * missing page are either allocated or accounted as new allocation. Then pages
+ * are copied to remote memory. Finaly the remote memory is faulted so that the
+ * device driver update the device page table.
+ *
+ * Device driver can decide to abort migration to remote memory at any step of
+ * the process by returning special value from the callback corresponding to
+ * the step.
+ *
+ * Migration to local memory is simpler. First pages are allocated then remote
+ * memory is copied into those pages. Once dma is done the pages are remapped
+ * inside the cpu page table or inside the page cache (for shared memory) and
+ * finaly the rmem is freed.
+ */
+
+/* see include/linux/hmm.h */
+int hmm_migrate_rmem_to_lmem(struct hmm_mirror *mirror,
+			     unsigned long faddr,
+			     unsigned long laddr)
+{
+	struct hmm *hmm = mirror->hmm;
+	struct vm_area_struct *vma;
+	struct hmm_event *event;
+	unsigned long next;
+	int ret = 0;
+
+	event = hmm_event_get(hmm, faddr, laddr, HMM_MIGRATE_TO_LMEM);
+	if (!hmm_ranges_reserve(hmm, event)) {
+		hmm_event_unqueue(hmm, event);
+		return 0;
+	}
+
+	hmm_mirror_ref(mirror);
+	down_read(&hmm->mm->mmap_sem);
+	vma = find_vma(hmm->mm, faddr);
+	faddr = max(vma->vm_start, faddr);
+	for (; vma && (faddr < laddr); faddr = next) {
+		next = min(laddr, vma->vm_end);
+
+		ret = hmm_migrate_to_lmem(hmm, vma, faddr, next, true);
+		if (ret) {
+			break;
+		}
+
+		vma = vma->vm_next;
+		next = max(vma->vm_start, next);
+	}
+	up_read(&hmm->mm->mmap_sem);
+	hmm_ranges_release(hmm, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_migrate_rmem_to_lmem);
+
+static void hmm_migrate_abort(struct hmm_mirror *mirror,
+			      struct hmm_fault *fault,
+			      unsigned long *pfns,
+			      unsigned long fuid)
+{
+	struct vm_area_struct *vma = fault->vma;
+	struct hmm_rmem rmem;
+	unsigned long i, npages;
+
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	for (i = npages - 1; i > 0; --i) {
+		if (pfns[i]) {
+			break;
+		}
+		npages = i;
+	}
+	if (!npages) {
+		return;
+	}
+
+	/* Fake temporary rmem object. */
+	hmm_rmem_init(&rmem, mirror->device);
+	rmem.fuid = fuid;
+	rmem.luid = fuid + npages;
+	rmem.pfns = pfns;
+
+	if (!(vma->vm_file)) {
+		unsigned long faddr, laddr;
+
+		faddr = fault->faddr;
+		laddr = faddr + (npages << PAGE_SHIFT);
+
+		/* The remapping fail only if something goes terribly wrong. */
+		if (hmm_rmem_remap_anon(&rmem, vma, faddr, laddr, fuid)) {
+
+			WARN_ONCE(1, "hmm: something is terribly wrong.\n");
+			hmm_rmem_poison_range(&rmem, vma->vm_mm, vma,
+					      faddr, laddr, fuid);
+		}
+	} else {
+		BUG();
+	}
+
+	/* Ok officialy dead. */
+	if (fault->rmem) {
+		fault->rmem->dead = true;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = hmm_pfn_to_page(pfns[i]);
+
+		if (!page) {
+			pfns[i] = 0;
+			continue;
+		}
+		if (test_bit(HMM_PFN_VALID_ZERO, &pfns[i])) {
+			/* Properly uncharge memory. */
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+			mem_cgroup_uncharge_mm(vma->vm_mm);
+			pfns[i] = 0;
+			continue;
+		}
+		if (test_bit(HMM_PFN_LOCK, &pfns[i])) {
+			unlock_page(page);
+			clear_bit(HMM_PFN_LOCK, &pfns[i]);
+		}
+		page_remove_rmap(page);
+		page_cache_release(page);
+		pfns[i] = 0;
+	}
+}
+
+/* see include/linux/hmm.h */
+int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
+			     struct hmm_mirror *mirror)
+{
+	struct vm_area_struct *vma;
+	struct hmm_device *device;
+	struct hmm_range *range;
+	struct hmm_fence *fence;
+	struct hmm_event *event;
+	struct hmm_rmem rmem;
+	unsigned long i, npages;
+	struct hmm *hmm;
+	int ret;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!fault || !mirror || fault->faddr > fault->laddr) {
+		return -EINVAL;
+	}
+	if (mirror->dead) {
+		hmm_mirror_unref(mirror);
+		return -ENODEV;
+	}
+	hmm = mirror->hmm;
+	device = mirror->device;
+	if (!device->rmem) {
+		hmm_mirror_unref(mirror);
+		return -EINVAL;
+	}
+	fault->rmem = NULL;
+	fault->faddr = fault->faddr & PAGE_MASK;
+	fault->laddr = PAGE_ALIGN(fault->laddr);
+	hmm_rmem_init(&rmem, mirror->device);
+	event = hmm_event_get(hmm, fault->faddr, fault->laddr,
+			      HMM_MIGRATE_TO_RMEM);
+	rmem.event = event;
+	hmm = mirror->hmm;
+
+	range = kmalloc(sizeof(struct hmm_range), GFP_KERNEL);
+	if (range == NULL) {
+		hmm_event_unqueue(hmm, event);
+		hmm_mirror_unref(mirror);
+		return -ENOMEM;
+	}
+
+	down_read(&hmm->mm->mmap_sem);
+	vma = find_vma_intersection(hmm->mm, fault->faddr, fault->laddr);
+	if (!vma) {
+		kfree(range);
+		range = NULL;
+		ret = -EFAULT;
+		goto out;
+	}
+	/* FIXME support HUGETLB */
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		kfree(range);
+		range = NULL;
+		ret = -EACCES;
+		goto out;
+	}
+	if (vma->vm_file) {
+		kfree(range);
+		range = NULL;
+		ret = -EBUSY;
+		goto out;
+	}
+	/* Adjust range to this vma only. */
+	event->faddr = fault->faddr = max(fault->faddr, vma->vm_start);
+	event->laddr  =fault->laddr = min(fault->laddr, vma->vm_end);
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	fault->vma = vma;
+
+	ret = hmm_rmem_alloc(&rmem, npages);
+	if (ret) {
+		kfree(range);
+		range = NULL;
+		goto out;
+	}
+
+	/* Prior to unmapping add to the hmm range tree so any pagefault can
+	 * find the proper range.
+	 */
+	hmm_range_init(range, mirror, &rmem, fault->faddr,
+		       fault->laddr, rmem.fuid);
+	hmm_range_insert(range);
+
+	ret = hmm_rmem_unmap(&rmem, vma, fault->faddr, fault->laddr);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	fault->rmem = device->ops->rmem_alloc(device, fault);
+	if (IS_ERR(fault->rmem)) {
+		ret = PTR_ERR(fault->rmem);
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+	if (fault->rmem == NULL) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		ret = 0;
+		goto out;
+	}
+	if (event->backoff) {
+		ret = -EBUSY;
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	hmm_rmem_init(fault->rmem, mirror->device);
+	spin_lock(&_hmm_rmems_lock);
+	fault->rmem->event = event;
+	hmm_rmem_tree_remove(&rmem, &_hmm_rmems);
+	fault->rmem->fuid = rmem.fuid;
+	fault->rmem->luid = rmem.luid;
+	hmm_rmem_tree_insert(fault->rmem, &_hmm_rmems);
+	fault->rmem->pfns = rmem.pfns;
+	range->rmem = fault->rmem;
+	list_del_init(&range->rlist);
+	list_add_tail(&range->rlist, &fault->rmem->ranges);
+	rmem.event = NULL;
+	spin_unlock(&_hmm_rmems_lock);
+
+	fence = device->ops->lmem_to_rmem(fault->rmem,rmem.fuid,rmem.luid);
+	if (IS_ERR(fence)) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	ret = hmm_device_fence_wait(device, fence);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	ret = device->ops->rmem_fault(mirror, range->rmem, range->faddr,
+				      range->laddr, range->fuid, NULL);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = hmm_pfn_to_page(rmem.pfns[i]);
+
+		if (test_bit(HMM_PFN_VALID_ZERO, &rmem.pfns[i])) {
+			rmem.pfns[i] = rmem.pfns[i] & HMM_PFN_CLEAR;
+			continue;
+		}
+		/* We only decrement now the page count so that cow happen
+		 * properly while page is in fligh.
+		 */
+		if (PageAnon(page)) {
+			unlock_page(page);
+			page_remove_rmap(page);
+			page_cache_release(page);
+			rmem.pfns[i] &= HMM_PFN_CLEAR;
+		} else {
+			/* Otherwise this means the page is in pagecache. Keep
+			 * a reference and page count elevated.
+			 */
+			clear_bit(HMM_PFN_LOCK, &rmem.pfns[i]);
+			/* We do not want side effect of page_remove_rmap ie
+			 * zone page accounting udpate but we do want zero
+			 * mapcount so writeback works properly.
+			 */
+			atomic_add(-1, &page->_mapcount);
+			unlock_page(page);
+		}
+	}
+
+	hmm_mirror_ranges_release(mirror, event);
+	hmm_event_unqueue(hmm, event);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_mirror_unref(mirror);
+	return 0;
+
+out:
+	if (!fault->rmem) {
+		kfree(rmem.pfns);
+		spin_lock(&_hmm_rmems_lock);
+		hmm_rmem_tree_remove(&rmem, &_hmm_rmems);
+		spin_unlock(&_hmm_rmems_lock);
+	}
+	hmm_mirror_ranges_release(mirror, event);
+	hmm_event_unqueue(hmm, event);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_range_unref(range);
+	hmm_rmem_unref(fault->rmem);
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_migrate_lmem_to_rmem);
+
+
+
+
 /* hmm_device - Each device driver must register one and only one hmm_device
  *
  * The hmm_device is the link btw hmm and each device driver.
@@ -1140,9 +3227,22 @@ int hmm_device_register(struct hmm_device *device, const char *name)
 	BUG_ON(!device->ops->lmem_fault);
 
 	kref_init(&device->kref);
+	device->rmem = false;
 	device->name = name;
 	mutex_init(&device->mutex);
 	INIT_LIST_HEAD(&device->mirrors);
+	init_waitqueue_head(&device->wait_queue);
+
+	if (device->ops->rmem_alloc &&
+	    device->ops->rmem_update &&
+	    device->ops->rmem_fault &&
+	    device->ops->rmem_to_lmem &&
+	    device->ops->lmem_to_rmem &&
+	    device->ops->rmem_split &&
+	    device->ops->rmem_split_adjust &&
+	    device->ops->rmem_destroy) {
+		device->rmem = true;
+	}
 
 	return 0;
 }
@@ -1179,6 +3279,7 @@ static int __init hmm_module_init(void)
 {
 	int ret;
 
+	spin_lock_init(&_hmm_rmems_lock);
 	ret = init_srcu_struct(&srcu);
 	if (ret) {
 		return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ceaf4d7..88e4acd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -56,6 +56,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/hmm.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -6649,6 +6650,8 @@ one_by_one:
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_HMM): if it is hmm entry, target->page is either NULL or point
+ *     to page to move charge.
  *
  * Called with pte lock held.
  */
@@ -6661,6 +6664,7 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_HMM,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
@@ -6690,6 +6694,9 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	swp_entry_t ent = pte_to_swp_entry(ptent);
 
+	if (is_hmm_entry(ent)) {
+		return swp_to_radix_entry(ent);
+	}
 	if (!move_anon() || non_swap_entry(ent))
 		return NULL;
 	/*
@@ -6764,6 +6771,10 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 	if (!page && !ent.val)
 		return ret;
+	if (radix_tree_exceptional_entry(page)) {
+		ret = MC_TARGET_HMM;
+		return ret;
+	}
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
@@ -7077,6 +7088,41 @@ put:			/* get_mctgt_type() gets the page */
 				mc.moved_swap++;
 			}
 			break;
+		case MC_TARGET_HMM:
+			if (target.page) {
+				page = target.page;
+				pc = lookup_page_cgroup(page);
+				if (!mem_cgroup_move_account(page, 1, pc,
+							     mc.from, mc.to)) {
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				put_page(page);
+			} else if (vma->vm_flags & VM_SHARED) {
+				/* Some one migrated the memory after we did
+				 * the pagecache lookup.
+				 */
+				/* FIXME can the precharge/moved_charge then
+				 * becomes wrong ?
+				 */
+				pte_unmap_unlock(pte - 1, ptl);
+				cond_resched();
+				goto retry;
+			} else {
+				unsigned long flags;
+
+				move_lock_mem_cgroup(mc.from, &flags);
+				move_lock_mem_cgroup(mc.to, &flags);
+				mem_cgroup_charge_statistics(mc.from, NULL, true, -1);
+				mem_cgroup_charge_statistics(mc.to, NULL, true, 1);
+				move_unlock_mem_cgroup(mc.to, &flags);
+				move_unlock_mem_cgroup(mc.from, &flags);
+				mc.precharge--;
+				/* we uncharge from mc.from later. */
+				mc.moved_charge++;
+			}
+			break;
 		default:
 			break;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 1e164a1..d35bc65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -53,6 +53,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -851,6 +852,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					if (pte_swp_soft_dirty(*src_pte))
 						pte = pte_swp_mksoft_dirty(pte);
 					set_pte_at(src_mm, addr, src_pte, pte);
+				} else if (is_hmm_entry(entry)) {
+					/* FIXME do we want to handle rblk fork, just mapcount rblk if so. */
+					BUG_ON(1);
 				}
 			}
 		}
@@ -3079,6 +3083,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_hmm_entry(entry)) {
+			ret = hmm_mm_fault(mm, vma, address, page_table,
+					   pmd, flags, orig_pte);
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 07/11] hmm: support moving anonymous page to remote memory
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Motivation:

Migrating to device memory can allow device to access memory through a link
with far greater bandwidth as well as with lower latency. Migration to device
memory is of course only meaningfull if the memory will only be access by the
device over a long period of time.

Because hmm aim to only provide an API to facilitate such use it does not
deal with policy on when, what and to migrate to remote memory. It is expected
that device driver that use hmm will have the informations to make such choice.

Implementation:

This use a two level structure to track remote memory. The first level is a
range structure that match a range of address with a specific remote memory
object. This allow for different range of address to point to the same remote
memory object (usefull for shared memory).

The second level is a structure holding informations specific to hmm about the
remote memory. This remote memory structure are allocated by device driver and
thus can be included inside the remote memory structure that is specific to the
device driver.

Each remote memory is given a range of unique id. Those unique id are use to
create special hmm swap entry. For anonymous memory the cpu page table entry
are set to this hmm swap entry and on cpu page fault the unique id is use to
find the remote memory and migrate it back to system memory.

Other event than cpu page fault can trigger migration back to system memory.
For instance on fork, to simplify things, the remote memory is migrated back
to system memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h          |  469 ++++++++-
 include/linux/mmu_notifier.h |    1 +
 include/linux/swap.h         |   12 +-
 include/linux/swapops.h      |   33 +-
 mm/hmm.c                     | 2307 ++++++++++++++++++++++++++++++++++++++++--
 mm/memcontrol.c              |   46 +
 mm/memory.c                  |    7 +
 7 files changed, 2768 insertions(+), 107 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e9c7722..96f41c4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -56,10 +56,10 @@
 
 struct hmm_device;
 struct hmm_device_ops;
-struct hmm_migrate;
 struct hmm_mirror;
 struct hmm_fault;
 struct hmm_event;
+struct hmm_rmem;
 struct hmm;
 
 /* The hmm provide page informations to the device using hmm pfn value. Below
@@ -67,15 +67,34 @@ struct hmm;
  * type of page, dirty page, page is locked or not, ...).
  *
  *   HMM_PFN_VALID_PAGE this means the pfn correspond to valid page.
- *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page.
+ *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page either use
+ *     it or directly clear rmem with zero what ever is the fastest method for
+ *     the device.
  *   HMM_PFN_DIRTY set when the page is dirty.
  *   HMM_PFN_WRITE is set if there is no need to call page_mkwrite
+ *   HMM_PFN_LOCK is only set while the rmem object is under going migration.
+ *   HMM_PFN_LMEM_UPTODATE the page that is in the rmem pfn array has uptodate.
+ *   HMM_PFN_RMEM_UPTODATE the rmem copy of the page is uptodate.
+ *
+ * Device driver only need to worry about :
+ *   HMM_PFN_VALID_PAGE
+ *   HMM_PFN_VALID_ZERO
+ *   HMM_PFN_DIRTY
+ *   HMM_PFN_WRITE
+ * Device driver must set/clear following flag after successfull dma :
+ *   HMM_PFN_LMEM_UPTODATE
+ *   HMM_PFN_RMEM_UPTODATE
+ * All the others flags are for hmm internal use only.
  */
 #define HMM_PFN_SHIFT		(PAGE_SHIFT)
+#define HMM_PFN_CLEAR		(((1UL << HMM_PFN_SHIFT) - 1UL) & ~0x3UL)
 #define HMM_PFN_VALID_PAGE	(0UL)
 #define HMM_PFN_VALID_ZERO	(1UL)
 #define HMM_PFN_DIRTY		(2UL)
 #define HMM_PFN_WRITE		(3UL)
+#define HMM_PFN_LOCK		(4UL)
+#define HMM_PFN_LMEM_UPTODATE	(5UL)
+#define HMM_PFN_RMEM_UPTODATE	(6UL)
 
 static inline struct page *hmm_pfn_to_page(unsigned long pfn)
 {
@@ -95,6 +114,28 @@ static inline void hmm_pfn_set_dirty(unsigned long *pfn)
 	set_bit(HMM_PFN_DIRTY, pfn);
 }
 
+static inline void hmm_pfn_set_lmem_uptodate(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_LMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_set_rmem_uptodate(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_RMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_clear_lmem_uptodate(unsigned long *pfn)
+{
+	clear_bit(HMM_PFN_LMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_clear_rmem_uptodate(unsigned long *pfn)
+{
+	clear_bit(HMM_PFN_RMEM_UPTODATE, pfn);
+}
+
+
+
 
 /* hmm_fence - device driver fence to wait for device driver operations.
  *
@@ -283,6 +324,255 @@ struct hmm_device_ops {
 			  unsigned long laddr,
 			  unsigned long *pfns,
 			  struct hmm_fault *fault);
+
+	/* rmem_alloc - allocate a new rmem object.
+	 *
+	 * @device: Device into which to allocate the remote memory.
+	 * @fault:  The fault for which this remote memory is allocated.
+	 * Returns: Valid rmem ptr on success, NULL or ERR_PTR otherwise.
+	 *
+	 * This allow migration to remote memory to operate in several steps.
+	 * First the hmm code will clamp the range that can migrated and will
+	 * unmap pages and prepare them for migration.
+	 *
+	 * It is only once migration is done with all above step that we know
+	 * how much memory can be migrated which is when rmem_alloc is call to
+	 * allocate the device rmem object to which memory should be migrated.
+	 *
+	 * Device driver can decide through this callback to abort migration
+	 * by returning NULL, or it can decide to continue with migration by
+	 * returning a properly allocated rmem object.
+	 *
+	 * Return rmem or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_rmem *(*rmem_alloc)(struct hmm_device *device,
+				       struct hmm_fault *fault);
+
+	/* rmem_update() - update device mmu for a range of remote memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @rmem:   The remote memory under update.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @fuid:   First uid of the remote memory at which the update begin.
+	 * @etype:  The type of memory event (unmap, fini, read only, ...).
+	 * @dirty:  Device driver should call hmm_pfn_set_dirty.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device mmu permission/usage for a range of remote
+	 * memory. The event type provide the nature of the update :
+	 *   - range is no longer valid (munmap).
+	 *   - range protection changes (mprotect, COW, ...).
+	 *   - range is unmapped (swap, reclaim, page migration, ...).
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to remote memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly call hmm_pfn_set_dirty on any page the device
+	 * did write to since last call to update_rmem. This is only needed if
+	 * the dirty parameter is true.
+	 *
+	 * The driver should return a fence pointer or NULL on success. It is
+	 * advice to return fence and delay wait for the operation to complete
+	 * to the wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*rmem_update)(struct hmm_mirror *mirror,
+					 struct hmm_rmem *rmem,
+					 unsigned long faddr,
+					 unsigned long laddr,
+					 unsigned long fuid,
+					 enum hmm_etype etype,
+					 bool dirty);
+
+	/* rmem_fault() - fault range of rmem on the device mmu.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @rmem:   The rmem backing this range.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @fuid:   First rmem unique id (inclusive).
+	 * @fault:  The fault structure provided by device driver.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Called to give the device driver the remote memory that is backing a
+	 * range of memory. The device driver can only map rmem page with write
+	 * permission only if the HMM_PFN_WRITE bit is set. If device want to
+	 * write to this range of rmem it can call hmm_mirror_fault.
+	 *
+	 * Return error if scheduled operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_fault)(struct hmm_mirror *mirror,
+			  struct hmm_rmem *rmem,
+			  unsigned long faddr,
+			  unsigned long laddr,
+			  unsigned long fuid,
+			  struct hmm_fault *fault);
+
+	/* rmem_to_lmem - copy remote memory to local memory.
+	 *
+	 * @rmem:   The remote memory structure.
+	 * @fuid:   First rmem unique id (inclusive) of range to copy.
+	 * @luid:   Last rmem unique id (exclusive) of range to copy.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * This is call to copy remote memory back to local memory. The device
+	 * driver need to schedule the dma to copy the remote memory to the
+	 * pages given by the pfns array. Device driver should return a fence
+	 * or an error pointer.
+	 *
+	 * If device driver does not return a fence then the device driver must
+	 * wait until the dma is done and all device cache are flush. Moreover
+	 * device driver must set the HMM_PFN_LMEM_UPTODATE on all successfully
+	 * copied pages (setting this flag can be delayed to the fence_wait
+	 * callback).
+	 *
+	 * If a valid fence is returned then hmm will wait on it and reschedule
+	 * any thread that need rescheduling.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * CPU THREAD WILL GET A SIGBUS.
+	 *
+	 * DEVICE DRIVER MUST SET HMM_PFN_LMEM_UPTODATE ON ALL SUCCESSFULLY
+	 * COPIED PAGES.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*rmem_to_lmem)(struct hmm_rmem *rmem,
+					  unsigned long fuid,
+					  unsigned long luid);
+
+	/* lmem_to_rmem - copy local memory to remote memory.
+	 *
+	 * @rmem:   The remote memory structure.
+	 * @fuid:   First rmem unique id (inclusive) of range to copy.
+	 * @luid:   Last rmem unique id (exclusive) of range to copy.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * This is call to copy local memory to remote memory. The driver need
+	 * to schedule the dma to copy the local memory from the pages given by
+	 * the pfns array, to the remote memory.
+	 *
+	 * Device driver should return a fence or an error pointer. If device
+	 * driver does not return a fence then the it must wait until the dma
+	 * is done. The device driver must set the HMM_PFN_RMEM_UPTODATE on all
+	 * successfully copied pages.
+	 *
+	 * If a valid fence is returned then hmm will wait on it and reschedule
+	 * any thread that need rescheduling.
+	 *
+	 * Failure will result in aborting migration to remote memory.
+	 *
+	 * DEVICE DRIVER MUST SET HMM_PFN_RMEM_UPTODATE ON ALL SUCCESSFULLY
+	 * COPIED PAGES.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*lmem_to_rmem)(struct hmm_rmem *rmem,
+					  unsigned long fuid,
+					  unsigned long luid);
+
+	/* rmem_split - split rmem.
+	 *
+	 * @rmem:   The remote memory to split.
+	 * @fuid:   First rmem unique id (inclusive) of range to split.
+	 * @luid:   Last rmem unique id (exclusive) of range to split.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Split remote memory, first the device driver must allocate a new
+	 * remote memory struct, second it must call hmm_rmem_split_new and
+	 * last it must transfer private driver resource from splited rmem to
+	 * the new remote memory struct.
+	 *
+	 * Device driver _can not_ adjust nor the fuid nor the luid.
+	 *
+	 * Failure should be forwarded if any of the step fails. The device
+	 * driver does not need to worry about freeing the new remote memory
+	 * object once hmm_rmem_split_new is call as it will be freed through
+	 * the rmem_destroy callback if anything fails.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * THE WHOLE RMEM WILL BE MIGRATED BACK TO LMEM.
+	 *
+	 * Return error if operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_split)(struct hmm_rmem *rmem,
+			  unsigned long fuid,
+			  unsigned long luid);
+
+	/* rmem_split_adjust - split rmem.
+	 *
+	 * @rmem:   The remote memory to split.
+	 * @fuid:   First rmem unique id (inclusive) of range to split.
+	 * @luid:   Last rmem unique id (exclusive) of range to split.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Split remote memory, first the device driver must allocate a new
+	 * remote memory struct, second it must call hmm_rmem_split_new and
+	 * last it must transfer private driver resource from splited rmem to
+	 * the new remote memory struct.
+	 *
+	 * Device driver _can_ adjust the fuid or the luid with constraint that
+	 * adjusted_fuid <= fuid and adjusted_luid >= luid.
+	 *
+	 * Failure should be forwarded if any of the step fails. The device
+	 * driver does not need to worry about freeing the new remote memory
+	 * object once hmm_rmem_split_new is call as it will be freed through
+	 * the rmem_destroy callback if anything fails.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * THE WHOLE RMEM WILL BE MIGRATED BACK TO LMEM.
+	 *
+	 * Return error if operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_split_adjust)(struct hmm_rmem *rmem,
+				 unsigned long fuid,
+				 unsigned long luid);
+
+	/* rmem_destroy - destroy rmem.
+	 *
+	 * @rmem:   The remote memory to destroy.
+	 *
+	 * Destroying remote memory structure once all ref are gone.
+	 */
+	void (*rmem_destroy)(struct hmm_rmem *rmem);
 };
 
 /* struct hmm_device - per device hmm structure
@@ -292,6 +582,7 @@ struct hmm_device_ops {
  * @mutex:      Mutex protecting mirrors list.
  * @ops:        The hmm operations callback.
  * @name:       Device name (uniquely identify the device on the system).
+ * @wait_queue: Wait queue for remote memory operations.
  *
  * Each device that want to mirror an address space must register one of this
  * struct (only once).
@@ -302,6 +593,8 @@ struct hmm_device {
 	struct mutex			mutex;
 	const struct hmm_device_ops	*ops;
 	const char			*name;
+	wait_queue_head_t		wait_queue;
+	bool				rmem;
 };
 
 /* hmm_device_register() - register a device with hmm.
@@ -322,6 +615,88 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
 
 
 
+/* hmm_rmem - The rmem struct hold hmm infos of a remote memory block.
+ *
+ * The device driver should derivate its remote memory tracking structure from
+ * the hmm_rmem structure. The hmm_rmem structure dos not hold any infos about
+ * the specific of the remote memory block (device address or anything else).
+ * It solely store informations needed for finding rmem when cpu try to access
+ * it.
+ */
+
+/* struct hmm_rmem - remote memory block
+ *
+ * @kref:           Reference count.
+ * @device:         The hmm device the remote memory is allocated on.
+ * @event:          The event currently associated with the rmem.
+ * @lock:           Lock protecting the ranges list and event field.
+ * @ranges:         The list of address ranges that point to this rmem.
+ * @node:           Node for rmem unique id tree.
+ * @pgoff:          Page offset into file (in PAGE_SIZE not PAGE_CACHE_SIZE).
+ * @fuid:           First unique id associated with this specific hmm_rmem.
+ * @fuid:           Last unique id associated with this specific hmm_rmem.
+ * @subtree_luid:   Optimization for red and black interval tree.
+ * @pfns:           Array of pfn for local memory when some is attached.
+ * @dead:           The remote memory is no longer valid restart lookup.
+ *
+ * Each hmm_rmem has a uniq range of id that is use to uniquely identify remote
+ * memory on cpu side. Those uniq id do not relate in any way with the device
+ * physical address at which the remote memory is located.
+ */
+struct hmm_rmem {
+	struct kref		kref;
+	struct hmm_device	*device;
+	struct hmm_event	*event;
+	spinlock_t		lock;
+	struct list_head	ranges;
+	struct rb_node		node;
+	unsigned long		pgoff;
+	unsigned long		fuid;
+	unsigned long		luid;
+	unsigned long		subtree_luid;
+	unsigned long		*pfns;
+	bool			dead;
+};
+
+struct hmm_rmem *hmm_rmem_ref(struct hmm_rmem *rmem);
+struct hmm_rmem *hmm_rmem_unref(struct hmm_rmem *rmem);
+
+/* hmm_rmem_split_new - helper to split rmem.
+ *
+ * @rmem:   The remote memory to split.
+ * @new:    The new remote memory struct.
+ * Returns: 0 on success, error value otherwise.
+ *
+ * The new remote memory struct must be allocated by the device driver and its
+ * fuid and lui field must be set to the range the device wish to new rmem to
+ * cover.
+ *
+ * Moreover all below conditions must be true :
+ *   (new->fuid < new->luid)
+ *   (new->fuid >= rmem->fuid && new->luid <= rmem->luid)
+ *   (new->fuid == rmem->fuid || new->luid == rmem->luid)
+ *
+ * This hmm helper function will split range and perform internal hmm update on
+ * behalf of the device driver.
+ *
+ * Note that this function must be call by the rmem_split and rmem_split_adjust
+ * callback.
+ *
+ * Once this function is call the device driver should not try to free the new
+ * rmem structure no matter what is the return value. Moreover if the function
+ * return 0 then the device driver should properly update the new rmem struct.
+ *
+ * Return error if operation failed. Valid value :
+ * -EINVAL If one of the above condition is false.
+ * -ENOMEM If it failed to allocate memory.
+ * 0 on success.
+ */
+int hmm_rmem_split_new(struct hmm_rmem *rmem,
+		       struct hmm_rmem *new);
+
+
+
+
 /* hmm_mirror - device specific mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct associating
@@ -406,6 +781,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  */
 struct hmm_fault {
 	struct vm_area_struct	*vma;
+	struct hmm_rmem		*rmem;
 	unsigned long		faddr;
 	unsigned long		laddr;
 	unsigned long		*pfns;
@@ -450,6 +826,56 @@ struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror);
 
 
 
+/* hmm_migrate - Memory migration from local memory to remote memory.
+ *
+ * Below are functions that handle migration from local memory to remote memory
+ * (represented by hmm_rmem struct). This is a multi-step process first the
+ * range is unmap, then the device driver depending on the size of the unmaped
+ * range can decide to proceed or abort the migration.
+ */
+
+/* hmm_migrate_rmem_to_lmem() - force migration of some rmem to lmem.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @faddr:  First address of the range to migrate to lmem.
+ * @laddr:  Last address of the range to migrate to lmem.
+ * Returns: 0 on success, -EIO or -EINVAL.
+ *
+ * This migrate any remote memory behind a range of address to local memory.
+ *
+ * Returns:
+ * 0 success.
+ * -EINVAL if invalid argument.
+ * -EIO if one of the device driver returned this error.
+ */
+int hmm_migrate_rmem_to_lmem(struct hmm_mirror *mirror,
+			     unsigned long faddr,
+			     unsigned long laddr);
+
+/* hmm_migrate_lmem_to_rmem() - call to migrate lmem to rmem.
+ *
+ * @migrate:    The migration temporary struct.
+ * @mirror:     The mirror that link process address space with the device.
+ * Returns:     0, -EINVAL, -ENOMEM, -EFAULT, -EACCES, -ENODEV, -EBUSY, -EIO.
+ *
+ * On success the migrate struct is updated with the range that was migrated.
+ *
+ * Returns:
+ * 0 success.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EFAULT if range of address is invalid (no vma backing any of the range).
+ * -EACCES if vma backing the range is special vma.
+ * -ENODEV if mirror is in process of being destroy.
+ * -EBUSY if range can not be migrated (many different reasons).
+ * -EIO if one of the device driver returned this error.
+ */
+int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
+			     struct hmm_mirror *mirror);
+
+
+
+
 /* Functions used by core mm code. Device driver should not use any of them. */
 void __hmm_destroy(struct mm_struct *mm);
 static inline void hmm_destroy(struct mm_struct *mm)
@@ -459,12 +885,51 @@ static inline void hmm_destroy(struct mm_struct *mm)
 	}
 }
 
+/* hmm_mm_fault() - call when cpu pagefault on special hmm pte entry.
+ *
+ * @mm:             The mm of the thread triggering the fault.
+ * @vma:            The vma in which the fault happen.
+ * @addr:           The address of the fault.
+ * @pte:            Pointer to the pte entry inside the cpu page table.
+ * @pmd:            Pointer to the pmd entry into which the pte is.
+ * @fault_flags:    Fault flags (read, write, ...).
+ * @orig_pte:       The original pte value when this fault happened.
+ *
+ * When the cpu try to access a range of memory that is in remote memory it
+ * fault in face of hmm special swap pte which will end up calling this
+ * function that should trigger the appropriate memory migration.
+ *
+ * Returns:
+ *   0 if some one else already migrated the rmem back.
+ *   VM_FAULT_SIGBUS on any i/o error during migration.
+ *   VM_FAULT_OOM if it fails to allocate memory for migration.
+ *   VM_FAULT_MAJOR on successfull migration.
+ */
+int hmm_mm_fault(struct mm_struct *mm,
+		 struct vm_area_struct *vma,
+		 unsigned long addr,
+		 pte_t *pte,
+		 pmd_t *pmd,
+		 unsigned int fault_flags,
+		 pte_t orig_pte);
+
 #else /* !CONFIG_HMM */
 
 static inline void hmm_destroy(struct mm_struct *mm)
 {
 }
 
+static inline int hmm_mm_fault(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long addr,
+			       pte_t *pte,
+			       pmd_t *pmd,
+			       unsigned int fault_flags,
+			       pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 #endif /* !CONFIG_HMM */
 
 #endif
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 0794a73b..bb2c23f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -42,6 +42,7 @@ enum mmu_action {
 	MMU_FAULT_WP,
 	MMU_THP_SPLIT,
 	MMU_THP_FAULT_WP,
+	MMU_HMM,
 };
 
 #ifdef CONFIG_MMU_NOTIFIER
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5a14b92..0739b32 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,18 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM			(MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_HMM_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6adfb7b..9a490d3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -188,7 +188,38 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#ifdef CONFIG_HMM
+
+static inline swp_entry_t make_hmm_entry(unsigned long pgoff)
+{
+	/* We don't need to keep the page pfn, so use offset to store writeable
+	 * flag.
+	 */
+	return swp_entry(SWP_HMM, pgoff);
+}
+
+static inline unsigned long hmm_entry_uid(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_HMM);
+}
+#else /* !CONFIG_HMM */
+#define make_hmm_entry(page, write) swp_entry(0, 0)
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+	return 0;
+}
+
+static inline void make_hmm_entry_read(swp_entry_t *entry)
+{
+}
+#endif /* !CONFIG_HMM */
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/mm/hmm.c b/mm/hmm.c
index 2b8986c..599d4f6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -77,6 +77,9 @@
 /* global SRCU for all MMs */
 static struct srcu_struct srcu;
 
+static spinlock_t _hmm_rmems_lock;
+static struct rb_root _hmm_rmems = RB_ROOT;
+
 
 
 
@@ -94,6 +97,7 @@ struct hmm_event {
 	unsigned long		faddr;
 	unsigned long		laddr;
 	struct list_head	fences;
+	struct list_head	ranges;
 	enum hmm_etype		etype;
 	bool			backoff;
 };
@@ -106,6 +110,7 @@ struct hmm_event {
  * @mirrors:        List of all mirror for this mm (one per device)
  * @mmu_notifier:   The mmu_notifier of this mm
  * @wait_queue:     Wait queue for synchronization btw cpu and device
+ * @ranges:         Tree of rmem ranges (sorted by address).
  * @events:         Events.
  * @nevents:        Number of events currently happening.
  * @dead:           The mm is being destroy.
@@ -122,6 +127,7 @@ struct hmm {
 	struct list_head	pending;
 	struct mmu_notifier	mmu_notifier;
 	wait_queue_head_t	wait_queue;
+	struct rb_root		ranges;
 	struct hmm_event	events[HMM_MAX_EVENTS];
 	int			nevents;
 	bool			dead;
@@ -132,137 +138,1456 @@ static struct mmu_notifier_ops hmm_notifier_ops;
 static inline struct hmm *hmm_ref(struct hmm *hmm);
 static inline struct hmm *hmm_unref(struct hmm *hmm);
 
-static int hmm_mirror_update(struct hmm_mirror *mirror,
-			     struct vm_area_struct *vma,
-			     unsigned long faddr,
-			     unsigned long laddr,
-			     struct hmm_event *event);
-static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long fuid);
+static void hmm_rmem_poison_range(struct hmm_rmem *rmem,
+				  struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid);
+
+static int hmm_mirror_rmem_update(struct hmm_mirror *mirror,
+				  struct hmm_rmem *rmem,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid,
+				  struct hmm_event *event,
+				  bool dirty);
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event);
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence);
+
+
+
+
+/* hmm_event - use to synchronize various mm events with each others.
+ *
+ * During life time of process various mm events will happen, hmm serialize
+ * event that affect overlapping range of address. The hmm_event are use for
+ * that purpose.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+}
+
+static inline unsigned long hmm_event_size(struct hmm_event *event)
+{
+	return (event->laddr - event->faddr);
+}
+
+
+
+
+/* hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * This code deals with reading the cpu page table to find the pages that are
+ * backing a range of address. It is use as an helper to the device page fault
+ * code.
+ */
+
+/* struct hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * @mm:     The mm of the process the device fault is happening in.
+ * @vma:    The vma in which the fault is happening.
+ * @faddr:  The first address for the range the device want to fault.
+ * @laddr:  The last address for the range the device want to fault.
+ * @pfns:   Array of hmm pfns (contains the result of the fault).
+ * @write:  Is this write fault.
+ */
+struct hmm_fault_mm {
+	struct mm_struct	*mm;
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	bool			write;
+};
+
+static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct mm_walk *walk)
+{
+	struct hmm_fault_mm *fault_mm = walk->private;
+	unsigned long idx, *pfns;
+	pte_t *ptep;
+
+	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
+	pfns = &fault_mm->pfns[idx];
+	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
+	if (pmd_none(*pmdp)) {
+		return -ENOENT;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME */
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		return -EINVAL;
+	}
+
+	ptep = pte_offset_map(pmdp, faddr);
+	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
+		pte_t pte = *ptep;
+
+		if (pte_none(pte)) {
+			if (fault_mm->write) {
+				ptep++;
+				break;
+			}
+			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
+			set_bit(HMM_PFN_VALID_ZERO, pfns);
+			continue;
+		}
+		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+		if (fault_mm->write && !pte_write(pte)) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+
+		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, pfns);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_WRITE, pfns);
+		}
+		/* Consider the page as hot as a device want to use it. */
+		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
+		fault_mm->laddr = faddr + PAGE_SIZE;
+	}
+	pte_unmap(ptep - 1);
+
+	return (faddr == laddr) ? 0 : -ENOENT;
+}
+
+static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
+{
+	struct mm_walk walk = {0};
+	unsigned long faddr, laddr;
+	int ret;
+
+	faddr = fault_mm->faddr;
+	laddr = fault_mm->laddr;
+	fault_mm->laddr = faddr;
+
+	walk.pmd_entry = hmm_fault_mm_fault_pmd;
+	walk.mm = fault_mm->mm;
+	walk.private = fault_mm;
+
+	ret = walk_page_range(faddr, laddr, &walk);
+	return ret;
+}
+
+
+
+
+/* hmm_range - address range backed by remote memory.
+ *
+ * Each address range backed by remote memory is tracked so that on cpu page
+ * fault for a given address we can find the corresponding remote memory. We
+ * use a separate structure from remote memory as several different address
+ * range can point to the same remote memory (in case of shared mapping).
+ */
+
+/* struct hmm_range - address range backed by remote memory.
+ *
+ * @kref:           Reference count.
+ * @rmem:           Remote memory that back this address range.
+ * @mirror:         Mirror with which this range is associated.
+ * @fuid:           First unique id of rmem for this range.
+ * @faddr:          First address (inclusive) of the range.
+ * @laddr:          Last address (exclusive) of the range.
+ * @subtree_laddr:  Optimization for red black interval tree.
+ * @rlist:          List of all range associated with same rmem.
+ * @elist:          List of all range associated with an event.
+ */
+struct hmm_range {
+	struct kref		kref;
+	struct hmm_rmem		*rmem;
+	struct hmm_mirror	*mirror;
+	unsigned long		fuid;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		subtree_laddr;
+	struct rb_node		node;
+	struct list_head	rlist;
+	struct list_head	elist;
+};
+
+static inline unsigned long hmm_range_faddr(struct hmm_range *range)
+{
+	return range->faddr;
+}
+
+static inline unsigned long hmm_range_laddr(struct hmm_range *range)
+{
+	return range->laddr - 1UL;
+}
+
+INTERVAL_TREE_DEFINE(struct hmm_range,
+		     node,
+		     unsigned long,
+		     subtree_laddr,
+		     hmm_range_faddr,
+		     hmm_range_laddr,,
+		     hmm_range_tree)
+
+static inline unsigned long hmm_range_npages(struct hmm_range *range)
+{
+	return (range->laddr - range->faddr) >> PAGE_SHIFT;
+}
+
+static inline unsigned long hmm_range_fuid(struct hmm_range *range)
+{
+	return range->fuid;
+}
+
+static inline unsigned long hmm_range_luid(struct hmm_range *range)
+{
+	return range->fuid + hmm_range_npages(range);
+}
+
+static void hmm_range_destroy(struct kref *kref)
+{
+	struct hmm_range *range;
+
+	range = container_of(kref, struct hmm_range, kref);
+	BUG_ON(!list_empty(&range->elist));
+	BUG_ON(!list_empty(&range->rlist));
+	BUG_ON(!RB_EMPTY_NODE(&range->node));
+
+	range->rmem = hmm_rmem_unref(range->rmem);
+	range->mirror = hmm_mirror_unref(range->mirror);
+	kfree(range);
+}
+
+static struct hmm_range *hmm_range_unref(struct hmm_range *range)
+{
+	if (range) {
+		kref_put(&range->kref, hmm_range_destroy);
+	}
+	return NULL;
+}
+
+static void hmm_range_init(struct hmm_range *range,
+			   struct hmm_mirror *mirror,
+			   struct hmm_rmem *rmem,
+			   unsigned long faddr,
+			   unsigned long laddr,
+			   unsigned long fuid)
+{
+	kref_init(&range->kref);
+	range->mirror = hmm_mirror_ref(mirror);
+	range->rmem = hmm_rmem_ref(rmem);
+	range->fuid = fuid;
+	range->faddr = faddr;
+	range->laddr = laddr;
+	RB_CLEAR_NODE(&range->node);
+
+	spin_lock(&rmem->lock);
+	list_add_tail(&range->rlist, &rmem->ranges);
+	if (rmem->event) {
+		list_add_tail(&range->elist, &rmem->event->ranges);
+	}
+	spin_unlock(&rmem->lock);
+}
+
+static void hmm_range_insert(struct hmm_range *range)
+{
+	struct hmm_mirror *mirror = range->mirror;
+
+	spin_lock(&mirror->hmm->lock);
+	if (RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_insert(range, &mirror->hmm->ranges);
+	}
+	spin_unlock(&mirror->hmm->lock);
+}
+
+static inline void hmm_range_adjust_locked(struct hmm_range *range,
+					   unsigned long faddr,
+					   unsigned long laddr)
+{
+	if (!RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_remove(range, &range->mirror->hmm->ranges);
+	}
+	if (faddr < range->faddr) {
+		range->fuid -= ((range->faddr - faddr) >> PAGE_SHIFT);
+	} else {
+		range->fuid += ((faddr - range->faddr) >> PAGE_SHIFT);
+	}
+	range->faddr = faddr;
+	range->laddr = laddr;
+	hmm_range_tree_insert(range, &range->mirror->hmm->ranges);
+}
+
+static int hmm_range_split(struct hmm_range *range,
+			   unsigned long saddr)
+{
+	struct hmm_mirror *mirror = range->mirror;
+	struct hmm_range *new;
+
+	if (range->faddr >= saddr) {
+		BUG();
+		return -EINVAL;
+	}
+
+	new = kmalloc(sizeof(struct hmm_range), GFP_KERNEL);
+	if (new == NULL) {
+		return -ENOMEM;
+	}
+
+	hmm_range_init(new,mirror,range->rmem,range->faddr,saddr,range->fuid);
+	spin_lock(&mirror->hmm->lock);
+	hmm_range_adjust_locked(range, saddr, range->laddr);
+	hmm_range_tree_insert(new, &mirror->hmm->ranges);
+	spin_unlock(&mirror->hmm->lock);
+	return 0;
+}
+
+static void hmm_range_fini(struct hmm_range *range)
+{
+	struct hmm_rmem *rmem = range->rmem;
+	struct hmm *hmm = range->mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (!RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_remove(range, &hmm->ranges);
+		RB_CLEAR_NODE(&range->node);
+	}
+	spin_unlock(&hmm->lock);
+
+	spin_lock(&rmem->lock);
+	list_del_init(&range->elist);
+	list_del_init(&range->rlist);
+	spin_unlock(&rmem->lock);
+
+	hmm_range_unref(range);
+}
+
+static void hmm_range_fini_clear(struct hmm_range *range,
+				 struct vm_area_struct *vma)
+{
+	hmm_rmem_clear_range(range->rmem, vma, range->faddr,
+			     range->laddr, range->fuid);
+	hmm_range_fini(range);
+}
+
+static inline bool hmm_range_reserve(struct hmm_range *range,
+				     struct hmm_event *event)
+{
+	bool reserved = false;
+
+	spin_lock(&range->rmem->lock);
+	if (range->rmem->event == NULL || range->rmem->event == event) {
+		range->rmem->event = event;
+		list_add_tail(&range->elist, &range->rmem->event->ranges);
+		reserved = true;
+	}
+	spin_unlock(&range->rmem->lock);
+	return reserved;
+}
+
+static inline void hmm_range_release(struct hmm_range *range,
+				     struct hmm_event *event)
+{
+	struct hmm_device *device = NULL;
+	spin_lock(&range->rmem->lock);
+	if (range->rmem->event != event) {
+		spin_unlock(&range->rmem->lock);
+		WARN_ONCE(1,"hmm: trying to release range from wrong event.\n");
+		return;
+	}
+	list_del_init(&range->elist);
+	if (list_empty(&range->rmem->event->ranges)) {
+		range->rmem->event = NULL;
+		device = range->rmem->device;
+	}
+	spin_unlock(&range->rmem->lock);
+
+	if (device) {
+		wake_up(&device->wait_queue);
+	}
+}
+
+
+
+
+/* hmm_rmem - The remote memory.
+ *
+ * Below are functions that deals with remote memory.
+ */
+
+/* struct hmm_rmem_mm - used during memory migration from/to rmem.
+ *
+ * @vma:            The vma that cover the range.
+ * @rmem:           The remote memory object.
+ * @faddr:          The first address in the range.
+ * @laddr:          The last address in the range.
+ * @fuid:           The first uid for the range.
+ * @rmeap_pages:    List of page to remap.
+ * @tlb:            For gathering cpu tlb flushes.
+ * @force_flush:    Force cpu tlb flush.
+ */
+struct hmm_rmem_mm {
+	struct vm_area_struct	*vma;
+	struct hmm_rmem		*rmem;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		fuid;
+	struct list_head	remap_pages;
+	struct mmu_gather	tlb;
+	int			force_flush;
+};
+
+/* Interval tree for the hmm_rmem object. Providing the following functions :
+ * hmm_rmem_tree_insert(struct hmm_rmem *, struct rb_root *)
+ * hmm_rmem_tree_remove(struct hmm_rmem *, struct rb_root *)
+ * hmm_rmem_tree_iter_first(struct rb_root *, fpgoff, lpgoff)
+ * hmm_rmem_tree_iter_next(struct hmm_rmem *, fpgoff, lpgoff)
+ */
+static inline unsigned long hmm_rmem_fuid(struct hmm_rmem *rmem)
+{
+	return rmem->fuid;
+}
+
+static inline unsigned long hmm_rmem_luid(struct hmm_rmem *rmem)
+{
+	return rmem->luid - 1UL;
+}
+
+INTERVAL_TREE_DEFINE(struct hmm_rmem,
+		     node,
+		     unsigned long,
+		     subtree_luid,
+		     hmm_rmem_fuid,
+		     hmm_rmem_luid,,
+		     hmm_rmem_tree)
+
+static inline unsigned long hmm_rmem_npages(struct hmm_rmem *rmem)
+{
+	return (rmem->luid - rmem->fuid);
+}
+
+static inline unsigned long hmm_rmem_size(struct hmm_rmem *rmem)
+{
+	return hmm_rmem_npages(rmem) << PAGE_SHIFT;
+}
+
+static void hmm_rmem_free(struct hmm_rmem *rmem)
+{
+	unsigned long i;
+
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!page || test_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[i])) {
+			continue;
+		}
+		/* Fake mapping so that page_remove_rmap behave as we want. */
+		VM_BUG_ON(page_mapcount(page));
+		atomic_set(&page->_mapcount, 0);
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0;
+	}
+	kfree(rmem->pfns);
+	rmem->pfns = NULL;
+
+	spin_lock(&_hmm_rmems_lock);
+	if (!RB_EMPTY_NODE(&rmem->node)) {
+		hmm_rmem_tree_remove(rmem, &_hmm_rmems);
+		RB_CLEAR_NODE(&rmem->node);
+	}
+	spin_unlock(&_hmm_rmems_lock);
+}
+
+static void hmm_rmem_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_rmem *rmem;
+
+	rmem = container_of(kref, struct hmm_rmem, kref);
+	device = rmem->device;
+	BUG_ON(!list_empty(&rmem->ranges));
+	hmm_rmem_free(rmem);
+	device->ops->rmem_destroy(rmem);
+}
+
+struct hmm_rmem *hmm_rmem_ref(struct hmm_rmem *rmem)
+{
+	if (rmem) {
+		kref_get(&rmem->kref);
+		return rmem;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_rmem_ref);
+
+struct hmm_rmem *hmm_rmem_unref(struct hmm_rmem *rmem)
+{
+	if (rmem) {
+		kref_put(&rmem->kref, hmm_rmem_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_rmem_unref);
+
+static void hmm_rmem_init(struct hmm_rmem *rmem,
+			  struct hmm_device *device)
+{
+	kref_init(&rmem->kref);
+	rmem->device = device;
+	rmem->fuid = 0;
+	rmem->luid = 0;
+	rmem->pfns = NULL;
+	rmem->dead = false;
+	INIT_LIST_HEAD(&rmem->ranges);
+	spin_lock_init(&rmem->lock);
+	RB_CLEAR_NODE(&rmem->node);
+}
+
+static int hmm_rmem_alloc(struct hmm_rmem *rmem, unsigned long npages)
+{
+	rmem->pfns = kzalloc(sizeof(long) * npages, GFP_KERNEL);
+	if (rmem->pfns == NULL) {
+		return -ENOMEM;
+	}
+
+	spin_lock(&_hmm_rmems_lock);
+	if (_hmm_rmems.rb_node == NULL) {
+		rmem->fuid = 1;
+		rmem->luid = 1 + npages;
+	} else {
+		struct hmm_rmem *head;
+
+		head = container_of(_hmm_rmems.rb_node,struct hmm_rmem,node);
+		/* The subtree_luid of root node is the current luid. */
+		rmem->fuid = head->subtree_luid;
+		rmem->luid = head->subtree_luid + npages;
+	}
+	/* The rmem uid value must fit into swap entry. FIXME can we please
+	 * have an ARCH define for the maximum swap entry value !
+	 */
+	if (rmem->luid < MM_MAX_SWAP_PAGES) {
+		hmm_rmem_tree_insert(rmem, &_hmm_rmems);
+		spin_unlock(&_hmm_rmems_lock);
+		return 0;
+	}
+	spin_unlock(&_hmm_rmems_lock);
+	rmem->fuid = 0;
+	rmem->luid = 0;
+	return -ENOSPC;
+}
+
+static struct hmm_rmem *hmm_rmem_find(unsigned long uid)
+{
+	struct hmm_rmem *rmem;
+
+	spin_lock(&_hmm_rmems_lock);
+	rmem = hmm_rmem_tree_iter_first(&_hmm_rmems, uid, uid);
+	hmm_rmem_ref(rmem);
+	spin_unlock(&_hmm_rmems_lock);
+	return rmem;
+}
+
+int hmm_rmem_split_new(struct hmm_rmem *rmem,
+		       struct hmm_rmem *new)
+{
+	struct hmm_range *range, *next;
+	unsigned long i, pgoff, npages;
+
+	hmm_rmem_init(new, rmem->device);
+
+	/* Sanity check, the new rmem is either at the begining or at the end
+	 * of the old rmem it can not be in the middle.
+	 */
+	if (!(new->fuid < new->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+	if (!(new->fuid >= rmem->fuid && new->luid <= rmem->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+	if (!(new->fuid == rmem->fuid || new->luid == rmem->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+
+	npages = hmm_rmem_npages(new);
+	new->pfns = kzalloc(sizeof(long) * npages, GFP_KERNEL);
+	if (new->pfns == NULL) {
+		hmm_rmem_unref(new);
+		return -ENOMEM;
+	}
+
+retry:
+	spin_lock(&rmem->lock);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		if (hmm_range_fuid(range) < new->fuid &&
+		    hmm_range_luid(range) > new->fuid) {
+			unsigned long soff;
+			int ret;
+
+			soff = ((new->fuid - range->fuid) << PAGE_SHIFT);
+			spin_unlock(&rmem->lock);
+			ret = hmm_range_split(range, soff + range->faddr);
+			if (ret) {
+				hmm_rmem_unref(new);
+				return ret;
+			}
+			goto retry;
+		}
+		if (hmm_range_fuid(range) < new->luid &&
+		    hmm_range_luid(range) > new->luid) {
+			unsigned long soff;
+			int ret;
+
+			soff = ((new->luid - range->fuid) << PAGE_SHIFT);
+			spin_unlock(&rmem->lock);
+			ret = hmm_range_split(range, soff + range->faddr);
+			if (ret) {
+				hmm_rmem_unref(new);
+				return ret;
+			}
+			goto retry;
+		}
+	}
+	spin_unlock(&rmem->lock);
+
+	spin_lock(&_hmm_rmems_lock);
+	hmm_rmem_tree_remove(rmem, &_hmm_rmems);
+	if (new->fuid != rmem->fuid) {
+		for (i = 0, pgoff = (new->fuid-rmem->fuid); i < npages; ++i) {
+			new->pfns[i] = rmem->pfns[i + pgoff];
+		}
+		rmem->luid = new->fuid;
+	} else {
+		for (i = 0; i < npages; ++i) {
+			new->pfns[i] = rmem->pfns[i];
+		}
+		rmem->fuid = new->luid;
+		for (i = 0, pgoff = npages; i < hmm_rmem_npages(rmem); ++i) {
+			rmem->pfns[i] = rmem->pfns[i + pgoff];
+		}
+	}
+	hmm_rmem_tree_insert(rmem, &_hmm_rmems);
+	hmm_rmem_tree_insert(new, &_hmm_rmems);
+
+	/* No need to lock the new ranges list as we are holding the
+	 * rmem uid tree lock and thus no one can find about the new
+	 * rmem yet.
+	 */
+	spin_lock(&rmem->lock);
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		if (range->fuid >= rmem->fuid) {
+			continue;
+		}
+		list_del(&range->rlist);
+		list_add_tail(&range->rlist, &new->ranges);
+	}
+	spin_unlock(&rmem->lock);
+	spin_unlock(&_hmm_rmems_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_rmem_split_new);
+
+static int hmm_rmem_split(struct hmm_rmem *rmem,
+			  unsigned long fuid,
+			  unsigned long luid,
+			  bool adjust)
+{
+	struct hmm_device *device = rmem->device;
+	int ret;
+
+	if (fuid < rmem->fuid || luid > rmem->luid) {
+		WARN_ONCE(1, "hmm: rmem split received invalid range.\n");
+		return -EINVAL;
+	}
+
+	if (fuid == rmem->fuid && luid == rmem->luid) {
+		return 0;
+	}
+
+	if (adjust) {
+		ret = device->ops->rmem_split_adjust(rmem, fuid, luid);
+	} else {
+		ret = device->ops->rmem_split(rmem, fuid, luid);
+	}
+	return ret;
+}
+
+static void hmm_rmem_clear_range_page(struct hmm_rmem_mm *rmem_mm,
+				      unsigned long addr,
+				      pte_t *ptep,
+				      pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long uid;
+	pte_t pte;
+
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte, swp_entry_to_pte(make_hmm_entry(uid)))) {
+//		print_bad_pte(vma, addr, ptep, NULL);
+		set_pte_at(mm, addr, ptep, pte);
+	}
+}
+
+static int hmm_rmem_clear_range_pmd(pmd_t *pmdp,
+				    unsigned long addr,
+				    unsigned long next,
+				    struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return 0;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* FIXME I do not think this can happen at this point given
+		 * that during unmap all thp pmd were split.
+		 */
+		BUG();
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		hmm_rmem_clear_range_page(rmem_mm, addr, ptep, pmdp);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long i, idx, npages;
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_clear_range_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	walk_page_range(faddr, laddr, &walk);
+
+	npages = (laddr - faddr) >> PAGE_SHIFT;
+	for (i = 0, idx = fuid - rmem->fuid; i < npages; ++i, ++idx) {
+		if (current->mm == vma->vm_mm) {
+			sync_mm_rss(vma->vm_mm);
+		}
+
+		/* Properly uncharge memory. */
+		mem_cgroup_uncharge_mm(vma->vm_mm);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+	}
+}
+
+static void hmm_rmem_poison_range_page(struct hmm_rmem_mm *rmem_mm,
+				       struct vm_area_struct *vma,
+				       unsigned long addr,
+				       pte_t *ptep,
+				       pmd_t *pmdp)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long uid;
+	pte_t pte;
+
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte, swp_entry_to_pte(make_hmm_entry(uid)))) {
+//		print_bad_pte(vma, addr, ptep, NULL);
+		set_pte_at(mm, addr, ptep, pte);
+	} else {
+		/* The 0 fuid is special poison value. */
+		pte = swp_entry_to_pte(make_hmm_entry(0));
+		set_pte_at(mm, addr, ptep, pte);
+	}
+}
+
+static int hmm_rmem_poison_range_pmd(pmd_t *pmdp,
+				     unsigned long addr,
+				     unsigned long next,
+				     struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (!vma) {
+		vma = find_vma(walk->mm, addr);
+	}
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return 0;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* FIXME I do not think this can happen at this point given
+		 * that during unmap all thp pmd were split.
+		 */
+		BUG();
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		hmm_rmem_poison_range_page(rmem_mm, vma, addr, ptep, pmdp);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+static void hmm_rmem_poison_range(struct hmm_rmem *rmem,
+				  struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_poison_range_pmd;
+	walk.mm = mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	walk_page_range(faddr, laddr, &walk);
+}
+
+static int hmm_rmem_remap_page(struct hmm_rmem_mm *rmem_mm,
+			       unsigned long addr,
+			       pte_t *ptep,
+			       pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	uid = rmem_mm->fuid + ((rmem_mm->faddr - addr) >> PAGE_SHIFT);
+	idx = (uid - rmem_mm->fuid);
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte,swp_entry_to_pte(make_hmm_entry(uid)))) {
+		set_pte_at(mm, addr, ptep, pte);
+		if (vma->vm_file) {
+			/* Just ignore it, it might means that the shared page
+			 * backing this address was remapped right after being
+			 * added to the pagecache.
+			 */
+			return 0;
+		} else {
+//			print_bad_pte(vma, addr, ptep, NULL);
+			return -EFAULT;
+		}
+	}
+	page = hmm_pfn_to_page(rmem->pfns[idx]);
+	if (!page) {
+		/* Nothing to do. */
+		return 0;
+	}
+
+	/* The remap code must lock page prior to remapping. */
+	BUG_ON(PageHuge(page));
+	if (test_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx])) {
+		BUG_ON(!PageLocked(page));
+		pte = mk_pte(page, vma->vm_page_prot);
+		if (test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			pte = pte_mkwrite(pte);
+		}
+		if (test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx])) {
+			pte = pte_mkdirty(pte);
+		}
+		get_page(page);
+		/* Private anonymous page. */
+		page_add_anon_rmap(page, vma, addr);
+		/* FIXME is this necessary ? I do not think so. */
+		if (!reuse_swap_page(page)) {
+			/* Page is still mapped in another process. */
+			pte = pte_wrprotect(pte);
+		}
+	} else {
+		/* Special zero page. */
+		pte = pte_mkspecial(pfn_pte(page_to_pfn(page),
+				    vma->vm_page_prot));
+	}
+	set_pte_at(mm, addr, ptep, pte);
+
+	return 0;
+}
+
+static int hmm_rmem_remap_pmd(pmd_t *pmdp,
+			      unsigned long addr,
+			      unsigned long next,
+			      struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	int ret = 0;
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* No pmd here. */
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		ret = hmm_rmem_remap_page(rmem_mm, addr, ptep, pmdp);
+		if (ret) {
+			/* Increment ptep so unlock works on correct pte. */
+			ptep++;
+			break;
+		}
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return ret;
+}
+
+static int hmm_rmem_remap_anon(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr,
+			       unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	int ret;
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_remap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	ret = walk_page_range(faddr, laddr, &walk);
+
+	return ret;
+}
+
+static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
+				    unsigned long addr,
+				    pte_t *ptep,
+				    pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	/* New pte value. */
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	idx = uid - rmem->fuid;
+	pte = ptep_get_and_clear_full(mm, addr, ptep, rmem_mm->tlb.fullmm);
+	tlb_remove_tlb_entry((&rmem_mm->tlb), ptep, addr);
+	rmem->pfns[idx] = 0;
+
+	if (pte_none(pte)) {
+		if (mem_cgroup_charge_anon(NULL, mm, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+		/* Zero pte means nothing is there and thus nothing to copy. */
+		pte = swp_entry_to_pte(make_hmm_entry(uid));
+		set_pte_at(mm, addr, ptep, pte);
+		rmem->pfns[idx] = my_zero_pfn(addr) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[idx]);
+		if (vma->vm_flags & VM_WRITE) {
+			set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+		}
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+		rmem_mm->laddr = addr + PAGE_SIZE;
+		return 0;
+	}
+	if (!pte_present(pte)) {
+		/* Page is not present it must be faulted, restore pte. */
+		set_pte_at(mm, addr, ptep, pte);
+		return -ENOENT;
+	}
+
+	page = pfn_to_page(pte_pfn(pte));
+	/* FIXME do we want to be able to unmap mlocked page ? */
+	if (PageMlocked(page)) {
+		set_pte_at(mm, addr, ptep, pte);
+		return -EBUSY;
+	}
+
+	rmem->pfns[idx] = pte_pfn(pte) << HMM_PFN_SHIFT;
+	if (is_zero_pfn(pte_pfn(pte))) {
+		set_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+	} else {
+		flush_cache_page(vma, addr, pte_pfn(pte));
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+		/* Anonymous private memory always writeable. */
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		if (trylock_page(page)) {
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[idx]);
+		}
+		rmem_mm->force_flush=!__tlb_remove_page(&rmem_mm->tlb,page);
+
+		/* tlb_flush_mmu drop one ref so take an extra ref here. */
+		get_page(page);
+	}
+	if (vma->vm_flags & VM_WRITE) {
+		set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+	}
+	rmem_mm->laddr = addr + PAGE_SIZE;
+
+	pte = swp_entry_to_pte(make_hmm_entry(uid));
+	set_pte_at(mm, addr, ptep, pte);
+
+	/* What a journey ! */
+	return 0;
+}
+
+static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
+			      unsigned long addr,
+			      unsigned long next,
+			      struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	int ret = 0;
+
+	if (pmd_none(*pmdp)) {
+		if (unlikely(__pte_alloc(vma->vm_mm, vma, pmdp, addr))) {
+			return -ENOENT;
+		}
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME this will dead lock because it does mmu_notifier_range_invalidate */
+		split_huge_page_pmd(vma, addr, pmdp);
+		return -EAGAIN;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* It is already be handled above. */
+		BUG();
+		return -EINVAL;
+	}
+
+again:
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
+					       ptep, pmdp);
+		if (ret || rmem_mm->force_flush) {
+			/* Increment ptep so unlock works on correct
+			 * pte.
+			 */
+			ptep++;
+			break;
+		}
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	/* mmu_gather ran out of room to batch pages, we break out of the PTE
+	 * lock to avoid doing the potential expensive TLB invalidate and
+	 * page-free while holding it.
+	 */
+	if (rmem_mm->force_flush) {
+		unsigned long old_end;
+
+		rmem_mm->force_flush = 0;
+		/*
+		 * Flush the TLB just for the previous segment,
+		 * then update the range to be the remaining
+		 * TLB range.
+		 */
+		old_end = rmem_mm->tlb.end;
+		rmem_mm->tlb.end = addr;
+
+		tlb_flush_mmu(&rmem_mm->tlb);
+
+		rmem_mm->tlb.start = addr;
+		rmem_mm->tlb.end = old_end;
+
+		if (!ret && addr != next) {
+			goto again;
+		}
+	}
+
+	return ret;
+}
+
+static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long i, npages;
+	int ret;
+
+	if (vma->vm_file) {
+		return -EINVAL;
+	}
 
-static int hmm_device_fence_wait(struct hmm_device *device,
-				 struct hmm_fence *fence);
+	npages = (laddr - faddr) >> PAGE_SHIFT;
+	rmem->pgoff = faddr;
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = faddr;
+	rmem_mm.fuid = rmem->fuid;
+	memset(rmem->pfns, 0, sizeof(long) * npages);
+
+	rmem_mm.force_flush = 0;
+	walk.pmd_entry = hmm_rmem_unmap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	mmu_notifier_invalidate_range_start(walk.mm,vma,faddr,laddr,MMU_HMM);
+	tlb_gather_mmu(&rmem_mm.tlb, walk.mm, faddr, laddr);
+	tlb_start_vma(&rmem_mm.tlb, rmem_mm->vma);
+	ret = walk_page_range(faddr, laddr, &walk);
+	tlb_end_vma(&rmem_mm.tlb, rmem_mm->vma);
+	tlb_finish_mmu(&rmem_mm.tlb, faddr, laddr);
+	mmu_notifier_invalidate_range_end(walk.mm, vma, faddr, laddr, MMU_HMM);
 
+	/* Before migrating page we must lock them. Here we lock all page we
+	 * could not lock while holding pte lock.
+	 */
+	npages = (rmem_mm.laddr - faddr) >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
 
+		if (test_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[i])) {
+			continue;
+		}
 
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+		}
+	}
 
-/* hmm_event - use to synchronize various mm events with each others.
- *
- * During life time of process various mm events will happen, hmm serialize
- * event that affect overlapping range of address. The hmm_event are use for
- * that purpose.
- */
+	return ret;
+}
 
-static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+static inline int hmm_rmem_unmap(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr)
 {
-	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+	if (vma->vm_file) {
+		return -EBUSY;
+	} else {
+		return hmm_rmem_unmap_anon(rmem, vma, faddr, laddr);
+	}
 }
 
-static inline unsigned long hmm_event_size(struct hmm_event *event)
+static int hmm_rmem_alloc_pages(struct hmm_rmem *rmem,
+				struct vm_area_struct *vma,
+				unsigned long addr)
 {
-	return (event->laddr - event->faddr);
-}
+	unsigned long i, npages = hmm_rmem_npages(rmem);
+	unsigned long *pfns = rmem->pfns;
+	struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+	int ret = 0;
 
+	if (vma && !(vma->vm_file)) {
+		if (unlikely(anon_vma_prepare(vma))) {
+			return -ENOMEM;
+		}
+	}
 
+	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
+		struct page *page;
 
+		/* (i) This does happen if vma is being split and rmem split
+		 * failed thus we are falling back to full rmem migration and
+		 * there might not be a vma covering all the address (ie some
+		 * of the migration is useless but to make code simpler we just
+		 * copy more stuff than necessary).
+		 */
+		if (vma && addr >= vma->vm_end) {
+			vma = mm ? find_vma(mm, addr) : NULL;
+		}
 
-/* hmm_fault_mm - used for reading cpu page table on device fault.
- *
- * This code deals with reading the cpu page table to find the pages that are
- * backing a range of address. It is use as an helper to the device page fault
- * code.
- */
+		/* No need to clear page they will be dma to of course this does
+		 * means we trust the device driver.
+		 */
+		if (!vma) {
+			/* See above (i) for when this does happen. */
+			page = alloc_page(GFP_HIGHUSER_MOVABLE);
+		} else {
+			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+		}
+		if (!page) {
+			ret = ret ? ret : -ENOMEM;
+			continue;
+		}
+		lock_page(page);
+		pfns[i] = page_to_pfn(page) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_WRITE, &pfns[i]);
+		set_bit(HMM_PFN_LOCK, &pfns[i]);
+		set_bit(HMM_PFN_VALID_PAGE, &pfns[i]);
+		page_add_new_anon_rmap(page, vma, addr);
+	}
 
-/* struct hmm_fault_mm - used for reading cpu page table on device fault.
- *
- * @mm:     The mm of the process the device fault is happening in.
- * @vma:    The vma in which the fault is happening.
- * @faddr:  The first address for the range the device want to fault.
- * @laddr:  The last address for the range the device want to fault.
- * @pfns:   Array of hmm pfns (contains the result of the fault).
- * @write:  Is this write fault.
- */
-struct hmm_fault_mm {
-	struct mm_struct	*mm;
-	struct vm_area_struct	*vma;
-	unsigned long		faddr;
-	unsigned long		laddr;
-	unsigned long		*pfns;
-	bool			write;
-};
+	return ret;
+}
 
-static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
-				  unsigned long faddr,
-				  unsigned long laddr,
-				  struct mm_walk *walk)
+int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
+			     struct vm_area_struct *vma,
+			     unsigned long addr,
+			     unsigned long fuid,
+			     unsigned long luid,
+			     bool adjust)
 {
-	struct hmm_fault_mm *fault_mm = walk->private;
-	unsigned long idx, *pfns;
-	pte_t *ptep;
+	struct hmm_device *device = rmem->device;
+	struct hmm_range *range, *next;
+	struct hmm_fence *fence, *tmp;
+	struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+	struct list_head fences;
+	unsigned long i;
+	int ret = 0;
 
-	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
-	pfns = &fault_mm->pfns[idx];
-	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
-	if (pmd_none(*pmdp)) {
-		return -ENOENT;
+	BUG_ON(vma && ((addr < vma->vm_start) || (addr >= vma->vm_end)));
+
+	/* Ignore split error will fallback to full migration. */
+	hmm_rmem_split(rmem, fuid, luid, adjust);
+
+	if (rmem->fuid > fuid || rmem->luid < luid) {
+		WARN_ONCE(1, "hmm: rmem split out of constraint.\n");
+		ret = -EINVAL;
+		goto error;
 	}
 
-	if (pmd_trans_huge(*pmdp)) {
-		/* FIXME */
-		return -EINVAL;
+	/* Adjust start address for page allocation if necessary. */
+	if (vma && (rmem->fuid < fuid)) {
+		if (((addr-vma->vm_start)>>PAGE_SHIFT) < (fuid-rmem->fuid)) {
+			/* FIXME can this happen ? I would say now but right
+			 * now i can not hold in my brain all code path that
+			 * leads to this place.
+			 */
+			vma = NULL;
+		} else {
+			addr -= ((fuid - rmem->fuid) << PAGE_SHIFT);
+		}
 	}
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
-		return -EINVAL;
+	ret = hmm_rmem_alloc_pages(rmem, vma, addr);
+	if (ret) {
+		goto error;
 	}
 
-	ptep = pte_offset_map(pmdp, faddr);
-	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
-		pte_t pte = *ptep;
+	INIT_LIST_HEAD(&fences);
 
-		if (pte_none(pte)) {
-			if (fault_mm->write) {
-				ptep++;
-				break;
-			}
-			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
-			set_bit(HMM_PFN_VALID_ZERO, pfns);
-			continue;
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		fence = device->ops->rmem_update(range->mirror,
+						 range->rmem,
+						 range->faddr,
+						 range->laddr,
+						 range->fuid,
+						 HMM_MIGRATE_TO_LMEM,
+						 false);
+		if (IS_ERR(fence)) {
+			ret = PTR_ERR(fence);
+			goto error;
 		}
-		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
-			/* Need to inc ptep so unmap unlock on right pmd. */
-			ptep++;
-			break;
+		if (fence) {
+			list_add_tail(&fence->list, &fences);
 		}
+	}
 
-		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
-		set_bit(HMM_PFN_VALID_PAGE, pfns);
-		if (pte_write(pte)) {
-			set_bit(HMM_PFN_WRITE, pfns);
+	list_for_each_entry_safe (fence, tmp, &fences, list) {
+		int r;
+
+		r = hmm_device_fence_wait(device, fence);
+		ret = ret ? min(ret, r) : r;
+	}
+	if (ret) {
+		goto error;
+	}
+
+	fence = device->ops->rmem_to_lmem(rmem, rmem->fuid, rmem->luid);
+	if (IS_ERR(fence)) {
+		/* FIXME Check return value. */
+		ret = PTR_ERR(fence);
+		goto error;
+	}
+
+	if (fence) {
+		INIT_LIST_HEAD(&fence->list);
+		ret = hmm_device_fence_wait(device, fence);
+		if (ret) {
+			goto error;
 		}
-		/* Consider the page as hot as a device want to use it. */
-		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
-		fault_mm->laddr = faddr + PAGE_SIZE;
 	}
-	pte_unmap(ptep - 1);
 
-	return (faddr == laddr) ? 0 : -ENOENT;
-}
+	/* Now the remote memory is officialy dead and nothing below can fails
+	 * badly.
+	 */
+	rmem->dead = true;
 
-static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
-{
-	struct mm_walk walk = {0};
-	unsigned long faddr, laddr;
-	int ret;
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		VM_BUG_ON(!vma);
+		VM_BUG_ON(range->faddr < vma->vm_start);
+		VM_BUG_ON(range->laddr > vma->vm_end);
+
+		/* The remapping fail only if something goes terribly wrong. */
+		ret = hmm_rmem_remap_anon(rmem, vma, range->faddr,
+					  range->laddr, range->fuid);
+		if (ret) {
+			WARN_ONCE(1, "hmm: something is terribly wrong.\n");
+			hmm_rmem_poison_range(rmem, mm, vma, range->faddr,
+					      range->laddr, range->fuid);
+		}
+		hmm_range_fini(range);
+	}
 
-	faddr = fault_mm->faddr;
-	laddr = fault_mm->laddr;
-	fault_mm->laddr = faddr;
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
-	walk.pmd_entry = hmm_fault_mm_fault_pmd;
-	walk.mm = fault_mm->mm;
-	walk.private = fault_mm;
+		unlock_page(page);
+		mem_cgroup_transfer_charge_anon(page, mm);
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0UL;
+	}
+	return 0;
 
-	ret = walk_page_range(faddr, laddr, &walk);
+error:
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	/* There is two case here :
+	 * (1) rmem is mirroring shared memory in which case we are facing the
+	 *     issue of poisoning all the mapping in all the process for that
+	 *     file.
+	 * (2) rmem is mirroring private memory, easy case poison all ranges
+	 *     referencing the rmem.
+	 */
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
+
+		if (!page) {
+			if (vma && !(vma->vm_flags & VM_SHARED)) {
+				/* Properly uncharge memory. */
+				mem_cgroup_uncharge_mm(mm);
+			}
+			continue;
+		}
+		/* Properly uncharge memory. */
+		mem_cgroup_transfer_charge_anon(page, mm);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			unlock_page(page);
+		}
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0UL;
+	}
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		mm = range->mirror->hmm->mm;
+		hmm_rmem_poison_range(rmem, mm, NULL, range->faddr,
+				      range->laddr, range->fuid);
+		hmm_range_fini(range);
+	}
 	return ret;
 }
 
@@ -285,6 +1610,7 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	INIT_LIST_HEAD(&hmm->mirrors);
 	INIT_LIST_HEAD(&hmm->pending);
 	spin_lock_init(&hmm->lock);
+	hmm->ranges = RB_ROOT;
 	init_waitqueue_head(&hmm->wait_queue);
 
 	for (i = 0; i < HMM_MAX_EVENTS; ++i) {
@@ -298,6 +1624,12 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	return ret;
 }
 
+static inline bool hmm_event_cover_range(struct hmm_event *a,
+					 struct hmm_range *b)
+{
+	return ((a->faddr <= b->faddr) && (a->laddr >= b->laddr));
+}
+
 static enum hmm_etype hmm_event_mmu(enum mmu_action action)
 {
 	switch (action) {
@@ -326,6 +1658,7 @@ static enum hmm_etype hmm_event_mmu(enum mmu_action action)
 	case MMU_MUNMAP:
 		return HMM_MUNMAP;
 	case MMU_SOFT_DIRTY:
+	case MMU_HMM:
 	default:
 		return HMM_NONE;
 	}
@@ -357,6 +1690,8 @@ static void hmm_destroy_kref(struct kref *kref)
 	mm->hmm = NULL;
 	mmu_notifier_unregister(&hmm->mmu_notifier, mm);
 
+	BUG_ON(!RB_EMPTY_ROOT(&hmm->ranges));
+
 	if (!list_empty(&hmm->mirrors)) {
 		BUG();
 		printk(KERN_ERR "destroying an hmm with still active mirror\n"
@@ -410,6 +1745,7 @@ out:
 	event->laddr = laddr;
 	event->backoff = false;
 	INIT_LIST_HEAD(&event->fences);
+	INIT_LIST_HEAD(&event->ranges);
 	hmm->nevents++;
 	list_add_tail(&event->list, &hmm->pending);
 
@@ -447,11 +1783,116 @@ wait:
 	goto retry_wait;
 }
 
+static int hmm_migrate_to_lmem(struct hmm *hmm,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr,
+			       bool adjust)
+{
+	struct hmm_range *range;
+	struct hmm_rmem *rmem;
+	int ret = 0;
+
+	if (unlikely(anon_vma_prepare(vma))) {
+		return -ENOMEM;
+	}
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range && faddr < laddr) {
+		struct hmm_device *device;
+		unsigned long fuid, luid, cfaddr, claddr;
+		int r;
+
+		cfaddr = max(faddr, range->faddr);
+		claddr = min(laddr, range->laddr);
+		fuid = range->fuid + ((cfaddr - range->faddr) >> PAGE_SHIFT);
+		luid = fuid + ((claddr - cfaddr) >> PAGE_SHIFT);
+		faddr = min(range->laddr, laddr);
+		rmem = hmm_rmem_ref(range->rmem);
+		device = rmem->device;
+		spin_unlock(&hmm->lock);
+
+		r = hmm_rmem_migrate_to_lmem(rmem, vma, cfaddr, fuid,
+					     luid, adjust);
+		hmm_rmem_unref(rmem);
+		if (r) {
+			ret = ret ? ret : r;
+			hmm_mirror_cleanup(range->mirror);
+			goto retry;
+		}
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,faddr,laddr-1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return ret;
+}
+
+static unsigned long hmm_ranges_reserve(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_range *range;
+	unsigned long faddr, laddr, count = 0;
+
+	faddr = event->faddr;
+	laddr = event->laddr;
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range) {
+		if (!hmm_range_reserve(range, event)) {
+			struct hmm_rmem *rmem = hmm_rmem_ref(range->rmem);
+			spin_unlock(&hmm->lock);
+			wait_event(hmm->wait_queue, rmem->event != NULL);
+			hmm_rmem_unref(rmem);
+			goto retry;
+		}
+
+		if (list_empty(&range->elist)) {
+			list_add_tail(&range->elist, &event->ranges);
+			count++;
+		}
+
+		range = hmm_range_tree_iter_next(range, faddr, laddr - 1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return count;
+}
+
+static void hmm_ranges_release(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_range *range, *next;
+
+	list_for_each_entry_safe (range, next, &event->ranges, elist) {
+		hmm_range_release(range, event);
+	}
+}
+
 static void hmm_update_mirrors(struct hmm *hmm,
 			       struct vm_area_struct *vma,
 			       struct hmm_event *event)
 {
 	unsigned long faddr, laddr;
+	bool migrate = false;
+
+	switch (event->etype) {
+	case HMM_COW:
+		migrate = true;
+		break;
+	case HMM_MUNMAP:
+		migrate = vma->vm_file ? true : false;
+		break;
+	default:
+		break;
+	}
+
+	if (hmm_ranges_reserve(hmm, event) && migrate) {
+		hmm_migrate_to_lmem(hmm,vma,event->faddr,event->laddr,false);
+	}
 
 	for (faddr = event->faddr; faddr < event->laddr; faddr = laddr) {
 		struct hmm_mirror *mirror;
@@ -494,6 +1935,7 @@ retry_ranges:
 			}
 		}
 	}
+	hmm_ranges_release(hmm, event);
 }
 
 static int hmm_fault_mm(struct hmm *hmm,
@@ -529,6 +1971,98 @@ static int hmm_fault_mm(struct hmm *hmm,
 	return 0;
 }
 
+/* see include/linux/hmm.h */
+int hmm_mm_fault(struct mm_struct *mm,
+		 struct vm_area_struct *vma,
+		 unsigned long addr,
+		 pte_t *pte,
+		 pmd_t *pmd,
+		 unsigned int fault_flags,
+		 pte_t opte)
+{
+	struct hmm_mirror *mirror = NULL;
+	struct hmm_device *device;
+	struct hmm_event *event;
+	struct hmm_range *range;
+	struct hmm_rmem *rmem = NULL;
+	unsigned long uid, faddr, laddr;
+	swp_entry_t entry;
+	struct hmm *hmm = hmm_ref(mm->hmm);
+	int ret;
+
+	if (!hmm) {
+		BUG();
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* Find the corresponding rmem. */
+	entry = pte_to_swp_entry(opte);
+	if (!is_hmm_entry(entry)) {
+		//print_bad_pte(vma, addr, opte, NULL);
+		hmm_unref(hmm);
+		return VM_FAULT_SIGBUS;
+	}
+	uid = hmm_entry_uid(entry);
+	if (!uid) {
+		/* Poisonous hmm swap entry. */
+		hmm_unref(hmm);
+		return VM_FAULT_SIGBUS;
+	}
+
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		hmm_unref(hmm);
+		if (pte_same(*pte, opte)) {
+			//print_bad_pte(vma, addr, opte, NULL);
+			return VM_FAULT_SIGBUS;
+		}
+		return 0;
+	}
+
+	faddr = addr & PAGE_MASK;
+	/* FIXME use the readahead value as a hint on how much to migrate. */
+	laddr = min(faddr + (16 << PAGE_SHIFT), vma->vm_end);
+	spin_lock(&rmem->lock);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		if (faddr < range->faddr || faddr >= range->laddr) {
+			continue;
+		}
+		if (range->mirror->hmm == hmm) {
+			laddr = min(laddr, range->laddr);
+			mirror = hmm_mirror_ref(range->mirror);
+			break;
+		}
+	}
+	spin_unlock(&rmem->lock);
+	hmm_rmem_unref(rmem);
+	hmm_unref(hmm);
+	if (mirror == NULL) {
+		if (pte_same(*pte, opte)) {
+			//print_bad_pte(vma, addr, opte, NULL);
+			return VM_FAULT_SIGBUS;
+		}
+		return 0;
+	}
+
+	device = rmem->device;
+	event = hmm_event_get(hmm, faddr, laddr, HMM_MIGRATE_TO_LMEM);
+	hmm_ranges_reserve(hmm, event);
+	ret = hmm_migrate_to_lmem(hmm, vma, faddr, laddr, true);
+	hmm_ranges_release(hmm, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_mirror_unref(mirror);
+	switch (ret) {
+	case 0:
+		break;
+	case -ENOMEM:
+		return VM_FAULT_OOM;
+	default:
+		return VM_FAULT_SIGBUS;
+	}
+
+	return VM_FAULT_MAJOR;
+}
+
 
 
 
@@ -726,16 +2260,15 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
  * device page table (through hmm callback). Or provide helper functions use by
  * the device driver to fault in range of memory in the device page table.
  */
-
-static int hmm_mirror_update(struct hmm_mirror *mirror,
-			     struct vm_area_struct *vma,
-			     unsigned long faddr,
-			     unsigned long laddr,
-			     struct hmm_event *event)
+
+static int hmm_mirror_lmem_update(struct hmm_mirror *mirror,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct hmm_event *event,
+				  bool dirty)
 {
 	struct hmm_device *device = mirror->device;
 	struct hmm_fence *fence;
-	bool dirty = !!(vma->vm_file);
 
 	fence = device->ops->lmem_update(mirror, faddr, laddr,
 					 event->etype, dirty);
@@ -749,6 +2282,175 @@ static int hmm_mirror_update(struct hmm_mirror *mirror,
 	return 0;
 }
 
+static int hmm_mirror_rmem_update(struct hmm_mirror *mirror,
+				  struct hmm_rmem *rmem,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid,
+				  struct hmm_event *event,
+				  bool dirty)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+
+	fence = device->ops->rmem_update(mirror, rmem, faddr, laddr,
+					 fuid, event->etype, dirty);
+	if (fence) {
+		if (IS_ERR(fence)) {
+			return PTR_ERR(fence);
+		}
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event)
+{
+	struct hmm *hmm = mirror->hmm;
+	unsigned long caddr = faddr;
+	bool free = false, dirty = !!(vma->vm_flags & VM_SHARED);
+	int ret;
+
+	switch (event->etype) {
+	case HMM_MUNMAP:
+		free = true;
+		break;
+	default:
+		break;
+	}
+
+	for (; caddr < laddr;) {
+		struct hmm_range *range;
+		unsigned long naddr;
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,caddr,laddr-1);
+		if (range && range->mirror != mirror) {
+			range = NULL;
+		}
+		spin_unlock(&hmm->lock);
+
+		/* At this point the range is on the event list and thus it can
+		 * not disappear.
+		 */
+		BUG_ON(range && list_empty(&range->elist));
+
+		if (!range || (range->faddr > caddr)) {
+			naddr = range ? range->faddr : laddr;
+			ret = hmm_mirror_lmem_update(mirror, caddr, naddr,
+						     event, dirty);
+			if (ret) {
+				return ret;
+			}
+			caddr = naddr;
+		}
+		if (range) {
+			unsigned long fuid;
+
+			naddr = min(range->laddr, laddr);
+			fuid = range->fuid+((caddr-range->faddr)>>PAGE_SHIFT);
+			ret = hmm_mirror_rmem_update(mirror,range->rmem,caddr,
+						     naddr,fuid,event,dirty);
+			caddr = naddr;
+			if (ret) {
+				return ret;
+			}
+			if (free) {
+				BUG_ON((caddr > range->faddr) ||
+				       (naddr < range->laddr));
+				hmm_range_fini_clear(range, vma);
+			}
+		}
+	}
+	return 0;
+}
+
+static unsigned long hmm_mirror_ranges_reserve(struct hmm_mirror *mirror,
+					       struct hmm_event *event)
+{
+	struct hmm_range *range;
+	unsigned long faddr, laddr, count = 0;
+	struct hmm *hmm = mirror->hmm;
+
+	faddr = event->faddr;
+	laddr = event->laddr;
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range) {
+		if (range->mirror == mirror) {
+			if (!hmm_range_reserve(range, event)) {
+				struct hmm_rmem *rmem;
+
+				rmem = hmm_rmem_ref(range->rmem);
+				spin_unlock(&hmm->lock);
+				wait_event(hmm->wait_queue, rmem->event!=NULL);
+				hmm_rmem_unref(rmem);
+				goto retry;
+			}
+			if (list_empty(&range->elist)) {
+				list_add_tail(&range->elist, &event->ranges);
+				count++;
+			}
+		}
+		range = hmm_range_tree_iter_next(range, faddr, laddr - 1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return count;
+}
+
+static void hmm_mirror_ranges_migrate(struct hmm_mirror *mirror,
+				      struct vm_area_struct *vma,
+				      struct hmm_event *event)
+{
+	struct hmm_range *range;
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges,
+					  vma->vm_start,
+					  vma->vm_end - 1);
+	while (range) {
+		struct hmm_rmem *rmem;
+
+		if (range->mirror != mirror) {
+			goto next;
+		}
+		rmem = hmm_rmem_ref(range->rmem);
+		spin_unlock(&hmm->lock);
+
+		hmm_rmem_migrate_to_lmem(rmem, vma, range->faddr,
+					 hmm_range_fuid(range),
+					 hmm_range_luid(range),
+					 true);
+		hmm_rmem_unref(rmem);
+
+		spin_lock(&hmm->lock);
+	next:
+		range = hmm_range_tree_iter_first(&hmm->ranges,
+						  vma->vm_start,
+						  vma->vm_end - 1);
+	}
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_mirror_ranges_release(struct hmm_mirror *mirror,
+				      struct hmm_event *event)
+{
+	struct hmm_range *range, *next;
+
+	list_for_each_entry_safe (range, next, &event->ranges, elist) {
+		hmm_range_release(range, event);
+	}
+}
+
 static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
 {
 	struct vm_area_struct *vma;
@@ -778,11 +2480,16 @@ static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
 		faddr = max(faddr, vma->vm_start);
 		laddr = vma->vm_end;
 
+		hmm_mirror_ranges_reserve(mirror, event);
+
 		hmm_mirror_update(mirror, vma, faddr, laddr, event);
 		list_for_each_entry_safe (fence, next, &event->fences, list) {
 			hmm_device_fence_wait(device, fence);
 		}
 
+		hmm_mirror_ranges_migrate(mirror, vma, event);
+		hmm_mirror_ranges_release(mirror, event);
+
 		if (laddr >= vma->vm_end) {
 			vma = vma->vm_next;
 		}
@@ -949,6 +2656,33 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
+				 struct hmm_fault *fault,
+				 struct vm_area_struct *vma,
+				 struct hmm_range *range,
+				 struct hmm_event *event,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 bool write)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_rmem *rmem = range->rmem;
+	unsigned long fuid, luid, npages;
+	int ret;
+
+	if (range->mirror != mirror) {
+		/* Returning -EAGAIN will force cpu page fault path. */
+		return -EAGAIN;
+	}
+
+	npages = (range->laddr - range->faddr) >> PAGE_SHIFT;
+	fuid = range->fuid + ((faddr - range->faddr) >> PAGE_SHIFT);
+	luid = fuid + npages;
+
+	ret = device->ops->rmem_fault(mirror, rmem, faddr, laddr, fuid, fault);
+	return ret;
+}
+
 static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror,
 				 struct hmm_fault *fault,
 				 unsigned long faddr,
@@ -995,6 +2729,7 @@ int hmm_mirror_fault(struct hmm_mirror *mirror,
 retry:
 	down_read(&hmm->mm->mmap_sem);
 	event = hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT);
+	hmm_ranges_reserve(hmm, event);
 	/* FIXME handle gate area ? and guard page */
 	vma = find_extend_vma(hmm->mm, caddr);
 	if (!vma) {
@@ -1031,6 +2766,29 @@ retry:
 
 	for (; caddr < event->laddr;) {
 		struct hmm_fault_mm fault_mm;
+		struct hmm_range *range;
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,
+						  caddr,
+						  naddr - 1);
+		if (range && range->faddr > caddr) {
+			naddr = range->faddr;
+			range = NULL;
+		}
+		spin_unlock(&hmm->lock);
+		if (range) {
+			naddr = min(range->laddr, event->laddr);
+			ret = hmm_mirror_rmem_fault(mirror,fault,vma,range,
+						    event,caddr,naddr,write);
+			if (ret) {
+				do_fault = (ret == -EAGAIN);
+				goto out;
+			}
+			caddr = naddr;
+			naddr = event->laddr;
+			continue;
+		}
 
 		fault_mm.mm = vma->vm_mm;
 		fault_mm.vma = vma;
@@ -1067,6 +2825,7 @@ retry:
 	}
 
 out:
+	hmm_ranges_release(hmm, event);
 	hmm_event_unqueue(hmm, event);
 	if (do_fault && !event->backoff && !mirror->dead) {
 		do_fault = false;
@@ -1092,6 +2851,334 @@ EXPORT_SYMBOL(hmm_mirror_fault);
 
 
 
+/* hmm_migrate - Memory migration to/from local memory from/to remote memory.
+ *
+ * Below are functions that handle migration to/from local memory from/to
+ * remote memory (rmem).
+ *
+ * Migration to remote memory is a multi-step process first pages are unmap and
+ * missing page are either allocated or accounted as new allocation. Then pages
+ * are copied to remote memory. Finaly the remote memory is faulted so that the
+ * device driver update the device page table.
+ *
+ * Device driver can decide to abort migration to remote memory at any step of
+ * the process by returning special value from the callback corresponding to
+ * the step.
+ *
+ * Migration to local memory is simpler. First pages are allocated then remote
+ * memory is copied into those pages. Once dma is done the pages are remapped
+ * inside the cpu page table or inside the page cache (for shared memory) and
+ * finaly the rmem is freed.
+ */
+
+/* see include/linux/hmm.h */
+int hmm_migrate_rmem_to_lmem(struct hmm_mirror *mirror,
+			     unsigned long faddr,
+			     unsigned long laddr)
+{
+	struct hmm *hmm = mirror->hmm;
+	struct vm_area_struct *vma;
+	struct hmm_event *event;
+	unsigned long next;
+	int ret = 0;
+
+	event = hmm_event_get(hmm, faddr, laddr, HMM_MIGRATE_TO_LMEM);
+	if (!hmm_ranges_reserve(hmm, event)) {
+		hmm_event_unqueue(hmm, event);
+		return 0;
+	}
+
+	hmm_mirror_ref(mirror);
+	down_read(&hmm->mm->mmap_sem);
+	vma = find_vma(hmm->mm, faddr);
+	faddr = max(vma->vm_start, faddr);
+	for (; vma && (faddr < laddr); faddr = next) {
+		next = min(laddr, vma->vm_end);
+
+		ret = hmm_migrate_to_lmem(hmm, vma, faddr, next, true);
+		if (ret) {
+			break;
+		}
+
+		vma = vma->vm_next;
+		next = max(vma->vm_start, next);
+	}
+	up_read(&hmm->mm->mmap_sem);
+	hmm_ranges_release(hmm, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_migrate_rmem_to_lmem);
+
+static void hmm_migrate_abort(struct hmm_mirror *mirror,
+			      struct hmm_fault *fault,
+			      unsigned long *pfns,
+			      unsigned long fuid)
+{
+	struct vm_area_struct *vma = fault->vma;
+	struct hmm_rmem rmem;
+	unsigned long i, npages;
+
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	for (i = npages - 1; i > 0; --i) {
+		if (pfns[i]) {
+			break;
+		}
+		npages = i;
+	}
+	if (!npages) {
+		return;
+	}
+
+	/* Fake temporary rmem object. */
+	hmm_rmem_init(&rmem, mirror->device);
+	rmem.fuid = fuid;
+	rmem.luid = fuid + npages;
+	rmem.pfns = pfns;
+
+	if (!(vma->vm_file)) {
+		unsigned long faddr, laddr;
+
+		faddr = fault->faddr;
+		laddr = faddr + (npages << PAGE_SHIFT);
+
+		/* The remapping fail only if something goes terribly wrong. */
+		if (hmm_rmem_remap_anon(&rmem, vma, faddr, laddr, fuid)) {
+
+			WARN_ONCE(1, "hmm: something is terribly wrong.\n");
+			hmm_rmem_poison_range(&rmem, vma->vm_mm, vma,
+					      faddr, laddr, fuid);
+		}
+	} else {
+		BUG();
+	}
+
+	/* Ok officialy dead. */
+	if (fault->rmem) {
+		fault->rmem->dead = true;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = hmm_pfn_to_page(pfns[i]);
+
+		if (!page) {
+			pfns[i] = 0;
+			continue;
+		}
+		if (test_bit(HMM_PFN_VALID_ZERO, &pfns[i])) {
+			/* Properly uncharge memory. */
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+			mem_cgroup_uncharge_mm(vma->vm_mm);
+			pfns[i] = 0;
+			continue;
+		}
+		if (test_bit(HMM_PFN_LOCK, &pfns[i])) {
+			unlock_page(page);
+			clear_bit(HMM_PFN_LOCK, &pfns[i]);
+		}
+		page_remove_rmap(page);
+		page_cache_release(page);
+		pfns[i] = 0;
+	}
+}
+
+/* see include/linux/hmm.h */
+int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
+			     struct hmm_mirror *mirror)
+{
+	struct vm_area_struct *vma;
+	struct hmm_device *device;
+	struct hmm_range *range;
+	struct hmm_fence *fence;
+	struct hmm_event *event;
+	struct hmm_rmem rmem;
+	unsigned long i, npages;
+	struct hmm *hmm;
+	int ret;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!fault || !mirror || fault->faddr > fault->laddr) {
+		return -EINVAL;
+	}
+	if (mirror->dead) {
+		hmm_mirror_unref(mirror);
+		return -ENODEV;
+	}
+	hmm = mirror->hmm;
+	device = mirror->device;
+	if (!device->rmem) {
+		hmm_mirror_unref(mirror);
+		return -EINVAL;
+	}
+	fault->rmem = NULL;
+	fault->faddr = fault->faddr & PAGE_MASK;
+	fault->laddr = PAGE_ALIGN(fault->laddr);
+	hmm_rmem_init(&rmem, mirror->device);
+	event = hmm_event_get(hmm, fault->faddr, fault->laddr,
+			      HMM_MIGRATE_TO_RMEM);
+	rmem.event = event;
+	hmm = mirror->hmm;
+
+	range = kmalloc(sizeof(struct hmm_range), GFP_KERNEL);
+	if (range == NULL) {
+		hmm_event_unqueue(hmm, event);
+		hmm_mirror_unref(mirror);
+		return -ENOMEM;
+	}
+
+	down_read(&hmm->mm->mmap_sem);
+	vma = find_vma_intersection(hmm->mm, fault->faddr, fault->laddr);
+	if (!vma) {
+		kfree(range);
+		range = NULL;
+		ret = -EFAULT;
+		goto out;
+	}
+	/* FIXME support HUGETLB */
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		kfree(range);
+		range = NULL;
+		ret = -EACCES;
+		goto out;
+	}
+	if (vma->vm_file) {
+		kfree(range);
+		range = NULL;
+		ret = -EBUSY;
+		goto out;
+	}
+	/* Adjust range to this vma only. */
+	event->faddr = fault->faddr = max(fault->faddr, vma->vm_start);
+	event->laddr  =fault->laddr = min(fault->laddr, vma->vm_end);
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	fault->vma = vma;
+
+	ret = hmm_rmem_alloc(&rmem, npages);
+	if (ret) {
+		kfree(range);
+		range = NULL;
+		goto out;
+	}
+
+	/* Prior to unmapping add to the hmm range tree so any pagefault can
+	 * find the proper range.
+	 */
+	hmm_range_init(range, mirror, &rmem, fault->faddr,
+		       fault->laddr, rmem.fuid);
+	hmm_range_insert(range);
+
+	ret = hmm_rmem_unmap(&rmem, vma, fault->faddr, fault->laddr);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	fault->rmem = device->ops->rmem_alloc(device, fault);
+	if (IS_ERR(fault->rmem)) {
+		ret = PTR_ERR(fault->rmem);
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+	if (fault->rmem == NULL) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		ret = 0;
+		goto out;
+	}
+	if (event->backoff) {
+		ret = -EBUSY;
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	hmm_rmem_init(fault->rmem, mirror->device);
+	spin_lock(&_hmm_rmems_lock);
+	fault->rmem->event = event;
+	hmm_rmem_tree_remove(&rmem, &_hmm_rmems);
+	fault->rmem->fuid = rmem.fuid;
+	fault->rmem->luid = rmem.luid;
+	hmm_rmem_tree_insert(fault->rmem, &_hmm_rmems);
+	fault->rmem->pfns = rmem.pfns;
+	range->rmem = fault->rmem;
+	list_del_init(&range->rlist);
+	list_add_tail(&range->rlist, &fault->rmem->ranges);
+	rmem.event = NULL;
+	spin_unlock(&_hmm_rmems_lock);
+
+	fence = device->ops->lmem_to_rmem(fault->rmem,rmem.fuid,rmem.luid);
+	if (IS_ERR(fence)) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	ret = hmm_device_fence_wait(device, fence);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	ret = device->ops->rmem_fault(mirror, range->rmem, range->faddr,
+				      range->laddr, range->fuid, NULL);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = hmm_pfn_to_page(rmem.pfns[i]);
+
+		if (test_bit(HMM_PFN_VALID_ZERO, &rmem.pfns[i])) {
+			rmem.pfns[i] = rmem.pfns[i] & HMM_PFN_CLEAR;
+			continue;
+		}
+		/* We only decrement now the page count so that cow happen
+		 * properly while page is in fligh.
+		 */
+		if (PageAnon(page)) {
+			unlock_page(page);
+			page_remove_rmap(page);
+			page_cache_release(page);
+			rmem.pfns[i] &= HMM_PFN_CLEAR;
+		} else {
+			/* Otherwise this means the page is in pagecache. Keep
+			 * a reference and page count elevated.
+			 */
+			clear_bit(HMM_PFN_LOCK, &rmem.pfns[i]);
+			/* We do not want side effect of page_remove_rmap ie
+			 * zone page accounting udpate but we do want zero
+			 * mapcount so writeback works properly.
+			 */
+			atomic_add(-1, &page->_mapcount);
+			unlock_page(page);
+		}
+	}
+
+	hmm_mirror_ranges_release(mirror, event);
+	hmm_event_unqueue(hmm, event);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_mirror_unref(mirror);
+	return 0;
+
+out:
+	if (!fault->rmem) {
+		kfree(rmem.pfns);
+		spin_lock(&_hmm_rmems_lock);
+		hmm_rmem_tree_remove(&rmem, &_hmm_rmems);
+		spin_unlock(&_hmm_rmems_lock);
+	}
+	hmm_mirror_ranges_release(mirror, event);
+	hmm_event_unqueue(hmm, event);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_range_unref(range);
+	hmm_rmem_unref(fault->rmem);
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_migrate_lmem_to_rmem);
+
+
+
+
 /* hmm_device - Each device driver must register one and only one hmm_device
  *
  * The hmm_device is the link btw hmm and each device driver.
@@ -1140,9 +3227,22 @@ int hmm_device_register(struct hmm_device *device, const char *name)
 	BUG_ON(!device->ops->lmem_fault);
 
 	kref_init(&device->kref);
+	device->rmem = false;
 	device->name = name;
 	mutex_init(&device->mutex);
 	INIT_LIST_HEAD(&device->mirrors);
+	init_waitqueue_head(&device->wait_queue);
+
+	if (device->ops->rmem_alloc &&
+	    device->ops->rmem_update &&
+	    device->ops->rmem_fault &&
+	    device->ops->rmem_to_lmem &&
+	    device->ops->lmem_to_rmem &&
+	    device->ops->rmem_split &&
+	    device->ops->rmem_split_adjust &&
+	    device->ops->rmem_destroy) {
+		device->rmem = true;
+	}
 
 	return 0;
 }
@@ -1179,6 +3279,7 @@ static int __init hmm_module_init(void)
 {
 	int ret;
 
+	spin_lock_init(&_hmm_rmems_lock);
 	ret = init_srcu_struct(&srcu);
 	if (ret) {
 		return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ceaf4d7..88e4acd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -56,6 +56,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/hmm.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -6649,6 +6650,8 @@ one_by_one:
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_HMM): if it is hmm entry, target->page is either NULL or point
+ *     to page to move charge.
  *
  * Called with pte lock held.
  */
@@ -6661,6 +6664,7 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_HMM,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
@@ -6690,6 +6694,9 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	swp_entry_t ent = pte_to_swp_entry(ptent);
 
+	if (is_hmm_entry(ent)) {
+		return swp_to_radix_entry(ent);
+	}
 	if (!move_anon() || non_swap_entry(ent))
 		return NULL;
 	/*
@@ -6764,6 +6771,10 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 	if (!page && !ent.val)
 		return ret;
+	if (radix_tree_exceptional_entry(page)) {
+		ret = MC_TARGET_HMM;
+		return ret;
+	}
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
@@ -7077,6 +7088,41 @@ put:			/* get_mctgt_type() gets the page */
 				mc.moved_swap++;
 			}
 			break;
+		case MC_TARGET_HMM:
+			if (target.page) {
+				page = target.page;
+				pc = lookup_page_cgroup(page);
+				if (!mem_cgroup_move_account(page, 1, pc,
+							     mc.from, mc.to)) {
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				put_page(page);
+			} else if (vma->vm_flags & VM_SHARED) {
+				/* Some one migrated the memory after we did
+				 * the pagecache lookup.
+				 */
+				/* FIXME can the precharge/moved_charge then
+				 * becomes wrong ?
+				 */
+				pte_unmap_unlock(pte - 1, ptl);
+				cond_resched();
+				goto retry;
+			} else {
+				unsigned long flags;
+
+				move_lock_mem_cgroup(mc.from, &flags);
+				move_lock_mem_cgroup(mc.to, &flags);
+				mem_cgroup_charge_statistics(mc.from, NULL, true, -1);
+				mem_cgroup_charge_statistics(mc.to, NULL, true, 1);
+				move_unlock_mem_cgroup(mc.to, &flags);
+				move_unlock_mem_cgroup(mc.from, &flags);
+				mc.precharge--;
+				/* we uncharge from mc.from later. */
+				mc.moved_charge++;
+			}
+			break;
 		default:
 			break;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 1e164a1..d35bc65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -53,6 +53,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -851,6 +852,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					if (pte_swp_soft_dirty(*src_pte))
 						pte = pte_swp_mksoft_dirty(pte);
 					set_pte_at(src_mm, addr, src_pte, pte);
+				} else if (is_hmm_entry(entry)) {
+					/* FIXME do we want to handle rblk fork, just mapcount rblk if so. */
+					BUG_ON(1);
 				}
 			}
 		}
@@ -3079,6 +3083,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_hmm_entry(entry)) {
+			ret = hmm_mm_fault(mm, vma, address, page_table,
+					   pmd, flags, orig_pte);
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 07/11] hmm: support moving anonymous page to remote memory
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Motivation:

Migrating to device memory can allow device to access memory through a link
with far greater bandwidth as well as with lower latency. Migration to device
memory is of course only meaningfull if the memory will only be access by the
device over a long period of time.

Because hmm aim to only provide an API to facilitate such use it does not
deal with policy on when, what and to migrate to remote memory. It is expected
that device driver that use hmm will have the informations to make such choice.

Implementation:

This use a two level structure to track remote memory. The first level is a
range structure that match a range of address with a specific remote memory
object. This allow for different range of address to point to the same remote
memory object (usefull for shared memory).

The second level is a structure holding informations specific to hmm about the
remote memory. This remote memory structure are allocated by device driver and
thus can be included inside the remote memory structure that is specific to the
device driver.

Each remote memory is given a range of unique id. Those unique id are use to
create special hmm swap entry. For anonymous memory the cpu page table entry
are set to this hmm swap entry and on cpu page fault the unique id is use to
find the remote memory and migrate it back to system memory.

Other event than cpu page fault can trigger migration back to system memory.
For instance on fork, to simplify things, the remote memory is migrated back
to system memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 include/linux/hmm.h          |  469 ++++++++-
 include/linux/mmu_notifier.h |    1 +
 include/linux/swap.h         |   12 +-
 include/linux/swapops.h      |   33 +-
 mm/hmm.c                     | 2307 ++++++++++++++++++++++++++++++++++++++++--
 mm/memcontrol.c              |   46 +
 mm/memory.c                  |    7 +
 7 files changed, 2768 insertions(+), 107 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e9c7722..96f41c4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -56,10 +56,10 @@
 
 struct hmm_device;
 struct hmm_device_ops;
-struct hmm_migrate;
 struct hmm_mirror;
 struct hmm_fault;
 struct hmm_event;
+struct hmm_rmem;
 struct hmm;
 
 /* The hmm provide page informations to the device using hmm pfn value. Below
@@ -67,15 +67,34 @@ struct hmm;
  * type of page, dirty page, page is locked or not, ...).
  *
  *   HMM_PFN_VALID_PAGE this means the pfn correspond to valid page.
- *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page.
+ *   HMM_PFN_VALID_ZERO this means the pfn is the special zero page either use
+ *     it or directly clear rmem with zero what ever is the fastest method for
+ *     the device.
  *   HMM_PFN_DIRTY set when the page is dirty.
  *   HMM_PFN_WRITE is set if there is no need to call page_mkwrite
+ *   HMM_PFN_LOCK is only set while the rmem object is under going migration.
+ *   HMM_PFN_LMEM_UPTODATE the page that is in the rmem pfn array has uptodate.
+ *   HMM_PFN_RMEM_UPTODATE the rmem copy of the page is uptodate.
+ *
+ * Device driver only need to worry about :
+ *   HMM_PFN_VALID_PAGE
+ *   HMM_PFN_VALID_ZERO
+ *   HMM_PFN_DIRTY
+ *   HMM_PFN_WRITE
+ * Device driver must set/clear following flag after successfull dma :
+ *   HMM_PFN_LMEM_UPTODATE
+ *   HMM_PFN_RMEM_UPTODATE
+ * All the others flags are for hmm internal use only.
  */
 #define HMM_PFN_SHIFT		(PAGE_SHIFT)
+#define HMM_PFN_CLEAR		(((1UL << HMM_PFN_SHIFT) - 1UL) & ~0x3UL)
 #define HMM_PFN_VALID_PAGE	(0UL)
 #define HMM_PFN_VALID_ZERO	(1UL)
 #define HMM_PFN_DIRTY		(2UL)
 #define HMM_PFN_WRITE		(3UL)
+#define HMM_PFN_LOCK		(4UL)
+#define HMM_PFN_LMEM_UPTODATE	(5UL)
+#define HMM_PFN_RMEM_UPTODATE	(6UL)
 
 static inline struct page *hmm_pfn_to_page(unsigned long pfn)
 {
@@ -95,6 +114,28 @@ static inline void hmm_pfn_set_dirty(unsigned long *pfn)
 	set_bit(HMM_PFN_DIRTY, pfn);
 }
 
+static inline void hmm_pfn_set_lmem_uptodate(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_LMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_set_rmem_uptodate(unsigned long *pfn)
+{
+	set_bit(HMM_PFN_RMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_clear_lmem_uptodate(unsigned long *pfn)
+{
+	clear_bit(HMM_PFN_LMEM_UPTODATE, pfn);
+}
+
+static inline void hmm_pfn_clear_rmem_uptodate(unsigned long *pfn)
+{
+	clear_bit(HMM_PFN_RMEM_UPTODATE, pfn);
+}
+
+
+
 
 /* hmm_fence - device driver fence to wait for device driver operations.
  *
@@ -283,6 +324,255 @@ struct hmm_device_ops {
 			  unsigned long laddr,
 			  unsigned long *pfns,
 			  struct hmm_fault *fault);
+
+	/* rmem_alloc - allocate a new rmem object.
+	 *
+	 * @device: Device into which to allocate the remote memory.
+	 * @fault:  The fault for which this remote memory is allocated.
+	 * Returns: Valid rmem ptr on success, NULL or ERR_PTR otherwise.
+	 *
+	 * This allow migration to remote memory to operate in several steps.
+	 * First the hmm code will clamp the range that can migrated and will
+	 * unmap pages and prepare them for migration.
+	 *
+	 * It is only once migration is done with all above step that we know
+	 * how much memory can be migrated which is when rmem_alloc is call to
+	 * allocate the device rmem object to which memory should be migrated.
+	 *
+	 * Device driver can decide through this callback to abort migration
+	 * by returning NULL, or it can decide to continue with migration by
+	 * returning a properly allocated rmem object.
+	 *
+	 * Return rmem or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_rmem *(*rmem_alloc)(struct hmm_device *device,
+				       struct hmm_fault *fault);
+
+	/* rmem_update() - update device mmu for a range of remote memory.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @rmem:   The remote memory under update.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @fuid:   First uid of the remote memory at which the update begin.
+	 * @etype:  The type of memory event (unmap, fini, read only, ...).
+	 * @dirty:  Device driver should call hmm_pfn_set_dirty.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * Called to update device mmu permission/usage for a range of remote
+	 * memory. The event type provide the nature of the update :
+	 *   - range is no longer valid (munmap).
+	 *   - range protection changes (mprotect, COW, ...).
+	 *   - range is unmapped (swap, reclaim, page migration, ...).
+	 *   - ...
+	 *
+	 * Any event that block further write to the memory must also trigger a
+	 * device cache flush and everything has to be flush to remote memory by
+	 * the time the wait callback return (if this callback returned a fence
+	 * otherwise everything must be flush by the time the callback return).
+	 *
+	 * Device must properly call hmm_pfn_set_dirty on any page the device
+	 * did write to since last call to update_rmem. This is only needed if
+	 * the dirty parameter is true.
+	 *
+	 * The driver should return a fence pointer or NULL on success. It is
+	 * advice to return fence and delay wait for the operation to complete
+	 * to the wait callback. Returning a fence allow hmm to batch update to
+	 * several devices and delay wait on those once they all have scheduled
+	 * the update.
+	 *
+	 * Device driver must not fail lightly, any failure result in device
+	 * process being kill.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*rmem_update)(struct hmm_mirror *mirror,
+					 struct hmm_rmem *rmem,
+					 unsigned long faddr,
+					 unsigned long laddr,
+					 unsigned long fuid,
+					 enum hmm_etype etype,
+					 bool dirty);
+
+	/* rmem_fault() - fault range of rmem on the device mmu.
+	 *
+	 * @mirror: The mirror that link process address space with the device.
+	 * @rmem:   The rmem backing this range.
+	 * @faddr:  First address in range (inclusive).
+	 * @laddr:  Last address in range (exclusive).
+	 * @fuid:   First rmem unique id (inclusive).
+	 * @fault:  The fault structure provided by device driver.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Called to give the device driver the remote memory that is backing a
+	 * range of memory. The device driver can only map rmem page with write
+	 * permission only if the HMM_PFN_WRITE bit is set. If device want to
+	 * write to this range of rmem it can call hmm_mirror_fault.
+	 *
+	 * Return error if scheduled operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_fault)(struct hmm_mirror *mirror,
+			  struct hmm_rmem *rmem,
+			  unsigned long faddr,
+			  unsigned long laddr,
+			  unsigned long fuid,
+			  struct hmm_fault *fault);
+
+	/* rmem_to_lmem - copy remote memory to local memory.
+	 *
+	 * @rmem:   The remote memory structure.
+	 * @fuid:   First rmem unique id (inclusive) of range to copy.
+	 * @luid:   Last rmem unique id (exclusive) of range to copy.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * This is call to copy remote memory back to local memory. The device
+	 * driver need to schedule the dma to copy the remote memory to the
+	 * pages given by the pfns array. Device driver should return a fence
+	 * or an error pointer.
+	 *
+	 * If device driver does not return a fence then the device driver must
+	 * wait until the dma is done and all device cache are flush. Moreover
+	 * device driver must set the HMM_PFN_LMEM_UPTODATE on all successfully
+	 * copied pages (setting this flag can be delayed to the fence_wait
+	 * callback).
+	 *
+	 * If a valid fence is returned then hmm will wait on it and reschedule
+	 * any thread that need rescheduling.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * CPU THREAD WILL GET A SIGBUS.
+	 *
+	 * DEVICE DRIVER MUST SET HMM_PFN_LMEM_UPTODATE ON ALL SUCCESSFULLY
+	 * COPIED PAGES.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*rmem_to_lmem)(struct hmm_rmem *rmem,
+					  unsigned long fuid,
+					  unsigned long luid);
+
+	/* lmem_to_rmem - copy local memory to remote memory.
+	 *
+	 * @rmem:   The remote memory structure.
+	 * @fuid:   First rmem unique id (inclusive) of range to copy.
+	 * @luid:   Last rmem unique id (exclusive) of range to copy.
+	 * Returns: Valid fence ptr or NULL on success otherwise ERR_PTR.
+	 *
+	 * This is call to copy local memory to remote memory. The driver need
+	 * to schedule the dma to copy the local memory from the pages given by
+	 * the pfns array, to the remote memory.
+	 *
+	 * Device driver should return a fence or an error pointer. If device
+	 * driver does not return a fence then the it must wait until the dma
+	 * is done. The device driver must set the HMM_PFN_RMEM_UPTODATE on all
+	 * successfully copied pages.
+	 *
+	 * If a valid fence is returned then hmm will wait on it and reschedule
+	 * any thread that need rescheduling.
+	 *
+	 * Failure will result in aborting migration to remote memory.
+	 *
+	 * DEVICE DRIVER MUST SET HMM_PFN_RMEM_UPTODATE ON ALL SUCCESSFULLY
+	 * COPIED PAGES.
+	 *
+	 * Return fence or NULL on success, error value otherwise :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	struct hmm_fence *(*lmem_to_rmem)(struct hmm_rmem *rmem,
+					  unsigned long fuid,
+					  unsigned long luid);
+
+	/* rmem_split - split rmem.
+	 *
+	 * @rmem:   The remote memory to split.
+	 * @fuid:   First rmem unique id (inclusive) of range to split.
+	 * @luid:   Last rmem unique id (exclusive) of range to split.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Split remote memory, first the device driver must allocate a new
+	 * remote memory struct, second it must call hmm_rmem_split_new and
+	 * last it must transfer private driver resource from splited rmem to
+	 * the new remote memory struct.
+	 *
+	 * Device driver _can not_ adjust nor the fuid nor the luid.
+	 *
+	 * Failure should be forwarded if any of the step fails. The device
+	 * driver does not need to worry about freeing the new remote memory
+	 * object once hmm_rmem_split_new is call as it will be freed through
+	 * the rmem_destroy callback if anything fails.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * THE WHOLE RMEM WILL BE MIGRATED BACK TO LMEM.
+	 *
+	 * Return error if operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_split)(struct hmm_rmem *rmem,
+			  unsigned long fuid,
+			  unsigned long luid);
+
+	/* rmem_split_adjust - split rmem.
+	 *
+	 * @rmem:   The remote memory to split.
+	 * @fuid:   First rmem unique id (inclusive) of range to split.
+	 * @luid:   Last rmem unique id (exclusive) of range to split.
+	 * Returns: 0 on success, error value otherwise.
+	 *
+	 * Split remote memory, first the device driver must allocate a new
+	 * remote memory struct, second it must call hmm_rmem_split_new and
+	 * last it must transfer private driver resource from splited rmem to
+	 * the new remote memory struct.
+	 *
+	 * Device driver _can_ adjust the fuid or the luid with constraint that
+	 * adjusted_fuid <= fuid and adjusted_luid >= luid.
+	 *
+	 * Failure should be forwarded if any of the step fails. The device
+	 * driver does not need to worry about freeing the new remote memory
+	 * object once hmm_rmem_split_new is call as it will be freed through
+	 * the rmem_destroy callback if anything fails.
+	 *
+	 * DEVICE DRIVER MUST ABSOLUTELY TRY TO MAKE THIS CALL WORK OTHERWISE
+	 * THE WHOLE RMEM WILL BE MIGRATED BACK TO LMEM.
+	 *
+	 * Return error if operation failed. Valid value :
+	 * -ENOMEM Not enough memory for performing the operation.
+	 * -EIO    Some input/output error with the device.
+	 *
+	 * All other return value trigger warning and are transformed to -EIO.
+	 */
+	int (*rmem_split_adjust)(struct hmm_rmem *rmem,
+				 unsigned long fuid,
+				 unsigned long luid);
+
+	/* rmem_destroy - destroy rmem.
+	 *
+	 * @rmem:   The remote memory to destroy.
+	 *
+	 * Destroying remote memory structure once all ref are gone.
+	 */
+	void (*rmem_destroy)(struct hmm_rmem *rmem);
 };
 
 /* struct hmm_device - per device hmm structure
@@ -292,6 +582,7 @@ struct hmm_device_ops {
  * @mutex:      Mutex protecting mirrors list.
  * @ops:        The hmm operations callback.
  * @name:       Device name (uniquely identify the device on the system).
+ * @wait_queue: Wait queue for remote memory operations.
  *
  * Each device that want to mirror an address space must register one of this
  * struct (only once).
@@ -302,6 +593,8 @@ struct hmm_device {
 	struct mutex			mutex;
 	const struct hmm_device_ops	*ops;
 	const char			*name;
+	wait_queue_head_t		wait_queue;
+	bool				rmem;
 };
 
 /* hmm_device_register() - register a device with hmm.
@@ -322,6 +615,88 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
 
 
 
+/* hmm_rmem - The rmem struct hold hmm infos of a remote memory block.
+ *
+ * The device driver should derivate its remote memory tracking structure from
+ * the hmm_rmem structure. The hmm_rmem structure dos not hold any infos about
+ * the specific of the remote memory block (device address or anything else).
+ * It solely store informations needed for finding rmem when cpu try to access
+ * it.
+ */
+
+/* struct hmm_rmem - remote memory block
+ *
+ * @kref:           Reference count.
+ * @device:         The hmm device the remote memory is allocated on.
+ * @event:          The event currently associated with the rmem.
+ * @lock:           Lock protecting the ranges list and event field.
+ * @ranges:         The list of address ranges that point to this rmem.
+ * @node:           Node for rmem unique id tree.
+ * @pgoff:          Page offset into file (in PAGE_SIZE not PAGE_CACHE_SIZE).
+ * @fuid:           First unique id associated with this specific hmm_rmem.
+ * @fuid:           Last unique id associated with this specific hmm_rmem.
+ * @subtree_luid:   Optimization for red and black interval tree.
+ * @pfns:           Array of pfn for local memory when some is attached.
+ * @dead:           The remote memory is no longer valid restart lookup.
+ *
+ * Each hmm_rmem has a uniq range of id that is use to uniquely identify remote
+ * memory on cpu side. Those uniq id do not relate in any way with the device
+ * physical address at which the remote memory is located.
+ */
+struct hmm_rmem {
+	struct kref		kref;
+	struct hmm_device	*device;
+	struct hmm_event	*event;
+	spinlock_t		lock;
+	struct list_head	ranges;
+	struct rb_node		node;
+	unsigned long		pgoff;
+	unsigned long		fuid;
+	unsigned long		luid;
+	unsigned long		subtree_luid;
+	unsigned long		*pfns;
+	bool			dead;
+};
+
+struct hmm_rmem *hmm_rmem_ref(struct hmm_rmem *rmem);
+struct hmm_rmem *hmm_rmem_unref(struct hmm_rmem *rmem);
+
+/* hmm_rmem_split_new - helper to split rmem.
+ *
+ * @rmem:   The remote memory to split.
+ * @new:    The new remote memory struct.
+ * Returns: 0 on success, error value otherwise.
+ *
+ * The new remote memory struct must be allocated by the device driver and its
+ * fuid and lui field must be set to the range the device wish to new rmem to
+ * cover.
+ *
+ * Moreover all below conditions must be true :
+ *   (new->fuid < new->luid)
+ *   (new->fuid >= rmem->fuid && new->luid <= rmem->luid)
+ *   (new->fuid == rmem->fuid || new->luid == rmem->luid)
+ *
+ * This hmm helper function will split range and perform internal hmm update on
+ * behalf of the device driver.
+ *
+ * Note that this function must be call by the rmem_split and rmem_split_adjust
+ * callback.
+ *
+ * Once this function is call the device driver should not try to free the new
+ * rmem structure no matter what is the return value. Moreover if the function
+ * return 0 then the device driver should properly update the new rmem struct.
+ *
+ * Return error if operation failed. Valid value :
+ * -EINVAL If one of the above condition is false.
+ * -ENOMEM If it failed to allocate memory.
+ * 0 on success.
+ */
+int hmm_rmem_split_new(struct hmm_rmem *rmem,
+		       struct hmm_rmem *new);
+
+
+
+
 /* hmm_mirror - device specific mirroring functions.
  *
  * Each device that mirror a process has a uniq hmm_mirror struct associating
@@ -406,6 +781,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  */
 struct hmm_fault {
 	struct vm_area_struct	*vma;
+	struct hmm_rmem		*rmem;
 	unsigned long		faddr;
 	unsigned long		laddr;
 	unsigned long		*pfns;
@@ -450,6 +826,56 @@ struct hmm_mirror *hmm_mirror_unref(struct hmm_mirror *mirror);
 
 
 
+/* hmm_migrate - Memory migration from local memory to remote memory.
+ *
+ * Below are functions that handle migration from local memory to remote memory
+ * (represented by hmm_rmem struct). This is a multi-step process first the
+ * range is unmap, then the device driver depending on the size of the unmaped
+ * range can decide to proceed or abort the migration.
+ */
+
+/* hmm_migrate_rmem_to_lmem() - force migration of some rmem to lmem.
+ *
+ * @mirror: The mirror that link process address space with the device.
+ * @faddr:  First address of the range to migrate to lmem.
+ * @laddr:  Last address of the range to migrate to lmem.
+ * Returns: 0 on success, -EIO or -EINVAL.
+ *
+ * This migrate any remote memory behind a range of address to local memory.
+ *
+ * Returns:
+ * 0 success.
+ * -EINVAL if invalid argument.
+ * -EIO if one of the device driver returned this error.
+ */
+int hmm_migrate_rmem_to_lmem(struct hmm_mirror *mirror,
+			     unsigned long faddr,
+			     unsigned long laddr);
+
+/* hmm_migrate_lmem_to_rmem() - call to migrate lmem to rmem.
+ *
+ * @migrate:    The migration temporary struct.
+ * @mirror:     The mirror that link process address space with the device.
+ * Returns:     0, -EINVAL, -ENOMEM, -EFAULT, -EACCES, -ENODEV, -EBUSY, -EIO.
+ *
+ * On success the migrate struct is updated with the range that was migrated.
+ *
+ * Returns:
+ * 0 success.
+ * -EINVAL if invalid argument.
+ * -ENOMEM if failing to allocate memory.
+ * -EFAULT if range of address is invalid (no vma backing any of the range).
+ * -EACCES if vma backing the range is special vma.
+ * -ENODEV if mirror is in process of being destroy.
+ * -EBUSY if range can not be migrated (many different reasons).
+ * -EIO if one of the device driver returned this error.
+ */
+int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
+			     struct hmm_mirror *mirror);
+
+
+
+
 /* Functions used by core mm code. Device driver should not use any of them. */
 void __hmm_destroy(struct mm_struct *mm);
 static inline void hmm_destroy(struct mm_struct *mm)
@@ -459,12 +885,51 @@ static inline void hmm_destroy(struct mm_struct *mm)
 	}
 }
 
+/* hmm_mm_fault() - call when cpu pagefault on special hmm pte entry.
+ *
+ * @mm:             The mm of the thread triggering the fault.
+ * @vma:            The vma in which the fault happen.
+ * @addr:           The address of the fault.
+ * @pte:            Pointer to the pte entry inside the cpu page table.
+ * @pmd:            Pointer to the pmd entry into which the pte is.
+ * @fault_flags:    Fault flags (read, write, ...).
+ * @orig_pte:       The original pte value when this fault happened.
+ *
+ * When the cpu try to access a range of memory that is in remote memory it
+ * fault in face of hmm special swap pte which will end up calling this
+ * function that should trigger the appropriate memory migration.
+ *
+ * Returns:
+ *   0 if some one else already migrated the rmem back.
+ *   VM_FAULT_SIGBUS on any i/o error during migration.
+ *   VM_FAULT_OOM if it fails to allocate memory for migration.
+ *   VM_FAULT_MAJOR on successfull migration.
+ */
+int hmm_mm_fault(struct mm_struct *mm,
+		 struct vm_area_struct *vma,
+		 unsigned long addr,
+		 pte_t *pte,
+		 pmd_t *pmd,
+		 unsigned int fault_flags,
+		 pte_t orig_pte);
+
 #else /* !CONFIG_HMM */
 
 static inline void hmm_destroy(struct mm_struct *mm)
 {
 }
 
+static inline int hmm_mm_fault(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long addr,
+			       pte_t *pte,
+			       pmd_t *pmd,
+			       unsigned int fault_flags,
+			       pte_t orig_pte)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 #endif /* !CONFIG_HMM */
 
 #endif
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 0794a73b..bb2c23f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -42,6 +42,7 @@ enum mmu_action {
 	MMU_FAULT_WP,
 	MMU_THP_SPLIT,
 	MMU_THP_FAULT_WP,
+	MMU_HMM,
 };
 
 #ifdef CONFIG_MMU_NOTIFIER
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5a14b92..0739b32 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,18 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
+/*
+ * HMM (heterogeneous memory management) used when data is in remote memory.
+ */
+#ifdef CONFIG_HMM
+#define SWP_HMM_NUM 1
+#define SWP_HMM			(MAX_SWAPFILES + SWP_MIGRATION_NUM + SWP_HWPOISON_NUM)
+#else
+#define SWP_HMM_NUM 0
+#endif
+
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_HMM_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 6adfb7b..9a490d3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -188,7 +188,38 @@ static inline int is_hwpoison_entry(swp_entry_t swp)
 }
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+#ifdef CONFIG_HMM
+
+static inline swp_entry_t make_hmm_entry(unsigned long pgoff)
+{
+	/* We don't need to keep the page pfn, so use offset to store writeable
+	 * flag.
+	 */
+	return swp_entry(SWP_HMM, pgoff);
+}
+
+static inline unsigned long hmm_entry_uid(swp_entry_t entry)
+{
+	return swp_offset(entry);
+}
+
+static inline int is_hmm_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_HMM);
+}
+#else /* !CONFIG_HMM */
+#define make_hmm_entry(page, write) swp_entry(0, 0)
+static inline int is_hmm_entry(swp_entry_t swp)
+{
+	return 0;
+}
+
+static inline void make_hmm_entry_read(swp_entry_t *entry)
+{
+}
+#endif /* !CONFIG_HMM */
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || defined(CONFIG_HMM)
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/mm/hmm.c b/mm/hmm.c
index 2b8986c..599d4f6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -77,6 +77,9 @@
 /* global SRCU for all MMs */
 static struct srcu_struct srcu;
 
+static spinlock_t _hmm_rmems_lock;
+static struct rb_root _hmm_rmems = RB_ROOT;
+
 
 
 
@@ -94,6 +97,7 @@ struct hmm_event {
 	unsigned long		faddr;
 	unsigned long		laddr;
 	struct list_head	fences;
+	struct list_head	ranges;
 	enum hmm_etype		etype;
 	bool			backoff;
 };
@@ -106,6 +110,7 @@ struct hmm_event {
  * @mirrors:        List of all mirror for this mm (one per device)
  * @mmu_notifier:   The mmu_notifier of this mm
  * @wait_queue:     Wait queue for synchronization btw cpu and device
+ * @ranges:         Tree of rmem ranges (sorted by address).
  * @events:         Events.
  * @nevents:        Number of events currently happening.
  * @dead:           The mm is being destroy.
@@ -122,6 +127,7 @@ struct hmm {
 	struct list_head	pending;
 	struct mmu_notifier	mmu_notifier;
 	wait_queue_head_t	wait_queue;
+	struct rb_root		ranges;
 	struct hmm_event	events[HMM_MAX_EVENTS];
 	int			nevents;
 	bool			dead;
@@ -132,137 +138,1456 @@ static struct mmu_notifier_ops hmm_notifier_ops;
 static inline struct hmm *hmm_ref(struct hmm *hmm);
 static inline struct hmm *hmm_unref(struct hmm *hmm);
 
-static int hmm_mirror_update(struct hmm_mirror *mirror,
-			     struct vm_area_struct *vma,
-			     unsigned long faddr,
-			     unsigned long laddr,
-			     struct hmm_event *event);
-static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long fuid);
+static void hmm_rmem_poison_range(struct hmm_rmem *rmem,
+				  struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid);
+
+static int hmm_mirror_rmem_update(struct hmm_mirror *mirror,
+				  struct hmm_rmem *rmem,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid,
+				  struct hmm_event *event,
+				  bool dirty);
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event);
+static void hmm_mirror_cleanup(struct hmm_mirror *mirror);
+
+static int hmm_device_fence_wait(struct hmm_device *device,
+				 struct hmm_fence *fence);
+
+
+
+
+/* hmm_event - use to synchronize various mm events with each others.
+ *
+ * During life time of process various mm events will happen, hmm serialize
+ * event that affect overlapping range of address. The hmm_event are use for
+ * that purpose.
+ */
+
+static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+{
+	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+}
+
+static inline unsigned long hmm_event_size(struct hmm_event *event)
+{
+	return (event->laddr - event->faddr);
+}
+
+
+
+
+/* hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * This code deals with reading the cpu page table to find the pages that are
+ * backing a range of address. It is use as an helper to the device page fault
+ * code.
+ */
+
+/* struct hmm_fault_mm - used for reading cpu page table on device fault.
+ *
+ * @mm:     The mm of the process the device fault is happening in.
+ * @vma:    The vma in which the fault is happening.
+ * @faddr:  The first address for the range the device want to fault.
+ * @laddr:  The last address for the range the device want to fault.
+ * @pfns:   Array of hmm pfns (contains the result of the fault).
+ * @write:  Is this write fault.
+ */
+struct hmm_fault_mm {
+	struct mm_struct	*mm;
+	struct vm_area_struct	*vma;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		*pfns;
+	bool			write;
+};
+
+static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct mm_walk *walk)
+{
+	struct hmm_fault_mm *fault_mm = walk->private;
+	unsigned long idx, *pfns;
+	pte_t *ptep;
+
+	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
+	pfns = &fault_mm->pfns[idx];
+	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
+	if (pmd_none(*pmdp)) {
+		return -ENOENT;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME */
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		return -EINVAL;
+	}
+
+	ptep = pte_offset_map(pmdp, faddr);
+	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
+		pte_t pte = *ptep;
+
+		if (pte_none(pte)) {
+			if (fault_mm->write) {
+				ptep++;
+				break;
+			}
+			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
+			set_bit(HMM_PFN_VALID_ZERO, pfns);
+			continue;
+		}
+		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+		if (fault_mm->write && !pte_write(pte)) {
+			/* Need to inc ptep so unmap unlock on right pmd. */
+			ptep++;
+			break;
+		}
+
+		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, pfns);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_WRITE, pfns);
+		}
+		/* Consider the page as hot as a device want to use it. */
+		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
+		fault_mm->laddr = faddr + PAGE_SIZE;
+	}
+	pte_unmap(ptep - 1);
+
+	return (faddr == laddr) ? 0 : -ENOENT;
+}
+
+static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
+{
+	struct mm_walk walk = {0};
+	unsigned long faddr, laddr;
+	int ret;
+
+	faddr = fault_mm->faddr;
+	laddr = fault_mm->laddr;
+	fault_mm->laddr = faddr;
+
+	walk.pmd_entry = hmm_fault_mm_fault_pmd;
+	walk.mm = fault_mm->mm;
+	walk.private = fault_mm;
+
+	ret = walk_page_range(faddr, laddr, &walk);
+	return ret;
+}
+
+
+
+
+/* hmm_range - address range backed by remote memory.
+ *
+ * Each address range backed by remote memory is tracked so that on cpu page
+ * fault for a given address we can find the corresponding remote memory. We
+ * use a separate structure from remote memory as several different address
+ * range can point to the same remote memory (in case of shared mapping).
+ */
+
+/* struct hmm_range - address range backed by remote memory.
+ *
+ * @kref:           Reference count.
+ * @rmem:           Remote memory that back this address range.
+ * @mirror:         Mirror with which this range is associated.
+ * @fuid:           First unique id of rmem for this range.
+ * @faddr:          First address (inclusive) of the range.
+ * @laddr:          Last address (exclusive) of the range.
+ * @subtree_laddr:  Optimization for red black interval tree.
+ * @rlist:          List of all range associated with same rmem.
+ * @elist:          List of all range associated with an event.
+ */
+struct hmm_range {
+	struct kref		kref;
+	struct hmm_rmem		*rmem;
+	struct hmm_mirror	*mirror;
+	unsigned long		fuid;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		subtree_laddr;
+	struct rb_node		node;
+	struct list_head	rlist;
+	struct list_head	elist;
+};
+
+static inline unsigned long hmm_range_faddr(struct hmm_range *range)
+{
+	return range->faddr;
+}
+
+static inline unsigned long hmm_range_laddr(struct hmm_range *range)
+{
+	return range->laddr - 1UL;
+}
+
+INTERVAL_TREE_DEFINE(struct hmm_range,
+		     node,
+		     unsigned long,
+		     subtree_laddr,
+		     hmm_range_faddr,
+		     hmm_range_laddr,,
+		     hmm_range_tree)
+
+static inline unsigned long hmm_range_npages(struct hmm_range *range)
+{
+	return (range->laddr - range->faddr) >> PAGE_SHIFT;
+}
+
+static inline unsigned long hmm_range_fuid(struct hmm_range *range)
+{
+	return range->fuid;
+}
+
+static inline unsigned long hmm_range_luid(struct hmm_range *range)
+{
+	return range->fuid + hmm_range_npages(range);
+}
+
+static void hmm_range_destroy(struct kref *kref)
+{
+	struct hmm_range *range;
+
+	range = container_of(kref, struct hmm_range, kref);
+	BUG_ON(!list_empty(&range->elist));
+	BUG_ON(!list_empty(&range->rlist));
+	BUG_ON(!RB_EMPTY_NODE(&range->node));
+
+	range->rmem = hmm_rmem_unref(range->rmem);
+	range->mirror = hmm_mirror_unref(range->mirror);
+	kfree(range);
+}
+
+static struct hmm_range *hmm_range_unref(struct hmm_range *range)
+{
+	if (range) {
+		kref_put(&range->kref, hmm_range_destroy);
+	}
+	return NULL;
+}
+
+static void hmm_range_init(struct hmm_range *range,
+			   struct hmm_mirror *mirror,
+			   struct hmm_rmem *rmem,
+			   unsigned long faddr,
+			   unsigned long laddr,
+			   unsigned long fuid)
+{
+	kref_init(&range->kref);
+	range->mirror = hmm_mirror_ref(mirror);
+	range->rmem = hmm_rmem_ref(rmem);
+	range->fuid = fuid;
+	range->faddr = faddr;
+	range->laddr = laddr;
+	RB_CLEAR_NODE(&range->node);
+
+	spin_lock(&rmem->lock);
+	list_add_tail(&range->rlist, &rmem->ranges);
+	if (rmem->event) {
+		list_add_tail(&range->elist, &rmem->event->ranges);
+	}
+	spin_unlock(&rmem->lock);
+}
+
+static void hmm_range_insert(struct hmm_range *range)
+{
+	struct hmm_mirror *mirror = range->mirror;
+
+	spin_lock(&mirror->hmm->lock);
+	if (RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_insert(range, &mirror->hmm->ranges);
+	}
+	spin_unlock(&mirror->hmm->lock);
+}
+
+static inline void hmm_range_adjust_locked(struct hmm_range *range,
+					   unsigned long faddr,
+					   unsigned long laddr)
+{
+	if (!RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_remove(range, &range->mirror->hmm->ranges);
+	}
+	if (faddr < range->faddr) {
+		range->fuid -= ((range->faddr - faddr) >> PAGE_SHIFT);
+	} else {
+		range->fuid += ((faddr - range->faddr) >> PAGE_SHIFT);
+	}
+	range->faddr = faddr;
+	range->laddr = laddr;
+	hmm_range_tree_insert(range, &range->mirror->hmm->ranges);
+}
+
+static int hmm_range_split(struct hmm_range *range,
+			   unsigned long saddr)
+{
+	struct hmm_mirror *mirror = range->mirror;
+	struct hmm_range *new;
+
+	if (range->faddr >= saddr) {
+		BUG();
+		return -EINVAL;
+	}
+
+	new = kmalloc(sizeof(struct hmm_range), GFP_KERNEL);
+	if (new == NULL) {
+		return -ENOMEM;
+	}
+
+	hmm_range_init(new,mirror,range->rmem,range->faddr,saddr,range->fuid);
+	spin_lock(&mirror->hmm->lock);
+	hmm_range_adjust_locked(range, saddr, range->laddr);
+	hmm_range_tree_insert(new, &mirror->hmm->ranges);
+	spin_unlock(&mirror->hmm->lock);
+	return 0;
+}
+
+static void hmm_range_fini(struct hmm_range *range)
+{
+	struct hmm_rmem *rmem = range->rmem;
+	struct hmm *hmm = range->mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	if (!RB_EMPTY_NODE(&range->node)) {
+		hmm_range_tree_remove(range, &hmm->ranges);
+		RB_CLEAR_NODE(&range->node);
+	}
+	spin_unlock(&hmm->lock);
+
+	spin_lock(&rmem->lock);
+	list_del_init(&range->elist);
+	list_del_init(&range->rlist);
+	spin_unlock(&rmem->lock);
+
+	hmm_range_unref(range);
+}
+
+static void hmm_range_fini_clear(struct hmm_range *range,
+				 struct vm_area_struct *vma)
+{
+	hmm_rmem_clear_range(range->rmem, vma, range->faddr,
+			     range->laddr, range->fuid);
+	hmm_range_fini(range);
+}
+
+static inline bool hmm_range_reserve(struct hmm_range *range,
+				     struct hmm_event *event)
+{
+	bool reserved = false;
+
+	spin_lock(&range->rmem->lock);
+	if (range->rmem->event == NULL || range->rmem->event == event) {
+		range->rmem->event = event;
+		list_add_tail(&range->elist, &range->rmem->event->ranges);
+		reserved = true;
+	}
+	spin_unlock(&range->rmem->lock);
+	return reserved;
+}
+
+static inline void hmm_range_release(struct hmm_range *range,
+				     struct hmm_event *event)
+{
+	struct hmm_device *device = NULL;
+	spin_lock(&range->rmem->lock);
+	if (range->rmem->event != event) {
+		spin_unlock(&range->rmem->lock);
+		WARN_ONCE(1,"hmm: trying to release range from wrong event.\n");
+		return;
+	}
+	list_del_init(&range->elist);
+	if (list_empty(&range->rmem->event->ranges)) {
+		range->rmem->event = NULL;
+		device = range->rmem->device;
+	}
+	spin_unlock(&range->rmem->lock);
+
+	if (device) {
+		wake_up(&device->wait_queue);
+	}
+}
+
+
+
+
+/* hmm_rmem - The remote memory.
+ *
+ * Below are functions that deals with remote memory.
+ */
+
+/* struct hmm_rmem_mm - used during memory migration from/to rmem.
+ *
+ * @vma:            The vma that cover the range.
+ * @rmem:           The remote memory object.
+ * @faddr:          The first address in the range.
+ * @laddr:          The last address in the range.
+ * @fuid:           The first uid for the range.
+ * @rmeap_pages:    List of page to remap.
+ * @tlb:            For gathering cpu tlb flushes.
+ * @force_flush:    Force cpu tlb flush.
+ */
+struct hmm_rmem_mm {
+	struct vm_area_struct	*vma;
+	struct hmm_rmem		*rmem;
+	unsigned long		faddr;
+	unsigned long		laddr;
+	unsigned long		fuid;
+	struct list_head	remap_pages;
+	struct mmu_gather	tlb;
+	int			force_flush;
+};
+
+/* Interval tree for the hmm_rmem object. Providing the following functions :
+ * hmm_rmem_tree_insert(struct hmm_rmem *, struct rb_root *)
+ * hmm_rmem_tree_remove(struct hmm_rmem *, struct rb_root *)
+ * hmm_rmem_tree_iter_first(struct rb_root *, fpgoff, lpgoff)
+ * hmm_rmem_tree_iter_next(struct hmm_rmem *, fpgoff, lpgoff)
+ */
+static inline unsigned long hmm_rmem_fuid(struct hmm_rmem *rmem)
+{
+	return rmem->fuid;
+}
+
+static inline unsigned long hmm_rmem_luid(struct hmm_rmem *rmem)
+{
+	return rmem->luid - 1UL;
+}
+
+INTERVAL_TREE_DEFINE(struct hmm_rmem,
+		     node,
+		     unsigned long,
+		     subtree_luid,
+		     hmm_rmem_fuid,
+		     hmm_rmem_luid,,
+		     hmm_rmem_tree)
+
+static inline unsigned long hmm_rmem_npages(struct hmm_rmem *rmem)
+{
+	return (rmem->luid - rmem->fuid);
+}
+
+static inline unsigned long hmm_rmem_size(struct hmm_rmem *rmem)
+{
+	return hmm_rmem_npages(rmem) << PAGE_SHIFT;
+}
+
+static void hmm_rmem_free(struct hmm_rmem *rmem)
+{
+	unsigned long i;
+
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!page || test_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[i])) {
+			continue;
+		}
+		/* Fake mapping so that page_remove_rmap behave as we want. */
+		VM_BUG_ON(page_mapcount(page));
+		atomic_set(&page->_mapcount, 0);
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0;
+	}
+	kfree(rmem->pfns);
+	rmem->pfns = NULL;
+
+	spin_lock(&_hmm_rmems_lock);
+	if (!RB_EMPTY_NODE(&rmem->node)) {
+		hmm_rmem_tree_remove(rmem, &_hmm_rmems);
+		RB_CLEAR_NODE(&rmem->node);
+	}
+	spin_unlock(&_hmm_rmems_lock);
+}
+
+static void hmm_rmem_destroy(struct kref *kref)
+{
+	struct hmm_device *device;
+	struct hmm_rmem *rmem;
+
+	rmem = container_of(kref, struct hmm_rmem, kref);
+	device = rmem->device;
+	BUG_ON(!list_empty(&rmem->ranges));
+	hmm_rmem_free(rmem);
+	device->ops->rmem_destroy(rmem);
+}
+
+struct hmm_rmem *hmm_rmem_ref(struct hmm_rmem *rmem)
+{
+	if (rmem) {
+		kref_get(&rmem->kref);
+		return rmem;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_rmem_ref);
+
+struct hmm_rmem *hmm_rmem_unref(struct hmm_rmem *rmem)
+{
+	if (rmem) {
+		kref_put(&rmem->kref, hmm_rmem_destroy);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_rmem_unref);
+
+static void hmm_rmem_init(struct hmm_rmem *rmem,
+			  struct hmm_device *device)
+{
+	kref_init(&rmem->kref);
+	rmem->device = device;
+	rmem->fuid = 0;
+	rmem->luid = 0;
+	rmem->pfns = NULL;
+	rmem->dead = false;
+	INIT_LIST_HEAD(&rmem->ranges);
+	spin_lock_init(&rmem->lock);
+	RB_CLEAR_NODE(&rmem->node);
+}
+
+static int hmm_rmem_alloc(struct hmm_rmem *rmem, unsigned long npages)
+{
+	rmem->pfns = kzalloc(sizeof(long) * npages, GFP_KERNEL);
+	if (rmem->pfns == NULL) {
+		return -ENOMEM;
+	}
+
+	spin_lock(&_hmm_rmems_lock);
+	if (_hmm_rmems.rb_node == NULL) {
+		rmem->fuid = 1;
+		rmem->luid = 1 + npages;
+	} else {
+		struct hmm_rmem *head;
+
+		head = container_of(_hmm_rmems.rb_node,struct hmm_rmem,node);
+		/* The subtree_luid of root node is the current luid. */
+		rmem->fuid = head->subtree_luid;
+		rmem->luid = head->subtree_luid + npages;
+	}
+	/* The rmem uid value must fit into swap entry. FIXME can we please
+	 * have an ARCH define for the maximum swap entry value !
+	 */
+	if (rmem->luid < MM_MAX_SWAP_PAGES) {
+		hmm_rmem_tree_insert(rmem, &_hmm_rmems);
+		spin_unlock(&_hmm_rmems_lock);
+		return 0;
+	}
+	spin_unlock(&_hmm_rmems_lock);
+	rmem->fuid = 0;
+	rmem->luid = 0;
+	return -ENOSPC;
+}
+
+static struct hmm_rmem *hmm_rmem_find(unsigned long uid)
+{
+	struct hmm_rmem *rmem;
+
+	spin_lock(&_hmm_rmems_lock);
+	rmem = hmm_rmem_tree_iter_first(&_hmm_rmems, uid, uid);
+	hmm_rmem_ref(rmem);
+	spin_unlock(&_hmm_rmems_lock);
+	return rmem;
+}
+
+int hmm_rmem_split_new(struct hmm_rmem *rmem,
+		       struct hmm_rmem *new)
+{
+	struct hmm_range *range, *next;
+	unsigned long i, pgoff, npages;
+
+	hmm_rmem_init(new, rmem->device);
+
+	/* Sanity check, the new rmem is either at the begining or at the end
+	 * of the old rmem it can not be in the middle.
+	 */
+	if (!(new->fuid < new->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+	if (!(new->fuid >= rmem->fuid && new->luid <= rmem->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+	if (!(new->fuid == rmem->fuid || new->luid == rmem->luid)) {
+		hmm_rmem_unref(new);
+		return -EINVAL;
+	}
+
+	npages = hmm_rmem_npages(new);
+	new->pfns = kzalloc(sizeof(long) * npages, GFP_KERNEL);
+	if (new->pfns == NULL) {
+		hmm_rmem_unref(new);
+		return -ENOMEM;
+	}
+
+retry:
+	spin_lock(&rmem->lock);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		if (hmm_range_fuid(range) < new->fuid &&
+		    hmm_range_luid(range) > new->fuid) {
+			unsigned long soff;
+			int ret;
+
+			soff = ((new->fuid - range->fuid) << PAGE_SHIFT);
+			spin_unlock(&rmem->lock);
+			ret = hmm_range_split(range, soff + range->faddr);
+			if (ret) {
+				hmm_rmem_unref(new);
+				return ret;
+			}
+			goto retry;
+		}
+		if (hmm_range_fuid(range) < new->luid &&
+		    hmm_range_luid(range) > new->luid) {
+			unsigned long soff;
+			int ret;
+
+			soff = ((new->luid - range->fuid) << PAGE_SHIFT);
+			spin_unlock(&rmem->lock);
+			ret = hmm_range_split(range, soff + range->faddr);
+			if (ret) {
+				hmm_rmem_unref(new);
+				return ret;
+			}
+			goto retry;
+		}
+	}
+	spin_unlock(&rmem->lock);
+
+	spin_lock(&_hmm_rmems_lock);
+	hmm_rmem_tree_remove(rmem, &_hmm_rmems);
+	if (new->fuid != rmem->fuid) {
+		for (i = 0, pgoff = (new->fuid-rmem->fuid); i < npages; ++i) {
+			new->pfns[i] = rmem->pfns[i + pgoff];
+		}
+		rmem->luid = new->fuid;
+	} else {
+		for (i = 0; i < npages; ++i) {
+			new->pfns[i] = rmem->pfns[i];
+		}
+		rmem->fuid = new->luid;
+		for (i = 0, pgoff = npages; i < hmm_rmem_npages(rmem); ++i) {
+			rmem->pfns[i] = rmem->pfns[i + pgoff];
+		}
+	}
+	hmm_rmem_tree_insert(rmem, &_hmm_rmems);
+	hmm_rmem_tree_insert(new, &_hmm_rmems);
+
+	/* No need to lock the new ranges list as we are holding the
+	 * rmem uid tree lock and thus no one can find about the new
+	 * rmem yet.
+	 */
+	spin_lock(&rmem->lock);
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		if (range->fuid >= rmem->fuid) {
+			continue;
+		}
+		list_del(&range->rlist);
+		list_add_tail(&range->rlist, &new->ranges);
+	}
+	spin_unlock(&rmem->lock);
+	spin_unlock(&_hmm_rmems_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_rmem_split_new);
+
+static int hmm_rmem_split(struct hmm_rmem *rmem,
+			  unsigned long fuid,
+			  unsigned long luid,
+			  bool adjust)
+{
+	struct hmm_device *device = rmem->device;
+	int ret;
+
+	if (fuid < rmem->fuid || luid > rmem->luid) {
+		WARN_ONCE(1, "hmm: rmem split received invalid range.\n");
+		return -EINVAL;
+	}
+
+	if (fuid == rmem->fuid && luid == rmem->luid) {
+		return 0;
+	}
+
+	if (adjust) {
+		ret = device->ops->rmem_split_adjust(rmem, fuid, luid);
+	} else {
+		ret = device->ops->rmem_split(rmem, fuid, luid);
+	}
+	return ret;
+}
+
+static void hmm_rmem_clear_range_page(struct hmm_rmem_mm *rmem_mm,
+				      unsigned long addr,
+				      pte_t *ptep,
+				      pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long uid;
+	pte_t pte;
+
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte, swp_entry_to_pte(make_hmm_entry(uid)))) {
+//		print_bad_pte(vma, addr, ptep, NULL);
+		set_pte_at(mm, addr, ptep, pte);
+	}
+}
+
+static int hmm_rmem_clear_range_pmd(pmd_t *pmdp,
+				    unsigned long addr,
+				    unsigned long next,
+				    struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return 0;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* FIXME I do not think this can happen at this point given
+		 * that during unmap all thp pmd were split.
+		 */
+		BUG();
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		hmm_rmem_clear_range_page(rmem_mm, addr, ptep, pmdp);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long i, idx, npages;
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_clear_range_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	walk_page_range(faddr, laddr, &walk);
+
+	npages = (laddr - faddr) >> PAGE_SHIFT;
+	for (i = 0, idx = fuid - rmem->fuid; i < npages; ++i, ++idx) {
+		if (current->mm == vma->vm_mm) {
+			sync_mm_rss(vma->vm_mm);
+		}
+
+		/* Properly uncharge memory. */
+		mem_cgroup_uncharge_mm(vma->vm_mm);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+	}
+}
+
+static void hmm_rmem_poison_range_page(struct hmm_rmem_mm *rmem_mm,
+				       struct vm_area_struct *vma,
+				       unsigned long addr,
+				       pte_t *ptep,
+				       pmd_t *pmdp)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long uid;
+	pte_t pte;
+
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte, swp_entry_to_pte(make_hmm_entry(uid)))) {
+//		print_bad_pte(vma, addr, ptep, NULL);
+		set_pte_at(mm, addr, ptep, pte);
+	} else {
+		/* The 0 fuid is special poison value. */
+		pte = swp_entry_to_pte(make_hmm_entry(0));
+		set_pte_at(mm, addr, ptep, pte);
+	}
+}
+
+static int hmm_rmem_poison_range_pmd(pmd_t *pmdp,
+				     unsigned long addr,
+				     unsigned long next,
+				     struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (!vma) {
+		vma = find_vma(walk->mm, addr);
+	}
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return 0;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* FIXME I do not think this can happen at this point given
+		 * that during unmap all thp pmd were split.
+		 */
+		BUG();
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		hmm_rmem_poison_range_page(rmem_mm, vma, addr, ptep, pmdp);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+static void hmm_rmem_poison_range(struct hmm_rmem *rmem,
+				  struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_poison_range_pmd;
+	walk.mm = mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	walk_page_range(faddr, laddr, &walk);
+}
+
+static int hmm_rmem_remap_page(struct hmm_rmem_mm *rmem_mm,
+			       unsigned long addr,
+			       pte_t *ptep,
+			       pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	uid = rmem_mm->fuid + ((rmem_mm->faddr - addr) >> PAGE_SHIFT);
+	idx = (uid - rmem_mm->fuid);
+	pte = ptep_get_and_clear(mm, addr, ptep);
+	if (!pte_same(pte,swp_entry_to_pte(make_hmm_entry(uid)))) {
+		set_pte_at(mm, addr, ptep, pte);
+		if (vma->vm_file) {
+			/* Just ignore it, it might means that the shared page
+			 * backing this address was remapped right after being
+			 * added to the pagecache.
+			 */
+			return 0;
+		} else {
+//			print_bad_pte(vma, addr, ptep, NULL);
+			return -EFAULT;
+		}
+	}
+	page = hmm_pfn_to_page(rmem->pfns[idx]);
+	if (!page) {
+		/* Nothing to do. */
+		return 0;
+	}
+
+	/* The remap code must lock page prior to remapping. */
+	BUG_ON(PageHuge(page));
+	if (test_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx])) {
+		BUG_ON(!PageLocked(page));
+		pte = mk_pte(page, vma->vm_page_prot);
+		if (test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			pte = pte_mkwrite(pte);
+		}
+		if (test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx])) {
+			pte = pte_mkdirty(pte);
+		}
+		get_page(page);
+		/* Private anonymous page. */
+		page_add_anon_rmap(page, vma, addr);
+		/* FIXME is this necessary ? I do not think so. */
+		if (!reuse_swap_page(page)) {
+			/* Page is still mapped in another process. */
+			pte = pte_wrprotect(pte);
+		}
+	} else {
+		/* Special zero page. */
+		pte = pte_mkspecial(pfn_pte(page_to_pfn(page),
+				    vma->vm_page_prot));
+	}
+	set_pte_at(mm, addr, ptep, pte);
+
+	return 0;
+}
+
+static int hmm_rmem_remap_pmd(pmd_t *pmdp,
+			      unsigned long addr,
+			      unsigned long next,
+			      struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	int ret = 0;
+
+	if (pmd_none(*pmdp)) {
+		return 0;
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* This can not happen we do split huge page during unmap. */
+		BUG();
+		return -EINVAL;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* No pmd here. */
+		return 0;
+	}
+
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		ret = hmm_rmem_remap_page(rmem_mm, addr, ptep, pmdp);
+		if (ret) {
+			/* Increment ptep so unlock works on correct pte. */
+			ptep++;
+			break;
+		}
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return ret;
+}
+
+static int hmm_rmem_remap_anon(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr,
+			       unsigned long fuid)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	int ret;
+
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = laddr;
+	rmem_mm.fuid = fuid;
+	walk.pmd_entry = hmm_rmem_remap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	/* No need to call mmu notifier the range was either unmaped or inside
+	 * video memory. In latter case invalidation must have happen prior to
+	 * this function being call.
+	 */
+	ret = walk_page_range(faddr, laddr, &walk);
+
+	return ret;
+}
+
+static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
+				    unsigned long addr,
+				    pte_t *ptep,
+				    pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	/* New pte value. */
+	uid = ((addr - rmem_mm->faddr) >> PAGE_SHIFT) + rmem_mm->fuid;
+	idx = uid - rmem->fuid;
+	pte = ptep_get_and_clear_full(mm, addr, ptep, rmem_mm->tlb.fullmm);
+	tlb_remove_tlb_entry((&rmem_mm->tlb), ptep, addr);
+	rmem->pfns[idx] = 0;
+
+	if (pte_none(pte)) {
+		if (mem_cgroup_charge_anon(NULL, mm, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+		/* Zero pte means nothing is there and thus nothing to copy. */
+		pte = swp_entry_to_pte(make_hmm_entry(uid));
+		set_pte_at(mm, addr, ptep, pte);
+		rmem->pfns[idx] = my_zero_pfn(addr) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[idx]);
+		if (vma->vm_flags & VM_WRITE) {
+			set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+		}
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+		rmem_mm->laddr = addr + PAGE_SIZE;
+		return 0;
+	}
+	if (!pte_present(pte)) {
+		/* Page is not present it must be faulted, restore pte. */
+		set_pte_at(mm, addr, ptep, pte);
+		return -ENOENT;
+	}
+
+	page = pfn_to_page(pte_pfn(pte));
+	/* FIXME do we want to be able to unmap mlocked page ? */
+	if (PageMlocked(page)) {
+		set_pte_at(mm, addr, ptep, pte);
+		return -EBUSY;
+	}
+
+	rmem->pfns[idx] = pte_pfn(pte) << HMM_PFN_SHIFT;
+	if (is_zero_pfn(pte_pfn(pte))) {
+		set_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+	} else {
+		flush_cache_page(vma, addr, pte_pfn(pte));
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+		/* Anonymous private memory always writeable. */
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		if (trylock_page(page)) {
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[idx]);
+		}
+		rmem_mm->force_flush=!__tlb_remove_page(&rmem_mm->tlb,page);
+
+		/* tlb_flush_mmu drop one ref so take an extra ref here. */
+		get_page(page);
+	}
+	if (vma->vm_flags & VM_WRITE) {
+		set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+	}
+	rmem_mm->laddr = addr + PAGE_SIZE;
+
+	pte = swp_entry_to_pte(make_hmm_entry(uid));
+	set_pte_at(mm, addr, ptep, pte);
+
+	/* What a journey ! */
+	return 0;
+}
+
+static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
+			      unsigned long addr,
+			      unsigned long next,
+			      struct mm_walk *walk)
+{
+	struct hmm_rmem_mm *rmem_mm = walk->private;
+	struct vm_area_struct *vma = rmem_mm->vma;
+	spinlock_t *ptl;
+	pte_t *ptep;
+	int ret = 0;
+
+	if (pmd_none(*pmdp)) {
+		if (unlikely(__pte_alloc(vma->vm_mm, vma, pmdp, addr))) {
+			return -ENOENT;
+		}
+	}
+
+	if (pmd_trans_huge(*pmdp)) {
+		/* FIXME this will dead lock because it does mmu_notifier_range_invalidate */
+		split_huge_page_pmd(vma, addr, pmdp);
+		return -EAGAIN;
+	}
+
+	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
+		/* It is already be handled above. */
+		BUG();
+		return -EINVAL;
+	}
+
+again:
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+		ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
+					       ptep, pmdp);
+		if (ret || rmem_mm->force_flush) {
+			/* Increment ptep so unlock works on correct
+			 * pte.
+			 */
+			ptep++;
+			break;
+		}
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	/* mmu_gather ran out of room to batch pages, we break out of the PTE
+	 * lock to avoid doing the potential expensive TLB invalidate and
+	 * page-free while holding it.
+	 */
+	if (rmem_mm->force_flush) {
+		unsigned long old_end;
+
+		rmem_mm->force_flush = 0;
+		/*
+		 * Flush the TLB just for the previous segment,
+		 * then update the range to be the remaining
+		 * TLB range.
+		 */
+		old_end = rmem_mm->tlb.end;
+		rmem_mm->tlb.end = addr;
+
+		tlb_flush_mmu(&rmem_mm->tlb);
+
+		rmem_mm->tlb.start = addr;
+		rmem_mm->tlb.end = old_end;
+
+		if (!ret && addr != next) {
+			goto again;
+		}
+	}
+
+	return ret;
+}
+
+static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr)
+{
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long i, npages;
+	int ret;
+
+	if (vma->vm_file) {
+		return -EINVAL;
+	}
 
-static int hmm_device_fence_wait(struct hmm_device *device,
-				 struct hmm_fence *fence);
+	npages = (laddr - faddr) >> PAGE_SHIFT;
+	rmem->pgoff = faddr;
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = faddr;
+	rmem_mm.fuid = rmem->fuid;
+	memset(rmem->pfns, 0, sizeof(long) * npages);
+
+	rmem_mm.force_flush = 0;
+	walk.pmd_entry = hmm_rmem_unmap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	mmu_notifier_invalidate_range_start(walk.mm,vma,faddr,laddr,MMU_HMM);
+	tlb_gather_mmu(&rmem_mm.tlb, walk.mm, faddr, laddr);
+	tlb_start_vma(&rmem_mm.tlb, rmem_mm->vma);
+	ret = walk_page_range(faddr, laddr, &walk);
+	tlb_end_vma(&rmem_mm.tlb, rmem_mm->vma);
+	tlb_finish_mmu(&rmem_mm.tlb, faddr, laddr);
+	mmu_notifier_invalidate_range_end(walk.mm, vma, faddr, laddr, MMU_HMM);
 
+	/* Before migrating page we must lock them. Here we lock all page we
+	 * could not lock while holding pte lock.
+	 */
+	npages = (rmem_mm.laddr - faddr) >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		struct page *page;
 
+		if (test_bit(HMM_PFN_VALID_ZERO, &rmem->pfns[i])) {
+			continue;
+		}
 
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+		}
+	}
 
-/* hmm_event - use to synchronize various mm events with each others.
- *
- * During life time of process various mm events will happen, hmm serialize
- * event that affect overlapping range of address. The hmm_event are use for
- * that purpose.
- */
+	return ret;
+}
 
-static inline bool hmm_event_overlap(struct hmm_event *a, struct hmm_event *b)
+static inline int hmm_rmem_unmap(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long faddr,
+				 unsigned long laddr)
 {
-	return !((a->laddr <= b->faddr) || (a->faddr >= b->laddr));
+	if (vma->vm_file) {
+		return -EBUSY;
+	} else {
+		return hmm_rmem_unmap_anon(rmem, vma, faddr, laddr);
+	}
 }
 
-static inline unsigned long hmm_event_size(struct hmm_event *event)
+static int hmm_rmem_alloc_pages(struct hmm_rmem *rmem,
+				struct vm_area_struct *vma,
+				unsigned long addr)
 {
-	return (event->laddr - event->faddr);
-}
+	unsigned long i, npages = hmm_rmem_npages(rmem);
+	unsigned long *pfns = rmem->pfns;
+	struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+	int ret = 0;
 
+	if (vma && !(vma->vm_file)) {
+		if (unlikely(anon_vma_prepare(vma))) {
+			return -ENOMEM;
+		}
+	}
 
+	for (i = 0; i < npages; ++i, addr += PAGE_SIZE) {
+		struct page *page;
 
+		/* (i) This does happen if vma is being split and rmem split
+		 * failed thus we are falling back to full rmem migration and
+		 * there might not be a vma covering all the address (ie some
+		 * of the migration is useless but to make code simpler we just
+		 * copy more stuff than necessary).
+		 */
+		if (vma && addr >= vma->vm_end) {
+			vma = mm ? find_vma(mm, addr) : NULL;
+		}
 
-/* hmm_fault_mm - used for reading cpu page table on device fault.
- *
- * This code deals with reading the cpu page table to find the pages that are
- * backing a range of address. It is use as an helper to the device page fault
- * code.
- */
+		/* No need to clear page they will be dma to of course this does
+		 * means we trust the device driver.
+		 */
+		if (!vma) {
+			/* See above (i) for when this does happen. */
+			page = alloc_page(GFP_HIGHUSER_MOVABLE);
+		} else {
+			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+		}
+		if (!page) {
+			ret = ret ? ret : -ENOMEM;
+			continue;
+		}
+		lock_page(page);
+		pfns[i] = page_to_pfn(page) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_WRITE, &pfns[i]);
+		set_bit(HMM_PFN_LOCK, &pfns[i]);
+		set_bit(HMM_PFN_VALID_PAGE, &pfns[i]);
+		page_add_new_anon_rmap(page, vma, addr);
+	}
 
-/* struct hmm_fault_mm - used for reading cpu page table on device fault.
- *
- * @mm:     The mm of the process the device fault is happening in.
- * @vma:    The vma in which the fault is happening.
- * @faddr:  The first address for the range the device want to fault.
- * @laddr:  The last address for the range the device want to fault.
- * @pfns:   Array of hmm pfns (contains the result of the fault).
- * @write:  Is this write fault.
- */
-struct hmm_fault_mm {
-	struct mm_struct	*mm;
-	struct vm_area_struct	*vma;
-	unsigned long		faddr;
-	unsigned long		laddr;
-	unsigned long		*pfns;
-	bool			write;
-};
+	return ret;
+}
 
-static int hmm_fault_mm_fault_pmd(pmd_t *pmdp,
-				  unsigned long faddr,
-				  unsigned long laddr,
-				  struct mm_walk *walk)
+int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
+			     struct vm_area_struct *vma,
+			     unsigned long addr,
+			     unsigned long fuid,
+			     unsigned long luid,
+			     bool adjust)
 {
-	struct hmm_fault_mm *fault_mm = walk->private;
-	unsigned long idx, *pfns;
-	pte_t *ptep;
+	struct hmm_device *device = rmem->device;
+	struct hmm_range *range, *next;
+	struct hmm_fence *fence, *tmp;
+	struct mm_struct *mm = vma ? vma->vm_mm : NULL;
+	struct list_head fences;
+	unsigned long i;
+	int ret = 0;
 
-	idx = (faddr - fault_mm->faddr) >> PAGE_SHIFT;
-	pfns = &fault_mm->pfns[idx];
-	memset(pfns, 0, ((laddr - faddr) >> PAGE_SHIFT) * sizeof(long));
-	if (pmd_none(*pmdp)) {
-		return -ENOENT;
+	BUG_ON(vma && ((addr < vma->vm_start) || (addr >= vma->vm_end)));
+
+	/* Ignore split error will fallback to full migration. */
+	hmm_rmem_split(rmem, fuid, luid, adjust);
+
+	if (rmem->fuid > fuid || rmem->luid < luid) {
+		WARN_ONCE(1, "hmm: rmem split out of constraint.\n");
+		ret = -EINVAL;
+		goto error;
 	}
 
-	if (pmd_trans_huge(*pmdp)) {
-		/* FIXME */
-		return -EINVAL;
+	/* Adjust start address for page allocation if necessary. */
+	if (vma && (rmem->fuid < fuid)) {
+		if (((addr-vma->vm_start)>>PAGE_SHIFT) < (fuid-rmem->fuid)) {
+			/* FIXME can this happen ? I would say now but right
+			 * now i can not hold in my brain all code path that
+			 * leads to this place.
+			 */
+			vma = NULL;
+		} else {
+			addr -= ((fuid - rmem->fuid) << PAGE_SHIFT);
+		}
 	}
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmdp)) {
-		return -EINVAL;
+	ret = hmm_rmem_alloc_pages(rmem, vma, addr);
+	if (ret) {
+		goto error;
 	}
 
-	ptep = pte_offset_map(pmdp, faddr);
-	for (; faddr != laddr; ++ptep, ++pfns, faddr += PAGE_SIZE) {
-		pte_t pte = *ptep;
+	INIT_LIST_HEAD(&fences);
 
-		if (pte_none(pte)) {
-			if (fault_mm->write) {
-				ptep++;
-				break;
-			}
-			*pfns = my_zero_pfn(faddr) << HMM_PFN_SHIFT;
-			set_bit(HMM_PFN_VALID_ZERO, pfns);
-			continue;
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		fence = device->ops->rmem_update(range->mirror,
+						 range->rmem,
+						 range->faddr,
+						 range->laddr,
+						 range->fuid,
+						 HMM_MIGRATE_TO_LMEM,
+						 false);
+		if (IS_ERR(fence)) {
+			ret = PTR_ERR(fence);
+			goto error;
 		}
-		if (!pte_present(pte) || (fault_mm->write && !pte_write(pte))) {
-			/* Need to inc ptep so unmap unlock on right pmd. */
-			ptep++;
-			break;
+		if (fence) {
+			list_add_tail(&fence->list, &fences);
 		}
+	}
 
-		*pfns = pte_pfn(pte) << HMM_PFN_SHIFT;
-		set_bit(HMM_PFN_VALID_PAGE, pfns);
-		if (pte_write(pte)) {
-			set_bit(HMM_PFN_WRITE, pfns);
+	list_for_each_entry_safe (fence, tmp, &fences, list) {
+		int r;
+
+		r = hmm_device_fence_wait(device, fence);
+		ret = ret ? min(ret, r) : r;
+	}
+	if (ret) {
+		goto error;
+	}
+
+	fence = device->ops->rmem_to_lmem(rmem, rmem->fuid, rmem->luid);
+	if (IS_ERR(fence)) {
+		/* FIXME Check return value. */
+		ret = PTR_ERR(fence);
+		goto error;
+	}
+
+	if (fence) {
+		INIT_LIST_HEAD(&fence->list);
+		ret = hmm_device_fence_wait(device, fence);
+		if (ret) {
+			goto error;
 		}
-		/* Consider the page as hot as a device want to use it. */
-		mark_page_accessed(pfn_to_page(pte_pfn(pte)));
-		fault_mm->laddr = faddr + PAGE_SIZE;
 	}
-	pte_unmap(ptep - 1);
 
-	return (faddr == laddr) ? 0 : -ENOENT;
-}
+	/* Now the remote memory is officialy dead and nothing below can fails
+	 * badly.
+	 */
+	rmem->dead = true;
 
-static int hmm_fault_mm_fault(struct hmm_fault_mm *fault_mm)
-{
-	struct mm_walk walk = {0};
-	unsigned long faddr, laddr;
-	int ret;
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		VM_BUG_ON(!vma);
+		VM_BUG_ON(range->faddr < vma->vm_start);
+		VM_BUG_ON(range->laddr > vma->vm_end);
+
+		/* The remapping fail only if something goes terribly wrong. */
+		ret = hmm_rmem_remap_anon(rmem, vma, range->faddr,
+					  range->laddr, range->fuid);
+		if (ret) {
+			WARN_ONCE(1, "hmm: something is terribly wrong.\n");
+			hmm_rmem_poison_range(rmem, mm, vma, range->faddr,
+					      range->laddr, range->fuid);
+		}
+		hmm_range_fini(range);
+	}
 
-	faddr = fault_mm->faddr;
-	laddr = fault_mm->laddr;
-	fault_mm->laddr = faddr;
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
-	walk.pmd_entry = hmm_fault_mm_fault_pmd;
-	walk.mm = fault_mm->mm;
-	walk.private = fault_mm;
+		unlock_page(page);
+		mem_cgroup_transfer_charge_anon(page, mm);
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0UL;
+	}
+	return 0;
 
-	ret = walk_page_range(faddr, laddr, &walk);
+error:
+	/* No need to lock because at this point no one else can modify the
+	 * ranges list.
+	 */
+	/* There is two case here :
+	 * (1) rmem is mirroring shared memory in which case we are facing the
+	 *     issue of poisoning all the mapping in all the process for that
+	 *     file.
+	 * (2) rmem is mirroring private memory, easy case poison all ranges
+	 *     referencing the rmem.
+	 */
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
+
+		if (!page) {
+			if (vma && !(vma->vm_flags & VM_SHARED)) {
+				/* Properly uncharge memory. */
+				mem_cgroup_uncharge_mm(mm);
+			}
+			continue;
+		}
+		/* Properly uncharge memory. */
+		mem_cgroup_transfer_charge_anon(page, mm);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			unlock_page(page);
+		}
+		page_remove_rmap(page);
+		page_cache_release(page);
+		rmem->pfns[i] = 0UL;
+	}
+	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		mm = range->mirror->hmm->mm;
+		hmm_rmem_poison_range(rmem, mm, NULL, range->faddr,
+				      range->laddr, range->fuid);
+		hmm_range_fini(range);
+	}
 	return ret;
 }
 
@@ -285,6 +1610,7 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	INIT_LIST_HEAD(&hmm->mirrors);
 	INIT_LIST_HEAD(&hmm->pending);
 	spin_lock_init(&hmm->lock);
+	hmm->ranges = RB_ROOT;
 	init_waitqueue_head(&hmm->wait_queue);
 
 	for (i = 0; i < HMM_MAX_EVENTS; ++i) {
@@ -298,6 +1624,12 @@ static int hmm_init(struct hmm *hmm, struct mm_struct *mm)
 	return ret;
 }
 
+static inline bool hmm_event_cover_range(struct hmm_event *a,
+					 struct hmm_range *b)
+{
+	return ((a->faddr <= b->faddr) && (a->laddr >= b->laddr));
+}
+
 static enum hmm_etype hmm_event_mmu(enum mmu_action action)
 {
 	switch (action) {
@@ -326,6 +1658,7 @@ static enum hmm_etype hmm_event_mmu(enum mmu_action action)
 	case MMU_MUNMAP:
 		return HMM_MUNMAP;
 	case MMU_SOFT_DIRTY:
+	case MMU_HMM:
 	default:
 		return HMM_NONE;
 	}
@@ -357,6 +1690,8 @@ static void hmm_destroy_kref(struct kref *kref)
 	mm->hmm = NULL;
 	mmu_notifier_unregister(&hmm->mmu_notifier, mm);
 
+	BUG_ON(!RB_EMPTY_ROOT(&hmm->ranges));
+
 	if (!list_empty(&hmm->mirrors)) {
 		BUG();
 		printk(KERN_ERR "destroying an hmm with still active mirror\n"
@@ -410,6 +1745,7 @@ out:
 	event->laddr = laddr;
 	event->backoff = false;
 	INIT_LIST_HEAD(&event->fences);
+	INIT_LIST_HEAD(&event->ranges);
 	hmm->nevents++;
 	list_add_tail(&event->list, &hmm->pending);
 
@@ -447,11 +1783,116 @@ wait:
 	goto retry_wait;
 }
 
+static int hmm_migrate_to_lmem(struct hmm *hmm,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr,
+			       bool adjust)
+{
+	struct hmm_range *range;
+	struct hmm_rmem *rmem;
+	int ret = 0;
+
+	if (unlikely(anon_vma_prepare(vma))) {
+		return -ENOMEM;
+	}
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range && faddr < laddr) {
+		struct hmm_device *device;
+		unsigned long fuid, luid, cfaddr, claddr;
+		int r;
+
+		cfaddr = max(faddr, range->faddr);
+		claddr = min(laddr, range->laddr);
+		fuid = range->fuid + ((cfaddr - range->faddr) >> PAGE_SHIFT);
+		luid = fuid + ((claddr - cfaddr) >> PAGE_SHIFT);
+		faddr = min(range->laddr, laddr);
+		rmem = hmm_rmem_ref(range->rmem);
+		device = rmem->device;
+		spin_unlock(&hmm->lock);
+
+		r = hmm_rmem_migrate_to_lmem(rmem, vma, cfaddr, fuid,
+					     luid, adjust);
+		hmm_rmem_unref(rmem);
+		if (r) {
+			ret = ret ? ret : r;
+			hmm_mirror_cleanup(range->mirror);
+			goto retry;
+		}
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,faddr,laddr-1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return ret;
+}
+
+static unsigned long hmm_ranges_reserve(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_range *range;
+	unsigned long faddr, laddr, count = 0;
+
+	faddr = event->faddr;
+	laddr = event->laddr;
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range) {
+		if (!hmm_range_reserve(range, event)) {
+			struct hmm_rmem *rmem = hmm_rmem_ref(range->rmem);
+			spin_unlock(&hmm->lock);
+			wait_event(hmm->wait_queue, rmem->event != NULL);
+			hmm_rmem_unref(rmem);
+			goto retry;
+		}
+
+		if (list_empty(&range->elist)) {
+			list_add_tail(&range->elist, &event->ranges);
+			count++;
+		}
+
+		range = hmm_range_tree_iter_next(range, faddr, laddr - 1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return count;
+}
+
+static void hmm_ranges_release(struct hmm *hmm, struct hmm_event *event)
+{
+	struct hmm_range *range, *next;
+
+	list_for_each_entry_safe (range, next, &event->ranges, elist) {
+		hmm_range_release(range, event);
+	}
+}
+
 static void hmm_update_mirrors(struct hmm *hmm,
 			       struct vm_area_struct *vma,
 			       struct hmm_event *event)
 {
 	unsigned long faddr, laddr;
+	bool migrate = false;
+
+	switch (event->etype) {
+	case HMM_COW:
+		migrate = true;
+		break;
+	case HMM_MUNMAP:
+		migrate = vma->vm_file ? true : false;
+		break;
+	default:
+		break;
+	}
+
+	if (hmm_ranges_reserve(hmm, event) && migrate) {
+		hmm_migrate_to_lmem(hmm,vma,event->faddr,event->laddr,false);
+	}
 
 	for (faddr = event->faddr; faddr < event->laddr; faddr = laddr) {
 		struct hmm_mirror *mirror;
@@ -494,6 +1935,7 @@ retry_ranges:
 			}
 		}
 	}
+	hmm_ranges_release(hmm, event);
 }
 
 static int hmm_fault_mm(struct hmm *hmm,
@@ -529,6 +1971,98 @@ static int hmm_fault_mm(struct hmm *hmm,
 	return 0;
 }
 
+/* see include/linux/hmm.h */
+int hmm_mm_fault(struct mm_struct *mm,
+		 struct vm_area_struct *vma,
+		 unsigned long addr,
+		 pte_t *pte,
+		 pmd_t *pmd,
+		 unsigned int fault_flags,
+		 pte_t opte)
+{
+	struct hmm_mirror *mirror = NULL;
+	struct hmm_device *device;
+	struct hmm_event *event;
+	struct hmm_range *range;
+	struct hmm_rmem *rmem = NULL;
+	unsigned long uid, faddr, laddr;
+	swp_entry_t entry;
+	struct hmm *hmm = hmm_ref(mm->hmm);
+	int ret;
+
+	if (!hmm) {
+		BUG();
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* Find the corresponding rmem. */
+	entry = pte_to_swp_entry(opte);
+	if (!is_hmm_entry(entry)) {
+		//print_bad_pte(vma, addr, opte, NULL);
+		hmm_unref(hmm);
+		return VM_FAULT_SIGBUS;
+	}
+	uid = hmm_entry_uid(entry);
+	if (!uid) {
+		/* Poisonous hmm swap entry. */
+		hmm_unref(hmm);
+		return VM_FAULT_SIGBUS;
+	}
+
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		hmm_unref(hmm);
+		if (pte_same(*pte, opte)) {
+			//print_bad_pte(vma, addr, opte, NULL);
+			return VM_FAULT_SIGBUS;
+		}
+		return 0;
+	}
+
+	faddr = addr & PAGE_MASK;
+	/* FIXME use the readahead value as a hint on how much to migrate. */
+	laddr = min(faddr + (16 << PAGE_SHIFT), vma->vm_end);
+	spin_lock(&rmem->lock);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		if (faddr < range->faddr || faddr >= range->laddr) {
+			continue;
+		}
+		if (range->mirror->hmm == hmm) {
+			laddr = min(laddr, range->laddr);
+			mirror = hmm_mirror_ref(range->mirror);
+			break;
+		}
+	}
+	spin_unlock(&rmem->lock);
+	hmm_rmem_unref(rmem);
+	hmm_unref(hmm);
+	if (mirror == NULL) {
+		if (pte_same(*pte, opte)) {
+			//print_bad_pte(vma, addr, opte, NULL);
+			return VM_FAULT_SIGBUS;
+		}
+		return 0;
+	}
+
+	device = rmem->device;
+	event = hmm_event_get(hmm, faddr, laddr, HMM_MIGRATE_TO_LMEM);
+	hmm_ranges_reserve(hmm, event);
+	ret = hmm_migrate_to_lmem(hmm, vma, faddr, laddr, true);
+	hmm_ranges_release(hmm, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_mirror_unref(mirror);
+	switch (ret) {
+	case 0:
+		break;
+	case -ENOMEM:
+		return VM_FAULT_OOM;
+	default:
+		return VM_FAULT_SIGBUS;
+	}
+
+	return VM_FAULT_MAJOR;
+}
+
 
 
 
@@ -726,16 +2260,15 @@ static struct mmu_notifier_ops hmm_notifier_ops = {
  * device page table (through hmm callback). Or provide helper functions use by
  * the device driver to fault in range of memory in the device page table.
  */
-
-static int hmm_mirror_update(struct hmm_mirror *mirror,
-			     struct vm_area_struct *vma,
-			     unsigned long faddr,
-			     unsigned long laddr,
-			     struct hmm_event *event)
+
+static int hmm_mirror_lmem_update(struct hmm_mirror *mirror,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  struct hmm_event *event,
+				  bool dirty)
 {
 	struct hmm_device *device = mirror->device;
 	struct hmm_fence *fence;
-	bool dirty = !!(vma->vm_file);
 
 	fence = device->ops->lmem_update(mirror, faddr, laddr,
 					 event->etype, dirty);
@@ -749,6 +2282,175 @@ static int hmm_mirror_update(struct hmm_mirror *mirror,
 	return 0;
 }
 
+static int hmm_mirror_rmem_update(struct hmm_mirror *mirror,
+				  struct hmm_rmem *rmem,
+				  unsigned long faddr,
+				  unsigned long laddr,
+				  unsigned long fuid,
+				  struct hmm_event *event,
+				  bool dirty)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_fence *fence;
+
+	fence = device->ops->rmem_update(mirror, rmem, faddr, laddr,
+					 fuid, event->etype, dirty);
+	if (fence) {
+		if (IS_ERR(fence)) {
+			return PTR_ERR(fence);
+		}
+		fence->mirror = mirror;
+		list_add_tail(&fence->list, &event->fences);
+	}
+	return 0;
+}
+
+static int hmm_mirror_update(struct hmm_mirror *mirror,
+			     struct vm_area_struct *vma,
+			     unsigned long faddr,
+			     unsigned long laddr,
+			     struct hmm_event *event)
+{
+	struct hmm *hmm = mirror->hmm;
+	unsigned long caddr = faddr;
+	bool free = false, dirty = !!(vma->vm_flags & VM_SHARED);
+	int ret;
+
+	switch (event->etype) {
+	case HMM_MUNMAP:
+		free = true;
+		break;
+	default:
+		break;
+	}
+
+	for (; caddr < laddr;) {
+		struct hmm_range *range;
+		unsigned long naddr;
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,caddr,laddr-1);
+		if (range && range->mirror != mirror) {
+			range = NULL;
+		}
+		spin_unlock(&hmm->lock);
+
+		/* At this point the range is on the event list and thus it can
+		 * not disappear.
+		 */
+		BUG_ON(range && list_empty(&range->elist));
+
+		if (!range || (range->faddr > caddr)) {
+			naddr = range ? range->faddr : laddr;
+			ret = hmm_mirror_lmem_update(mirror, caddr, naddr,
+						     event, dirty);
+			if (ret) {
+				return ret;
+			}
+			caddr = naddr;
+		}
+		if (range) {
+			unsigned long fuid;
+
+			naddr = min(range->laddr, laddr);
+			fuid = range->fuid+((caddr-range->faddr)>>PAGE_SHIFT);
+			ret = hmm_mirror_rmem_update(mirror,range->rmem,caddr,
+						     naddr,fuid,event,dirty);
+			caddr = naddr;
+			if (ret) {
+				return ret;
+			}
+			if (free) {
+				BUG_ON((caddr > range->faddr) ||
+				       (naddr < range->laddr));
+				hmm_range_fini_clear(range, vma);
+			}
+		}
+	}
+	return 0;
+}
+
+static unsigned long hmm_mirror_ranges_reserve(struct hmm_mirror *mirror,
+					       struct hmm_event *event)
+{
+	struct hmm_range *range;
+	unsigned long faddr, laddr, count = 0;
+	struct hmm *hmm = mirror->hmm;
+
+	faddr = event->faddr;
+	laddr = event->laddr;
+
+retry:
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges, faddr, laddr - 1);
+	while (range) {
+		if (range->mirror == mirror) {
+			if (!hmm_range_reserve(range, event)) {
+				struct hmm_rmem *rmem;
+
+				rmem = hmm_rmem_ref(range->rmem);
+				spin_unlock(&hmm->lock);
+				wait_event(hmm->wait_queue, rmem->event!=NULL);
+				hmm_rmem_unref(rmem);
+				goto retry;
+			}
+			if (list_empty(&range->elist)) {
+				list_add_tail(&range->elist, &event->ranges);
+				count++;
+			}
+		}
+		range = hmm_range_tree_iter_next(range, faddr, laddr - 1);
+	}
+	spin_unlock(&hmm->lock);
+
+	return count;
+}
+
+static void hmm_mirror_ranges_migrate(struct hmm_mirror *mirror,
+				      struct vm_area_struct *vma,
+				      struct hmm_event *event)
+{
+	struct hmm_range *range;
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	range = hmm_range_tree_iter_first(&hmm->ranges,
+					  vma->vm_start,
+					  vma->vm_end - 1);
+	while (range) {
+		struct hmm_rmem *rmem;
+
+		if (range->mirror != mirror) {
+			goto next;
+		}
+		rmem = hmm_rmem_ref(range->rmem);
+		spin_unlock(&hmm->lock);
+
+		hmm_rmem_migrate_to_lmem(rmem, vma, range->faddr,
+					 hmm_range_fuid(range),
+					 hmm_range_luid(range),
+					 true);
+		hmm_rmem_unref(rmem);
+
+		spin_lock(&hmm->lock);
+	next:
+		range = hmm_range_tree_iter_first(&hmm->ranges,
+						  vma->vm_start,
+						  vma->vm_end - 1);
+	}
+	spin_unlock(&hmm->lock);
+}
+
+static void hmm_mirror_ranges_release(struct hmm_mirror *mirror,
+				      struct hmm_event *event)
+{
+	struct hmm_range *range, *next;
+
+	list_for_each_entry_safe (range, next, &event->ranges, elist) {
+		hmm_range_release(range, event);
+	}
+}
+
 static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
 {
 	struct vm_area_struct *vma;
@@ -778,11 +2480,16 @@ static void hmm_mirror_cleanup(struct hmm_mirror *mirror)
 		faddr = max(faddr, vma->vm_start);
 		laddr = vma->vm_end;
 
+		hmm_mirror_ranges_reserve(mirror, event);
+
 		hmm_mirror_update(mirror, vma, faddr, laddr, event);
 		list_for_each_entry_safe (fence, next, &event->fences, list) {
 			hmm_device_fence_wait(device, fence);
 		}
 
+		hmm_mirror_ranges_migrate(mirror, vma, event);
+		hmm_mirror_ranges_release(mirror, event);
+
 		if (laddr >= vma->vm_end) {
 			vma = vma->vm_next;
 		}
@@ -949,6 +2656,33 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
+				 struct hmm_fault *fault,
+				 struct vm_area_struct *vma,
+				 struct hmm_range *range,
+				 struct hmm_event *event,
+				 unsigned long faddr,
+				 unsigned long laddr,
+				 bool write)
+{
+	struct hmm_device *device = mirror->device;
+	struct hmm_rmem *rmem = range->rmem;
+	unsigned long fuid, luid, npages;
+	int ret;
+
+	if (range->mirror != mirror) {
+		/* Returning -EAGAIN will force cpu page fault path. */
+		return -EAGAIN;
+	}
+
+	npages = (range->laddr - range->faddr) >> PAGE_SHIFT;
+	fuid = range->fuid + ((faddr - range->faddr) >> PAGE_SHIFT);
+	luid = fuid + npages;
+
+	ret = device->ops->rmem_fault(mirror, rmem, faddr, laddr, fuid, fault);
+	return ret;
+}
+
 static int hmm_mirror_lmem_fault(struct hmm_mirror *mirror,
 				 struct hmm_fault *fault,
 				 unsigned long faddr,
@@ -995,6 +2729,7 @@ int hmm_mirror_fault(struct hmm_mirror *mirror,
 retry:
 	down_read(&hmm->mm->mmap_sem);
 	event = hmm_event_get(hmm, caddr, naddr, HMM_DEVICE_FAULT);
+	hmm_ranges_reserve(hmm, event);
 	/* FIXME handle gate area ? and guard page */
 	vma = find_extend_vma(hmm->mm, caddr);
 	if (!vma) {
@@ -1031,6 +2766,29 @@ retry:
 
 	for (; caddr < event->laddr;) {
 		struct hmm_fault_mm fault_mm;
+		struct hmm_range *range;
+
+		spin_lock(&hmm->lock);
+		range = hmm_range_tree_iter_first(&hmm->ranges,
+						  caddr,
+						  naddr - 1);
+		if (range && range->faddr > caddr) {
+			naddr = range->faddr;
+			range = NULL;
+		}
+		spin_unlock(&hmm->lock);
+		if (range) {
+			naddr = min(range->laddr, event->laddr);
+			ret = hmm_mirror_rmem_fault(mirror,fault,vma,range,
+						    event,caddr,naddr,write);
+			if (ret) {
+				do_fault = (ret == -EAGAIN);
+				goto out;
+			}
+			caddr = naddr;
+			naddr = event->laddr;
+			continue;
+		}
 
 		fault_mm.mm = vma->vm_mm;
 		fault_mm.vma = vma;
@@ -1067,6 +2825,7 @@ retry:
 	}
 
 out:
+	hmm_ranges_release(hmm, event);
 	hmm_event_unqueue(hmm, event);
 	if (do_fault && !event->backoff && !mirror->dead) {
 		do_fault = false;
@@ -1092,6 +2851,334 @@ EXPORT_SYMBOL(hmm_mirror_fault);
 
 
 
+/* hmm_migrate - Memory migration to/from local memory from/to remote memory.
+ *
+ * Below are functions that handle migration to/from local memory from/to
+ * remote memory (rmem).
+ *
+ * Migration to remote memory is a multi-step process first pages are unmap and
+ * missing page are either allocated or accounted as new allocation. Then pages
+ * are copied to remote memory. Finaly the remote memory is faulted so that the
+ * device driver update the device page table.
+ *
+ * Device driver can decide to abort migration to remote memory at any step of
+ * the process by returning special value from the callback corresponding to
+ * the step.
+ *
+ * Migration to local memory is simpler. First pages are allocated then remote
+ * memory is copied into those pages. Once dma is done the pages are remapped
+ * inside the cpu page table or inside the page cache (for shared memory) and
+ * finaly the rmem is freed.
+ */
+
+/* see include/linux/hmm.h */
+int hmm_migrate_rmem_to_lmem(struct hmm_mirror *mirror,
+			     unsigned long faddr,
+			     unsigned long laddr)
+{
+	struct hmm *hmm = mirror->hmm;
+	struct vm_area_struct *vma;
+	struct hmm_event *event;
+	unsigned long next;
+	int ret = 0;
+
+	event = hmm_event_get(hmm, faddr, laddr, HMM_MIGRATE_TO_LMEM);
+	if (!hmm_ranges_reserve(hmm, event)) {
+		hmm_event_unqueue(hmm, event);
+		return 0;
+	}
+
+	hmm_mirror_ref(mirror);
+	down_read(&hmm->mm->mmap_sem);
+	vma = find_vma(hmm->mm, faddr);
+	faddr = max(vma->vm_start, faddr);
+	for (; vma && (faddr < laddr); faddr = next) {
+		next = min(laddr, vma->vm_end);
+
+		ret = hmm_migrate_to_lmem(hmm, vma, faddr, next, true);
+		if (ret) {
+			break;
+		}
+
+		vma = vma->vm_next;
+		next = max(vma->vm_start, next);
+	}
+	up_read(&hmm->mm->mmap_sem);
+	hmm_ranges_release(hmm, event);
+	hmm_event_unqueue(hmm, event);
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_migrate_rmem_to_lmem);
+
+static void hmm_migrate_abort(struct hmm_mirror *mirror,
+			      struct hmm_fault *fault,
+			      unsigned long *pfns,
+			      unsigned long fuid)
+{
+	struct vm_area_struct *vma = fault->vma;
+	struct hmm_rmem rmem;
+	unsigned long i, npages;
+
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	for (i = npages - 1; i > 0; --i) {
+		if (pfns[i]) {
+			break;
+		}
+		npages = i;
+	}
+	if (!npages) {
+		return;
+	}
+
+	/* Fake temporary rmem object. */
+	hmm_rmem_init(&rmem, mirror->device);
+	rmem.fuid = fuid;
+	rmem.luid = fuid + npages;
+	rmem.pfns = pfns;
+
+	if (!(vma->vm_file)) {
+		unsigned long faddr, laddr;
+
+		faddr = fault->faddr;
+		laddr = faddr + (npages << PAGE_SHIFT);
+
+		/* The remapping fail only if something goes terribly wrong. */
+		if (hmm_rmem_remap_anon(&rmem, vma, faddr, laddr, fuid)) {
+
+			WARN_ONCE(1, "hmm: something is terribly wrong.\n");
+			hmm_rmem_poison_range(&rmem, vma->vm_mm, vma,
+					      faddr, laddr, fuid);
+		}
+	} else {
+		BUG();
+	}
+
+	/* Ok officialy dead. */
+	if (fault->rmem) {
+		fault->rmem->dead = true;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = hmm_pfn_to_page(pfns[i]);
+
+		if (!page) {
+			pfns[i] = 0;
+			continue;
+		}
+		if (test_bit(HMM_PFN_VALID_ZERO, &pfns[i])) {
+			/* Properly uncharge memory. */
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+			mem_cgroup_uncharge_mm(vma->vm_mm);
+			pfns[i] = 0;
+			continue;
+		}
+		if (test_bit(HMM_PFN_LOCK, &pfns[i])) {
+			unlock_page(page);
+			clear_bit(HMM_PFN_LOCK, &pfns[i]);
+		}
+		page_remove_rmap(page);
+		page_cache_release(page);
+		pfns[i] = 0;
+	}
+}
+
+/* see include/linux/hmm.h */
+int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
+			     struct hmm_mirror *mirror)
+{
+	struct vm_area_struct *vma;
+	struct hmm_device *device;
+	struct hmm_range *range;
+	struct hmm_fence *fence;
+	struct hmm_event *event;
+	struct hmm_rmem rmem;
+	unsigned long i, npages;
+	struct hmm *hmm;
+	int ret;
+
+	mirror = hmm_mirror_ref(mirror);
+	if (!fault || !mirror || fault->faddr > fault->laddr) {
+		return -EINVAL;
+	}
+	if (mirror->dead) {
+		hmm_mirror_unref(mirror);
+		return -ENODEV;
+	}
+	hmm = mirror->hmm;
+	device = mirror->device;
+	if (!device->rmem) {
+		hmm_mirror_unref(mirror);
+		return -EINVAL;
+	}
+	fault->rmem = NULL;
+	fault->faddr = fault->faddr & PAGE_MASK;
+	fault->laddr = PAGE_ALIGN(fault->laddr);
+	hmm_rmem_init(&rmem, mirror->device);
+	event = hmm_event_get(hmm, fault->faddr, fault->laddr,
+			      HMM_MIGRATE_TO_RMEM);
+	rmem.event = event;
+	hmm = mirror->hmm;
+
+	range = kmalloc(sizeof(struct hmm_range), GFP_KERNEL);
+	if (range == NULL) {
+		hmm_event_unqueue(hmm, event);
+		hmm_mirror_unref(mirror);
+		return -ENOMEM;
+	}
+
+	down_read(&hmm->mm->mmap_sem);
+	vma = find_vma_intersection(hmm->mm, fault->faddr, fault->laddr);
+	if (!vma) {
+		kfree(range);
+		range = NULL;
+		ret = -EFAULT;
+		goto out;
+	}
+	/* FIXME support HUGETLB */
+	if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP | VM_HUGETLB))) {
+		kfree(range);
+		range = NULL;
+		ret = -EACCES;
+		goto out;
+	}
+	if (vma->vm_file) {
+		kfree(range);
+		range = NULL;
+		ret = -EBUSY;
+		goto out;
+	}
+	/* Adjust range to this vma only. */
+	event->faddr = fault->faddr = max(fault->faddr, vma->vm_start);
+	event->laddr  =fault->laddr = min(fault->laddr, vma->vm_end);
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	fault->vma = vma;
+
+	ret = hmm_rmem_alloc(&rmem, npages);
+	if (ret) {
+		kfree(range);
+		range = NULL;
+		goto out;
+	}
+
+	/* Prior to unmapping add to the hmm range tree so any pagefault can
+	 * find the proper range.
+	 */
+	hmm_range_init(range, mirror, &rmem, fault->faddr,
+		       fault->laddr, rmem.fuid);
+	hmm_range_insert(range);
+
+	ret = hmm_rmem_unmap(&rmem, vma, fault->faddr, fault->laddr);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	fault->rmem = device->ops->rmem_alloc(device, fault);
+	if (IS_ERR(fault->rmem)) {
+		ret = PTR_ERR(fault->rmem);
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+	if (fault->rmem == NULL) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		ret = 0;
+		goto out;
+	}
+	if (event->backoff) {
+		ret = -EBUSY;
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	hmm_rmem_init(fault->rmem, mirror->device);
+	spin_lock(&_hmm_rmems_lock);
+	fault->rmem->event = event;
+	hmm_rmem_tree_remove(&rmem, &_hmm_rmems);
+	fault->rmem->fuid = rmem.fuid;
+	fault->rmem->luid = rmem.luid;
+	hmm_rmem_tree_insert(fault->rmem, &_hmm_rmems);
+	fault->rmem->pfns = rmem.pfns;
+	range->rmem = fault->rmem;
+	list_del_init(&range->rlist);
+	list_add_tail(&range->rlist, &fault->rmem->ranges);
+	rmem.event = NULL;
+	spin_unlock(&_hmm_rmems_lock);
+
+	fence = device->ops->lmem_to_rmem(fault->rmem,rmem.fuid,rmem.luid);
+	if (IS_ERR(fence)) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	ret = hmm_device_fence_wait(device, fence);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	ret = device->ops->rmem_fault(mirror, range->rmem, range->faddr,
+				      range->laddr, range->fuid, NULL);
+	if (ret) {
+		hmm_migrate_abort(mirror, fault, rmem.pfns, rmem.fuid);
+		goto out;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		struct page *page = hmm_pfn_to_page(rmem.pfns[i]);
+
+		if (test_bit(HMM_PFN_VALID_ZERO, &rmem.pfns[i])) {
+			rmem.pfns[i] = rmem.pfns[i] & HMM_PFN_CLEAR;
+			continue;
+		}
+		/* We only decrement now the page count so that cow happen
+		 * properly while page is in fligh.
+		 */
+		if (PageAnon(page)) {
+			unlock_page(page);
+			page_remove_rmap(page);
+			page_cache_release(page);
+			rmem.pfns[i] &= HMM_PFN_CLEAR;
+		} else {
+			/* Otherwise this means the page is in pagecache. Keep
+			 * a reference and page count elevated.
+			 */
+			clear_bit(HMM_PFN_LOCK, &rmem.pfns[i]);
+			/* We do not want side effect of page_remove_rmap ie
+			 * zone page accounting udpate but we do want zero
+			 * mapcount so writeback works properly.
+			 */
+			atomic_add(-1, &page->_mapcount);
+			unlock_page(page);
+		}
+	}
+
+	hmm_mirror_ranges_release(mirror, event);
+	hmm_event_unqueue(hmm, event);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_mirror_unref(mirror);
+	return 0;
+
+out:
+	if (!fault->rmem) {
+		kfree(rmem.pfns);
+		spin_lock(&_hmm_rmems_lock);
+		hmm_rmem_tree_remove(&rmem, &_hmm_rmems);
+		spin_unlock(&_hmm_rmems_lock);
+	}
+	hmm_mirror_ranges_release(mirror, event);
+	hmm_event_unqueue(hmm, event);
+	up_read(&hmm->mm->mmap_sem);
+	hmm_range_unref(range);
+	hmm_rmem_unref(fault->rmem);
+	hmm_mirror_unref(mirror);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_migrate_lmem_to_rmem);
+
+
+
+
 /* hmm_device - Each device driver must register one and only one hmm_device
  *
  * The hmm_device is the link btw hmm and each device driver.
@@ -1140,9 +3227,22 @@ int hmm_device_register(struct hmm_device *device, const char *name)
 	BUG_ON(!device->ops->lmem_fault);
 
 	kref_init(&device->kref);
+	device->rmem = false;
 	device->name = name;
 	mutex_init(&device->mutex);
 	INIT_LIST_HEAD(&device->mirrors);
+	init_waitqueue_head(&device->wait_queue);
+
+	if (device->ops->rmem_alloc &&
+	    device->ops->rmem_update &&
+	    device->ops->rmem_fault &&
+	    device->ops->rmem_to_lmem &&
+	    device->ops->lmem_to_rmem &&
+	    device->ops->rmem_split &&
+	    device->ops->rmem_split_adjust &&
+	    device->ops->rmem_destroy) {
+		device->rmem = true;
+	}
 
 	return 0;
 }
@@ -1179,6 +3279,7 @@ static int __init hmm_module_init(void)
 {
 	int ret;
 
+	spin_lock_init(&_hmm_rmems_lock);
 	ret = init_srcu_struct(&srcu);
 	if (ret) {
 		return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ceaf4d7..88e4acd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -56,6 +56,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/hmm.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -6649,6 +6650,8 @@ one_by_one:
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_HMM): if it is hmm entry, target->page is either NULL or point
+ *     to page to move charge.
  *
  * Called with pte lock held.
  */
@@ -6661,6 +6664,7 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_HMM,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
@@ -6690,6 +6694,9 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	swp_entry_t ent = pte_to_swp_entry(ptent);
 
+	if (is_hmm_entry(ent)) {
+		return swp_to_radix_entry(ent);
+	}
 	if (!move_anon() || non_swap_entry(ent))
 		return NULL;
 	/*
@@ -6764,6 +6771,10 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 	if (!page && !ent.val)
 		return ret;
+	if (radix_tree_exceptional_entry(page)) {
+		ret = MC_TARGET_HMM;
+		return ret;
+	}
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
@@ -7077,6 +7088,41 @@ put:			/* get_mctgt_type() gets the page */
 				mc.moved_swap++;
 			}
 			break;
+		case MC_TARGET_HMM:
+			if (target.page) {
+				page = target.page;
+				pc = lookup_page_cgroup(page);
+				if (!mem_cgroup_move_account(page, 1, pc,
+							     mc.from, mc.to)) {
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				put_page(page);
+			} else if (vma->vm_flags & VM_SHARED) {
+				/* Some one migrated the memory after we did
+				 * the pagecache lookup.
+				 */
+				/* FIXME can the precharge/moved_charge then
+				 * becomes wrong ?
+				 */
+				pte_unmap_unlock(pte - 1, ptl);
+				cond_resched();
+				goto retry;
+			} else {
+				unsigned long flags;
+
+				move_lock_mem_cgroup(mc.from, &flags);
+				move_lock_mem_cgroup(mc.to, &flags);
+				mem_cgroup_charge_statistics(mc.from, NULL, true, -1);
+				mem_cgroup_charge_statistics(mc.to, NULL, true, 1);
+				move_unlock_mem_cgroup(mc.to, &flags);
+				move_unlock_mem_cgroup(mc.from, &flags);
+				mc.precharge--;
+				/* we uncharge from mc.from later. */
+				mc.moved_charge++;
+			}
+			break;
 		default:
 			break;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 1e164a1..d35bc65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -53,6 +53,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
@@ -851,6 +852,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					if (pte_swp_soft_dirty(*src_pte))
 						pte = pte_swp_mksoft_dirty(pte);
 					set_pte_at(src_mm, addr, src_pte, pte);
+				} else if (is_hmm_entry(entry)) {
+					/* FIXME do we want to handle rblk fork, just mapcount rblk if so. */
+					BUG_ON(1);
 				}
 			}
 		}
@@ -3079,6 +3083,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_hmm_entry(entry)) {
+			ret = hmm_mm_fault(mm, vma, address, page_table,
+					   pmd, flags, orig_pte);
 		} else {
 			print_bad_pte(vma, address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 08/11] hmm: support for migrate file backed pages to remote memory
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Motivation:

Same as for migrating anonymous private memory ie device local memory has
higher bandwidth and lower latency.

Implementation:

Migrated range are tracked exactly as private anonymous memory refer to
the commit adding support for migrating private anonymous memory.

Migrating file backed page is more complex than private anonymous memory
as those pages might be involved in various filesystem event from write
back to splice or truncation.

This patchset use a special hmm swap value that it store inside the radix
tree for page that are migrated to remote memory. Any code that need to do
radix tree lookup is updated to understand those special hmm swap entry
and to call hmm helper function to perform the appropriate operation.

For most operations (file read, splice, truncate, ...) the end result is
simply to migrate back to local memory. It is expected that user of hmm
will do not perform such operation on file back memory that was migrated
to remote memory.

Write back is different as we preserve the capabilities of doing dirtied
memory write back from remote memory (using local system memory as a bounce
buffer).

Each filesystem code must be modified to support hmm. This patchset only
modify common helper code and add the core set of helpers needed for this
feature.

Issues:

The big issue here is how to handle failure to migrate the remote memory back
to local memory. Should all the process trying further access to the file get
SIGBUS ? Should only the process that migrated memory to remote memory get
SIGBUS ? ...

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 fs/aio.c             |    9 +
 fs/buffer.c          |    3 +
 fs/splice.c          |   38 +-
 include/linux/fs.h   |    4 +
 include/linux/hmm.h  |   72 +++-
 include/linux/rmap.h |    1 +
 mm/filemap.c         |   99 ++++-
 mm/hmm.c             | 1094 ++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/madvise.c         |    4 +
 mm/mincore.c         |   11 +
 mm/page-writeback.c  |  131 ++++--
 mm/rmap.c            |   17 +-
 mm/swap.c            |    9 +
 mm/truncate.c        |  103 ++++-
 14 files changed, 1524 insertions(+), 71 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0bf693f..0ec9f16 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/hmm.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -405,10 +406,18 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
+
+	repeat:
 		page = find_or_create_page(file->f_inode->i_mapping,
 					   i, GFP_HIGHUSER | __GFP_ZERO);
 		if (!page)
 			break;
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(file->f_inode->i_mapping, swap);
+			goto repeat;
+		}
 		pr_debug("pid(%d) page[%d]->count=%d\n",
 			 current->pid, i, page_count(page));
 		SetPageUptodate(page);
diff --git a/fs/buffer.c b/fs/buffer.c
index e33f8d5..2be2a04 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
 #include <linux/cpu.h>
 #include <linux/bitops.h>
 #include <linux/mpage.h>
+#include <linux/hmm.h>
 #include <linux/bit_spinlock.h>
 #include <trace/events/block.h>
 
@@ -1023,6 +1024,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	if (!page)
 		return ret;
 
+	/* This can not happen ! */
+	BUG_ON(radix_tree_exceptional_entry(page));
 	BUG_ON(!PageLocked(page));
 
 	if (page_has_buffers(page)) {
diff --git a/fs/splice.c b/fs/splice.c
index 9dc23de..175f80c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -33,6 +33,7 @@
 #include <linux/socket.h>
 #include <linux/compat.h>
 #include <linux/aio.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 /*
@@ -334,6 +335,20 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 	 * Lookup the (hopefully) full range of pages we need.
 	 */
 	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
+	/* Handle hmm entry, ie migrate remote memory back to local memory. */
+	for (page_nr = 0; page_nr < spd.nr_pages;) {
+		page = spd.pages[page_nr];
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			spd.pages[page_nr] = find_get_page(mapping, index + page_nr);
+			continue;
+		} else {
+			page_nr++;
+		}
+	}
 	index += spd.nr_pages;
 
 	/*
@@ -351,6 +366,14 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * the first hole.
 		 */
 		page = find_get_page(mapping, index);
+
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			continue;
+		}
 		if (!page) {
 			/*
 			 * page didn't exist, allocate one.
@@ -373,7 +396,6 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			 */
 			unlock_page(page);
 		}
-
 		spd.pages[spd.nr_pages++] = page;
 		index++;
 	}
@@ -415,6 +437,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			 */
 			if (!page->mapping) {
 				unlock_page(page);
+retry:
 				page = find_or_create_page(mapping, index,
 						mapping_gfp_mask(mapping));
 
@@ -422,8 +445,17 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 					error = -ENOMEM;
 					break;
 				}
-				page_cache_release(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
+				/* At this point it can not be an exceptional hmm entry. */
+				if (radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					/* FIXME How to handle hmm migration failure ? */
+					hmm_pagecache_migrate(mapping, swap);
+					goto retry;
+				} else {
+					page_cache_release(spd.pages[page_nr]);
+					spd.pages[page_nr] = page;
+				}
 			}
 			/*
 			 * page was already under io and is now done, great
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4e92d55..149a73e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -366,8 +366,12 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+
+	int features;
 };
 
+#define AOPS_FEATURE_HMM	(1 << 0)
+
 extern const struct address_space_operations empty_aops;
 
 /*
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 96f41c4..9d232c1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -53,7 +53,6 @@
 #include <linux/swapops.h>
 #include <linux/mman.h>
 
-
 struct hmm_device;
 struct hmm_device_ops;
 struct hmm_mirror;
@@ -75,6 +74,14 @@ struct hmm;
  *   HMM_PFN_LOCK is only set while the rmem object is under going migration.
  *   HMM_PFN_LMEM_UPTODATE the page that is in the rmem pfn array has uptodate.
  *   HMM_PFN_RMEM_UPTODATE the rmem copy of the page is uptodate.
+ *   HMM_PFN_FILE is set for page part of pagecache.
+ *   HMM_PFN_WRITEBACK is set when page is under going writeback, this means
+ *     that the page is lock and all device mapping to rmem for this page are
+ *     set to read only. It will only be clear if device do write fault on the
+ *     page or on migration back to lmem.
+ *   HMM_PFN_FS_WRITEABLE the rmem can be written to without calling mkwrite.
+ *     This is for hmm internal use only to know if hmm needs to call the fs
+ *     mkwrite callback or not.
  *
  * Device driver only need to worry about :
  *   HMM_PFN_VALID_PAGE
@@ -95,6 +102,9 @@ struct hmm;
 #define HMM_PFN_LOCK		(4UL)
 #define HMM_PFN_LMEM_UPTODATE	(5UL)
 #define HMM_PFN_RMEM_UPTODATE	(6UL)
+#define HMM_PFN_FILE		(7UL)
+#define HMM_PFN_WRITEBACK	(8UL)
+#define HMM_PFN_FS_WRITEABLE	(9UL)
 
 static inline struct page *hmm_pfn_to_page(unsigned long pfn)
 {
@@ -170,6 +180,7 @@ enum hmm_etype {
 	HMM_UNMAP,
 	HMM_MIGRATE_TO_LMEM,
 	HMM_MIGRATE_TO_RMEM,
+	HMM_WRITEBACK,
 };
 
 struct hmm_fence {
@@ -628,6 +639,7 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
  *
  * @kref:           Reference count.
  * @device:         The hmm device the remote memory is allocated on.
+ * @mapping:        If rmem backing shared mapping.
  * @event:          The event currently associated with the rmem.
  * @lock:           Lock protecting the ranges list and event field.
  * @ranges:         The list of address ranges that point to this rmem.
@@ -646,6 +658,7 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
 struct hmm_rmem {
 	struct kref		kref;
 	struct hmm_device	*device;
+	struct address_space	*mapping;
 	struct hmm_event	*event;
 	spinlock_t		lock;
 	struct list_head	ranges;
@@ -913,6 +926,42 @@ int hmm_mm_fault(struct mm_struct *mm,
 		 unsigned int fault_flags,
 		 pte_t orig_pte);
 
+/* hmm_pagecache_migrate - migrate remote memory to local memory.
+ *
+ * @mapping:    The address space into which the rmem was found.
+ * @swap:       The hmm special swap entry that needs to be migrated.
+ *
+ * When the fs code need to migrate remote memory to local memory it calls this
+ * function. From caller point of view this function can not fail. If it does
+ * then it will trigger SIGBUS if process that were using rmem try accessing
+ * the failed migration page. Other process will just get that lastest content
+ * we had for the page. Hence from pagecache point of view it never fails.
+ */
+void hmm_pagecache_migrate(struct address_space *mapping,
+			   swp_entry_t swap);
+
+/* hmm_pagecache_writeback - temporaty copy of rmem for writeback.
+ *
+ * @mapping:    The address space into which the rmem was found.
+ * @swap:       The hmm special swap entry that needs temporary copy.
+ * Return:      Page pointer or NULL on failure.
+ *
+ * When the fs code need to writeback remote memory to backing storage it calls
+ * this function. The function return pointer to temporary page into which the
+ * lastest copy of the remote memory is. The remote memory will be mark as read
+ * only for the duration of the writeback.
+ *
+ * On failure this will return NULL and will poison any mapping of the process
+ * that was responsible for the remote memory thus triggering a SIGBUS for this
+ * process. It will as well kill the mirror that was using this remote memory.
+ *
+ * When NULL is returned the caller should perform a new radix tree lookup.
+ */
+struct page *hmm_pagecache_writeback(struct address_space *mapping,
+				     swp_entry_t swap);
+struct page *hmm_pagecache_page(struct address_space *mapping,
+				swp_entry_t swap);
+
 #else /* !CONFIG_HMM */
 
 static inline void hmm_destroy(struct mm_struct *mm)
@@ -930,6 +979,27 @@ static inline int hmm_mm_fault(struct mm_struct *mm,
 	return VM_FAULT_SIGBUS;
 }
 
+static inline void hmm_pagecache_migrate(struct address_space *mapping,
+					 swp_entry_t swap)
+{
+	/* This can not happen ! */
+	BUG();
+}
+
+static inline struct page *hmm_pagecache_writeback(struct address_space *mapping,
+						   swp_entry_t swap)
+{
+	BUG();
+	return NULL;
+}
+
+static inline struct page *hmm_pagecache_page(struct address_space *mapping,
+					      swp_entry_t swap)
+{
+	BUG();
+	return NULL;
+}
+
 #endif /* !CONFIG_HMM */
 
 #endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 575851f..0641ccf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -76,6 +76,7 @@ enum ttu_flags {
 	TTU_POISON = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 3,		/* munlock mode */
+	TTU_HMM = 4,			/* hmm mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
diff --git a/mm/filemap.c b/mm/filemap.c
index 067c3c0..686f46b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -343,6 +344,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 {
 	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
 	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
+	pgoff_t last_index = index;
 	struct pagevec pvec;
 	int nr_pages;
 	int ret2, ret = 0;
@@ -360,6 +362,19 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				pvec.pages[i] = NULL;
+				/* Force to examine the range again in case the
+				 * the migration triggered page writeback.
+				 */
+				index = last_index;
+				continue;
+			}
+
 			/* until radix tree lookup accepts end_index */
 			if (page->index > end)
 				continue;
@@ -369,6 +384,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 				ret = -EIO;
 		}
 		pagevec_release(&pvec);
+		last_index = index;
 		cond_resched();
 	}
 out:
@@ -987,14 +1003,21 @@ EXPORT_SYMBOL(find_get_entry);
  * Looks up the page cache slot at @mapping & @offset.  If there is a
  * page cache page, it is returned with an increased refcount.
  *
+ * Note that this will also return hmm special entry.
+ *
  * Otherwise, %NULL is returned.
  */
 struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page = find_get_entry(mapping, offset);
 
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (!is_hmm_entry(swap)) {
+			page = NULL;
+		}
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -1044,6 +1067,8 @@ EXPORT_SYMBOL(find_lock_entry);
  * page cache page, it is returned locked and with an increased
  * refcount.
  *
+ * Note that this will also return hmm special entry.
+ *
  * Otherwise, %NULL is returned.
  *
  * find_lock_page() may sleep.
@@ -1052,8 +1077,13 @@ struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page = find_lock_entry(mapping, offset);
 
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (!is_hmm_entry(swap)) {
+			page = NULL;
+		}
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -1222,6 +1252,12 @@ repeat:
 				WARN_ON(iter.index);
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page,
 			 * or a swap entry from shmem/tmpfs.  Skip
@@ -1239,6 +1275,7 @@ repeat:
 			goto repeat;
 		}
 
+export:
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1289,6 +1326,12 @@ repeat:
 				 */
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page,
 			 * or a swap entry from shmem/tmpfs.  Stop
@@ -1316,6 +1359,7 @@ repeat:
 			break;
 		}
 
+export:
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1342,6 +1386,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 	struct radix_tree_iter iter;
 	void **slot;
 	unsigned ret = 0;
+	pgoff_t index_last = *index;
 
 	if (unlikely(!nr_pages))
 		return 0;
@@ -1365,6 +1410,12 @@ repeat:
 				 */
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page.
 			 *
@@ -1388,6 +1439,8 @@ repeat:
 			goto repeat;
 		}
 
+export:
+		index_last = iter.index;
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1396,7 +1449,7 @@ repeat:
 	rcu_read_unlock();
 
 	if (ret)
-		*index = pages[ret - 1]->index + 1;
+		*index = index_last + 1;
 
 	return ret;
 }
@@ -1420,6 +1473,13 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = find_get_page(mapping, index);
 
+	if (radix_tree_exceptional_entry(page)) {
+		/* Only happen is page is migrated to remote memory and the
+		 * fs code knows how to handle the case thus it is safe to
+		 * return the special entry.
+		 */
+		return page;
+	}
 	if (page) {
 		if (trylock_page(page))
 			return page;
@@ -1497,6 +1557,13 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			goto find_page;
+		}
 		if (!page) {
 			page_cache_sync_readahead(mapping,
 					ra, filp,
@@ -1879,7 +1946,15 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	/*
 	 * Do we have something in the page cache already?
 	 */
+find_page:
 	page = find_get_page(mapping, offset);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto find_page;
+	}
 	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
@@ -2145,6 +2220,13 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto repeat;
+	}
 	if (!page) {
 		page = __page_cache_alloc(gfp | __GFP_COLD);
 		if (!page)
@@ -2442,6 +2524,13 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto repeat;
+	}
 	if (page)
 		goto found;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 599d4f6..0d97762 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -61,6 +61,7 @@
 #include <linux/wait.h>
 #include <linux/interval_tree_generic.h>
 #include <linux/mman.h>
+#include <linux/buffer_head.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <linux/delay.h>
@@ -656,6 +657,7 @@ static void hmm_rmem_init(struct hmm_rmem *rmem,
 {
 	kref_init(&rmem->kref);
 	rmem->device = device;
+	rmem->mapping = NULL;
 	rmem->fuid = 0;
 	rmem->luid = 0;
 	rmem->pfns = NULL;
@@ -923,9 +925,13 @@ static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
 			sync_mm_rss(vma->vm_mm);
 		}
 
-		/* Properly uncharge memory. */
-		mem_cgroup_uncharge_mm(vma->vm_mm);
-		add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[idx])) {
+			/* Properly uncharge memory. */
+			mem_cgroup_uncharge_mm(vma->vm_mm);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+		} else {
+			add_mm_counter(vma->vm_mm, MM_FILEPAGES, -1);
+		}
 	}
 }
 
@@ -1064,8 +1070,10 @@ static int hmm_rmem_remap_page(struct hmm_rmem_mm *rmem_mm,
 			pte = pte_mkdirty(pte);
 		}
 		get_page(page);
-		/* Private anonymous page. */
-		page_add_anon_rmap(page, vma, addr);
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[idx])) {
+			/* Private anonymous page. */
+			page_add_anon_rmap(page, vma, addr);
+		}
 		/* FIXME is this necessary ? I do not think so. */
 		if (!reuse_swap_page(page)) {
 			/* Page is still mapped in another process. */
@@ -1149,6 +1157,87 @@ static int hmm_rmem_remap_anon(struct hmm_rmem *rmem,
 	return ret;
 }
 
+static void hmm_rmem_remap_file_single_page(struct hmm_rmem *rmem,
+					    struct page *page)
+{
+	struct address_space *mapping = rmem->mapping;
+	void **slotp;
+
+	list_del_init(&page->lru);
+	spin_lock_irq(&mapping->tree_lock);
+	slotp = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+	if (slotp) {
+		radix_tree_replace_slot(slotp, page);
+		get_page(page);
+	} else {
+		/* This should never happen. */
+		WARN_ONCE(1, "hmm: null slot while remapping !\n");
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	page->mapping = mapping;
+	unlock_page(page);
+	/* To balance putback_lru_page and isolate_lru_page. */
+	get_page(page);
+	putback_lru_page(page);
+	page_remove_rmap(page);
+	page_cache_release(page);
+}
+
+static void hmm_rmem_remap_file(struct hmm_rmem *rmem)
+{
+	struct address_space *mapping = rmem->mapping;
+	unsigned long i, index, uid;
+
+	/* This part is lot easier than the unmap one. */
+	uid = rmem->fuid;
+	index = rmem->pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i, ++uid, ++index) {
+		void *expected, *item, **slotp;
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!page || !test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			continue;
+		}
+		slotp = radix_tree_lookup_slot(&mapping->page_tree, index);
+		if (!slotp) {
+			/* This should never happen. */
+			WARN_ONCE(1, "hmm: null slot while remapping !\n");
+			continue;
+		}
+		item = radix_tree_deref_slot_protected(slotp,
+						       &mapping->tree_lock);
+		expected = swp_to_radix_entry(make_hmm_entry(uid));
+		if (item == expected) {
+			if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i])) {
+				/* FIXME Something was wrong for read back. */
+				ClearPageUptodate(page);
+			}
+			page->mapping = mapping;
+			get_page(page);
+			radix_tree_replace_slot(slotp, page);
+		} else {
+			WARN_ONCE(1, "hmm: expect 0x%p got 0x%p\n",
+				  expected, item);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i, ++uid, ++index) {
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		page->mapping = mapping;
+		if (test_bit(HMM_PFN_DIRTY, &rmem->pfns[i])) {
+			set_page_dirty(page);
+		}
+		unlock_page(page);
+		clear_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+	}
+}
+
 static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
 				    unsigned long addr,
 				    pte_t *ptep,
@@ -1230,6 +1319,94 @@ static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
 	return 0;
 }
 
+static int hmm_rmem_unmap_file_page(struct hmm_rmem_mm *rmem_mm,
+				    unsigned long addr,
+				    pte_t *ptep,
+				    pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	/* New pte value. */
+	uid = rmem_mm->fuid + ((addr - rmem_mm->faddr) >> PAGE_SHIFT);
+	idx = uid - rmem->fuid;
+	pte = ptep_get_and_clear_full(mm, addr, ptep, rmem_mm->tlb.fullmm);
+	tlb_remove_tlb_entry((&rmem_mm->tlb), ptep, addr);
+
+	if (pte_none(pte)) {
+		rmem_mm->laddr = addr + PAGE_SIZE;
+		return 0;
+	}
+	if (!pte_present(pte)) {
+		swp_entry_t entry;
+
+		if (pte_file(pte)) {
+			/* Definitly a fault as we do not support migrating non
+			 * linear vma to remote memory.
+			 */
+			WARN_ONCE(1, "hmm: was trying to migrate non linear vma.\n");
+			return -EBUSY;
+		}
+		entry = pte_to_swp_entry(pte);
+		if (unlikely(non_swap_entry(entry))) {
+			/* This can not happen ! At this point no other process
+			 * knows about this page or have pending operation on
+			 * it beside read operation.
+			 *
+			 * There can be no mm event happening (no migration or
+			 * anything else) that would set a special pte.
+			 */
+			WARN_ONCE(1, "hmm: unhandled pte value 0x%016llx.\n",
+				  (long long)pte_val(pte));
+			return -EBUSY;
+		}
+		/* FIXME free swap ? This was pointing to swap entry of shmem shared memory. */
+		return 0;
+	}
+
+	flush_cache_page(vma, addr, pte_pfn(pte));
+	page = pfn_to_page(pte_pfn(pte));
+	if (PageAnon(page)) {
+		page = hmm_pfn_to_page(rmem->pfns[idx]);
+		list_add_tail(&page->lru, &rmem_mm->remap_pages);
+		rmem->pfns[idx] = pte_pfn(pte);
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		page = pfn_to_page(pte_pfn(pte));
+		pte = swp_entry_to_pte(make_hmm_entry(uid));
+		set_pte_at(mm, addr, ptep, pte);
+		/* tlb_flush_mmu drop one ref so take an extra ref here. */
+		get_page(page);
+	} else {
+		VM_BUG_ON(page != hmm_pfn_to_page(rmem->pfns[idx]));
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx]);
+		}
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		set_bit(HMM_PFN_FILE, &rmem->pfns[idx]);
+		add_mm_counter(mm, MM_FILEPAGES, -1);
+		page_remove_rmap(page);
+		/* Unlike anonymous page do not take an extra reference as we
+		 * already holding one.
+		 */
+	}
+
+	rmem_mm->force_flush = !__tlb_remove_page(&rmem_mm->tlb, page);
+	rmem_mm->laddr = addr + PAGE_SIZE;
+
+	return 0;
+}
+
 static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
 			      unsigned long addr,
 			      unsigned long next,
@@ -1262,15 +1439,29 @@ static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
 again:
 	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
-	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
-		ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
-					       ptep, pmdp);
-		if (ret || rmem_mm->force_flush) {
-			/* Increment ptep so unlock works on correct
-			 * pte.
-			 */
-			ptep++;
-			break;
+	if (vma->vm_file) {
+		for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+			ret = hmm_rmem_unmap_file_page(rmem_mm, addr,
+						       ptep, pmdp);
+			if (ret || rmem_mm->force_flush) {
+				/* Increment ptep so unlock works on correct
+				 * pte.
+				 */
+				ptep++;
+				break;
+			}
+		}
+	} else {
+		for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+			ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
+						       ptep, pmdp);
+			if (ret || rmem_mm->force_flush) {
+				/* Increment ptep so unlock works on correct
+				 * pte.
+				 */
+				ptep++;
+				break;
+			}
 		}
 	}
 	arch_leave_lazy_mmu_mode();
@@ -1321,6 +1512,7 @@ static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
 
 	npages = (laddr - faddr) >> PAGE_SHIFT;
 	rmem->pgoff = faddr;
+	rmem->mapping = NULL;
 	rmem_mm.vma = vma;
 	rmem_mm.rmem = rmem;
 	rmem_mm.faddr = faddr;
@@ -1362,13 +1554,433 @@ static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
 	return ret;
 }
 
+static int hmm_rmem_unmap_file(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr)
+{
+	struct address_space *mapping;
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long addr, i, index, npages, uid;
+	struct page *page, *tmp;
+	int ret;
+
+	npages = hmm_rmem_npages(rmem);
+	rmem->pgoff = vma->vm_pgoff + ((faddr - vma->vm_start) >> PAGE_SHIFT);
+	rmem->mapping = vma->vm_file->f_mapping;
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = faddr;
+	rmem_mm.fuid = rmem->fuid;
+	INIT_LIST_HEAD(&rmem_mm.remap_pages);
+	memset(rmem->pfns, 0, sizeof(long) * npages);
+
+	i = 0;
+	uid = rmem->fuid;
+	addr = faddr;
+	index = rmem->pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	mapping = rmem->mapping;
+
+	/* Probably the most complex part of the code as it needs to serialize
+	 * againt various memory and filesystem event. The range we are trying
+	 * to migrate can be under going writeback, direct_IO, read, write or
+	 * simply mm event such as page reclaimation, page migration, ...
+	 *
+	 * We need to get exclusive access to all the page in the range so that
+	 * no other process access them or try to do anything with them. Trick
+	 * is to set the page->mapping to NULL so that anyone with reference
+	 * on the page will think that the page was either reclaim, migrated or
+	 * truncated. Any code that see that either skip the page or retry to
+	 * do a find_get_page which will result in getting the hmm special swap
+	 * value.
+	 *
+	 * This is a multistep process, first we update the pagecache to point
+	 * to special hmm swap entry so that any new event coming in sees that
+	 * and could block the migration. While updating the pagecache we also
+	 * make sure it is fully populated. We also try lock all page we can so
+	 * that no other process can lock them for write, direct_IO or anything
+	 * else that require the page lock.
+	 *
+	 * Once pagecache is updated we proceed to lock all the unlocked page
+	 * and to isolate them from the lru as we do not want any of them to
+	 * be reclaim while doing the migration. We also make sure the page is
+	 * Uptodate and read it back from the disk if not.
+	 *
+	 * Next step is to unmap the range from the process address for which
+	 * the migration is happening. We do so because we need to account all
+	 * the page against this process so that on migration back unaccounting
+	 * can be done consistently.
+	 *
+	 * Finaly the last step is to unmap for all other process after this
+	 * the only thing that can still be happening is that some page are
+	 * undergoing read or writeback, both of which are fine.
+	 *
+	 * To know up to which step exactly each page went we use various hmm
+	 * pfn flags so that error handling code can take proper action to
+	 * restore page into its original state.
+	 */
+
+retry:
+	if (rmem->event->backoff) {
+		npages = i;
+		ret = -EBUSY;
+		goto out;
+	}
+	spin_lock_irq(&mapping->tree_lock);
+	for (; i < npages; ++i, ++uid, ++index, addr += PAGE_SIZE){
+		void *item, **slotp;
+		int error;
+
+		slotp = radix_tree_lookup_slot(&mapping->page_tree, index);
+		if (!slotp) {
+			spin_unlock_irq(&mapping->tree_lock);
+			page = page_cache_alloc_cold(mapping);
+			if (!page) {
+				npages = i;
+				ret = -ENOMEM;
+				goto out;
+			}
+			ret = add_to_page_cache_lru(page, mapping,
+						    index, GFP_KERNEL);
+			if (ret) {
+				page_cache_release(page);
+				if (ret == -EEXIST) {
+					goto retry;
+				}
+				npages = i;
+				goto out;
+			}
+			/* A previous I/O error may have been due to temporary
+			 * failures, eg. multipath errors. PG_error will be set
+			 * again if readpage fails.
+			 *
+			 * FIXME i do not think this is necessary.
+			 */
+			ClearPageError(page);
+			/* Start the read. The read will unlock the page. */
+			error = mapping->a_ops->readpage(vma->vm_file, page);
+			page_cache_release(page);
+			if (error) {
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+			goto retry;
+		}
+		item = radix_tree_deref_slot_protected(slotp,
+						       &mapping->tree_lock);
+		if (radix_tree_exceptional_entry(item)) {
+			swp_entry_t entry = radix_to_swp_entry(item);
+
+			/* The case of private mapping of a file make things
+			 * interestings as both shared and private anonymous
+			 * page can exist in such rmem object.
+			 *
+			 * For now we just force them to go back to lmem, to
+			 * supporting it require another level of indirection.
+			 */
+			if (!is_hmm_entry(entry)) {
+				spin_unlock_irq(&mapping->tree_lock);
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+			/* FIXME handle shmem swap entry or some other device
+			 */
+			spin_unlock_irq(&mapping->tree_lock);
+			npages = i;
+			ret = -EBUSY;
+			goto out;
+		}
+		page = item;
+		if (unlikely(PageMlocked(page))) {
+			spin_unlock_irq(&mapping->tree_lock);
+			npages = i;
+			ret = -EBUSY;
+			goto out;
+		}
+		item = swp_to_radix_entry(make_hmm_entry(uid));
+		radix_tree_replace_slot(slotp, item);
+		rmem->pfns[i] = page_to_pfn(page) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[i]);
+		set_bit(HMM_PFN_FILE, &rmem->pfns[i]);
+		rmem_mm.laddr = addr + PAGE_SIZE;
+
+		/* Pretend the page is being map make error code handling lot
+		 * simpler and cleaner.
+		 */
+		page_add_file_rmap(page);
+		add_mm_counter(vma->vm_mm, MM_FILEPAGES, 1);
+
+		if (trylock_page(page)) {
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+			if (page->mapping != mapping) {
+				/* Page have been truncated. */
+				spin_unlock_irq(&mapping->tree_lock);
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+		}
+		if (PageWriteback(page)) {
+			set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[i]);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* At this point any unlocked page can still be referenced by various
+	 * file activities (read, write, splice, ...). But no new mapping can
+	 * be instanciated as the pagecache is now updated with special entry.
+	 */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		ret = isolate_lru_page(page);
+		if (ret) {
+			goto out;
+		}
+		/* Isolate take an extra-ref which we do not want, as we are
+		 * already holding a reference on the page. Only holding one
+		 * reference  simplify error code path which then knows that
+		 * we are only holding one reference for each page, it does
+		 * not need to know wether we are holding and extra reference
+		 * or not from the isolate_lru_page.
+		 */
+		put_page(page);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+			/* Has the page been truncated ? */
+			if (page->mapping != mapping) {
+				ret = -EBUSY;
+				goto out;
+			}
+		}
+		if (unlikely(!PageUptodate(page))) {
+			int error;
+
+			/* A previous I/O error may have been due to temporary
+			 * failures, eg. multipath errors. PG_error will be set
+			 * again if readpage fails.
+			 */
+			ClearPageError(page);
+			/* The read will unlock the page which is ok because no
+			 * one else knows about this page at this point.
+			 */
+			error = mapping->a_ops->readpage(vma->vm_file, page);
+			if (error) {
+				ret = -EBUSY;
+				goto out;
+			}
+			lock_page(page);
+		}
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i]);
+	}
+
+	/* At this point all page are lock which means that the page content is
+	 * stable. Because we will reset the page->mapping field we also know
+	 * that anyone holding a reference on the page will retry to find the
+	 * page or skip current operations.
+	 *
+	 * Also at this point no one can be unmapping those pages from the vma
+	 * as the hmm event prevent any mmu_notifier invalidation to proceed
+	 * until we are done.
+	 *
+	 * We need to unmap page from the vma ourself so we can properly update
+	 * the mm counter.
+	 */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (current->mm == vma->vm_mm) {
+		sync_mm_rss(vma->vm_mm);
+	}
+	rmem_mm.force_flush = 0;
+	walk.pmd_entry = hmm_rmem_unmap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	mmu_notifier_invalidate_range_start(walk.mm,vma,faddr,laddr,MMU_HMM);
+	tlb_gather_mmu(&rmem_mm.tlb, walk.mm, faddr, laddr);
+	tlb_start_vma(&rmem_mm.tlb, rmem_mm->vma);
+	ret = walk_page_range(faddr, laddr, &walk);
+	tlb_end_vma(&rmem_mm.tlb, rmem_mm->vma);
+	tlb_finish_mmu(&rmem_mm.tlb, faddr, laddr);
+	mmu_notifier_invalidate_range_end(walk.mm, vma, faddr, laddr, MMU_HMM);
+
+	/* Remap any pages that were replaced by anonymous page. */
+	list_for_each_entry_safe (page, tmp, &rmem_mm.remap_pages, lru) {
+		hmm_rmem_remap_file_single_page(rmem, page);
+	}
+
+	if (ret) {
+		npages = (rmem_mm.laddr - rmem_mm.faddr) >> PAGE_SHIFT;
+		goto out;
+	}
+
+	/* Now unmap from all other process. */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for (i = 0, ret = 0; i < npages; ++i) {
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			continue;
+		}
+
+		/* Because we did call page_add_file_rmap then mapcount must be
+		 * at least one. This was done on to avoid page_remove_rmap to
+		 * update memcg and mm statistic.
+		 */
+		BUG_ON(page_mapcount(page) <= 0);
+		if (page_mapcount(page) > 1) {
+			try_to_unmap(page,
+					 TTU_HMM |
+					 TTU_IGNORE_MLOCK |
+					 TTU_IGNORE_ACCESS);
+			if (page_mapcount(page) > 1) {
+				ret = ret ? ret : -EBUSY;
+			} else {
+				/* Everyone will think page have been migrated,
+				 * truncated or reclaimed.
+				 */
+				page->mapping = NULL;
+			}
+		} else {
+			/* Everyone will think page have been migrated,
+			 * truncated or reclaimed.
+			 */
+			page->mapping = NULL;
+		}
+		/* At this point no one else can write to the page. Save dirty bit and check it when doing
+		 * fault.
+		 */
+		if (PageDirty(page)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[i]);
+			ClearPageDirty(page);
+		}
+	}
+
+	/* This was a long journey but at this point hmm has exclusive owner
+	 * of all the pages and all of them are accounted against the process
+	 * mm as well as Uptodate and ready for to be copied to remote memory.
+	 */
+out:
+	if (ret) {
+		/* Unaccount any unmapped pages. */
+		for (i = 0; i < npages; ++i) {
+			if (test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+				add_mm_counter(walk.mm, MM_FILEPAGES, -1);
+			}
+		}
+	}
+	return ret;
+}
+
+static int hmm_rmem_file_mkwrite(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long addr,
+				 unsigned long uid)
+{
+	struct vm_fault vmf;
+	unsigned long idx = uid - rmem->fuid;
+	struct page *page;
+	int r;
+
+	page = hmm_pfn_to_page(rmem->pfns[idx]);
+	if (test_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx])) {
+		lock_page(page);
+		page->mapping = rmem->mapping;
+		goto release;
+	}
+
+	vmf.virtual_address = (void __user *)(addr & PAGE_MASK);
+	vmf.pgoff = page->index;
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.page = page;
+	page->mapping = rmem->mapping;
+	page_cache_get(page);
+
+	r = vma->vm_ops->page_mkwrite(vma, &vmf);
+	if (unlikely(r & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+		page_cache_release(page);
+		return -EFAULT;
+	}
+	if (unlikely(!(r & VM_FAULT_LOCKED))) {
+		lock_page(page);
+		if (!page->mapping) {
+
+			WARN_ONCE(1, "hmm: page can not be truncated while in rmem !\n");
+			unlock_page(page);
+			page_cache_release(page);
+			return -EFAULT;
+		}
+	}
+	set_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx]);
+	/* Ok to put_page here as we hold another reference. */
+	page_cache_release(page);
+
+release:
+	/* We clear the write back now to forbid any new write back. The write
+	 * back code will need to go through its slow code path to set again
+	 * the writeback flags.
+	 */
+	clear_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+	/* Now wait for any in progress writeback. */
+	if (PageWriteback(page)) {
+		wait_on_page_writeback(page);
+	}
+	/* The page count is what we use to synchronize with write back. The
+	 * write back code take an extra reference on page before returning
+	 * them to the write back fs code and thus here at this point we see
+	 * that and forbid the change.
+	 *
+	 * However as we just waited for pending writeback above, in case the
+	 * writeback was already scheduled then at this point its done and it
+	 * should have drop the extra reference thus the rmem can be written
+	 * to again.
+	 */
+	if (page_count(page) > (1 + page_has_private(page))) {
+		page->mapping = NULL;
+		unlock_page(page);
+		return -EBUSY;
+	}
+	/* No body should have write to that page thus nobody should have set
+	 * the dirty bit.
+	 */
+	BUG_ON(PageDirty(page));
+
+	/* Restore page count. */
+	page->mapping = NULL;
+	clear_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+	/* Ok now device can write to rmem. */
+	set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+	unlock_page(page);
+
+	return 0;
+}
+
 static inline int hmm_rmem_unmap(struct hmm_rmem *rmem,
 				 struct vm_area_struct *vma,
 				 unsigned long faddr,
 				 unsigned long laddr)
 {
 	if (vma->vm_file) {
-		return -EBUSY;
+		return hmm_rmem_unmap_file(rmem, vma, faddr, laddr);
 	} else {
 		return hmm_rmem_unmap_anon(rmem, vma, faddr, laddr);
 	}
@@ -1402,6 +2014,34 @@ static int hmm_rmem_alloc_pages(struct hmm_rmem *rmem,
 			vma = mm ? find_vma(mm, addr) : NULL;
 		}
 
+		page = hmm_pfn_to_page(pfns[i]);
+		if (page && test_bit(HMM_PFN_VALID_PAGE, &pfns[i])) {
+			BUG_ON(test_bit(HMM_PFN_LOCK, &pfns[i]));
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &pfns[i]);
+
+			/* Fake one mapping so that page_remove_rmap behave as
+			 * we want.
+			 */
+			BUG_ON(page_mapcount(page));
+			atomic_set(&page->_mapcount, 0);
+
+			spin_lock(&rmem->lock);
+			if (test_bit(HMM_PFN_WRITEBACK, &pfns[i])) {
+				/* Clear the bit first, it is fine because any
+				 * thread that will test the bit will first
+				 * check the rmem->event and at this point it
+				 * is set to the migration event.
+				 */
+				clear_bit(HMM_PFN_WRITEBACK, &pfns[i]);
+				spin_unlock(&rmem->lock);
+				wait_on_page_writeback(page);
+			} else {
+				spin_unlock(&rmem->lock);
+			}
+			continue;
+		}
+
 		/* No need to clear page they will be dma to of course this does
 		 * means we trust the device driver.
 		 */
@@ -1482,7 +2122,7 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 						 range->laddr,
 						 range->fuid,
 						 HMM_MIGRATE_TO_LMEM,
-						 false);
+						 !!(range->rmem->mapping));
 		if (IS_ERR(fence)) {
 			ret = PTR_ERR(fence);
 			goto error;
@@ -1517,6 +2157,19 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 		}
 	}
 
+	/* Sanity check the driver. */
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i])) {
+			WARN_ONCE(1, "hmm: driver failed to set HMM_PFN_LMEM_UPTODATE.\n");
+			ret = -EINVAL;
+			goto error;
+		}
+	}
+
+	if (rmem->mapping) {
+		hmm_rmem_remap_file(rmem);
+	}
+
 	/* Now the remote memory is officialy dead and nothing below can fails
 	 * badly.
 	 */
@@ -1526,6 +2179,13 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 	 * ranges list.
 	 */
 	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		if (rmem->mapping) {
+			add_mm_counter(range->mirror->hmm->mm, MM_FILEPAGES,
+				       -hmm_range_npages(range));
+			hmm_range_fini(range);
+			continue;
+		}
+
 		VM_BUG_ON(!vma);
 		VM_BUG_ON(range->faddr < vma->vm_start);
 		VM_BUG_ON(range->laddr > vma->vm_end);
@@ -1544,8 +2204,20 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
 		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
-		unlock_page(page);
-		mem_cgroup_transfer_charge_anon(page, mm);
+		/* The HMM_PFN_FILE bit is only set for page that are in the
+		 * pagecache and thus are already accounted properly. So when
+		 * unset this means this is a private anonymous page for which
+		 * we need to transfer charge.
+		 *
+		 * If remapping failed then below page_remove_rmap will update
+		 * the memcg and mm properly.
+		 */
+		if (mm && !test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			mem_cgroup_transfer_charge_anon(page, mm);
+		}
+		if (test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			unlock_page(page);
+		}
 		page_remove_rmap(page);
 		page_cache_release(page);
 		rmem->pfns[i] = 0UL;
@@ -1563,6 +2235,19 @@ error:
 	 * (2) rmem is mirroring private memory, easy case poison all ranges
 	 *     referencing the rmem.
 	 */
+	if (rmem->mapping) {
+		/* No matter what try to copy back data, driver should be
+		 * clever and not copy over page with HMM_PFN_LMEM_UPTODATE
+		 * bit set.
+		 */
+		fence = device->ops->rmem_to_lmem(rmem, rmem->fuid, rmem->luid);
+		if (fence && !IS_ERR(fence)) {
+			INIT_LIST_HEAD(&fence->list);
+			ret = hmm_device_fence_wait(device, fence);
+		}
+		/* FIXME how to handle error ? Mark page with error ? */
+		hmm_rmem_remap_file(rmem);
+	}
 	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
 		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
@@ -1573,9 +2258,11 @@ error:
 			}
 			continue;
 		}
-		/* Properly uncharge memory. */
-		mem_cgroup_transfer_charge_anon(page, mm);
-		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			/* Properly uncharge memory. */
+			mem_cgroup_transfer_charge_anon(page, mm);
+		}
+		if (test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
 			unlock_page(page);
 		}
 		page_remove_rmap(page);
@@ -1583,6 +2270,15 @@ error:
 		rmem->pfns[i] = 0UL;
 	}
 	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		/* FIXME Philosophical question Should we poison other process that access this shared file ? */
+		if (rmem->mapping) {
+			add_mm_counter(range->mirror->hmm->mm, MM_FILEPAGES,
+				       -hmm_range_npages(range));
+			/* Case (1) FIXME implement ! */
+			hmm_range_fini(range);
+			continue;
+		}
+
 		mm = range->mirror->hmm->mm;
 		hmm_rmem_poison_range(rmem, mm, NULL, range->faddr,
 				      range->laddr, range->fuid);
@@ -2063,6 +2759,268 @@ int hmm_mm_fault(struct mm_struct *mm,
 	return VM_FAULT_MAJOR;
 }
 
+/* see include/linux/hmm.h */
+void hmm_pagecache_migrate(struct address_space *mapping,
+			   swp_entry_t swap)
+{
+	struct hmm_rmem *rmem = NULL;
+	unsigned long fuid, luid, npages;
+
+	/* This can not happen ! */
+	VM_BUG_ON(!is_hmm_entry(swap));
+
+	fuid = hmm_entry_uid(swap);
+	VM_BUG_ON(!fuid);
+
+	rmem = hmm_rmem_find(fuid);
+	if (!rmem || rmem->dead) {
+		hmm_rmem_unref(rmem);
+		return;
+	}
+
+	/* FIXME use something else that 16 pages. Readahead ? Or just all range of dirty pages. */
+	npages = 16;
+	luid = min((fuid - rmem->fuid), (npages >> 2));
+	fuid = fuid - luid;
+	luid = min(fuid + npages, rmem->luid);
+
+	hmm_rmem_migrate_to_lmem(rmem, NULL, 0, fuid, luid, true);
+	hmm_rmem_unref(rmem);
+}
+EXPORT_SYMBOL(hmm_pagecache_migrate);
+
+/* see include/linux/hmm.h */
+struct page *hmm_pagecache_writeback(struct address_space *mapping,
+				     swp_entry_t swap)
+{
+	struct hmm_device *device;
+	struct hmm_range *range, *nrange;
+	struct hmm_fence *fence, *nfence;
+	struct hmm_event event;
+	struct hmm_rmem *rmem = NULL;
+	unsigned long i, uid, idx, npages;
+	/* FIXME hardcoded 16 */
+	struct page *pages[16];
+	bool dirty = false;
+	int ret;
+
+	/* Find the corresponding rmem. */
+	if (!is_hmm_entry(swap)) {
+		BUG();
+		return NULL;
+	}
+	uid = hmm_entry_uid(swap);
+	if (!uid) {
+		/* Poisonous hmm swap entry this can not happen. */
+		BUG();
+		return NULL;
+	}
+
+retry:
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		/* Someone likely migrated it back to lmem by returning NULL
+		 * the caller will perform a new lookup.
+		 */
+		return NULL;
+	}
+
+	if (rmem->dead) {
+		/* When dead is set everything is done. */
+		hmm_rmem_unref(rmem);
+		return NULL;
+	}
+
+	idx = uid - rmem->fuid;
+	device = rmem->device;
+	spin_lock(&rmem->lock);
+	if (rmem->event) {
+		if (rmem->event->etype == HMM_MIGRATE_TO_RMEM) {
+			rmem->event->backoff = true;
+		}
+		spin_unlock(&rmem->lock);
+		wait_event(device->wait_queue, rmem->event!=NULL);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+	pages[0] =  hmm_pfn_to_page(rmem->pfns[idx]);
+	if (!pages[0]) {
+		spin_unlock(&rmem->lock);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+	get_page(pages[0]);
+	if (!trylock_page(pages[0])) {
+		unsigned long orig = rmem->pfns[idx];
+
+		spin_unlock(&rmem->lock);
+		lock_page(pages[0]);
+		spin_lock(&rmem->lock);
+		if (rmem->pfns[idx] != orig) {
+			spin_unlock(&rmem->lock);
+			unlock_page(pages[0]);
+			page_cache_release(pages[0]);
+			hmm_rmem_unref(rmem);
+			goto retry;
+		}
+	}
+	if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+		dirty = test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+		spin_unlock(&rmem->lock);
+		hmm_rmem_unref(rmem);
+		if (dirty) {
+			set_page_dirty(pages[0]);
+		}
+		return pages[0];
+	}
+
+	if (rmem->event) {
+		spin_unlock(&rmem->lock);
+		unlock_page(pages[0]);
+		page_cache_release(pages[0]);
+		wait_event(device->wait_queue, rmem->event!=NULL);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+
+	/* Try to batch few pages. */
+	/* FIXME use something else that 16 pages. Readahead ? Or just all range of dirty pages. */
+	npages = 16;
+	set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+	for (i = 1; i < npages; ++i) {
+		pages[i] = hmm_pfn_to_page(rmem->pfns[idx + i]);
+		if (!trylock_page(pages[i])) {
+			npages = i;
+			break;
+		}
+		if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx + i])) {
+			unlock_page(pages[i]);
+			npages = i;
+			break;
+		}
+		set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx + i]);
+		get_page(pages[i]);
+	}
+
+	event.etype = HMM_WRITEBACK;
+	event.faddr = uid;
+	event.laddr = uid + npages;
+	rmem->event = &event;
+	INIT_LIST_HEAD(&event.ranges);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		list_add_tail(&range->elist, &event.ranges);
+	}
+	spin_unlock(&rmem->lock);
+
+	list_for_each_entry (range, &event.ranges, elist) {
+		unsigned long fuid, faddr, laddr;
+
+		if (event.laddr <  hmm_range_fuid(range) ||
+		    event.faddr >= hmm_range_luid(range)) {
+			continue;
+		}
+		fuid  = max(event.faddr, hmm_range_fuid(range));
+		faddr = fuid - hmm_range_fuid(range);
+		laddr = min(event.laddr, hmm_range_luid(range)) - fuid;
+		faddr = range->faddr + (faddr << PAGE_SHIFT);
+		laddr = range->faddr + (laddr << PAGE_SHIFT);
+		ret = hmm_mirror_rmem_update(range->mirror, rmem, faddr,
+					     laddr, fuid, &event, true);
+		if (ret) {
+			goto error;
+		}
+	}
+
+	list_for_each_entry_safe (fence, nfence, &event.fences, list) {
+		hmm_device_fence_wait(device, fence);
+	}
+
+	/* Event faddr is fuid and laddr is luid. */
+	fence = device->ops->rmem_to_lmem(rmem, event.faddr, event.laddr);
+	if (IS_ERR(fence)) {
+		goto error;
+	}
+	INIT_LIST_HEAD(&fence->list);
+	ret = hmm_device_fence_wait(device, fence);
+	if (ret) {
+		goto error;
+	}
+
+	spin_lock(&rmem->lock);
+	if (test_bit(!HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx + i])) {
+		/* This should not happen the driver must set the bit. */
+		WARN_ONCE(1, "hmm: driver failed to set HMM_PFN_LMEM_UPTODATE.\n");
+		goto error;
+	}
+	rmem->event = NULL;
+	dirty = test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+	list_for_each_entry_safe (range, nrange, &event.ranges, elist) {
+		list_del_init(&range->elist);
+	}
+	spin_unlock(&rmem->lock);
+	/* Do not unlock first page, return it locked. */
+	for (i = 1; i < npages; ++i) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+	wake_up(&device->wait_queue);
+	hmm_rmem_unref(rmem);
+	if (dirty) {
+		set_page_dirty(pages[0]);
+	}
+	return pages[0];
+
+error:
+	for (i = 0; i < npages; ++i) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+	spin_lock(&rmem->lock);
+	rmem->event = NULL;
+	list_for_each_entry_safe (range, nrange, &event.ranges, elist) {
+		list_del_init(&range->elist);
+	}
+	spin_unlock(&rmem->lock);
+	hmm_rmem_unref(rmem);
+	hmm_pagecache_migrate(mapping, swap);
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_pagecache_writeback);
+
+struct page *hmm_pagecache_page(struct address_space *mapping,
+				swp_entry_t swap)
+{
+	struct hmm_rmem *rmem = NULL;
+	struct page *page;
+	unsigned long uid;
+
+	/* Find the corresponding rmem. */
+	if (!is_hmm_entry(swap)) {
+		BUG();
+		return NULL;
+	}
+	uid = hmm_entry_uid(swap);
+	if (!uid) {
+		/* Poisonous hmm swap entry this can not happen. */
+		BUG();
+		return NULL;
+	}
+
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		/* Someone likely migrated it back to lmem by returning NULL
+		 * the caller will perform a new lookup.
+		 */
+		return NULL;
+	}
+
+	page = hmm_pfn_to_page(rmem->pfns[uid - rmem->fuid]);
+	get_page(page);
+	hmm_rmem_unref(rmem);
+	return page;
+}
+
 
 
 
@@ -2667,7 +3625,7 @@ static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
 {
 	struct hmm_device *device = mirror->device;
 	struct hmm_rmem *rmem = range->rmem;
-	unsigned long fuid, luid, npages;
+	unsigned long i, fuid, luid, npages, uid;
 	int ret;
 
 	if (range->mirror != mirror) {
@@ -2679,6 +3637,77 @@ static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
 	fuid = range->fuid + ((faddr - range->faddr) >> PAGE_SHIFT);
 	luid = fuid + npages;
 
+	/* The rmem might not be uptodate so synchronize again. The only way
+	 * this might be the case is if a previous mkwrite failed and the
+	 * device decided to use the local memory copy.
+	 */
+	i = fuid - rmem->fuid;
+	for (uid = fuid; uid < luid; ++uid, ++i) {
+		if (!test_bit(HMM_PFN_RMEM_UPTODATE, &rmem->pfns[i])) {
+			struct hmm_fence *fence, *nfence;
+			enum hmm_etype etype = event->etype;
+
+			event->etype = HMM_UNMAP;
+			ret = hmm_mirror_rmem_update(mirror, rmem, range->faddr,
+						     range->laddr, range->fuid,
+						     event, true);
+			event->etype = etype;
+			if (ret) {
+				return ret;
+			}
+			list_for_each_entry_safe (fence, nfence,
+						  &event->fences, list) {
+				hmm_device_fence_wait(device, fence);
+			}
+			fence = device->ops->lmem_to_rmem(rmem, range->fuid,
+							  hmm_range_luid(range));
+			if (IS_ERR(fence)) {
+				return PTR_ERR(fence);
+			}
+			ret = hmm_device_fence_wait(device, fence);
+			if (ret) {
+				return ret;
+			}
+			break;
+		}
+	}
+
+	if (write && rmem->mapping) {
+		unsigned long addr;
+
+		if (current->mm == vma->vm_mm) {
+			sync_mm_rss(vma->vm_mm);
+		}
+		i = fuid - rmem->fuid;
+		addr = faddr;
+		for (uid = fuid; uid < luid; ++uid, ++i, addr += PAGE_SIZE) {
+			if (test_bit(HMM_PFN_WRITE, &rmem->pfns[i])) {
+				continue;
+			}
+			if (vma->vm_flags & VM_SHARED) {
+				ret = hmm_rmem_file_mkwrite(rmem,vma,addr,uid);
+				if (ret && ret != -EBUSY) {
+					return ret;
+				}
+			} else {
+				struct mm_struct *mm = vma->vm_mm;
+				struct page *page;
+
+				/* COW */
+				if(mem_cgroup_charge_anon(NULL,mm,GFP_KERNEL)){
+					return -ENOMEM;
+				}
+				add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+				spin_lock(&rmem->lock);
+				page = hmm_pfn_to_page(rmem->pfns[i]);
+				rmem->pfns[i] = 0;
+				set_bit(HMM_PFN_WRITE, &rmem->pfns[i]);
+				spin_unlock(&rmem->lock);
+				hmm_rmem_remap_file_single_page(rmem, page);
+			}
+		}
+	}
+
 	ret = device->ops->rmem_fault(mirror, rmem, faddr, laddr, fuid, fault);
 	return ret;
 }
@@ -2951,7 +3980,10 @@ static void hmm_migrate_abort(struct hmm_mirror *mirror,
 					      faddr, laddr, fuid);
 		}
 	} else {
-		BUG();
+		rmem.pgoff = vma->vm_pgoff;
+		rmem.pgoff += ((fault->faddr - vma->vm_start) >> PAGE_SHIFT);
+		rmem.mapping = vma->vm_file->f_mapping;
+		hmm_rmem_remap_file(&rmem);
 	}
 
 	/* Ok officialy dead. */
@@ -2977,6 +4009,15 @@ static void hmm_migrate_abort(struct hmm_mirror *mirror,
 			unlock_page(page);
 			clear_bit(HMM_PFN_LOCK, &pfns[i]);
 		}
+		if (test_bit(HMM_PFN_FILE, &pfns[i]) && !PageLRU(page)) {
+			/* To balance putback_lru_page and isolate_lru_page. As
+			 * a simplification we droped the extra reference taken
+			 * by isolate_lru_page. This is why we need to take an
+			 * extra reference here for putback_lru_page.
+			 */
+			get_page(page);
+			putback_lru_page(page);
+		}
 		page_remove_rmap(page);
 		page_cache_release(page);
 		pfns[i] = 0;
@@ -2988,6 +4029,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 			     struct hmm_mirror *mirror)
 {
 	struct vm_area_struct *vma;
+	struct address_space *mapping;
 	struct hmm_device *device;
 	struct hmm_range *range;
 	struct hmm_fence *fence;
@@ -3042,7 +4084,8 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 		ret = -EACCES;
 		goto out;
 	}
-	if (vma->vm_file) {
+	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
+	if (vma->vm_file && !(mapping->a_ops->features & AOPS_FEATURE_HMM)) {
 		kfree(range);
 		range = NULL;
 		ret = -EBUSY;
@@ -3053,6 +4096,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 	event->laddr  =fault->laddr = min(fault->laddr, vma->vm_end);
 	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
 	fault->vma = vma;
+	rmem.mapping = (vma->vm_flags & VM_SHARED) ? mapping : NULL;
 
 	ret = hmm_rmem_alloc(&rmem, npages);
 	if (ret) {
@@ -3100,6 +4144,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 	hmm_rmem_tree_insert(fault->rmem, &_hmm_rmems);
 	fault->rmem->pfns = rmem.pfns;
 	range->rmem = fault->rmem;
+	fault->rmem->mapping = rmem.mapping;
 	list_del_init(&range->rlist);
 	list_add_tail(&range->rlist, &fault->rmem->ranges);
 	rmem.event = NULL;
@@ -3128,7 +4173,6 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 		struct page *page = hmm_pfn_to_page(rmem.pfns[i]);
 
 		if (test_bit(HMM_PFN_VALID_ZERO, &rmem.pfns[i])) {
-			rmem.pfns[i] = rmem.pfns[i] & HMM_PFN_CLEAR;
 			continue;
 		}
 		/* We only decrement now the page count so that cow happen
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..7c13f8d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -202,6 +202,10 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 			continue;
 		}
 		swap = radix_to_swp_entry(page);
+		if (is_hmm_entry(swap)) {
+			/* FIXME start migration here ? */
+			continue;
+		}
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
 								NULL, 0);
 		if (page)
diff --git a/mm/mincore.c b/mm/mincore.c
index 725c809..107b870 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -79,6 +79,10 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 		 */
 		if (radix_tree_exceptional_entry(page)) {
 			swp_entry_t swp = radix_to_swp_entry(page);
+
+			if (is_hmm_entry(swp)) {
+				return 1;
+			}
 			page = find_get_page(swap_address_space(swp), swp.val);
 		}
 	} else
@@ -86,6 +90,13 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 #else
 	page = find_get_page(mapping, pgoff);
 #endif
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (is_hmm_entry(swap)) {
+			return 1;
+		}
+	}
 	if (page) {
 		present = PageUptodate(page);
 		page_cache_release(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 023cf08..b6dcf80 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -37,6 +37,7 @@
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
 #include <linux/mm_inline.h>
+#include <linux/hmm.h>
 #include <trace/events/writeback.h>
 
 #include "internal.h"
@@ -1900,6 +1901,8 @@ retry:
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
+		pgoff_t save_index = index;
+		bool migrated = false;
 		int i;
 
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
@@ -1907,58 +1910,106 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		for (i = 0, migrated = false; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* This can not happen ! */
+				BUG_ON(!is_hmm_entry(swap));
+				page = hmm_pagecache_writeback(mapping, swap);
+				if (page == NULL) {
+					migrated = true;
+					pvec.pages[i] = NULL;
+				}
+			}
+		}
+
+		/* Some rmem was migrated we need to redo the page cache lookup. */
+		if (migrated) {
+			for (i = 0; i < nr_pages; i++) {
+				struct page *page = pvec.pages[i];
+
+				if (page && radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					page = hmm_pagecache_page(mapping, swap);
+					unlock_page(page);
+					page_cache_release(page);
+					pvec.pages[i] = page;
+				}
+			}
+			pagevec_release(&pvec);
+			cond_resched();
+			index = save_index;
+			goto retry;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			/*
-			 * At this point, the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or
-			 * even swizzled back from swapper_space to tmpfs file
-			 * mapping. However, page->index will not change
-			 * because we have a reference on the page.
-			 */
-			if (page->index > end) {
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				pvec.pages[i] = page = hmm_pagecache_page(mapping, swap);
+				page_cache_release(page);
+				done_index = page->index;
+			} else {
 				/*
-				 * can't be range_cyclic (1st pass) because
-				 * end == -1 in that case.
+				 * At this point, the page may be truncated or
+				 * invalidated (changing page->mapping to NULL), or
+				 * even swizzled back from swapper_space to tmpfs file
+				 * mapping. However, page->index will not change
+				 * because we have a reference on the page.
 				 */
-				done = 1;
-				break;
-			}
+				if (page->index > end) {
+					/*
+					 * can't be range_cyclic (1st pass) because
+					 * end == -1 in that case.
+					 */
+					done = 1;
+					break;
+				}
 
-			done_index = page->index;
+				done_index = page->index;
 
-			lock_page(page);
+				lock_page(page);
 
-			/*
-			 * Page truncated or invalidated. We can freely skip it
-			 * then, even for data integrity operations: the page
-			 * has disappeared concurrently, so there could be no
-			 * real expectation of this data interity operation
-			 * even if there is now a new, dirty page at the same
-			 * pagecache address.
-			 */
-			if (unlikely(page->mapping != mapping)) {
-continue_unlock:
-				unlock_page(page);
-				continue;
+				/*
+				 * Page truncated or invalidated. We can freely skip it
+				 * then, even for data integrity operations: the page
+				 * has disappeared concurrently, so there could be no
+				 * real expectation of this data interity operation
+				 * even if there is now a new, dirty page at the same
+				 * pagecache address.
+				 */
+				if (unlikely(page->mapping != mapping)) {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			if (!PageDirty(page)) {
 				/* someone wrote it for us */
-				goto continue_unlock;
+				unlock_page(page);
+				continue;
 			}
 
 			if (PageWriteback(page)) {
-				if (wbc->sync_mode != WB_SYNC_NONE)
+				if (wbc->sync_mode != WB_SYNC_NONE) {
 					wait_on_page_writeback(page);
-				else
-					goto continue_unlock;
+				} else {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
-				goto continue_unlock;
+			if (!clear_page_dirty_for_io(page)) {
+				unlock_page(page);
+				continue;
+			}
 
 			trace_wbc_writepage(wbc, mapping->backing_dev_info);
 			ret = (*writepage)(page, wbc, data);
@@ -1994,6 +2045,20 @@ continue_unlock:
 				break;
 			}
 		}
+
+		/* Some entry of pvec might still be exceptional ! */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				page_cache_release(page);
+				pvec.pages[i] = page;
+			}
+		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index e07450c..3b7fbd3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1132,6 +1132,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	case TTU_MUNLOCK:
 		action = MMU_MUNLOCK;
 		break;
+	case TTU_HMM:
+		action = MMU_HMM;
+		break;
 	default:
 		/* Please report this ! */
 		BUG();
@@ -1327,6 +1330,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	case TTU_MUNLOCK:
 		action = MMU_MUNLOCK;
 		break;
+	case TTU_HMM:
+		action = MMU_HMM;
+		break;
 	default:
 		/* Please report this ! */
 		BUG();
@@ -1426,7 +1432,12 @@ static int try_to_unmap_nonlinear(struct page *page,
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
-	unsigned int mapcount;
+	unsigned int mapcount, min_mapcount = 0;
+
+	/* The hmm code keep mapcount elevated to 1 to avoid updating mm and
+	 * memcg. If we are call on behalf of hmm just ignore this extra 1.
+	 */
+	min_mapcount = (TTU_ACTION((enum ttu_flags)arg) == TTU_HMM) ? 1 : 0;
 
 	list_for_each_entry(vma,
 		&mapping->i_mmap_nonlinear, shared.nonlinear) {
@@ -1449,8 +1460,10 @@ static int try_to_unmap_nonlinear(struct page *page,
 	 * just walk the nonlinear vmas trying to age and unmap some.
 	 * The mapcount of the page we came in with is irrelevant,
 	 * but even so use it as a guide to how hard we should try?
+	 *
+	 * See comment about hmm above for min_mapcount.
 	 */
-	mapcount = page_mapcount(page);
+	mapcount = page_mapcount(page) - min_mapcount;
 	if (!mapcount)
 		return ret;
 
diff --git a/mm/swap.c b/mm/swap.c
index c0ed4d6..426fede 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -839,6 +839,15 @@ void release_pages(struct page **pages, int nr, int cold)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
+		if (!page) {
+			continue;
+		}
+		if (radix_tree_exceptional_entry(page)) {
+			/* This should really not happen tell us about it ! */
+			WARN_ONCE(1, "hmm exceptional entry left\n");
+			continue;
+		}
+
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c81..c979fd6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -281,6 +282,32 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -313,7 +340,16 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	}
 
 	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+		struct page *page;
+
+	repeat_start:
+		page = find_lock_page(mapping, start - 1);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(mapping, swap);
+			goto repeat_start;
+		}
 		if (page) {
 			unsigned int top = PAGE_CACHE_SIZE;
 			if (start > end) {
@@ -332,7 +368,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 	}
 	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+		struct page *page;
+	repeat_end:
+		page = find_lock_page(mapping, end);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(mapping, swap);
+			goto repeat_end;
+		}
 		if (page) {
 			wait_on_page_writeback(page);
 			zero_user_segment(page, 0, partial_end);
@@ -371,6 +415,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
+			/* FIXME Find a way to block rmem migration on truncate. */
+			BUG_ON(radix_tree_exceptional_entry(page));
+
 			/* We rely upon deletion not changing page->index */
 			index = indices[i];
 			if (index >= end)
@@ -488,6 +535,32 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -597,6 +670,32 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 08/11] hmm: support for migrate file backed pages to remote memory
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: Jérôme Glisse <jglisse@redhat.com>

Motivation:

Same as for migrating anonymous private memory ie device local memory has
higher bandwidth and lower latency.

Implementation:

Migrated range are tracked exactly as private anonymous memory refer to
the commit adding support for migrating private anonymous memory.

Migrating file backed page is more complex than private anonymous memory
as those pages might be involved in various filesystem event from write
back to splice or truncation.

This patchset use a special hmm swap value that it store inside the radix
tree for page that are migrated to remote memory. Any code that need to do
radix tree lookup is updated to understand those special hmm swap entry
and to call hmm helper function to perform the appropriate operation.

For most operations (file read, splice, truncate, ...) the end result is
simply to migrate back to local memory. It is expected that user of hmm
will do not perform such operation on file back memory that was migrated
to remote memory.

Write back is different as we preserve the capabilities of doing dirtied
memory write back from remote memory (using local system memory as a bounce
buffer).

Each filesystem code must be modified to support hmm. This patchset only
modify common helper code and add the core set of helpers needed for this
feature.

Issues:

The big issue here is how to handle failure to migrate the remote memory back
to local memory. Should all the process trying further access to the file get
SIGBUS ? Should only the process that migrated memory to remote memory get
SIGBUS ? ...

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 fs/aio.c             |    9 +
 fs/buffer.c          |    3 +
 fs/splice.c          |   38 +-
 include/linux/fs.h   |    4 +
 include/linux/hmm.h  |   72 +++-
 include/linux/rmap.h |    1 +
 mm/filemap.c         |   99 ++++-
 mm/hmm.c             | 1094 ++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/madvise.c         |    4 +
 mm/mincore.c         |   11 +
 mm/page-writeback.c  |  131 ++++--
 mm/rmap.c            |   17 +-
 mm/swap.c            |    9 +
 mm/truncate.c        |  103 ++++-
 14 files changed, 1524 insertions(+), 71 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0bf693f..0ec9f16 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/hmm.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -405,10 +406,18 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
+
+	repeat:
 		page = find_or_create_page(file->f_inode->i_mapping,
 					   i, GFP_HIGHUSER | __GFP_ZERO);
 		if (!page)
 			break;
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(file->f_inode->i_mapping, swap);
+			goto repeat;
+		}
 		pr_debug("pid(%d) page[%d]->count=%d\n",
 			 current->pid, i, page_count(page));
 		SetPageUptodate(page);
diff --git a/fs/buffer.c b/fs/buffer.c
index e33f8d5..2be2a04 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
 #include <linux/cpu.h>
 #include <linux/bitops.h>
 #include <linux/mpage.h>
+#include <linux/hmm.h>
 #include <linux/bit_spinlock.h>
 #include <trace/events/block.h>
 
@@ -1023,6 +1024,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	if (!page)
 		return ret;
 
+	/* This can not happen ! */
+	BUG_ON(radix_tree_exceptional_entry(page));
 	BUG_ON(!PageLocked(page));
 
 	if (page_has_buffers(page)) {
diff --git a/fs/splice.c b/fs/splice.c
index 9dc23de..175f80c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -33,6 +33,7 @@
 #include <linux/socket.h>
 #include <linux/compat.h>
 #include <linux/aio.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 /*
@@ -334,6 +335,20 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 	 * Lookup the (hopefully) full range of pages we need.
 	 */
 	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
+	/* Handle hmm entry, ie migrate remote memory back to local memory. */
+	for (page_nr = 0; page_nr < spd.nr_pages;) {
+		page = spd.pages[page_nr];
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			spd.pages[page_nr] = find_get_page(mapping, index + page_nr);
+			continue;
+		} else {
+			page_nr++;
+		}
+	}
 	index += spd.nr_pages;
 
 	/*
@@ -351,6 +366,14 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * the first hole.
 		 */
 		page = find_get_page(mapping, index);
+
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			continue;
+		}
 		if (!page) {
 			/*
 			 * page didn't exist, allocate one.
@@ -373,7 +396,6 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			 */
 			unlock_page(page);
 		}
-
 		spd.pages[spd.nr_pages++] = page;
 		index++;
 	}
@@ -415,6 +437,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			 */
 			if (!page->mapping) {
 				unlock_page(page);
+retry:
 				page = find_or_create_page(mapping, index,
 						mapping_gfp_mask(mapping));
 
@@ -422,8 +445,17 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 					error = -ENOMEM;
 					break;
 				}
-				page_cache_release(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
+				/* At this point it can not be an exceptional hmm entry. */
+				if (radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					/* FIXME How to handle hmm migration failure ? */
+					hmm_pagecache_migrate(mapping, swap);
+					goto retry;
+				} else {
+					page_cache_release(spd.pages[page_nr]);
+					spd.pages[page_nr] = page;
+				}
 			}
 			/*
 			 * page was already under io and is now done, great
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4e92d55..149a73e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -366,8 +366,12 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+
+	int features;
 };
 
+#define AOPS_FEATURE_HMM	(1 << 0)
+
 extern const struct address_space_operations empty_aops;
 
 /*
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 96f41c4..9d232c1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -53,7 +53,6 @@
 #include <linux/swapops.h>
 #include <linux/mman.h>
 
-
 struct hmm_device;
 struct hmm_device_ops;
 struct hmm_mirror;
@@ -75,6 +74,14 @@ struct hmm;
  *   HMM_PFN_LOCK is only set while the rmem object is under going migration.
  *   HMM_PFN_LMEM_UPTODATE the page that is in the rmem pfn array has uptodate.
  *   HMM_PFN_RMEM_UPTODATE the rmem copy of the page is uptodate.
+ *   HMM_PFN_FILE is set for page part of pagecache.
+ *   HMM_PFN_WRITEBACK is set when page is under going writeback, this means
+ *     that the page is lock and all device mapping to rmem for this page are
+ *     set to read only. It will only be clear if device do write fault on the
+ *     page or on migration back to lmem.
+ *   HMM_PFN_FS_WRITEABLE the rmem can be written to without calling mkwrite.
+ *     This is for hmm internal use only to know if hmm needs to call the fs
+ *     mkwrite callback or not.
  *
  * Device driver only need to worry about :
  *   HMM_PFN_VALID_PAGE
@@ -95,6 +102,9 @@ struct hmm;
 #define HMM_PFN_LOCK		(4UL)
 #define HMM_PFN_LMEM_UPTODATE	(5UL)
 #define HMM_PFN_RMEM_UPTODATE	(6UL)
+#define HMM_PFN_FILE		(7UL)
+#define HMM_PFN_WRITEBACK	(8UL)
+#define HMM_PFN_FS_WRITEABLE	(9UL)
 
 static inline struct page *hmm_pfn_to_page(unsigned long pfn)
 {
@@ -170,6 +180,7 @@ enum hmm_etype {
 	HMM_UNMAP,
 	HMM_MIGRATE_TO_LMEM,
 	HMM_MIGRATE_TO_RMEM,
+	HMM_WRITEBACK,
 };
 
 struct hmm_fence {
@@ -628,6 +639,7 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
  *
  * @kref:           Reference count.
  * @device:         The hmm device the remote memory is allocated on.
+ * @mapping:        If rmem backing shared mapping.
  * @event:          The event currently associated with the rmem.
  * @lock:           Lock protecting the ranges list and event field.
  * @ranges:         The list of address ranges that point to this rmem.
@@ -646,6 +658,7 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
 struct hmm_rmem {
 	struct kref		kref;
 	struct hmm_device	*device;
+	struct address_space	*mapping;
 	struct hmm_event	*event;
 	spinlock_t		lock;
 	struct list_head	ranges;
@@ -913,6 +926,42 @@ int hmm_mm_fault(struct mm_struct *mm,
 		 unsigned int fault_flags,
 		 pte_t orig_pte);
 
+/* hmm_pagecache_migrate - migrate remote memory to local memory.
+ *
+ * @mapping:    The address space into which the rmem was found.
+ * @swap:       The hmm special swap entry that needs to be migrated.
+ *
+ * When the fs code need to migrate remote memory to local memory it calls this
+ * function. From caller point of view this function can not fail. If it does
+ * then it will trigger SIGBUS if process that were using rmem try accessing
+ * the failed migration page. Other process will just get that lastest content
+ * we had for the page. Hence from pagecache point of view it never fails.
+ */
+void hmm_pagecache_migrate(struct address_space *mapping,
+			   swp_entry_t swap);
+
+/* hmm_pagecache_writeback - temporaty copy of rmem for writeback.
+ *
+ * @mapping:    The address space into which the rmem was found.
+ * @swap:       The hmm special swap entry that needs temporary copy.
+ * Return:      Page pointer or NULL on failure.
+ *
+ * When the fs code need to writeback remote memory to backing storage it calls
+ * this function. The function return pointer to temporary page into which the
+ * lastest copy of the remote memory is. The remote memory will be mark as read
+ * only for the duration of the writeback.
+ *
+ * On failure this will return NULL and will poison any mapping of the process
+ * that was responsible for the remote memory thus triggering a SIGBUS for this
+ * process. It will as well kill the mirror that was using this remote memory.
+ *
+ * When NULL is returned the caller should perform a new radix tree lookup.
+ */
+struct page *hmm_pagecache_writeback(struct address_space *mapping,
+				     swp_entry_t swap);
+struct page *hmm_pagecache_page(struct address_space *mapping,
+				swp_entry_t swap);
+
 #else /* !CONFIG_HMM */
 
 static inline void hmm_destroy(struct mm_struct *mm)
@@ -930,6 +979,27 @@ static inline int hmm_mm_fault(struct mm_struct *mm,
 	return VM_FAULT_SIGBUS;
 }
 
+static inline void hmm_pagecache_migrate(struct address_space *mapping,
+					 swp_entry_t swap)
+{
+	/* This can not happen ! */
+	BUG();
+}
+
+static inline struct page *hmm_pagecache_writeback(struct address_space *mapping,
+						   swp_entry_t swap)
+{
+	BUG();
+	return NULL;
+}
+
+static inline struct page *hmm_pagecache_page(struct address_space *mapping,
+					      swp_entry_t swap)
+{
+	BUG();
+	return NULL;
+}
+
 #endif /* !CONFIG_HMM */
 
 #endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 575851f..0641ccf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -76,6 +76,7 @@ enum ttu_flags {
 	TTU_POISON = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 3,		/* munlock mode */
+	TTU_HMM = 4,			/* hmm mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
diff --git a/mm/filemap.c b/mm/filemap.c
index 067c3c0..686f46b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -343,6 +344,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 {
 	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
 	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
+	pgoff_t last_index = index;
 	struct pagevec pvec;
 	int nr_pages;
 	int ret2, ret = 0;
@@ -360,6 +362,19 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				pvec.pages[i] = NULL;
+				/* Force to examine the range again in case the
+				 * the migration triggered page writeback.
+				 */
+				index = last_index;
+				continue;
+			}
+
 			/* until radix tree lookup accepts end_index */
 			if (page->index > end)
 				continue;
@@ -369,6 +384,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 				ret = -EIO;
 		}
 		pagevec_release(&pvec);
+		last_index = index;
 		cond_resched();
 	}
 out:
@@ -987,14 +1003,21 @@ EXPORT_SYMBOL(find_get_entry);
  * Looks up the page cache slot at @mapping & @offset.  If there is a
  * page cache page, it is returned with an increased refcount.
  *
+ * Note that this will also return hmm special entry.
+ *
  * Otherwise, %NULL is returned.
  */
 struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page = find_get_entry(mapping, offset);
 
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (!is_hmm_entry(swap)) {
+			page = NULL;
+		}
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -1044,6 +1067,8 @@ EXPORT_SYMBOL(find_lock_entry);
  * page cache page, it is returned locked and with an increased
  * refcount.
  *
+ * Note that this will also return hmm special entry.
+ *
  * Otherwise, %NULL is returned.
  *
  * find_lock_page() may sleep.
@@ -1052,8 +1077,13 @@ struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page = find_lock_entry(mapping, offset);
 
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (!is_hmm_entry(swap)) {
+			page = NULL;
+		}
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -1222,6 +1252,12 @@ repeat:
 				WARN_ON(iter.index);
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page,
 			 * or a swap entry from shmem/tmpfs.  Skip
@@ -1239,6 +1275,7 @@ repeat:
 			goto repeat;
 		}
 
+export:
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1289,6 +1326,12 @@ repeat:
 				 */
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page,
 			 * or a swap entry from shmem/tmpfs.  Stop
@@ -1316,6 +1359,7 @@ repeat:
 			break;
 		}
 
+export:
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1342,6 +1386,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 	struct radix_tree_iter iter;
 	void **slot;
 	unsigned ret = 0;
+	pgoff_t index_last = *index;
 
 	if (unlikely(!nr_pages))
 		return 0;
@@ -1365,6 +1410,12 @@ repeat:
 				 */
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page.
 			 *
@@ -1388,6 +1439,8 @@ repeat:
 			goto repeat;
 		}
 
+export:
+		index_last = iter.index;
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1396,7 +1449,7 @@ repeat:
 	rcu_read_unlock();
 
 	if (ret)
-		*index = pages[ret - 1]->index + 1;
+		*index = index_last + 1;
 
 	return ret;
 }
@@ -1420,6 +1473,13 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = find_get_page(mapping, index);
 
+	if (radix_tree_exceptional_entry(page)) {
+		/* Only happen is page is migrated to remote memory and the
+		 * fs code knows how to handle the case thus it is safe to
+		 * return the special entry.
+		 */
+		return page;
+	}
 	if (page) {
 		if (trylock_page(page))
 			return page;
@@ -1497,6 +1557,13 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			goto find_page;
+		}
 		if (!page) {
 			page_cache_sync_readahead(mapping,
 					ra, filp,
@@ -1879,7 +1946,15 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	/*
 	 * Do we have something in the page cache already?
 	 */
+find_page:
 	page = find_get_page(mapping, offset);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto find_page;
+	}
 	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
@@ -2145,6 +2220,13 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto repeat;
+	}
 	if (!page) {
 		page = __page_cache_alloc(gfp | __GFP_COLD);
 		if (!page)
@@ -2442,6 +2524,13 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto repeat;
+	}
 	if (page)
 		goto found;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 599d4f6..0d97762 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -61,6 +61,7 @@
 #include <linux/wait.h>
 #include <linux/interval_tree_generic.h>
 #include <linux/mman.h>
+#include <linux/buffer_head.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <linux/delay.h>
@@ -656,6 +657,7 @@ static void hmm_rmem_init(struct hmm_rmem *rmem,
 {
 	kref_init(&rmem->kref);
 	rmem->device = device;
+	rmem->mapping = NULL;
 	rmem->fuid = 0;
 	rmem->luid = 0;
 	rmem->pfns = NULL;
@@ -923,9 +925,13 @@ static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
 			sync_mm_rss(vma->vm_mm);
 		}
 
-		/* Properly uncharge memory. */
-		mem_cgroup_uncharge_mm(vma->vm_mm);
-		add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[idx])) {
+			/* Properly uncharge memory. */
+			mem_cgroup_uncharge_mm(vma->vm_mm);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+		} else {
+			add_mm_counter(vma->vm_mm, MM_FILEPAGES, -1);
+		}
 	}
 }
 
@@ -1064,8 +1070,10 @@ static int hmm_rmem_remap_page(struct hmm_rmem_mm *rmem_mm,
 			pte = pte_mkdirty(pte);
 		}
 		get_page(page);
-		/* Private anonymous page. */
-		page_add_anon_rmap(page, vma, addr);
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[idx])) {
+			/* Private anonymous page. */
+			page_add_anon_rmap(page, vma, addr);
+		}
 		/* FIXME is this necessary ? I do not think so. */
 		if (!reuse_swap_page(page)) {
 			/* Page is still mapped in another process. */
@@ -1149,6 +1157,87 @@ static int hmm_rmem_remap_anon(struct hmm_rmem *rmem,
 	return ret;
 }
 
+static void hmm_rmem_remap_file_single_page(struct hmm_rmem *rmem,
+					    struct page *page)
+{
+	struct address_space *mapping = rmem->mapping;
+	void **slotp;
+
+	list_del_init(&page->lru);
+	spin_lock_irq(&mapping->tree_lock);
+	slotp = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+	if (slotp) {
+		radix_tree_replace_slot(slotp, page);
+		get_page(page);
+	} else {
+		/* This should never happen. */
+		WARN_ONCE(1, "hmm: null slot while remapping !\n");
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	page->mapping = mapping;
+	unlock_page(page);
+	/* To balance putback_lru_page and isolate_lru_page. */
+	get_page(page);
+	putback_lru_page(page);
+	page_remove_rmap(page);
+	page_cache_release(page);
+}
+
+static void hmm_rmem_remap_file(struct hmm_rmem *rmem)
+{
+	struct address_space *mapping = rmem->mapping;
+	unsigned long i, index, uid;
+
+	/* This part is lot easier than the unmap one. */
+	uid = rmem->fuid;
+	index = rmem->pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i, ++uid, ++index) {
+		void *expected, *item, **slotp;
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!page || !test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			continue;
+		}
+		slotp = radix_tree_lookup_slot(&mapping->page_tree, index);
+		if (!slotp) {
+			/* This should never happen. */
+			WARN_ONCE(1, "hmm: null slot while remapping !\n");
+			continue;
+		}
+		item = radix_tree_deref_slot_protected(slotp,
+						       &mapping->tree_lock);
+		expected = swp_to_radix_entry(make_hmm_entry(uid));
+		if (item == expected) {
+			if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i])) {
+				/* FIXME Something was wrong for read back. */
+				ClearPageUptodate(page);
+			}
+			page->mapping = mapping;
+			get_page(page);
+			radix_tree_replace_slot(slotp, page);
+		} else {
+			WARN_ONCE(1, "hmm: expect 0x%p got 0x%p\n",
+				  expected, item);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i, ++uid, ++index) {
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		page->mapping = mapping;
+		if (test_bit(HMM_PFN_DIRTY, &rmem->pfns[i])) {
+			set_page_dirty(page);
+		}
+		unlock_page(page);
+		clear_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+	}
+}
+
 static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
 				    unsigned long addr,
 				    pte_t *ptep,
@@ -1230,6 +1319,94 @@ static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
 	return 0;
 }
 
+static int hmm_rmem_unmap_file_page(struct hmm_rmem_mm *rmem_mm,
+				    unsigned long addr,
+				    pte_t *ptep,
+				    pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	/* New pte value. */
+	uid = rmem_mm->fuid + ((addr - rmem_mm->faddr) >> PAGE_SHIFT);
+	idx = uid - rmem->fuid;
+	pte = ptep_get_and_clear_full(mm, addr, ptep, rmem_mm->tlb.fullmm);
+	tlb_remove_tlb_entry((&rmem_mm->tlb), ptep, addr);
+
+	if (pte_none(pte)) {
+		rmem_mm->laddr = addr + PAGE_SIZE;
+		return 0;
+	}
+	if (!pte_present(pte)) {
+		swp_entry_t entry;
+
+		if (pte_file(pte)) {
+			/* Definitly a fault as we do not support migrating non
+			 * linear vma to remote memory.
+			 */
+			WARN_ONCE(1, "hmm: was trying to migrate non linear vma.\n");
+			return -EBUSY;
+		}
+		entry = pte_to_swp_entry(pte);
+		if (unlikely(non_swap_entry(entry))) {
+			/* This can not happen ! At this point no other process
+			 * knows about this page or have pending operation on
+			 * it beside read operation.
+			 *
+			 * There can be no mm event happening (no migration or
+			 * anything else) that would set a special pte.
+			 */
+			WARN_ONCE(1, "hmm: unhandled pte value 0x%016llx.\n",
+				  (long long)pte_val(pte));
+			return -EBUSY;
+		}
+		/* FIXME free swap ? This was pointing to swap entry of shmem shared memory. */
+		return 0;
+	}
+
+	flush_cache_page(vma, addr, pte_pfn(pte));
+	page = pfn_to_page(pte_pfn(pte));
+	if (PageAnon(page)) {
+		page = hmm_pfn_to_page(rmem->pfns[idx]);
+		list_add_tail(&page->lru, &rmem_mm->remap_pages);
+		rmem->pfns[idx] = pte_pfn(pte);
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		page = pfn_to_page(pte_pfn(pte));
+		pte = swp_entry_to_pte(make_hmm_entry(uid));
+		set_pte_at(mm, addr, ptep, pte);
+		/* tlb_flush_mmu drop one ref so take an extra ref here. */
+		get_page(page);
+	} else {
+		VM_BUG_ON(page != hmm_pfn_to_page(rmem->pfns[idx]));
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx]);
+		}
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		set_bit(HMM_PFN_FILE, &rmem->pfns[idx]);
+		add_mm_counter(mm, MM_FILEPAGES, -1);
+		page_remove_rmap(page);
+		/* Unlike anonymous page do not take an extra reference as we
+		 * already holding one.
+		 */
+	}
+
+	rmem_mm->force_flush = !__tlb_remove_page(&rmem_mm->tlb, page);
+	rmem_mm->laddr = addr + PAGE_SIZE;
+
+	return 0;
+}
+
 static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
 			      unsigned long addr,
 			      unsigned long next,
@@ -1262,15 +1439,29 @@ static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
 again:
 	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
-	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
-		ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
-					       ptep, pmdp);
-		if (ret || rmem_mm->force_flush) {
-			/* Increment ptep so unlock works on correct
-			 * pte.
-			 */
-			ptep++;
-			break;
+	if (vma->vm_file) {
+		for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+			ret = hmm_rmem_unmap_file_page(rmem_mm, addr,
+						       ptep, pmdp);
+			if (ret || rmem_mm->force_flush) {
+				/* Increment ptep so unlock works on correct
+				 * pte.
+				 */
+				ptep++;
+				break;
+			}
+		}
+	} else {
+		for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+			ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
+						       ptep, pmdp);
+			if (ret || rmem_mm->force_flush) {
+				/* Increment ptep so unlock works on correct
+				 * pte.
+				 */
+				ptep++;
+				break;
+			}
 		}
 	}
 	arch_leave_lazy_mmu_mode();
@@ -1321,6 +1512,7 @@ static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
 
 	npages = (laddr - faddr) >> PAGE_SHIFT;
 	rmem->pgoff = faddr;
+	rmem->mapping = NULL;
 	rmem_mm.vma = vma;
 	rmem_mm.rmem = rmem;
 	rmem_mm.faddr = faddr;
@@ -1362,13 +1554,433 @@ static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
 	return ret;
 }
 
+static int hmm_rmem_unmap_file(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr)
+{
+	struct address_space *mapping;
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long addr, i, index, npages, uid;
+	struct page *page, *tmp;
+	int ret;
+
+	npages = hmm_rmem_npages(rmem);
+	rmem->pgoff = vma->vm_pgoff + ((faddr - vma->vm_start) >> PAGE_SHIFT);
+	rmem->mapping = vma->vm_file->f_mapping;
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = faddr;
+	rmem_mm.fuid = rmem->fuid;
+	INIT_LIST_HEAD(&rmem_mm.remap_pages);
+	memset(rmem->pfns, 0, sizeof(long) * npages);
+
+	i = 0;
+	uid = rmem->fuid;
+	addr = faddr;
+	index = rmem->pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	mapping = rmem->mapping;
+
+	/* Probably the most complex part of the code as it needs to serialize
+	 * againt various memory and filesystem event. The range we are trying
+	 * to migrate can be under going writeback, direct_IO, read, write or
+	 * simply mm event such as page reclaimation, page migration, ...
+	 *
+	 * We need to get exclusive access to all the page in the range so that
+	 * no other process access them or try to do anything with them. Trick
+	 * is to set the page->mapping to NULL so that anyone with reference
+	 * on the page will think that the page was either reclaim, migrated or
+	 * truncated. Any code that see that either skip the page or retry to
+	 * do a find_get_page which will result in getting the hmm special swap
+	 * value.
+	 *
+	 * This is a multistep process, first we update the pagecache to point
+	 * to special hmm swap entry so that any new event coming in sees that
+	 * and could block the migration. While updating the pagecache we also
+	 * make sure it is fully populated. We also try lock all page we can so
+	 * that no other process can lock them for write, direct_IO or anything
+	 * else that require the page lock.
+	 *
+	 * Once pagecache is updated we proceed to lock all the unlocked page
+	 * and to isolate them from the lru as we do not want any of them to
+	 * be reclaim while doing the migration. We also make sure the page is
+	 * Uptodate and read it back from the disk if not.
+	 *
+	 * Next step is to unmap the range from the process address for which
+	 * the migration is happening. We do so because we need to account all
+	 * the page against this process so that on migration back unaccounting
+	 * can be done consistently.
+	 *
+	 * Finaly the last step is to unmap for all other process after this
+	 * the only thing that can still be happening is that some page are
+	 * undergoing read or writeback, both of which are fine.
+	 *
+	 * To know up to which step exactly each page went we use various hmm
+	 * pfn flags so that error handling code can take proper action to
+	 * restore page into its original state.
+	 */
+
+retry:
+	if (rmem->event->backoff) {
+		npages = i;
+		ret = -EBUSY;
+		goto out;
+	}
+	spin_lock_irq(&mapping->tree_lock);
+	for (; i < npages; ++i, ++uid, ++index, addr += PAGE_SIZE){
+		void *item, **slotp;
+		int error;
+
+		slotp = radix_tree_lookup_slot(&mapping->page_tree, index);
+		if (!slotp) {
+			spin_unlock_irq(&mapping->tree_lock);
+			page = page_cache_alloc_cold(mapping);
+			if (!page) {
+				npages = i;
+				ret = -ENOMEM;
+				goto out;
+			}
+			ret = add_to_page_cache_lru(page, mapping,
+						    index, GFP_KERNEL);
+			if (ret) {
+				page_cache_release(page);
+				if (ret == -EEXIST) {
+					goto retry;
+				}
+				npages = i;
+				goto out;
+			}
+			/* A previous I/O error may have been due to temporary
+			 * failures, eg. multipath errors. PG_error will be set
+			 * again if readpage fails.
+			 *
+			 * FIXME i do not think this is necessary.
+			 */
+			ClearPageError(page);
+			/* Start the read. The read will unlock the page. */
+			error = mapping->a_ops->readpage(vma->vm_file, page);
+			page_cache_release(page);
+			if (error) {
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+			goto retry;
+		}
+		item = radix_tree_deref_slot_protected(slotp,
+						       &mapping->tree_lock);
+		if (radix_tree_exceptional_entry(item)) {
+			swp_entry_t entry = radix_to_swp_entry(item);
+
+			/* The case of private mapping of a file make things
+			 * interestings as both shared and private anonymous
+			 * page can exist in such rmem object.
+			 *
+			 * For now we just force them to go back to lmem, to
+			 * supporting it require another level of indirection.
+			 */
+			if (!is_hmm_entry(entry)) {
+				spin_unlock_irq(&mapping->tree_lock);
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+			/* FIXME handle shmem swap entry or some other device
+			 */
+			spin_unlock_irq(&mapping->tree_lock);
+			npages = i;
+			ret = -EBUSY;
+			goto out;
+		}
+		page = item;
+		if (unlikely(PageMlocked(page))) {
+			spin_unlock_irq(&mapping->tree_lock);
+			npages = i;
+			ret = -EBUSY;
+			goto out;
+		}
+		item = swp_to_radix_entry(make_hmm_entry(uid));
+		radix_tree_replace_slot(slotp, item);
+		rmem->pfns[i] = page_to_pfn(page) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[i]);
+		set_bit(HMM_PFN_FILE, &rmem->pfns[i]);
+		rmem_mm.laddr = addr + PAGE_SIZE;
+
+		/* Pretend the page is being map make error code handling lot
+		 * simpler and cleaner.
+		 */
+		page_add_file_rmap(page);
+		add_mm_counter(vma->vm_mm, MM_FILEPAGES, 1);
+
+		if (trylock_page(page)) {
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+			if (page->mapping != mapping) {
+				/* Page have been truncated. */
+				spin_unlock_irq(&mapping->tree_lock);
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+		}
+		if (PageWriteback(page)) {
+			set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[i]);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* At this point any unlocked page can still be referenced by various
+	 * file activities (read, write, splice, ...). But no new mapping can
+	 * be instanciated as the pagecache is now updated with special entry.
+	 */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		ret = isolate_lru_page(page);
+		if (ret) {
+			goto out;
+		}
+		/* Isolate take an extra-ref which we do not want, as we are
+		 * already holding a reference on the page. Only holding one
+		 * reference  simplify error code path which then knows that
+		 * we are only holding one reference for each page, it does
+		 * not need to know wether we are holding and extra reference
+		 * or not from the isolate_lru_page.
+		 */
+		put_page(page);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+			/* Has the page been truncated ? */
+			if (page->mapping != mapping) {
+				ret = -EBUSY;
+				goto out;
+			}
+		}
+		if (unlikely(!PageUptodate(page))) {
+			int error;
+
+			/* A previous I/O error may have been due to temporary
+			 * failures, eg. multipath errors. PG_error will be set
+			 * again if readpage fails.
+			 */
+			ClearPageError(page);
+			/* The read will unlock the page which is ok because no
+			 * one else knows about this page at this point.
+			 */
+			error = mapping->a_ops->readpage(vma->vm_file, page);
+			if (error) {
+				ret = -EBUSY;
+				goto out;
+			}
+			lock_page(page);
+		}
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i]);
+	}
+
+	/* At this point all page are lock which means that the page content is
+	 * stable. Because we will reset the page->mapping field we also know
+	 * that anyone holding a reference on the page will retry to find the
+	 * page or skip current operations.
+	 *
+	 * Also at this point no one can be unmapping those pages from the vma
+	 * as the hmm event prevent any mmu_notifier invalidation to proceed
+	 * until we are done.
+	 *
+	 * We need to unmap page from the vma ourself so we can properly update
+	 * the mm counter.
+	 */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (current->mm == vma->vm_mm) {
+		sync_mm_rss(vma->vm_mm);
+	}
+	rmem_mm.force_flush = 0;
+	walk.pmd_entry = hmm_rmem_unmap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	mmu_notifier_invalidate_range_start(walk.mm,vma,faddr,laddr,MMU_HMM);
+	tlb_gather_mmu(&rmem_mm.tlb, walk.mm, faddr, laddr);
+	tlb_start_vma(&rmem_mm.tlb, rmem_mm->vma);
+	ret = walk_page_range(faddr, laddr, &walk);
+	tlb_end_vma(&rmem_mm.tlb, rmem_mm->vma);
+	tlb_finish_mmu(&rmem_mm.tlb, faddr, laddr);
+	mmu_notifier_invalidate_range_end(walk.mm, vma, faddr, laddr, MMU_HMM);
+
+	/* Remap any pages that were replaced by anonymous page. */
+	list_for_each_entry_safe (page, tmp, &rmem_mm.remap_pages, lru) {
+		hmm_rmem_remap_file_single_page(rmem, page);
+	}
+
+	if (ret) {
+		npages = (rmem_mm.laddr - rmem_mm.faddr) >> PAGE_SHIFT;
+		goto out;
+	}
+
+	/* Now unmap from all other process. */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for (i = 0, ret = 0; i < npages; ++i) {
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			continue;
+		}
+
+		/* Because we did call page_add_file_rmap then mapcount must be
+		 * at least one. This was done on to avoid page_remove_rmap to
+		 * update memcg and mm statistic.
+		 */
+		BUG_ON(page_mapcount(page) <= 0);
+		if (page_mapcount(page) > 1) {
+			try_to_unmap(page,
+					 TTU_HMM |
+					 TTU_IGNORE_MLOCK |
+					 TTU_IGNORE_ACCESS);
+			if (page_mapcount(page) > 1) {
+				ret = ret ? ret : -EBUSY;
+			} else {
+				/* Everyone will think page have been migrated,
+				 * truncated or reclaimed.
+				 */
+				page->mapping = NULL;
+			}
+		} else {
+			/* Everyone will think page have been migrated,
+			 * truncated or reclaimed.
+			 */
+			page->mapping = NULL;
+		}
+		/* At this point no one else can write to the page. Save dirty bit and check it when doing
+		 * fault.
+		 */
+		if (PageDirty(page)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[i]);
+			ClearPageDirty(page);
+		}
+	}
+
+	/* This was a long journey but at this point hmm has exclusive owner
+	 * of all the pages and all of them are accounted against the process
+	 * mm as well as Uptodate and ready for to be copied to remote memory.
+	 */
+out:
+	if (ret) {
+		/* Unaccount any unmapped pages. */
+		for (i = 0; i < npages; ++i) {
+			if (test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+				add_mm_counter(walk.mm, MM_FILEPAGES, -1);
+			}
+		}
+	}
+	return ret;
+}
+
+static int hmm_rmem_file_mkwrite(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long addr,
+				 unsigned long uid)
+{
+	struct vm_fault vmf;
+	unsigned long idx = uid - rmem->fuid;
+	struct page *page;
+	int r;
+
+	page = hmm_pfn_to_page(rmem->pfns[idx]);
+	if (test_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx])) {
+		lock_page(page);
+		page->mapping = rmem->mapping;
+		goto release;
+	}
+
+	vmf.virtual_address = (void __user *)(addr & PAGE_MASK);
+	vmf.pgoff = page->index;
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.page = page;
+	page->mapping = rmem->mapping;
+	page_cache_get(page);
+
+	r = vma->vm_ops->page_mkwrite(vma, &vmf);
+	if (unlikely(r & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+		page_cache_release(page);
+		return -EFAULT;
+	}
+	if (unlikely(!(r & VM_FAULT_LOCKED))) {
+		lock_page(page);
+		if (!page->mapping) {
+
+			WARN_ONCE(1, "hmm: page can not be truncated while in rmem !\n");
+			unlock_page(page);
+			page_cache_release(page);
+			return -EFAULT;
+		}
+	}
+	set_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx]);
+	/* Ok to put_page here as we hold another reference. */
+	page_cache_release(page);
+
+release:
+	/* We clear the write back now to forbid any new write back. The write
+	 * back code will need to go through its slow code path to set again
+	 * the writeback flags.
+	 */
+	clear_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+	/* Now wait for any in progress writeback. */
+	if (PageWriteback(page)) {
+		wait_on_page_writeback(page);
+	}
+	/* The page count is what we use to synchronize with write back. The
+	 * write back code take an extra reference on page before returning
+	 * them to the write back fs code and thus here at this point we see
+	 * that and forbid the change.
+	 *
+	 * However as we just waited for pending writeback above, in case the
+	 * writeback was already scheduled then at this point its done and it
+	 * should have drop the extra reference thus the rmem can be written
+	 * to again.
+	 */
+	if (page_count(page) > (1 + page_has_private(page))) {
+		page->mapping = NULL;
+		unlock_page(page);
+		return -EBUSY;
+	}
+	/* No body should have write to that page thus nobody should have set
+	 * the dirty bit.
+	 */
+	BUG_ON(PageDirty(page));
+
+	/* Restore page count. */
+	page->mapping = NULL;
+	clear_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+	/* Ok now device can write to rmem. */
+	set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+	unlock_page(page);
+
+	return 0;
+}
+
 static inline int hmm_rmem_unmap(struct hmm_rmem *rmem,
 				 struct vm_area_struct *vma,
 				 unsigned long faddr,
 				 unsigned long laddr)
 {
 	if (vma->vm_file) {
-		return -EBUSY;
+		return hmm_rmem_unmap_file(rmem, vma, faddr, laddr);
 	} else {
 		return hmm_rmem_unmap_anon(rmem, vma, faddr, laddr);
 	}
@@ -1402,6 +2014,34 @@ static int hmm_rmem_alloc_pages(struct hmm_rmem *rmem,
 			vma = mm ? find_vma(mm, addr) : NULL;
 		}
 
+		page = hmm_pfn_to_page(pfns[i]);
+		if (page && test_bit(HMM_PFN_VALID_PAGE, &pfns[i])) {
+			BUG_ON(test_bit(HMM_PFN_LOCK, &pfns[i]));
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &pfns[i]);
+
+			/* Fake one mapping so that page_remove_rmap behave as
+			 * we want.
+			 */
+			BUG_ON(page_mapcount(page));
+			atomic_set(&page->_mapcount, 0);
+
+			spin_lock(&rmem->lock);
+			if (test_bit(HMM_PFN_WRITEBACK, &pfns[i])) {
+				/* Clear the bit first, it is fine because any
+				 * thread that will test the bit will first
+				 * check the rmem->event and at this point it
+				 * is set to the migration event.
+				 */
+				clear_bit(HMM_PFN_WRITEBACK, &pfns[i]);
+				spin_unlock(&rmem->lock);
+				wait_on_page_writeback(page);
+			} else {
+				spin_unlock(&rmem->lock);
+			}
+			continue;
+		}
+
 		/* No need to clear page they will be dma to of course this does
 		 * means we trust the device driver.
 		 */
@@ -1482,7 +2122,7 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 						 range->laddr,
 						 range->fuid,
 						 HMM_MIGRATE_TO_LMEM,
-						 false);
+						 !!(range->rmem->mapping));
 		if (IS_ERR(fence)) {
 			ret = PTR_ERR(fence);
 			goto error;
@@ -1517,6 +2157,19 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 		}
 	}
 
+	/* Sanity check the driver. */
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i])) {
+			WARN_ONCE(1, "hmm: driver failed to set HMM_PFN_LMEM_UPTODATE.\n");
+			ret = -EINVAL;
+			goto error;
+		}
+	}
+
+	if (rmem->mapping) {
+		hmm_rmem_remap_file(rmem);
+	}
+
 	/* Now the remote memory is officialy dead and nothing below can fails
 	 * badly.
 	 */
@@ -1526,6 +2179,13 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 	 * ranges list.
 	 */
 	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		if (rmem->mapping) {
+			add_mm_counter(range->mirror->hmm->mm, MM_FILEPAGES,
+				       -hmm_range_npages(range));
+			hmm_range_fini(range);
+			continue;
+		}
+
 		VM_BUG_ON(!vma);
 		VM_BUG_ON(range->faddr < vma->vm_start);
 		VM_BUG_ON(range->laddr > vma->vm_end);
@@ -1544,8 +2204,20 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
 		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
-		unlock_page(page);
-		mem_cgroup_transfer_charge_anon(page, mm);
+		/* The HMM_PFN_FILE bit is only set for page that are in the
+		 * pagecache and thus are already accounted properly. So when
+		 * unset this means this is a private anonymous page for which
+		 * we need to transfer charge.
+		 *
+		 * If remapping failed then below page_remove_rmap will update
+		 * the memcg and mm properly.
+		 */
+		if (mm && !test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			mem_cgroup_transfer_charge_anon(page, mm);
+		}
+		if (test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			unlock_page(page);
+		}
 		page_remove_rmap(page);
 		page_cache_release(page);
 		rmem->pfns[i] = 0UL;
@@ -1563,6 +2235,19 @@ error:
 	 * (2) rmem is mirroring private memory, easy case poison all ranges
 	 *     referencing the rmem.
 	 */
+	if (rmem->mapping) {
+		/* No matter what try to copy back data, driver should be
+		 * clever and not copy over page with HMM_PFN_LMEM_UPTODATE
+		 * bit set.
+		 */
+		fence = device->ops->rmem_to_lmem(rmem, rmem->fuid, rmem->luid);
+		if (fence && !IS_ERR(fence)) {
+			INIT_LIST_HEAD(&fence->list);
+			ret = hmm_device_fence_wait(device, fence);
+		}
+		/* FIXME how to handle error ? Mark page with error ? */
+		hmm_rmem_remap_file(rmem);
+	}
 	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
 		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
@@ -1573,9 +2258,11 @@ error:
 			}
 			continue;
 		}
-		/* Properly uncharge memory. */
-		mem_cgroup_transfer_charge_anon(page, mm);
-		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			/* Properly uncharge memory. */
+			mem_cgroup_transfer_charge_anon(page, mm);
+		}
+		if (test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
 			unlock_page(page);
 		}
 		page_remove_rmap(page);
@@ -1583,6 +2270,15 @@ error:
 		rmem->pfns[i] = 0UL;
 	}
 	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		/* FIXME Philosophical question Should we poison other process that access this shared file ? */
+		if (rmem->mapping) {
+			add_mm_counter(range->mirror->hmm->mm, MM_FILEPAGES,
+				       -hmm_range_npages(range));
+			/* Case (1) FIXME implement ! */
+			hmm_range_fini(range);
+			continue;
+		}
+
 		mm = range->mirror->hmm->mm;
 		hmm_rmem_poison_range(rmem, mm, NULL, range->faddr,
 				      range->laddr, range->fuid);
@@ -2063,6 +2759,268 @@ int hmm_mm_fault(struct mm_struct *mm,
 	return VM_FAULT_MAJOR;
 }
 
+/* see include/linux/hmm.h */
+void hmm_pagecache_migrate(struct address_space *mapping,
+			   swp_entry_t swap)
+{
+	struct hmm_rmem *rmem = NULL;
+	unsigned long fuid, luid, npages;
+
+	/* This can not happen ! */
+	VM_BUG_ON(!is_hmm_entry(swap));
+
+	fuid = hmm_entry_uid(swap);
+	VM_BUG_ON(!fuid);
+
+	rmem = hmm_rmem_find(fuid);
+	if (!rmem || rmem->dead) {
+		hmm_rmem_unref(rmem);
+		return;
+	}
+
+	/* FIXME use something else that 16 pages. Readahead ? Or just all range of dirty pages. */
+	npages = 16;
+	luid = min((fuid - rmem->fuid), (npages >> 2));
+	fuid = fuid - luid;
+	luid = min(fuid + npages, rmem->luid);
+
+	hmm_rmem_migrate_to_lmem(rmem, NULL, 0, fuid, luid, true);
+	hmm_rmem_unref(rmem);
+}
+EXPORT_SYMBOL(hmm_pagecache_migrate);
+
+/* see include/linux/hmm.h */
+struct page *hmm_pagecache_writeback(struct address_space *mapping,
+				     swp_entry_t swap)
+{
+	struct hmm_device *device;
+	struct hmm_range *range, *nrange;
+	struct hmm_fence *fence, *nfence;
+	struct hmm_event event;
+	struct hmm_rmem *rmem = NULL;
+	unsigned long i, uid, idx, npages;
+	/* FIXME hardcoded 16 */
+	struct page *pages[16];
+	bool dirty = false;
+	int ret;
+
+	/* Find the corresponding rmem. */
+	if (!is_hmm_entry(swap)) {
+		BUG();
+		return NULL;
+	}
+	uid = hmm_entry_uid(swap);
+	if (!uid) {
+		/* Poisonous hmm swap entry this can not happen. */
+		BUG();
+		return NULL;
+	}
+
+retry:
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		/* Someone likely migrated it back to lmem by returning NULL
+		 * the caller will perform a new lookup.
+		 */
+		return NULL;
+	}
+
+	if (rmem->dead) {
+		/* When dead is set everything is done. */
+		hmm_rmem_unref(rmem);
+		return NULL;
+	}
+
+	idx = uid - rmem->fuid;
+	device = rmem->device;
+	spin_lock(&rmem->lock);
+	if (rmem->event) {
+		if (rmem->event->etype == HMM_MIGRATE_TO_RMEM) {
+			rmem->event->backoff = true;
+		}
+		spin_unlock(&rmem->lock);
+		wait_event(device->wait_queue, rmem->event!=NULL);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+	pages[0] =  hmm_pfn_to_page(rmem->pfns[idx]);
+	if (!pages[0]) {
+		spin_unlock(&rmem->lock);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+	get_page(pages[0]);
+	if (!trylock_page(pages[0])) {
+		unsigned long orig = rmem->pfns[idx];
+
+		spin_unlock(&rmem->lock);
+		lock_page(pages[0]);
+		spin_lock(&rmem->lock);
+		if (rmem->pfns[idx] != orig) {
+			spin_unlock(&rmem->lock);
+			unlock_page(pages[0]);
+			page_cache_release(pages[0]);
+			hmm_rmem_unref(rmem);
+			goto retry;
+		}
+	}
+	if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+		dirty = test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+		spin_unlock(&rmem->lock);
+		hmm_rmem_unref(rmem);
+		if (dirty) {
+			set_page_dirty(pages[0]);
+		}
+		return pages[0];
+	}
+
+	if (rmem->event) {
+		spin_unlock(&rmem->lock);
+		unlock_page(pages[0]);
+		page_cache_release(pages[0]);
+		wait_event(device->wait_queue, rmem->event!=NULL);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+
+	/* Try to batch few pages. */
+	/* FIXME use something else that 16 pages. Readahead ? Or just all range of dirty pages. */
+	npages = 16;
+	set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+	for (i = 1; i < npages; ++i) {
+		pages[i] = hmm_pfn_to_page(rmem->pfns[idx + i]);
+		if (!trylock_page(pages[i])) {
+			npages = i;
+			break;
+		}
+		if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx + i])) {
+			unlock_page(pages[i]);
+			npages = i;
+			break;
+		}
+		set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx + i]);
+		get_page(pages[i]);
+	}
+
+	event.etype = HMM_WRITEBACK;
+	event.faddr = uid;
+	event.laddr = uid + npages;
+	rmem->event = &event;
+	INIT_LIST_HEAD(&event.ranges);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		list_add_tail(&range->elist, &event.ranges);
+	}
+	spin_unlock(&rmem->lock);
+
+	list_for_each_entry (range, &event.ranges, elist) {
+		unsigned long fuid, faddr, laddr;
+
+		if (event.laddr <  hmm_range_fuid(range) ||
+		    event.faddr >= hmm_range_luid(range)) {
+			continue;
+		}
+		fuid  = max(event.faddr, hmm_range_fuid(range));
+		faddr = fuid - hmm_range_fuid(range);
+		laddr = min(event.laddr, hmm_range_luid(range)) - fuid;
+		faddr = range->faddr + (faddr << PAGE_SHIFT);
+		laddr = range->faddr + (laddr << PAGE_SHIFT);
+		ret = hmm_mirror_rmem_update(range->mirror, rmem, faddr,
+					     laddr, fuid, &event, true);
+		if (ret) {
+			goto error;
+		}
+	}
+
+	list_for_each_entry_safe (fence, nfence, &event.fences, list) {
+		hmm_device_fence_wait(device, fence);
+	}
+
+	/* Event faddr is fuid and laddr is luid. */
+	fence = device->ops->rmem_to_lmem(rmem, event.faddr, event.laddr);
+	if (IS_ERR(fence)) {
+		goto error;
+	}
+	INIT_LIST_HEAD(&fence->list);
+	ret = hmm_device_fence_wait(device, fence);
+	if (ret) {
+		goto error;
+	}
+
+	spin_lock(&rmem->lock);
+	if (test_bit(!HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx + i])) {
+		/* This should not happen the driver must set the bit. */
+		WARN_ONCE(1, "hmm: driver failed to set HMM_PFN_LMEM_UPTODATE.\n");
+		goto error;
+	}
+	rmem->event = NULL;
+	dirty = test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+	list_for_each_entry_safe (range, nrange, &event.ranges, elist) {
+		list_del_init(&range->elist);
+	}
+	spin_unlock(&rmem->lock);
+	/* Do not unlock first page, return it locked. */
+	for (i = 1; i < npages; ++i) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+	wake_up(&device->wait_queue);
+	hmm_rmem_unref(rmem);
+	if (dirty) {
+		set_page_dirty(pages[0]);
+	}
+	return pages[0];
+
+error:
+	for (i = 0; i < npages; ++i) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+	spin_lock(&rmem->lock);
+	rmem->event = NULL;
+	list_for_each_entry_safe (range, nrange, &event.ranges, elist) {
+		list_del_init(&range->elist);
+	}
+	spin_unlock(&rmem->lock);
+	hmm_rmem_unref(rmem);
+	hmm_pagecache_migrate(mapping, swap);
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_pagecache_writeback);
+
+struct page *hmm_pagecache_page(struct address_space *mapping,
+				swp_entry_t swap)
+{
+	struct hmm_rmem *rmem = NULL;
+	struct page *page;
+	unsigned long uid;
+
+	/* Find the corresponding rmem. */
+	if (!is_hmm_entry(swap)) {
+		BUG();
+		return NULL;
+	}
+	uid = hmm_entry_uid(swap);
+	if (!uid) {
+		/* Poisonous hmm swap entry this can not happen. */
+		BUG();
+		return NULL;
+	}
+
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		/* Someone likely migrated it back to lmem by returning NULL
+		 * the caller will perform a new lookup.
+		 */
+		return NULL;
+	}
+
+	page = hmm_pfn_to_page(rmem->pfns[uid - rmem->fuid]);
+	get_page(page);
+	hmm_rmem_unref(rmem);
+	return page;
+}
+
 
 
 
@@ -2667,7 +3625,7 @@ static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
 {
 	struct hmm_device *device = mirror->device;
 	struct hmm_rmem *rmem = range->rmem;
-	unsigned long fuid, luid, npages;
+	unsigned long i, fuid, luid, npages, uid;
 	int ret;
 
 	if (range->mirror != mirror) {
@@ -2679,6 +3637,77 @@ static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
 	fuid = range->fuid + ((faddr - range->faddr) >> PAGE_SHIFT);
 	luid = fuid + npages;
 
+	/* The rmem might not be uptodate so synchronize again. The only way
+	 * this might be the case is if a previous mkwrite failed and the
+	 * device decided to use the local memory copy.
+	 */
+	i = fuid - rmem->fuid;
+	for (uid = fuid; uid < luid; ++uid, ++i) {
+		if (!test_bit(HMM_PFN_RMEM_UPTODATE, &rmem->pfns[i])) {
+			struct hmm_fence *fence, *nfence;
+			enum hmm_etype etype = event->etype;
+
+			event->etype = HMM_UNMAP;
+			ret = hmm_mirror_rmem_update(mirror, rmem, range->faddr,
+						     range->laddr, range->fuid,
+						     event, true);
+			event->etype = etype;
+			if (ret) {
+				return ret;
+			}
+			list_for_each_entry_safe (fence, nfence,
+						  &event->fences, list) {
+				hmm_device_fence_wait(device, fence);
+			}
+			fence = device->ops->lmem_to_rmem(rmem, range->fuid,
+							  hmm_range_luid(range));
+			if (IS_ERR(fence)) {
+				return PTR_ERR(fence);
+			}
+			ret = hmm_device_fence_wait(device, fence);
+			if (ret) {
+				return ret;
+			}
+			break;
+		}
+	}
+
+	if (write && rmem->mapping) {
+		unsigned long addr;
+
+		if (current->mm == vma->vm_mm) {
+			sync_mm_rss(vma->vm_mm);
+		}
+		i = fuid - rmem->fuid;
+		addr = faddr;
+		for (uid = fuid; uid < luid; ++uid, ++i, addr += PAGE_SIZE) {
+			if (test_bit(HMM_PFN_WRITE, &rmem->pfns[i])) {
+				continue;
+			}
+			if (vma->vm_flags & VM_SHARED) {
+				ret = hmm_rmem_file_mkwrite(rmem,vma,addr,uid);
+				if (ret && ret != -EBUSY) {
+					return ret;
+				}
+			} else {
+				struct mm_struct *mm = vma->vm_mm;
+				struct page *page;
+
+				/* COW */
+				if(mem_cgroup_charge_anon(NULL,mm,GFP_KERNEL)){
+					return -ENOMEM;
+				}
+				add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+				spin_lock(&rmem->lock);
+				page = hmm_pfn_to_page(rmem->pfns[i]);
+				rmem->pfns[i] = 0;
+				set_bit(HMM_PFN_WRITE, &rmem->pfns[i]);
+				spin_unlock(&rmem->lock);
+				hmm_rmem_remap_file_single_page(rmem, page);
+			}
+		}
+	}
+
 	ret = device->ops->rmem_fault(mirror, rmem, faddr, laddr, fuid, fault);
 	return ret;
 }
@@ -2951,7 +3980,10 @@ static void hmm_migrate_abort(struct hmm_mirror *mirror,
 					      faddr, laddr, fuid);
 		}
 	} else {
-		BUG();
+		rmem.pgoff = vma->vm_pgoff;
+		rmem.pgoff += ((fault->faddr - vma->vm_start) >> PAGE_SHIFT);
+		rmem.mapping = vma->vm_file->f_mapping;
+		hmm_rmem_remap_file(&rmem);
 	}
 
 	/* Ok officialy dead. */
@@ -2977,6 +4009,15 @@ static void hmm_migrate_abort(struct hmm_mirror *mirror,
 			unlock_page(page);
 			clear_bit(HMM_PFN_LOCK, &pfns[i]);
 		}
+		if (test_bit(HMM_PFN_FILE, &pfns[i]) && !PageLRU(page)) {
+			/* To balance putback_lru_page and isolate_lru_page. As
+			 * a simplification we droped the extra reference taken
+			 * by isolate_lru_page. This is why we need to take an
+			 * extra reference here for putback_lru_page.
+			 */
+			get_page(page);
+			putback_lru_page(page);
+		}
 		page_remove_rmap(page);
 		page_cache_release(page);
 		pfns[i] = 0;
@@ -2988,6 +4029,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 			     struct hmm_mirror *mirror)
 {
 	struct vm_area_struct *vma;
+	struct address_space *mapping;
 	struct hmm_device *device;
 	struct hmm_range *range;
 	struct hmm_fence *fence;
@@ -3042,7 +4084,8 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 		ret = -EACCES;
 		goto out;
 	}
-	if (vma->vm_file) {
+	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
+	if (vma->vm_file && !(mapping->a_ops->features & AOPS_FEATURE_HMM)) {
 		kfree(range);
 		range = NULL;
 		ret = -EBUSY;
@@ -3053,6 +4096,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 	event->laddr  =fault->laddr = min(fault->laddr, vma->vm_end);
 	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
 	fault->vma = vma;
+	rmem.mapping = (vma->vm_flags & VM_SHARED) ? mapping : NULL;
 
 	ret = hmm_rmem_alloc(&rmem, npages);
 	if (ret) {
@@ -3100,6 +4144,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 	hmm_rmem_tree_insert(fault->rmem, &_hmm_rmems);
 	fault->rmem->pfns = rmem.pfns;
 	range->rmem = fault->rmem;
+	fault->rmem->mapping = rmem.mapping;
 	list_del_init(&range->rlist);
 	list_add_tail(&range->rlist, &fault->rmem->ranges);
 	rmem.event = NULL;
@@ -3128,7 +4173,6 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 		struct page *page = hmm_pfn_to_page(rmem.pfns[i]);
 
 		if (test_bit(HMM_PFN_VALID_ZERO, &rmem.pfns[i])) {
-			rmem.pfns[i] = rmem.pfns[i] & HMM_PFN_CLEAR;
 			continue;
 		}
 		/* We only decrement now the page count so that cow happen
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..7c13f8d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -202,6 +202,10 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 			continue;
 		}
 		swap = radix_to_swp_entry(page);
+		if (is_hmm_entry(swap)) {
+			/* FIXME start migration here ? */
+			continue;
+		}
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
 								NULL, 0);
 		if (page)
diff --git a/mm/mincore.c b/mm/mincore.c
index 725c809..107b870 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -79,6 +79,10 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 		 */
 		if (radix_tree_exceptional_entry(page)) {
 			swp_entry_t swp = radix_to_swp_entry(page);
+
+			if (is_hmm_entry(swp)) {
+				return 1;
+			}
 			page = find_get_page(swap_address_space(swp), swp.val);
 		}
 	} else
@@ -86,6 +90,13 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 #else
 	page = find_get_page(mapping, pgoff);
 #endif
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (is_hmm_entry(swap)) {
+			return 1;
+		}
+	}
 	if (page) {
 		present = PageUptodate(page);
 		page_cache_release(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 023cf08..b6dcf80 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -37,6 +37,7 @@
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
 #include <linux/mm_inline.h>
+#include <linux/hmm.h>
 #include <trace/events/writeback.h>
 
 #include "internal.h"
@@ -1900,6 +1901,8 @@ retry:
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
+		pgoff_t save_index = index;
+		bool migrated = false;
 		int i;
 
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
@@ -1907,58 +1910,106 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		for (i = 0, migrated = false; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* This can not happen ! */
+				BUG_ON(!is_hmm_entry(swap));
+				page = hmm_pagecache_writeback(mapping, swap);
+				if (page == NULL) {
+					migrated = true;
+					pvec.pages[i] = NULL;
+				}
+			}
+		}
+
+		/* Some rmem was migrated we need to redo the page cache lookup. */
+		if (migrated) {
+			for (i = 0; i < nr_pages; i++) {
+				struct page *page = pvec.pages[i];
+
+				if (page && radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					page = hmm_pagecache_page(mapping, swap);
+					unlock_page(page);
+					page_cache_release(page);
+					pvec.pages[i] = page;
+				}
+			}
+			pagevec_release(&pvec);
+			cond_resched();
+			index = save_index;
+			goto retry;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			/*
-			 * At this point, the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or
-			 * even swizzled back from swapper_space to tmpfs file
-			 * mapping. However, page->index will not change
-			 * because we have a reference on the page.
-			 */
-			if (page->index > end) {
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				pvec.pages[i] = page = hmm_pagecache_page(mapping, swap);
+				page_cache_release(page);
+				done_index = page->index;
+			} else {
 				/*
-				 * can't be range_cyclic (1st pass) because
-				 * end == -1 in that case.
+				 * At this point, the page may be truncated or
+				 * invalidated (changing page->mapping to NULL), or
+				 * even swizzled back from swapper_space to tmpfs file
+				 * mapping. However, page->index will not change
+				 * because we have a reference on the page.
 				 */
-				done = 1;
-				break;
-			}
+				if (page->index > end) {
+					/*
+					 * can't be range_cyclic (1st pass) because
+					 * end == -1 in that case.
+					 */
+					done = 1;
+					break;
+				}
 
-			done_index = page->index;
+				done_index = page->index;
 
-			lock_page(page);
+				lock_page(page);
 
-			/*
-			 * Page truncated or invalidated. We can freely skip it
-			 * then, even for data integrity operations: the page
-			 * has disappeared concurrently, so there could be no
-			 * real expectation of this data interity operation
-			 * even if there is now a new, dirty page at the same
-			 * pagecache address.
-			 */
-			if (unlikely(page->mapping != mapping)) {
-continue_unlock:
-				unlock_page(page);
-				continue;
+				/*
+				 * Page truncated or invalidated. We can freely skip it
+				 * then, even for data integrity operations: the page
+				 * has disappeared concurrently, so there could be no
+				 * real expectation of this data interity operation
+				 * even if there is now a new, dirty page at the same
+				 * pagecache address.
+				 */
+				if (unlikely(page->mapping != mapping)) {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			if (!PageDirty(page)) {
 				/* someone wrote it for us */
-				goto continue_unlock;
+				unlock_page(page);
+				continue;
 			}
 
 			if (PageWriteback(page)) {
-				if (wbc->sync_mode != WB_SYNC_NONE)
+				if (wbc->sync_mode != WB_SYNC_NONE) {
 					wait_on_page_writeback(page);
-				else
-					goto continue_unlock;
+				} else {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
-				goto continue_unlock;
+			if (!clear_page_dirty_for_io(page)) {
+				unlock_page(page);
+				continue;
+			}
 
 			trace_wbc_writepage(wbc, mapping->backing_dev_info);
 			ret = (*writepage)(page, wbc, data);
@@ -1994,6 +2045,20 @@ continue_unlock:
 				break;
 			}
 		}
+
+		/* Some entry of pvec might still be exceptional ! */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				page_cache_release(page);
+				pvec.pages[i] = page;
+			}
+		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index e07450c..3b7fbd3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1132,6 +1132,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	case TTU_MUNLOCK:
 		action = MMU_MUNLOCK;
 		break;
+	case TTU_HMM:
+		action = MMU_HMM;
+		break;
 	default:
 		/* Please report this ! */
 		BUG();
@@ -1327,6 +1330,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	case TTU_MUNLOCK:
 		action = MMU_MUNLOCK;
 		break;
+	case TTU_HMM:
+		action = MMU_HMM;
+		break;
 	default:
 		/* Please report this ! */
 		BUG();
@@ -1426,7 +1432,12 @@ static int try_to_unmap_nonlinear(struct page *page,
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
-	unsigned int mapcount;
+	unsigned int mapcount, min_mapcount = 0;
+
+	/* The hmm code keep mapcount elevated to 1 to avoid updating mm and
+	 * memcg. If we are call on behalf of hmm just ignore this extra 1.
+	 */
+	min_mapcount = (TTU_ACTION((enum ttu_flags)arg) == TTU_HMM) ? 1 : 0;
 
 	list_for_each_entry(vma,
 		&mapping->i_mmap_nonlinear, shared.nonlinear) {
@@ -1449,8 +1460,10 @@ static int try_to_unmap_nonlinear(struct page *page,
 	 * just walk the nonlinear vmas trying to age and unmap some.
 	 * The mapcount of the page we came in with is irrelevant,
 	 * but even so use it as a guide to how hard we should try?
+	 *
+	 * See comment about hmm above for min_mapcount.
 	 */
-	mapcount = page_mapcount(page);
+	mapcount = page_mapcount(page) - min_mapcount;
 	if (!mapcount)
 		return ret;
 
diff --git a/mm/swap.c b/mm/swap.c
index c0ed4d6..426fede 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -839,6 +839,15 @@ void release_pages(struct page **pages, int nr, int cold)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
+		if (!page) {
+			continue;
+		}
+		if (radix_tree_exceptional_entry(page)) {
+			/* This should really not happen tell us about it ! */
+			WARN_ONCE(1, "hmm exceptional entry left\n");
+			continue;
+		}
+
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c81..c979fd6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -281,6 +282,32 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -313,7 +340,16 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	}
 
 	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+		struct page *page;
+
+	repeat_start:
+		page = find_lock_page(mapping, start - 1);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(mapping, swap);
+			goto repeat_start;
+		}
 		if (page) {
 			unsigned int top = PAGE_CACHE_SIZE;
 			if (start > end) {
@@ -332,7 +368,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 	}
 	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+		struct page *page;
+	repeat_end:
+		page = find_lock_page(mapping, end);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(mapping, swap);
+			goto repeat_end;
+		}
 		if (page) {
 			wait_on_page_writeback(page);
 			zero_user_segment(page, 0, partial_end);
@@ -371,6 +415,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
+			/* FIXME Find a way to block rmem migration on truncate. */
+			BUG_ON(radix_tree_exceptional_entry(page));
+
 			/* We rely upon deletion not changing page->index */
 			index = indices[i];
 			if (index >= end)
@@ -488,6 +535,32 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -597,6 +670,32 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 08/11] hmm: support for migrate file backed pages to remote memory
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel
  Cc: Jérôme Glisse, Sherry Cheung, Subhash Gutti,
	Mark Hairgrove, John Hubbard, Jatin Kumar

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Motivation:

Same as for migrating anonymous private memory ie device local memory has
higher bandwidth and lower latency.

Implementation:

Migrated range are tracked exactly as private anonymous memory refer to
the commit adding support for migrating private anonymous memory.

Migrating file backed page is more complex than private anonymous memory
as those pages might be involved in various filesystem event from write
back to splice or truncation.

This patchset use a special hmm swap value that it store inside the radix
tree for page that are migrated to remote memory. Any code that need to do
radix tree lookup is updated to understand those special hmm swap entry
and to call hmm helper function to perform the appropriate operation.

For most operations (file read, splice, truncate, ...) the end result is
simply to migrate back to local memory. It is expected that user of hmm
will do not perform such operation on file back memory that was migrated
to remote memory.

Write back is different as we preserve the capabilities of doing dirtied
memory write back from remote memory (using local system memory as a bounce
buffer).

Each filesystem code must be modified to support hmm. This patchset only
modify common helper code and add the core set of helpers needed for this
feature.

Issues:

The big issue here is how to handle failure to migrate the remote memory back
to local memory. Should all the process trying further access to the file get
SIGBUS ? Should only the process that migrated memory to remote memory get
SIGBUS ? ...

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
---
 fs/aio.c             |    9 +
 fs/buffer.c          |    3 +
 fs/splice.c          |   38 +-
 include/linux/fs.h   |    4 +
 include/linux/hmm.h  |   72 +++-
 include/linux/rmap.h |    1 +
 mm/filemap.c         |   99 ++++-
 mm/hmm.c             | 1094 ++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/madvise.c         |    4 +
 mm/mincore.c         |   11 +
 mm/page-writeback.c  |  131 ++++--
 mm/rmap.c            |   17 +-
 mm/swap.c            |    9 +
 mm/truncate.c        |  103 ++++-
 14 files changed, 1524 insertions(+), 71 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0bf693f..0ec9f16 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/hmm.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -405,10 +406,18 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
+
+	repeat:
 		page = find_or_create_page(file->f_inode->i_mapping,
 					   i, GFP_HIGHUSER | __GFP_ZERO);
 		if (!page)
 			break;
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(file->f_inode->i_mapping, swap);
+			goto repeat;
+		}
 		pr_debug("pid(%d) page[%d]->count=%d\n",
 			 current->pid, i, page_count(page));
 		SetPageUptodate(page);
diff --git a/fs/buffer.c b/fs/buffer.c
index e33f8d5..2be2a04 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -40,6 +40,7 @@
 #include <linux/cpu.h>
 #include <linux/bitops.h>
 #include <linux/mpage.h>
+#include <linux/hmm.h>
 #include <linux/bit_spinlock.h>
 #include <trace/events/block.h>
 
@@ -1023,6 +1024,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	if (!page)
 		return ret;
 
+	/* This can not happen ! */
+	BUG_ON(radix_tree_exceptional_entry(page));
 	BUG_ON(!PageLocked(page));
 
 	if (page_has_buffers(page)) {
diff --git a/fs/splice.c b/fs/splice.c
index 9dc23de..175f80c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -33,6 +33,7 @@
 #include <linux/socket.h>
 #include <linux/compat.h>
 #include <linux/aio.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 /*
@@ -334,6 +335,20 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 	 * Lookup the (hopefully) full range of pages we need.
 	 */
 	spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages);
+	/* Handle hmm entry, ie migrate remote memory back to local memory. */
+	for (page_nr = 0; page_nr < spd.nr_pages;) {
+		page = spd.pages[page_nr];
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			spd.pages[page_nr] = find_get_page(mapping, index + page_nr);
+			continue;
+		} else {
+			page_nr++;
+		}
+	}
 	index += spd.nr_pages;
 
 	/*
@@ -351,6 +366,14 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * the first hole.
 		 */
 		page = find_get_page(mapping, index);
+
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			continue;
+		}
 		if (!page) {
 			/*
 			 * page didn't exist, allocate one.
@@ -373,7 +396,6 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			 */
 			unlock_page(page);
 		}
-
 		spd.pages[spd.nr_pages++] = page;
 		index++;
 	}
@@ -415,6 +437,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			 */
 			if (!page->mapping) {
 				unlock_page(page);
+retry:
 				page = find_or_create_page(mapping, index,
 						mapping_gfp_mask(mapping));
 
@@ -422,8 +445,17 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 					error = -ENOMEM;
 					break;
 				}
-				page_cache_release(spd.pages[page_nr]);
-				spd.pages[page_nr] = page;
+				/* At this point it can not be an exceptional hmm entry. */
+				if (radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					/* FIXME How to handle hmm migration failure ? */
+					hmm_pagecache_migrate(mapping, swap);
+					goto retry;
+				} else {
+					page_cache_release(spd.pages[page_nr]);
+					spd.pages[page_nr] = page;
+				}
 			}
 			/*
 			 * page was already under io and is now done, great
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4e92d55..149a73e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -366,8 +366,12 @@ struct address_space_operations {
 	int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
+
+	int features;
 };
 
+#define AOPS_FEATURE_HMM	(1 << 0)
+
 extern const struct address_space_operations empty_aops;
 
 /*
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 96f41c4..9d232c1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -53,7 +53,6 @@
 #include <linux/swapops.h>
 #include <linux/mman.h>
 
-
 struct hmm_device;
 struct hmm_device_ops;
 struct hmm_mirror;
@@ -75,6 +74,14 @@ struct hmm;
  *   HMM_PFN_LOCK is only set while the rmem object is under going migration.
  *   HMM_PFN_LMEM_UPTODATE the page that is in the rmem pfn array has uptodate.
  *   HMM_PFN_RMEM_UPTODATE the rmem copy of the page is uptodate.
+ *   HMM_PFN_FILE is set for page part of pagecache.
+ *   HMM_PFN_WRITEBACK is set when page is under going writeback, this means
+ *     that the page is lock and all device mapping to rmem for this page are
+ *     set to read only. It will only be clear if device do write fault on the
+ *     page or on migration back to lmem.
+ *   HMM_PFN_FS_WRITEABLE the rmem can be written to without calling mkwrite.
+ *     This is for hmm internal use only to know if hmm needs to call the fs
+ *     mkwrite callback or not.
  *
  * Device driver only need to worry about :
  *   HMM_PFN_VALID_PAGE
@@ -95,6 +102,9 @@ struct hmm;
 #define HMM_PFN_LOCK		(4UL)
 #define HMM_PFN_LMEM_UPTODATE	(5UL)
 #define HMM_PFN_RMEM_UPTODATE	(6UL)
+#define HMM_PFN_FILE		(7UL)
+#define HMM_PFN_WRITEBACK	(8UL)
+#define HMM_PFN_FS_WRITEABLE	(9UL)
 
 static inline struct page *hmm_pfn_to_page(unsigned long pfn)
 {
@@ -170,6 +180,7 @@ enum hmm_etype {
 	HMM_UNMAP,
 	HMM_MIGRATE_TO_LMEM,
 	HMM_MIGRATE_TO_RMEM,
+	HMM_WRITEBACK,
 };
 
 struct hmm_fence {
@@ -628,6 +639,7 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
  *
  * @kref:           Reference count.
  * @device:         The hmm device the remote memory is allocated on.
+ * @mapping:        If rmem backing shared mapping.
  * @event:          The event currently associated with the rmem.
  * @lock:           Lock protecting the ranges list and event field.
  * @ranges:         The list of address ranges that point to this rmem.
@@ -646,6 +658,7 @@ struct hmm_device *hmm_device_unref(struct hmm_device *device);
 struct hmm_rmem {
 	struct kref		kref;
 	struct hmm_device	*device;
+	struct address_space	*mapping;
 	struct hmm_event	*event;
 	spinlock_t		lock;
 	struct list_head	ranges;
@@ -913,6 +926,42 @@ int hmm_mm_fault(struct mm_struct *mm,
 		 unsigned int fault_flags,
 		 pte_t orig_pte);
 
+/* hmm_pagecache_migrate - migrate remote memory to local memory.
+ *
+ * @mapping:    The address space into which the rmem was found.
+ * @swap:       The hmm special swap entry that needs to be migrated.
+ *
+ * When the fs code need to migrate remote memory to local memory it calls this
+ * function. From caller point of view this function can not fail. If it does
+ * then it will trigger SIGBUS if process that were using rmem try accessing
+ * the failed migration page. Other process will just get that lastest content
+ * we had for the page. Hence from pagecache point of view it never fails.
+ */
+void hmm_pagecache_migrate(struct address_space *mapping,
+			   swp_entry_t swap);
+
+/* hmm_pagecache_writeback - temporaty copy of rmem for writeback.
+ *
+ * @mapping:    The address space into which the rmem was found.
+ * @swap:       The hmm special swap entry that needs temporary copy.
+ * Return:      Page pointer or NULL on failure.
+ *
+ * When the fs code need to writeback remote memory to backing storage it calls
+ * this function. The function return pointer to temporary page into which the
+ * lastest copy of the remote memory is. The remote memory will be mark as read
+ * only for the duration of the writeback.
+ *
+ * On failure this will return NULL and will poison any mapping of the process
+ * that was responsible for the remote memory thus triggering a SIGBUS for this
+ * process. It will as well kill the mirror that was using this remote memory.
+ *
+ * When NULL is returned the caller should perform a new radix tree lookup.
+ */
+struct page *hmm_pagecache_writeback(struct address_space *mapping,
+				     swp_entry_t swap);
+struct page *hmm_pagecache_page(struct address_space *mapping,
+				swp_entry_t swap);
+
 #else /* !CONFIG_HMM */
 
 static inline void hmm_destroy(struct mm_struct *mm)
@@ -930,6 +979,27 @@ static inline int hmm_mm_fault(struct mm_struct *mm,
 	return VM_FAULT_SIGBUS;
 }
 
+static inline void hmm_pagecache_migrate(struct address_space *mapping,
+					 swp_entry_t swap)
+{
+	/* This can not happen ! */
+	BUG();
+}
+
+static inline struct page *hmm_pagecache_writeback(struct address_space *mapping,
+						   swp_entry_t swap)
+{
+	BUG();
+	return NULL;
+}
+
+static inline struct page *hmm_pagecache_page(struct address_space *mapping,
+					      swp_entry_t swap)
+{
+	BUG();
+	return NULL;
+}
+
 #endif /* !CONFIG_HMM */
 
 #endif
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 575851f..0641ccf 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -76,6 +76,7 @@ enum ttu_flags {
 	TTU_POISON = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 3,		/* munlock mode */
+	TTU_HMM = 4,			/* hmm mode */
 	TTU_ACTION_MASK = 0xff,
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
diff --git a/mm/filemap.c b/mm/filemap.c
index 067c3c0..686f46b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -343,6 +344,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 {
 	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
 	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
+	pgoff_t last_index = index;
 	struct pagevec pvec;
 	int nr_pages;
 	int ret2, ret = 0;
@@ -360,6 +362,19 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				pvec.pages[i] = NULL;
+				/* Force to examine the range again in case the
+				 * the migration triggered page writeback.
+				 */
+				index = last_index;
+				continue;
+			}
+
 			/* until radix tree lookup accepts end_index */
 			if (page->index > end)
 				continue;
@@ -369,6 +384,7 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 				ret = -EIO;
 		}
 		pagevec_release(&pvec);
+		last_index = index;
 		cond_resched();
 	}
 out:
@@ -987,14 +1003,21 @@ EXPORT_SYMBOL(find_get_entry);
  * Looks up the page cache slot at @mapping & @offset.  If there is a
  * page cache page, it is returned with an increased refcount.
  *
+ * Note that this will also return hmm special entry.
+ *
  * Otherwise, %NULL is returned.
  */
 struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page = find_get_entry(mapping, offset);
 
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (!is_hmm_entry(swap)) {
+			page = NULL;
+		}
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -1044,6 +1067,8 @@ EXPORT_SYMBOL(find_lock_entry);
  * page cache page, it is returned locked and with an increased
  * refcount.
  *
+ * Note that this will also return hmm special entry.
+ *
  * Otherwise, %NULL is returned.
  *
  * find_lock_page() may sleep.
@@ -1052,8 +1077,13 @@ struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page = find_lock_entry(mapping, offset);
 
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (!is_hmm_entry(swap)) {
+			page = NULL;
+		}
+	}
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -1222,6 +1252,12 @@ repeat:
 				WARN_ON(iter.index);
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page,
 			 * or a swap entry from shmem/tmpfs.  Skip
@@ -1239,6 +1275,7 @@ repeat:
 			goto repeat;
 		}
 
+export:
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1289,6 +1326,12 @@ repeat:
 				 */
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page,
 			 * or a swap entry from shmem/tmpfs.  Stop
@@ -1316,6 +1359,7 @@ repeat:
 			break;
 		}
 
+export:
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1342,6 +1386,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 	struct radix_tree_iter iter;
 	void **slot;
 	unsigned ret = 0;
+	pgoff_t index_last = *index;
 
 	if (unlikely(!nr_pages))
 		return 0;
@@ -1365,6 +1410,12 @@ repeat:
 				 */
 				goto restart;
 			}
+			if (is_hmm_entry(radix_to_swp_entry(page))) {
+				/* This is hmm special entry, page have been
+				 * migrated to some device memory.
+				 */
+				goto export;
+			}
 			/*
 			 * A shadow entry of a recently evicted page.
 			 *
@@ -1388,6 +1439,8 @@ repeat:
 			goto repeat;
 		}
 
+export:
+		index_last = iter.index;
 		pages[ret] = page;
 		if (++ret == nr_pages)
 			break;
@@ -1396,7 +1449,7 @@ repeat:
 	rcu_read_unlock();
 
 	if (ret)
-		*index = pages[ret - 1]->index + 1;
+		*index = index_last + 1;
 
 	return ret;
 }
@@ -1420,6 +1473,13 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = find_get_page(mapping, index);
 
+	if (radix_tree_exceptional_entry(page)) {
+		/* Only happen is page is migrated to remote memory and the
+		 * fs code knows how to handle the case thus it is safe to
+		 * return the special entry.
+		 */
+		return page;
+	}
 	if (page) {
 		if (trylock_page(page))
 			return page;
@@ -1497,6 +1557,13 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(mapping, swap);
+			goto find_page;
+		}
 		if (!page) {
 			page_cache_sync_readahead(mapping,
 					ra, filp,
@@ -1879,7 +1946,15 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	/*
 	 * Do we have something in the page cache already?
 	 */
+find_page:
 	page = find_get_page(mapping, offset);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto find_page;
+	}
 	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
@@ -2145,6 +2220,13 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	int err;
 repeat:
 	page = find_get_page(mapping, index);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto repeat;
+	}
 	if (!page) {
 		page = __page_cache_alloc(gfp | __GFP_COLD);
 		if (!page)
@@ -2442,6 +2524,13 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto repeat;
+	}
 	if (page)
 		goto found;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 599d4f6..0d97762 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -61,6 +61,7 @@
 #include <linux/wait.h>
 #include <linux/interval_tree_generic.h>
 #include <linux/mman.h>
+#include <linux/buffer_head.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <linux/delay.h>
@@ -656,6 +657,7 @@ static void hmm_rmem_init(struct hmm_rmem *rmem,
 {
 	kref_init(&rmem->kref);
 	rmem->device = device;
+	rmem->mapping = NULL;
 	rmem->fuid = 0;
 	rmem->luid = 0;
 	rmem->pfns = NULL;
@@ -923,9 +925,13 @@ static void hmm_rmem_clear_range(struct hmm_rmem *rmem,
 			sync_mm_rss(vma->vm_mm);
 		}
 
-		/* Properly uncharge memory. */
-		mem_cgroup_uncharge_mm(vma->vm_mm);
-		add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[idx])) {
+			/* Properly uncharge memory. */
+			mem_cgroup_uncharge_mm(vma->vm_mm);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, -1);
+		} else {
+			add_mm_counter(vma->vm_mm, MM_FILEPAGES, -1);
+		}
 	}
 }
 
@@ -1064,8 +1070,10 @@ static int hmm_rmem_remap_page(struct hmm_rmem_mm *rmem_mm,
 			pte = pte_mkdirty(pte);
 		}
 		get_page(page);
-		/* Private anonymous page. */
-		page_add_anon_rmap(page, vma, addr);
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[idx])) {
+			/* Private anonymous page. */
+			page_add_anon_rmap(page, vma, addr);
+		}
 		/* FIXME is this necessary ? I do not think so. */
 		if (!reuse_swap_page(page)) {
 			/* Page is still mapped in another process. */
@@ -1149,6 +1157,87 @@ static int hmm_rmem_remap_anon(struct hmm_rmem *rmem,
 	return ret;
 }
 
+static void hmm_rmem_remap_file_single_page(struct hmm_rmem *rmem,
+					    struct page *page)
+{
+	struct address_space *mapping = rmem->mapping;
+	void **slotp;
+
+	list_del_init(&page->lru);
+	spin_lock_irq(&mapping->tree_lock);
+	slotp = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+	if (slotp) {
+		radix_tree_replace_slot(slotp, page);
+		get_page(page);
+	} else {
+		/* This should never happen. */
+		WARN_ONCE(1, "hmm: null slot while remapping !\n");
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	page->mapping = mapping;
+	unlock_page(page);
+	/* To balance putback_lru_page and isolate_lru_page. */
+	get_page(page);
+	putback_lru_page(page);
+	page_remove_rmap(page);
+	page_cache_release(page);
+}
+
+static void hmm_rmem_remap_file(struct hmm_rmem *rmem)
+{
+	struct address_space *mapping = rmem->mapping;
+	unsigned long i, index, uid;
+
+	/* This part is lot easier than the unmap one. */
+	uid = rmem->fuid;
+	index = rmem->pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i, ++uid, ++index) {
+		void *expected, *item, **slotp;
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		if (!page || !test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			continue;
+		}
+		slotp = radix_tree_lookup_slot(&mapping->page_tree, index);
+		if (!slotp) {
+			/* This should never happen. */
+			WARN_ONCE(1, "hmm: null slot while remapping !\n");
+			continue;
+		}
+		item = radix_tree_deref_slot_protected(slotp,
+						       &mapping->tree_lock);
+		expected = swp_to_radix_entry(make_hmm_entry(uid));
+		if (item == expected) {
+			if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i])) {
+				/* FIXME Something was wrong for read back. */
+				ClearPageUptodate(page);
+			}
+			page->mapping = mapping;
+			get_page(page);
+			radix_tree_replace_slot(slotp, page);
+		} else {
+			WARN_ONCE(1, "hmm: expect 0x%p got 0x%p\n",
+				  expected, item);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i, ++uid, ++index) {
+		struct page *page;
+
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		page->mapping = mapping;
+		if (test_bit(HMM_PFN_DIRTY, &rmem->pfns[i])) {
+			set_page_dirty(page);
+		}
+		unlock_page(page);
+		clear_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+	}
+}
+
 static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
 				    unsigned long addr,
 				    pte_t *ptep,
@@ -1230,6 +1319,94 @@ static int hmm_rmem_unmap_anon_page(struct hmm_rmem_mm *rmem_mm,
 	return 0;
 }
 
+static int hmm_rmem_unmap_file_page(struct hmm_rmem_mm *rmem_mm,
+				    unsigned long addr,
+				    pte_t *ptep,
+				    pmd_t *pmdp)
+{
+	struct vm_area_struct *vma = rmem_mm->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct hmm_rmem *rmem = rmem_mm->rmem;
+	unsigned long idx, uid;
+	struct page *page;
+	pte_t pte;
+
+	/* New pte value. */
+	uid = rmem_mm->fuid + ((addr - rmem_mm->faddr) >> PAGE_SHIFT);
+	idx = uid - rmem->fuid;
+	pte = ptep_get_and_clear_full(mm, addr, ptep, rmem_mm->tlb.fullmm);
+	tlb_remove_tlb_entry((&rmem_mm->tlb), ptep, addr);
+
+	if (pte_none(pte)) {
+		rmem_mm->laddr = addr + PAGE_SIZE;
+		return 0;
+	}
+	if (!pte_present(pte)) {
+		swp_entry_t entry;
+
+		if (pte_file(pte)) {
+			/* Definitly a fault as we do not support migrating non
+			 * linear vma to remote memory.
+			 */
+			WARN_ONCE(1, "hmm: was trying to migrate non linear vma.\n");
+			return -EBUSY;
+		}
+		entry = pte_to_swp_entry(pte);
+		if (unlikely(non_swap_entry(entry))) {
+			/* This can not happen ! At this point no other process
+			 * knows about this page or have pending operation on
+			 * it beside read operation.
+			 *
+			 * There can be no mm event happening (no migration or
+			 * anything else) that would set a special pte.
+			 */
+			WARN_ONCE(1, "hmm: unhandled pte value 0x%016llx.\n",
+				  (long long)pte_val(pte));
+			return -EBUSY;
+		}
+		/* FIXME free swap ? This was pointing to swap entry of shmem shared memory. */
+		return 0;
+	}
+
+	flush_cache_page(vma, addr, pte_pfn(pte));
+	page = pfn_to_page(pte_pfn(pte));
+	if (PageAnon(page)) {
+		page = hmm_pfn_to_page(rmem->pfns[idx]);
+		list_add_tail(&page->lru, &rmem_mm->remap_pages);
+		rmem->pfns[idx] = pte_pfn(pte);
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		page = pfn_to_page(pte_pfn(pte));
+		pte = swp_entry_to_pte(make_hmm_entry(uid));
+		set_pte_at(mm, addr, ptep, pte);
+		/* tlb_flush_mmu drop one ref so take an extra ref here. */
+		get_page(page);
+	} else {
+		VM_BUG_ON(page != hmm_pfn_to_page(rmem->pfns[idx]));
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[idx]);
+		if (pte_write(pte)) {
+			set_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx]);
+		}
+		if (pte_dirty(pte)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		}
+		set_bit(HMM_PFN_FILE, &rmem->pfns[idx]);
+		add_mm_counter(mm, MM_FILEPAGES, -1);
+		page_remove_rmap(page);
+		/* Unlike anonymous page do not take an extra reference as we
+		 * already holding one.
+		 */
+	}
+
+	rmem_mm->force_flush = !__tlb_remove_page(&rmem_mm->tlb, page);
+	rmem_mm->laddr = addr + PAGE_SIZE;
+
+	return 0;
+}
+
 static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
 			      unsigned long addr,
 			      unsigned long next,
@@ -1262,15 +1439,29 @@ static int hmm_rmem_unmap_pmd(pmd_t *pmdp,
 again:
 	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
-	for (; addr != next; ++ptep, addr += PAGE_SIZE) {
-		ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
-					       ptep, pmdp);
-		if (ret || rmem_mm->force_flush) {
-			/* Increment ptep so unlock works on correct
-			 * pte.
-			 */
-			ptep++;
-			break;
+	if (vma->vm_file) {
+		for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+			ret = hmm_rmem_unmap_file_page(rmem_mm, addr,
+						       ptep, pmdp);
+			if (ret || rmem_mm->force_flush) {
+				/* Increment ptep so unlock works on correct
+				 * pte.
+				 */
+				ptep++;
+				break;
+			}
+		}
+	} else {
+		for (; addr != next; ++ptep, addr += PAGE_SIZE) {
+			ret = hmm_rmem_unmap_anon_page(rmem_mm, addr,
+						       ptep, pmdp);
+			if (ret || rmem_mm->force_flush) {
+				/* Increment ptep so unlock works on correct
+				 * pte.
+				 */
+				ptep++;
+				break;
+			}
 		}
 	}
 	arch_leave_lazy_mmu_mode();
@@ -1321,6 +1512,7 @@ static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
 
 	npages = (laddr - faddr) >> PAGE_SHIFT;
 	rmem->pgoff = faddr;
+	rmem->mapping = NULL;
 	rmem_mm.vma = vma;
 	rmem_mm.rmem = rmem;
 	rmem_mm.faddr = faddr;
@@ -1362,13 +1554,433 @@ static int hmm_rmem_unmap_anon(struct hmm_rmem *rmem,
 	return ret;
 }
 
+static int hmm_rmem_unmap_file(struct hmm_rmem *rmem,
+			       struct vm_area_struct *vma,
+			       unsigned long faddr,
+			       unsigned long laddr)
+{
+	struct address_space *mapping;
+	struct hmm_rmem_mm rmem_mm;
+	struct mm_walk walk = {0};
+	unsigned long addr, i, index, npages, uid;
+	struct page *page, *tmp;
+	int ret;
+
+	npages = hmm_rmem_npages(rmem);
+	rmem->pgoff = vma->vm_pgoff + ((faddr - vma->vm_start) >> PAGE_SHIFT);
+	rmem->mapping = vma->vm_file->f_mapping;
+	rmem_mm.vma = vma;
+	rmem_mm.rmem = rmem;
+	rmem_mm.faddr = faddr;
+	rmem_mm.laddr = faddr;
+	rmem_mm.fuid = rmem->fuid;
+	INIT_LIST_HEAD(&rmem_mm.remap_pages);
+	memset(rmem->pfns, 0, sizeof(long) * npages);
+
+	i = 0;
+	uid = rmem->fuid;
+	addr = faddr;
+	index = rmem->pgoff >> (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	mapping = rmem->mapping;
+
+	/* Probably the most complex part of the code as it needs to serialize
+	 * againt various memory and filesystem event. The range we are trying
+	 * to migrate can be under going writeback, direct_IO, read, write or
+	 * simply mm event such as page reclaimation, page migration, ...
+	 *
+	 * We need to get exclusive access to all the page in the range so that
+	 * no other process access them or try to do anything with them. Trick
+	 * is to set the page->mapping to NULL so that anyone with reference
+	 * on the page will think that the page was either reclaim, migrated or
+	 * truncated. Any code that see that either skip the page or retry to
+	 * do a find_get_page which will result in getting the hmm special swap
+	 * value.
+	 *
+	 * This is a multistep process, first we update the pagecache to point
+	 * to special hmm swap entry so that any new event coming in sees that
+	 * and could block the migration. While updating the pagecache we also
+	 * make sure it is fully populated. We also try lock all page we can so
+	 * that no other process can lock them for write, direct_IO or anything
+	 * else that require the page lock.
+	 *
+	 * Once pagecache is updated we proceed to lock all the unlocked page
+	 * and to isolate them from the lru as we do not want any of them to
+	 * be reclaim while doing the migration. We also make sure the page is
+	 * Uptodate and read it back from the disk if not.
+	 *
+	 * Next step is to unmap the range from the process address for which
+	 * the migration is happening. We do so because we need to account all
+	 * the page against this process so that on migration back unaccounting
+	 * can be done consistently.
+	 *
+	 * Finaly the last step is to unmap for all other process after this
+	 * the only thing that can still be happening is that some page are
+	 * undergoing read or writeback, both of which are fine.
+	 *
+	 * To know up to which step exactly each page went we use various hmm
+	 * pfn flags so that error handling code can take proper action to
+	 * restore page into its original state.
+	 */
+
+retry:
+	if (rmem->event->backoff) {
+		npages = i;
+		ret = -EBUSY;
+		goto out;
+	}
+	spin_lock_irq(&mapping->tree_lock);
+	for (; i < npages; ++i, ++uid, ++index, addr += PAGE_SIZE){
+		void *item, **slotp;
+		int error;
+
+		slotp = radix_tree_lookup_slot(&mapping->page_tree, index);
+		if (!slotp) {
+			spin_unlock_irq(&mapping->tree_lock);
+			page = page_cache_alloc_cold(mapping);
+			if (!page) {
+				npages = i;
+				ret = -ENOMEM;
+				goto out;
+			}
+			ret = add_to_page_cache_lru(page, mapping,
+						    index, GFP_KERNEL);
+			if (ret) {
+				page_cache_release(page);
+				if (ret == -EEXIST) {
+					goto retry;
+				}
+				npages = i;
+				goto out;
+			}
+			/* A previous I/O error may have been due to temporary
+			 * failures, eg. multipath errors. PG_error will be set
+			 * again if readpage fails.
+			 *
+			 * FIXME i do not think this is necessary.
+			 */
+			ClearPageError(page);
+			/* Start the read. The read will unlock the page. */
+			error = mapping->a_ops->readpage(vma->vm_file, page);
+			page_cache_release(page);
+			if (error) {
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+			goto retry;
+		}
+		item = radix_tree_deref_slot_protected(slotp,
+						       &mapping->tree_lock);
+		if (radix_tree_exceptional_entry(item)) {
+			swp_entry_t entry = radix_to_swp_entry(item);
+
+			/* The case of private mapping of a file make things
+			 * interestings as both shared and private anonymous
+			 * page can exist in such rmem object.
+			 *
+			 * For now we just force them to go back to lmem, to
+			 * supporting it require another level of indirection.
+			 */
+			if (!is_hmm_entry(entry)) {
+				spin_unlock_irq(&mapping->tree_lock);
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+			/* FIXME handle shmem swap entry or some other device
+			 */
+			spin_unlock_irq(&mapping->tree_lock);
+			npages = i;
+			ret = -EBUSY;
+			goto out;
+		}
+		page = item;
+		if (unlikely(PageMlocked(page))) {
+			spin_unlock_irq(&mapping->tree_lock);
+			npages = i;
+			ret = -EBUSY;
+			goto out;
+		}
+		item = swp_to_radix_entry(make_hmm_entry(uid));
+		radix_tree_replace_slot(slotp, item);
+		rmem->pfns[i] = page_to_pfn(page) << HMM_PFN_SHIFT;
+		set_bit(HMM_PFN_VALID_PAGE, &rmem->pfns[i]);
+		set_bit(HMM_PFN_FILE, &rmem->pfns[i]);
+		rmem_mm.laddr = addr + PAGE_SIZE;
+
+		/* Pretend the page is being map make error code handling lot
+		 * simpler and cleaner.
+		 */
+		page_add_file_rmap(page);
+		add_mm_counter(vma->vm_mm, MM_FILEPAGES, 1);
+
+		if (trylock_page(page)) {
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+			if (page->mapping != mapping) {
+				/* Page have been truncated. */
+				spin_unlock_irq(&mapping->tree_lock);
+				npages = i;
+				ret = -EBUSY;
+				goto out;
+			}
+		}
+		if (PageWriteback(page)) {
+			set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[i]);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* At this point any unlocked page can still be referenced by various
+	 * file activities (read, write, splice, ...). But no new mapping can
+	 * be instanciated as the pagecache is now updated with special entry.
+	 */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+		ret = isolate_lru_page(page);
+		if (ret) {
+			goto out;
+		}
+		/* Isolate take an extra-ref which we do not want, as we are
+		 * already holding a reference on the page. Only holding one
+		 * reference  simplify error code path which then knows that
+		 * we are only holding one reference for each page, it does
+		 * not need to know wether we are holding and extra reference
+		 * or not from the isolate_lru_page.
+		 */
+		put_page(page);
+		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &rmem->pfns[i]);
+			/* Has the page been truncated ? */
+			if (page->mapping != mapping) {
+				ret = -EBUSY;
+				goto out;
+			}
+		}
+		if (unlikely(!PageUptodate(page))) {
+			int error;
+
+			/* A previous I/O error may have been due to temporary
+			 * failures, eg. multipath errors. PG_error will be set
+			 * again if readpage fails.
+			 */
+			ClearPageError(page);
+			/* The read will unlock the page which is ok because no
+			 * one else knows about this page at this point.
+			 */
+			error = mapping->a_ops->readpage(vma->vm_file, page);
+			if (error) {
+				ret = -EBUSY;
+				goto out;
+			}
+			lock_page(page);
+		}
+		set_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i]);
+	}
+
+	/* At this point all page are lock which means that the page content is
+	 * stable. Because we will reset the page->mapping field we also know
+	 * that anyone holding a reference on the page will retry to find the
+	 * page or skip current operations.
+	 *
+	 * Also at this point no one can be unmapping those pages from the vma
+	 * as the hmm event prevent any mmu_notifier invalidation to proceed
+	 * until we are done.
+	 *
+	 * We need to unmap page from the vma ourself so we can properly update
+	 * the mm counter.
+	 */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	if (current->mm == vma->vm_mm) {
+		sync_mm_rss(vma->vm_mm);
+	}
+	rmem_mm.force_flush = 0;
+	walk.pmd_entry = hmm_rmem_unmap_pmd;
+	walk.mm = vma->vm_mm;
+	walk.private = &rmem_mm;
+
+	mmu_notifier_invalidate_range_start(walk.mm,vma,faddr,laddr,MMU_HMM);
+	tlb_gather_mmu(&rmem_mm.tlb, walk.mm, faddr, laddr);
+	tlb_start_vma(&rmem_mm.tlb, rmem_mm->vma);
+	ret = walk_page_range(faddr, laddr, &walk);
+	tlb_end_vma(&rmem_mm.tlb, rmem_mm->vma);
+	tlb_finish_mmu(&rmem_mm.tlb, faddr, laddr);
+	mmu_notifier_invalidate_range_end(walk.mm, vma, faddr, laddr, MMU_HMM);
+
+	/* Remap any pages that were replaced by anonymous page. */
+	list_for_each_entry_safe (page, tmp, &rmem_mm.remap_pages, lru) {
+		hmm_rmem_remap_file_single_page(rmem, page);
+	}
+
+	if (ret) {
+		npages = (rmem_mm.laddr - rmem_mm.faddr) >> PAGE_SHIFT;
+		goto out;
+	}
+
+	/* Now unmap from all other process. */
+
+	if (rmem->event->backoff) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	for (i = 0, ret = 0; i < npages; ++i) {
+		page = hmm_pfn_to_page(rmem->pfns[i]);
+
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			continue;
+		}
+
+		/* Because we did call page_add_file_rmap then mapcount must be
+		 * at least one. This was done on to avoid page_remove_rmap to
+		 * update memcg and mm statistic.
+		 */
+		BUG_ON(page_mapcount(page) <= 0);
+		if (page_mapcount(page) > 1) {
+			try_to_unmap(page,
+					 TTU_HMM |
+					 TTU_IGNORE_MLOCK |
+					 TTU_IGNORE_ACCESS);
+			if (page_mapcount(page) > 1) {
+				ret = ret ? ret : -EBUSY;
+			} else {
+				/* Everyone will think page have been migrated,
+				 * truncated or reclaimed.
+				 */
+				page->mapping = NULL;
+			}
+		} else {
+			/* Everyone will think page have been migrated,
+			 * truncated or reclaimed.
+			 */
+			page->mapping = NULL;
+		}
+		/* At this point no one else can write to the page. Save dirty bit and check it when doing
+		 * fault.
+		 */
+		if (PageDirty(page)) {
+			set_bit(HMM_PFN_DIRTY, &rmem->pfns[i]);
+			ClearPageDirty(page);
+		}
+	}
+
+	/* This was a long journey but at this point hmm has exclusive owner
+	 * of all the pages and all of them are accounted against the process
+	 * mm as well as Uptodate and ready for to be copied to remote memory.
+	 */
+out:
+	if (ret) {
+		/* Unaccount any unmapped pages. */
+		for (i = 0; i < npages; ++i) {
+			if (test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+				add_mm_counter(walk.mm, MM_FILEPAGES, -1);
+			}
+		}
+	}
+	return ret;
+}
+
+static int hmm_rmem_file_mkwrite(struct hmm_rmem *rmem,
+				 struct vm_area_struct *vma,
+				 unsigned long addr,
+				 unsigned long uid)
+{
+	struct vm_fault vmf;
+	unsigned long idx = uid - rmem->fuid;
+	struct page *page;
+	int r;
+
+	page = hmm_pfn_to_page(rmem->pfns[idx]);
+	if (test_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx])) {
+		lock_page(page);
+		page->mapping = rmem->mapping;
+		goto release;
+	}
+
+	vmf.virtual_address = (void __user *)(addr & PAGE_MASK);
+	vmf.pgoff = page->index;
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.page = page;
+	page->mapping = rmem->mapping;
+	page_cache_get(page);
+
+	r = vma->vm_ops->page_mkwrite(vma, &vmf);
+	if (unlikely(r & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+		page_cache_release(page);
+		return -EFAULT;
+	}
+	if (unlikely(!(r & VM_FAULT_LOCKED))) {
+		lock_page(page);
+		if (!page->mapping) {
+
+			WARN_ONCE(1, "hmm: page can not be truncated while in rmem !\n");
+			unlock_page(page);
+			page_cache_release(page);
+			return -EFAULT;
+		}
+	}
+	set_bit(HMM_PFN_FS_WRITEABLE, &rmem->pfns[idx]);
+	/* Ok to put_page here as we hold another reference. */
+	page_cache_release(page);
+
+release:
+	/* We clear the write back now to forbid any new write back. The write
+	 * back code will need to go through its slow code path to set again
+	 * the writeback flags.
+	 */
+	clear_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+	/* Now wait for any in progress writeback. */
+	if (PageWriteback(page)) {
+		wait_on_page_writeback(page);
+	}
+	/* The page count is what we use to synchronize with write back. The
+	 * write back code take an extra reference on page before returning
+	 * them to the write back fs code and thus here at this point we see
+	 * that and forbid the change.
+	 *
+	 * However as we just waited for pending writeback above, in case the
+	 * writeback was already scheduled then at this point its done and it
+	 * should have drop the extra reference thus the rmem can be written
+	 * to again.
+	 */
+	if (page_count(page) > (1 + page_has_private(page))) {
+		page->mapping = NULL;
+		unlock_page(page);
+		return -EBUSY;
+	}
+	/* No body should have write to that page thus nobody should have set
+	 * the dirty bit.
+	 */
+	BUG_ON(PageDirty(page));
+
+	/* Restore page count. */
+	page->mapping = NULL;
+	clear_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx]);
+	/* Ok now device can write to rmem. */
+	set_bit(HMM_PFN_WRITE, &rmem->pfns[idx]);
+	unlock_page(page);
+
+	return 0;
+}
+
 static inline int hmm_rmem_unmap(struct hmm_rmem *rmem,
 				 struct vm_area_struct *vma,
 				 unsigned long faddr,
 				 unsigned long laddr)
 {
 	if (vma->vm_file) {
-		return -EBUSY;
+		return hmm_rmem_unmap_file(rmem, vma, faddr, laddr);
 	} else {
 		return hmm_rmem_unmap_anon(rmem, vma, faddr, laddr);
 	}
@@ -1402,6 +2014,34 @@ static int hmm_rmem_alloc_pages(struct hmm_rmem *rmem,
 			vma = mm ? find_vma(mm, addr) : NULL;
 		}
 
+		page = hmm_pfn_to_page(pfns[i]);
+		if (page && test_bit(HMM_PFN_VALID_PAGE, &pfns[i])) {
+			BUG_ON(test_bit(HMM_PFN_LOCK, &pfns[i]));
+			lock_page(page);
+			set_bit(HMM_PFN_LOCK, &pfns[i]);
+
+			/* Fake one mapping so that page_remove_rmap behave as
+			 * we want.
+			 */
+			BUG_ON(page_mapcount(page));
+			atomic_set(&page->_mapcount, 0);
+
+			spin_lock(&rmem->lock);
+			if (test_bit(HMM_PFN_WRITEBACK, &pfns[i])) {
+				/* Clear the bit first, it is fine because any
+				 * thread that will test the bit will first
+				 * check the rmem->event and at this point it
+				 * is set to the migration event.
+				 */
+				clear_bit(HMM_PFN_WRITEBACK, &pfns[i]);
+				spin_unlock(&rmem->lock);
+				wait_on_page_writeback(page);
+			} else {
+				spin_unlock(&rmem->lock);
+			}
+			continue;
+		}
+
 		/* No need to clear page they will be dma to of course this does
 		 * means we trust the device driver.
 		 */
@@ -1482,7 +2122,7 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 						 range->laddr,
 						 range->fuid,
 						 HMM_MIGRATE_TO_LMEM,
-						 false);
+						 !!(range->rmem->mapping));
 		if (IS_ERR(fence)) {
 			ret = PTR_ERR(fence);
 			goto error;
@@ -1517,6 +2157,19 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 		}
 	}
 
+	/* Sanity check the driver. */
+	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
+		if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[i])) {
+			WARN_ONCE(1, "hmm: driver failed to set HMM_PFN_LMEM_UPTODATE.\n");
+			ret = -EINVAL;
+			goto error;
+		}
+	}
+
+	if (rmem->mapping) {
+		hmm_rmem_remap_file(rmem);
+	}
+
 	/* Now the remote memory is officialy dead and nothing below can fails
 	 * badly.
 	 */
@@ -1526,6 +2179,13 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 	 * ranges list.
 	 */
 	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		if (rmem->mapping) {
+			add_mm_counter(range->mirror->hmm->mm, MM_FILEPAGES,
+				       -hmm_range_npages(range));
+			hmm_range_fini(range);
+			continue;
+		}
+
 		VM_BUG_ON(!vma);
 		VM_BUG_ON(range->faddr < vma->vm_start);
 		VM_BUG_ON(range->laddr > vma->vm_end);
@@ -1544,8 +2204,20 @@ int hmm_rmem_migrate_to_lmem(struct hmm_rmem *rmem,
 	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
 		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
-		unlock_page(page);
-		mem_cgroup_transfer_charge_anon(page, mm);
+		/* The HMM_PFN_FILE bit is only set for page that are in the
+		 * pagecache and thus are already accounted properly. So when
+		 * unset this means this is a private anonymous page for which
+		 * we need to transfer charge.
+		 *
+		 * If remapping failed then below page_remove_rmap will update
+		 * the memcg and mm properly.
+		 */
+		if (mm && !test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			mem_cgroup_transfer_charge_anon(page, mm);
+		}
+		if (test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+			unlock_page(page);
+		}
 		page_remove_rmap(page);
 		page_cache_release(page);
 		rmem->pfns[i] = 0UL;
@@ -1563,6 +2235,19 @@ error:
 	 * (2) rmem is mirroring private memory, easy case poison all ranges
 	 *     referencing the rmem.
 	 */
+	if (rmem->mapping) {
+		/* No matter what try to copy back data, driver should be
+		 * clever and not copy over page with HMM_PFN_LMEM_UPTODATE
+		 * bit set.
+		 */
+		fence = device->ops->rmem_to_lmem(rmem, rmem->fuid, rmem->luid);
+		if (fence && !IS_ERR(fence)) {
+			INIT_LIST_HEAD(&fence->list);
+			ret = hmm_device_fence_wait(device, fence);
+		}
+		/* FIXME how to handle error ? Mark page with error ? */
+		hmm_rmem_remap_file(rmem);
+	}
 	for (i = 0; i < hmm_rmem_npages(rmem); ++i) {
 		struct page *page = hmm_pfn_to_page(rmem->pfns[i]);
 
@@ -1573,9 +2258,11 @@ error:
 			}
 			continue;
 		}
-		/* Properly uncharge memory. */
-		mem_cgroup_transfer_charge_anon(page, mm);
-		if (!test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
+		if (!test_bit(HMM_PFN_FILE, &rmem->pfns[i])) {
+			/* Properly uncharge memory. */
+			mem_cgroup_transfer_charge_anon(page, mm);
+		}
+		if (test_bit(HMM_PFN_LOCK, &rmem->pfns[i])) {
 			unlock_page(page);
 		}
 		page_remove_rmap(page);
@@ -1583,6 +2270,15 @@ error:
 		rmem->pfns[i] = 0UL;
 	}
 	list_for_each_entry_safe (range, next, &rmem->ranges, rlist) {
+		/* FIXME Philosophical question Should we poison other process that access this shared file ? */
+		if (rmem->mapping) {
+			add_mm_counter(range->mirror->hmm->mm, MM_FILEPAGES,
+				       -hmm_range_npages(range));
+			/* Case (1) FIXME implement ! */
+			hmm_range_fini(range);
+			continue;
+		}
+
 		mm = range->mirror->hmm->mm;
 		hmm_rmem_poison_range(rmem, mm, NULL, range->faddr,
 				      range->laddr, range->fuid);
@@ -2063,6 +2759,268 @@ int hmm_mm_fault(struct mm_struct *mm,
 	return VM_FAULT_MAJOR;
 }
 
+/* see include/linux/hmm.h */
+void hmm_pagecache_migrate(struct address_space *mapping,
+			   swp_entry_t swap)
+{
+	struct hmm_rmem *rmem = NULL;
+	unsigned long fuid, luid, npages;
+
+	/* This can not happen ! */
+	VM_BUG_ON(!is_hmm_entry(swap));
+
+	fuid = hmm_entry_uid(swap);
+	VM_BUG_ON(!fuid);
+
+	rmem = hmm_rmem_find(fuid);
+	if (!rmem || rmem->dead) {
+		hmm_rmem_unref(rmem);
+		return;
+	}
+
+	/* FIXME use something else that 16 pages. Readahead ? Or just all range of dirty pages. */
+	npages = 16;
+	luid = min((fuid - rmem->fuid), (npages >> 2));
+	fuid = fuid - luid;
+	luid = min(fuid + npages, rmem->luid);
+
+	hmm_rmem_migrate_to_lmem(rmem, NULL, 0, fuid, luid, true);
+	hmm_rmem_unref(rmem);
+}
+EXPORT_SYMBOL(hmm_pagecache_migrate);
+
+/* see include/linux/hmm.h */
+struct page *hmm_pagecache_writeback(struct address_space *mapping,
+				     swp_entry_t swap)
+{
+	struct hmm_device *device;
+	struct hmm_range *range, *nrange;
+	struct hmm_fence *fence, *nfence;
+	struct hmm_event event;
+	struct hmm_rmem *rmem = NULL;
+	unsigned long i, uid, idx, npages;
+	/* FIXME hardcoded 16 */
+	struct page *pages[16];
+	bool dirty = false;
+	int ret;
+
+	/* Find the corresponding rmem. */
+	if (!is_hmm_entry(swap)) {
+		BUG();
+		return NULL;
+	}
+	uid = hmm_entry_uid(swap);
+	if (!uid) {
+		/* Poisonous hmm swap entry this can not happen. */
+		BUG();
+		return NULL;
+	}
+
+retry:
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		/* Someone likely migrated it back to lmem by returning NULL
+		 * the caller will perform a new lookup.
+		 */
+		return NULL;
+	}
+
+	if (rmem->dead) {
+		/* When dead is set everything is done. */
+		hmm_rmem_unref(rmem);
+		return NULL;
+	}
+
+	idx = uid - rmem->fuid;
+	device = rmem->device;
+	spin_lock(&rmem->lock);
+	if (rmem->event) {
+		if (rmem->event->etype == HMM_MIGRATE_TO_RMEM) {
+			rmem->event->backoff = true;
+		}
+		spin_unlock(&rmem->lock);
+		wait_event(device->wait_queue, rmem->event!=NULL);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+	pages[0] =  hmm_pfn_to_page(rmem->pfns[idx]);
+	if (!pages[0]) {
+		spin_unlock(&rmem->lock);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+	get_page(pages[0]);
+	if (!trylock_page(pages[0])) {
+		unsigned long orig = rmem->pfns[idx];
+
+		spin_unlock(&rmem->lock);
+		lock_page(pages[0]);
+		spin_lock(&rmem->lock);
+		if (rmem->pfns[idx] != orig) {
+			spin_unlock(&rmem->lock);
+			unlock_page(pages[0]);
+			page_cache_release(pages[0]);
+			hmm_rmem_unref(rmem);
+			goto retry;
+		}
+	}
+	if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+		dirty = test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+		set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+		spin_unlock(&rmem->lock);
+		hmm_rmem_unref(rmem);
+		if (dirty) {
+			set_page_dirty(pages[0]);
+		}
+		return pages[0];
+	}
+
+	if (rmem->event) {
+		spin_unlock(&rmem->lock);
+		unlock_page(pages[0]);
+		page_cache_release(pages[0]);
+		wait_event(device->wait_queue, rmem->event!=NULL);
+		hmm_rmem_unref(rmem);
+		goto retry;
+	}
+
+	/* Try to batch few pages. */
+	/* FIXME use something else that 16 pages. Readahead ? Or just all range of dirty pages. */
+	npages = 16;
+	set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx]);
+	for (i = 1; i < npages; ++i) {
+		pages[i] = hmm_pfn_to_page(rmem->pfns[idx + i]);
+		if (!trylock_page(pages[i])) {
+			npages = i;
+			break;
+		}
+		if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx + i])) {
+			unlock_page(pages[i]);
+			npages = i;
+			break;
+		}
+		set_bit(HMM_PFN_WRITEBACK, &rmem->pfns[idx + i]);
+		get_page(pages[i]);
+	}
+
+	event.etype = HMM_WRITEBACK;
+	event.faddr = uid;
+	event.laddr = uid + npages;
+	rmem->event = &event;
+	INIT_LIST_HEAD(&event.ranges);
+	list_for_each_entry (range, &rmem->ranges, rlist) {
+		list_add_tail(&range->elist, &event.ranges);
+	}
+	spin_unlock(&rmem->lock);
+
+	list_for_each_entry (range, &event.ranges, elist) {
+		unsigned long fuid, faddr, laddr;
+
+		if (event.laddr <  hmm_range_fuid(range) ||
+		    event.faddr >= hmm_range_luid(range)) {
+			continue;
+		}
+		fuid  = max(event.faddr, hmm_range_fuid(range));
+		faddr = fuid - hmm_range_fuid(range);
+		laddr = min(event.laddr, hmm_range_luid(range)) - fuid;
+		faddr = range->faddr + (faddr << PAGE_SHIFT);
+		laddr = range->faddr + (laddr << PAGE_SHIFT);
+		ret = hmm_mirror_rmem_update(range->mirror, rmem, faddr,
+					     laddr, fuid, &event, true);
+		if (ret) {
+			goto error;
+		}
+	}
+
+	list_for_each_entry_safe (fence, nfence, &event.fences, list) {
+		hmm_device_fence_wait(device, fence);
+	}
+
+	/* Event faddr is fuid and laddr is luid. */
+	fence = device->ops->rmem_to_lmem(rmem, event.faddr, event.laddr);
+	if (IS_ERR(fence)) {
+		goto error;
+	}
+	INIT_LIST_HEAD(&fence->list);
+	ret = hmm_device_fence_wait(device, fence);
+	if (ret) {
+		goto error;
+	}
+
+	spin_lock(&rmem->lock);
+	if (test_bit(!HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx + i])) {
+		/* This should not happen the driver must set the bit. */
+		WARN_ONCE(1, "hmm: driver failed to set HMM_PFN_LMEM_UPTODATE.\n");
+		goto error;
+	}
+	rmem->event = NULL;
+	dirty = test_bit(HMM_PFN_DIRTY, &rmem->pfns[idx]);
+	list_for_each_entry_safe (range, nrange, &event.ranges, elist) {
+		list_del_init(&range->elist);
+	}
+	spin_unlock(&rmem->lock);
+	/* Do not unlock first page, return it locked. */
+	for (i = 1; i < npages; ++i) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+	wake_up(&device->wait_queue);
+	hmm_rmem_unref(rmem);
+	if (dirty) {
+		set_page_dirty(pages[0]);
+	}
+	return pages[0];
+
+error:
+	for (i = 0; i < npages; ++i) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+	spin_lock(&rmem->lock);
+	rmem->event = NULL;
+	list_for_each_entry_safe (range, nrange, &event.ranges, elist) {
+		list_del_init(&range->elist);
+	}
+	spin_unlock(&rmem->lock);
+	hmm_rmem_unref(rmem);
+	hmm_pagecache_migrate(mapping, swap);
+	return NULL;
+}
+EXPORT_SYMBOL(hmm_pagecache_writeback);
+
+struct page *hmm_pagecache_page(struct address_space *mapping,
+				swp_entry_t swap)
+{
+	struct hmm_rmem *rmem = NULL;
+	struct page *page;
+	unsigned long uid;
+
+	/* Find the corresponding rmem. */
+	if (!is_hmm_entry(swap)) {
+		BUG();
+		return NULL;
+	}
+	uid = hmm_entry_uid(swap);
+	if (!uid) {
+		/* Poisonous hmm swap entry this can not happen. */
+		BUG();
+		return NULL;
+	}
+
+	rmem = hmm_rmem_find(uid);
+	if (!rmem) {
+		/* Someone likely migrated it back to lmem by returning NULL
+		 * the caller will perform a new lookup.
+		 */
+		return NULL;
+	}
+
+	page = hmm_pfn_to_page(rmem->pfns[uid - rmem->fuid]);
+	get_page(page);
+	hmm_rmem_unref(rmem);
+	return page;
+}
+
 
 
 
@@ -2667,7 +3625,7 @@ static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
 {
 	struct hmm_device *device = mirror->device;
 	struct hmm_rmem *rmem = range->rmem;
-	unsigned long fuid, luid, npages;
+	unsigned long i, fuid, luid, npages, uid;
 	int ret;
 
 	if (range->mirror != mirror) {
@@ -2679,6 +3637,77 @@ static int hmm_mirror_rmem_fault(struct hmm_mirror *mirror,
 	fuid = range->fuid + ((faddr - range->faddr) >> PAGE_SHIFT);
 	luid = fuid + npages;
 
+	/* The rmem might not be uptodate so synchronize again. The only way
+	 * this might be the case is if a previous mkwrite failed and the
+	 * device decided to use the local memory copy.
+	 */
+	i = fuid - rmem->fuid;
+	for (uid = fuid; uid < luid; ++uid, ++i) {
+		if (!test_bit(HMM_PFN_RMEM_UPTODATE, &rmem->pfns[i])) {
+			struct hmm_fence *fence, *nfence;
+			enum hmm_etype etype = event->etype;
+
+			event->etype = HMM_UNMAP;
+			ret = hmm_mirror_rmem_update(mirror, rmem, range->faddr,
+						     range->laddr, range->fuid,
+						     event, true);
+			event->etype = etype;
+			if (ret) {
+				return ret;
+			}
+			list_for_each_entry_safe (fence, nfence,
+						  &event->fences, list) {
+				hmm_device_fence_wait(device, fence);
+			}
+			fence = device->ops->lmem_to_rmem(rmem, range->fuid,
+							  hmm_range_luid(range));
+			if (IS_ERR(fence)) {
+				return PTR_ERR(fence);
+			}
+			ret = hmm_device_fence_wait(device, fence);
+			if (ret) {
+				return ret;
+			}
+			break;
+		}
+	}
+
+	if (write && rmem->mapping) {
+		unsigned long addr;
+
+		if (current->mm == vma->vm_mm) {
+			sync_mm_rss(vma->vm_mm);
+		}
+		i = fuid - rmem->fuid;
+		addr = faddr;
+		for (uid = fuid; uid < luid; ++uid, ++i, addr += PAGE_SIZE) {
+			if (test_bit(HMM_PFN_WRITE, &rmem->pfns[i])) {
+				continue;
+			}
+			if (vma->vm_flags & VM_SHARED) {
+				ret = hmm_rmem_file_mkwrite(rmem,vma,addr,uid);
+				if (ret && ret != -EBUSY) {
+					return ret;
+				}
+			} else {
+				struct mm_struct *mm = vma->vm_mm;
+				struct page *page;
+
+				/* COW */
+				if(mem_cgroup_charge_anon(NULL,mm,GFP_KERNEL)){
+					return -ENOMEM;
+				}
+				add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+				spin_lock(&rmem->lock);
+				page = hmm_pfn_to_page(rmem->pfns[i]);
+				rmem->pfns[i] = 0;
+				set_bit(HMM_PFN_WRITE, &rmem->pfns[i]);
+				spin_unlock(&rmem->lock);
+				hmm_rmem_remap_file_single_page(rmem, page);
+			}
+		}
+	}
+
 	ret = device->ops->rmem_fault(mirror, rmem, faddr, laddr, fuid, fault);
 	return ret;
 }
@@ -2951,7 +3980,10 @@ static void hmm_migrate_abort(struct hmm_mirror *mirror,
 					      faddr, laddr, fuid);
 		}
 	} else {
-		BUG();
+		rmem.pgoff = vma->vm_pgoff;
+		rmem.pgoff += ((fault->faddr - vma->vm_start) >> PAGE_SHIFT);
+		rmem.mapping = vma->vm_file->f_mapping;
+		hmm_rmem_remap_file(&rmem);
 	}
 
 	/* Ok officialy dead. */
@@ -2977,6 +4009,15 @@ static void hmm_migrate_abort(struct hmm_mirror *mirror,
 			unlock_page(page);
 			clear_bit(HMM_PFN_LOCK, &pfns[i]);
 		}
+		if (test_bit(HMM_PFN_FILE, &pfns[i]) && !PageLRU(page)) {
+			/* To balance putback_lru_page and isolate_lru_page. As
+			 * a simplification we droped the extra reference taken
+			 * by isolate_lru_page. This is why we need to take an
+			 * extra reference here for putback_lru_page.
+			 */
+			get_page(page);
+			putback_lru_page(page);
+		}
 		page_remove_rmap(page);
 		page_cache_release(page);
 		pfns[i] = 0;
@@ -2988,6 +4029,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 			     struct hmm_mirror *mirror)
 {
 	struct vm_area_struct *vma;
+	struct address_space *mapping;
 	struct hmm_device *device;
 	struct hmm_range *range;
 	struct hmm_fence *fence;
@@ -3042,7 +4084,8 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 		ret = -EACCES;
 		goto out;
 	}
-	if (vma->vm_file) {
+	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
+	if (vma->vm_file && !(mapping->a_ops->features & AOPS_FEATURE_HMM)) {
 		kfree(range);
 		range = NULL;
 		ret = -EBUSY;
@@ -3053,6 +4096,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 	event->laddr  =fault->laddr = min(fault->laddr, vma->vm_end);
 	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
 	fault->vma = vma;
+	rmem.mapping = (vma->vm_flags & VM_SHARED) ? mapping : NULL;
 
 	ret = hmm_rmem_alloc(&rmem, npages);
 	if (ret) {
@@ -3100,6 +4144,7 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 	hmm_rmem_tree_insert(fault->rmem, &_hmm_rmems);
 	fault->rmem->pfns = rmem.pfns;
 	range->rmem = fault->rmem;
+	fault->rmem->mapping = rmem.mapping;
 	list_del_init(&range->rlist);
 	list_add_tail(&range->rlist, &fault->rmem->ranges);
 	rmem.event = NULL;
@@ -3128,7 +4173,6 @@ int hmm_migrate_lmem_to_rmem(struct hmm_fault *fault,
 		struct page *page = hmm_pfn_to_page(rmem.pfns[i]);
 
 		if (test_bit(HMM_PFN_VALID_ZERO, &rmem.pfns[i])) {
-			rmem.pfns[i] = rmem.pfns[i] & HMM_PFN_CLEAR;
 			continue;
 		}
 		/* We only decrement now the page count so that cow happen
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..7c13f8d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -202,6 +202,10 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 			continue;
 		}
 		swap = radix_to_swp_entry(page);
+		if (is_hmm_entry(swap)) {
+			/* FIXME start migration here ? */
+			continue;
+		}
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
 								NULL, 0);
 		if (page)
diff --git a/mm/mincore.c b/mm/mincore.c
index 725c809..107b870 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -79,6 +79,10 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 		 */
 		if (radix_tree_exceptional_entry(page)) {
 			swp_entry_t swp = radix_to_swp_entry(page);
+
+			if (is_hmm_entry(swp)) {
+				return 1;
+			}
 			page = find_get_page(swap_address_space(swp), swp.val);
 		}
 	} else
@@ -86,6 +90,13 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 #else
 	page = find_get_page(mapping, pgoff);
 #endif
+	if (radix_tree_exceptional_entry(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		if (is_hmm_entry(swap)) {
+			return 1;
+		}
+	}
 	if (page) {
 		present = PageUptodate(page);
 		page_cache_release(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 023cf08..b6dcf80 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -37,6 +37,7 @@
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
 #include <linux/mm_inline.h>
+#include <linux/hmm.h>
 #include <trace/events/writeback.h>
 
 #include "internal.h"
@@ -1900,6 +1901,8 @@ retry:
 		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
+		pgoff_t save_index = index;
+		bool migrated = false;
 		int i;
 
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
@@ -1907,58 +1910,106 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		for (i = 0, migrated = false; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* This can not happen ! */
+				BUG_ON(!is_hmm_entry(swap));
+				page = hmm_pagecache_writeback(mapping, swap);
+				if (page == NULL) {
+					migrated = true;
+					pvec.pages[i] = NULL;
+				}
+			}
+		}
+
+		/* Some rmem was migrated we need to redo the page cache lookup. */
+		if (migrated) {
+			for (i = 0; i < nr_pages; i++) {
+				struct page *page = pvec.pages[i];
+
+				if (page && radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					page = hmm_pagecache_page(mapping, swap);
+					unlock_page(page);
+					page_cache_release(page);
+					pvec.pages[i] = page;
+				}
+			}
+			pagevec_release(&pvec);
+			cond_resched();
+			index = save_index;
+			goto retry;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			/*
-			 * At this point, the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or
-			 * even swizzled back from swapper_space to tmpfs file
-			 * mapping. However, page->index will not change
-			 * because we have a reference on the page.
-			 */
-			if (page->index > end) {
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				pvec.pages[i] = page = hmm_pagecache_page(mapping, swap);
+				page_cache_release(page);
+				done_index = page->index;
+			} else {
 				/*
-				 * can't be range_cyclic (1st pass) because
-				 * end == -1 in that case.
+				 * At this point, the page may be truncated or
+				 * invalidated (changing page->mapping to NULL), or
+				 * even swizzled back from swapper_space to tmpfs file
+				 * mapping. However, page->index will not change
+				 * because we have a reference on the page.
 				 */
-				done = 1;
-				break;
-			}
+				if (page->index > end) {
+					/*
+					 * can't be range_cyclic (1st pass) because
+					 * end == -1 in that case.
+					 */
+					done = 1;
+					break;
+				}
 
-			done_index = page->index;
+				done_index = page->index;
 
-			lock_page(page);
+				lock_page(page);
 
-			/*
-			 * Page truncated or invalidated. We can freely skip it
-			 * then, even for data integrity operations: the page
-			 * has disappeared concurrently, so there could be no
-			 * real expectation of this data interity operation
-			 * even if there is now a new, dirty page at the same
-			 * pagecache address.
-			 */
-			if (unlikely(page->mapping != mapping)) {
-continue_unlock:
-				unlock_page(page);
-				continue;
+				/*
+				 * Page truncated or invalidated. We can freely skip it
+				 * then, even for data integrity operations: the page
+				 * has disappeared concurrently, so there could be no
+				 * real expectation of this data interity operation
+				 * even if there is now a new, dirty page at the same
+				 * pagecache address.
+				 */
+				if (unlikely(page->mapping != mapping)) {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			if (!PageDirty(page)) {
 				/* someone wrote it for us */
-				goto continue_unlock;
+				unlock_page(page);
+				continue;
 			}
 
 			if (PageWriteback(page)) {
-				if (wbc->sync_mode != WB_SYNC_NONE)
+				if (wbc->sync_mode != WB_SYNC_NONE) {
 					wait_on_page_writeback(page);
-				else
-					goto continue_unlock;
+				} else {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
-				goto continue_unlock;
+			if (!clear_page_dirty_for_io(page)) {
+				unlock_page(page);
+				continue;
+			}
 
 			trace_wbc_writepage(wbc, mapping->backing_dev_info);
 			ret = (*writepage)(page, wbc, data);
@@ -1994,6 +2045,20 @@ continue_unlock:
 				break;
 			}
 		}
+
+		/* Some entry of pvec might still be exceptional ! */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				page_cache_release(page);
+				pvec.pages[i] = page;
+			}
+		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index e07450c..3b7fbd3c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1132,6 +1132,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	case TTU_MUNLOCK:
 		action = MMU_MUNLOCK;
 		break;
+	case TTU_HMM:
+		action = MMU_HMM;
+		break;
 	default:
 		/* Please report this ! */
 		BUG();
@@ -1327,6 +1330,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	case TTU_MUNLOCK:
 		action = MMU_MUNLOCK;
 		break;
+	case TTU_HMM:
+		action = MMU_HMM;
+		break;
 	default:
 		/* Please report this ! */
 		BUG();
@@ -1426,7 +1432,12 @@ static int try_to_unmap_nonlinear(struct page *page,
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
-	unsigned int mapcount;
+	unsigned int mapcount, min_mapcount = 0;
+
+	/* The hmm code keep mapcount elevated to 1 to avoid updating mm and
+	 * memcg. If we are call on behalf of hmm just ignore this extra 1.
+	 */
+	min_mapcount = (TTU_ACTION((enum ttu_flags)arg) == TTU_HMM) ? 1 : 0;
 
 	list_for_each_entry(vma,
 		&mapping->i_mmap_nonlinear, shared.nonlinear) {
@@ -1449,8 +1460,10 @@ static int try_to_unmap_nonlinear(struct page *page,
 	 * just walk the nonlinear vmas trying to age and unmap some.
 	 * The mapcount of the page we came in with is irrelevant,
 	 * but even so use it as a guide to how hard we should try?
+	 *
+	 * See comment about hmm above for min_mapcount.
 	 */
-	mapcount = page_mapcount(page);
+	mapcount = page_mapcount(page) - min_mapcount;
 	if (!mapcount)
 		return ret;
 
diff --git a/mm/swap.c b/mm/swap.c
index c0ed4d6..426fede 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -839,6 +839,15 @@ void release_pages(struct page **pages, int nr, int cold)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
+		if (!page) {
+			continue;
+		}
+		if (radix_tree_exceptional_entry(page)) {
+			/* This should really not happen tell us about it ! */
+			WARN_ONCE(1, "hmm exceptional entry left\n");
+			continue;
+		}
+
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
diff --git a/mm/truncate.c b/mm/truncate.c
index 6a78c81..c979fd6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/hmm.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -281,6 +282,32 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -313,7 +340,16 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	}
 
 	if (partial_start) {
-		struct page *page = find_lock_page(mapping, start - 1);
+		struct page *page;
+
+	repeat_start:
+		page = find_lock_page(mapping, start - 1);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(mapping, swap);
+			goto repeat_start;
+		}
 		if (page) {
 			unsigned int top = PAGE_CACHE_SIZE;
 			if (start > end) {
@@ -332,7 +368,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		}
 	}
 	if (partial_end) {
-		struct page *page = find_lock_page(mapping, end);
+		struct page *page;
+	repeat_end:
+		page = find_lock_page(mapping, end);
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			hmm_pagecache_migrate(mapping, swap);
+			goto repeat_end;
+		}
 		if (page) {
 			wait_on_page_writeback(page);
 			zero_user_segment(page, 0, partial_end);
@@ -371,6 +415,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
+			/* FIXME Find a way to block rmem migration on truncate. */
+			BUG_ON(radix_tree_exceptional_entry(page));
+
 			/* We rely upon deletion not changing page->index */
 			index = indices[i];
 			if (index >= end)
@@ -488,6 +535,32 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -597,6 +670,32 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
+		bool migrated = false;
+
+		for (i = 0; i < pagevec_count(&pvec); ++i) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(mapping, swap);
+				for (; i < pagevec_count(&pvec); ++i) {
+					if (radix_tree_exceptional_entry(page)) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				migrated = true;
+				break;
+			}
+		}
+
+		if (migrated) {
+			pagevec_release(&pvec);
+			cond_resched();
+			continue;
+		}
+
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 09/11] fs/ext4: add support for hmm migration to remote memory of pagecache.
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

This add support for migrating page of ext4 filesystem to remote device
memory using the hmm infrastructure. Writeback need special handling as
we want to keep content inside remote memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 fs/ext4/file.c  |  20 +++++++
 fs/ext4/inode.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 174 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 708aad7..7c787d5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -26,6 +26,7 @@
 #include <linux/aio.h>
 #include <linux/quotaops.h>
 #include <linux/pagevec.h>
+#include <linux/hmm.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -304,6 +305,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 		unsigned long nr_pages;
 
 		num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
+retry:
 		nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index,
 					  (pgoff_t)num);
 		if (nr_pages == 0) {
@@ -321,6 +323,24 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 			break;
 		}
 
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exception(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(inode->i_mapping, swap);
+				for (; i < nr_pages; i++) {
+					if (radix_tree_exception(pvec.pages[i])) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				pagevec_release(&pvec);
+				goto retry;
+			}
+		}
+
 		/*
 		 * If this is the first time to go into the loop and
 		 * offset is smaller than the first page offset, it will be a
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1dc334..f2558e2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include <linux/ratelimit.h>
 #include <linux/aio.h>
 #include <linux/bitops.h>
+#include <linux/hmm.h>
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -1462,16 +1463,37 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 			break;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
-			if (page->index > end)
-				break;
-			BUG_ON(!PageLocked(page));
-			BUG_ON(PageWriteback(page));
-			if (invalidate) {
-				block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
-				ClearPageUptodate(page);
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				pvec.pages[i] = page;
+				if (page->index > end)
+					break;
+			} else {
+				if (page->index > end)
+					break;
+				BUG_ON(!PageLocked(page));
+				BUG_ON(PageWriteback(page));
+				if (invalidate) {
+					block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
+					ClearPageUptodate(page);
+				}
 			}
 			unlock_page(page);
 		}
+		for (; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				pvec.pages[i] = page;
+			}
+		}
 		index = pvec.pages[nr_pages - 1]->index + 1;
 		pagevec_release(&pvec);
 	}
@@ -2060,6 +2082,20 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
 					  PAGEVEC_SIZE);
 		if (nr_pages == 0)
 			break;
+
+		/* Replace hmm entry with the page backing it. At this point
+		 * they are uptodate and locked.
+		 */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				 pvec.pages[i] = hmm_pagecache_page(inode->i_mapping, swap);
+			}
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -2331,13 +2367,61 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 	mpd->map.m_len = 0;
 	mpd->next_page = index;
 	while (index <= end) {
+		pgoff_t save_index = index;
+		bool migrated;
+
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			goto out;
 
+		for (i = 0, migrated = false; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* This can not happen ! */
+				VM_BUG_ON(!is_hmm_entry(swap));
+				page = hmm_pagecache_writeback(mapping, swap);
+				if (page == NULL) {
+					migrated = true;
+					pvec.pages[i] = NULL;
+				}
+			}
+		}
+
+		/* Some rmem was migrated we need to redo the page cache lookup. */
+		if (migrated) {
+			for (i = 0; i < nr_pages; i++) {
+				struct page *page = pvec.pages[i];
+
+				if (page && radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					page = hmm_pagecache_page(mapping, swap);
+					unlock_page(page);
+					page_cache_release(page);
+					pvec.pages[i] = page;
+				}
+			}
+			pagevec_release(&pvec);
+			cond_resched();
+			index = save_index;
+			continue;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
+			struct page *hmm_page = NULL;
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				pvec.pages[i] = hmm_pagecache_page(mapping, swap);
+				hmm_page = page = pvec.pages[i];
+				page_cache_release(hmm_page);
+			}
 
 			/*
 			 * At this point, the page may be truncated or
@@ -2364,20 +2448,24 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			if (mpd->map.m_len > 0 && mpd->next_page != page->index)
 				goto out;
 
-			lock_page(page);
-			/*
-			 * If the page is no longer dirty, or its mapping no
-			 * longer corresponds to inode we are writing (which
-			 * means it has been truncated or invalidated), or the
-			 * page is already under writeback and we are not doing
-			 * a data integrity writeback, skip the page
-			 */
-			if (!PageDirty(page) ||
-			    (PageWriteback(page) &&
-			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
-			    unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				continue;
+			if (!hmm_page) {
+				lock_page(page);
+
+				/* If the page is no longer dirty, or its
+				 * mapping no longer corresponds to inode
+				 * we are writing (which means it has been
+				 * truncated or invalidated), or the page
+				 * is already under writeback and we are
+				 * not doing a data integrity writeback,
+				 * skip the page
+				 */
+				if (!PageDirty(page) ||
+				    (PageWriteback(page) &&
+				     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
+				    unlikely(page->mapping != mapping)) {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			wait_on_page_writeback(page);
@@ -2396,11 +2484,37 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			err = 0;
 			left--;
 		}
+		/* Some entry of pvec might still be exceptional ! */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				page_cache_release(page);
+				pvec.pages[i] = page;
+			}
+		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	return 0;
 out:
+	/* Some entry of pvec might still be exceptional ! */
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pvec.pages[i];
+
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			page = hmm_pagecache_page(mapping, swap);
+			unlock_page(page);
+			page_cache_release(page);
+			pvec.pages[i] = page;
+		}
+	}
 	pagevec_release(&pvec);
 	return err;
 }
@@ -3281,6 +3395,7 @@ static const struct address_space_operations ext4_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3297,6 +3412,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.direct_IO		= ext4_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3313,6 +3429,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 void ext4_set_aops(struct inode *inode)
@@ -3355,11 +3472,20 @@ static int ext4_block_zero_page_range(handle_t *handle,
 	struct page *page;
 	int err = 0;
 
+retry:
 	page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
 				   mapping_gfp_mask(mapping) & ~__GFP_FS);
 	if (!page)
 		return -ENOMEM;
 
+	if (radix_tree_exception(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto retry;
+	}
+
 	blocksize = inode->i_sb->s_blocksize;
 	max = blocksize - (offset & (blocksize - 1));
 
@@ -4529,6 +4655,13 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
 				      inode->i_size >> PAGE_CACHE_SHIFT);
 		if (!page)
 			return;
+		if (radix_tree_exception(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(inode->i_mapping, swap);
+			continue;
+		}
 		ret = __ext4_journalled_invalidatepage(page, offset,
 						PAGE_CACHE_SIZE - offset);
 		unlock_page(page);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 09/11] fs/ext4: add support for hmm migration to remote memory of pagecache.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

This add support for migrating page of ext4 filesystem to remote device
memory using the hmm infrastructure. Writeback need special handling as
we want to keep content inside remote memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 fs/ext4/file.c  |  20 +++++++
 fs/ext4/inode.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 174 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 708aad7..7c787d5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -26,6 +26,7 @@
 #include <linux/aio.h>
 #include <linux/quotaops.h>
 #include <linux/pagevec.h>
+#include <linux/hmm.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -304,6 +305,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 		unsigned long nr_pages;
 
 		num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
+retry:
 		nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index,
 					  (pgoff_t)num);
 		if (nr_pages == 0) {
@@ -321,6 +323,24 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 			break;
 		}
 
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exception(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(inode->i_mapping, swap);
+				for (; i < nr_pages; i++) {
+					if (radix_tree_exception(pvec.pages[i])) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				pagevec_release(&pvec);
+				goto retry;
+			}
+		}
+
 		/*
 		 * If this is the first time to go into the loop and
 		 * offset is smaller than the first page offset, it will be a
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1dc334..f2558e2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include <linux/ratelimit.h>
 #include <linux/aio.h>
 #include <linux/bitops.h>
+#include <linux/hmm.h>
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -1462,16 +1463,37 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 			break;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
-			if (page->index > end)
-				break;
-			BUG_ON(!PageLocked(page));
-			BUG_ON(PageWriteback(page));
-			if (invalidate) {
-				block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
-				ClearPageUptodate(page);
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				pvec.pages[i] = page;
+				if (page->index > end)
+					break;
+			} else {
+				if (page->index > end)
+					break;
+				BUG_ON(!PageLocked(page));
+				BUG_ON(PageWriteback(page));
+				if (invalidate) {
+					block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
+					ClearPageUptodate(page);
+				}
 			}
 			unlock_page(page);
 		}
+		for (; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				pvec.pages[i] = page;
+			}
+		}
 		index = pvec.pages[nr_pages - 1]->index + 1;
 		pagevec_release(&pvec);
 	}
@@ -2060,6 +2082,20 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
 					  PAGEVEC_SIZE);
 		if (nr_pages == 0)
 			break;
+
+		/* Replace hmm entry with the page backing it. At this point
+		 * they are uptodate and locked.
+		 */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				 pvec.pages[i] = hmm_pagecache_page(inode->i_mapping, swap);
+			}
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -2331,13 +2367,61 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 	mpd->map.m_len = 0;
 	mpd->next_page = index;
 	while (index <= end) {
+		pgoff_t save_index = index;
+		bool migrated;
+
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			goto out;
 
+		for (i = 0, migrated = false; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* This can not happen ! */
+				VM_BUG_ON(!is_hmm_entry(swap));
+				page = hmm_pagecache_writeback(mapping, swap);
+				if (page == NULL) {
+					migrated = true;
+					pvec.pages[i] = NULL;
+				}
+			}
+		}
+
+		/* Some rmem was migrated we need to redo the page cache lookup. */
+		if (migrated) {
+			for (i = 0; i < nr_pages; i++) {
+				struct page *page = pvec.pages[i];
+
+				if (page && radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					page = hmm_pagecache_page(mapping, swap);
+					unlock_page(page);
+					page_cache_release(page);
+					pvec.pages[i] = page;
+				}
+			}
+			pagevec_release(&pvec);
+			cond_resched();
+			index = save_index;
+			continue;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
+			struct page *hmm_page = NULL;
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				pvec.pages[i] = hmm_pagecache_page(mapping, swap);
+				hmm_page = page = pvec.pages[i];
+				page_cache_release(hmm_page);
+			}
 
 			/*
 			 * At this point, the page may be truncated or
@@ -2364,20 +2448,24 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			if (mpd->map.m_len > 0 && mpd->next_page != page->index)
 				goto out;
 
-			lock_page(page);
-			/*
-			 * If the page is no longer dirty, or its mapping no
-			 * longer corresponds to inode we are writing (which
-			 * means it has been truncated or invalidated), or the
-			 * page is already under writeback and we are not doing
-			 * a data integrity writeback, skip the page
-			 */
-			if (!PageDirty(page) ||
-			    (PageWriteback(page) &&
-			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
-			    unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				continue;
+			if (!hmm_page) {
+				lock_page(page);
+
+				/* If the page is no longer dirty, or its
+				 * mapping no longer corresponds to inode
+				 * we are writing (which means it has been
+				 * truncated or invalidated), or the page
+				 * is already under writeback and we are
+				 * not doing a data integrity writeback,
+				 * skip the page
+				 */
+				if (!PageDirty(page) ||
+				    (PageWriteback(page) &&
+				     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
+				    unlikely(page->mapping != mapping)) {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			wait_on_page_writeback(page);
@@ -2396,11 +2484,37 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			err = 0;
 			left--;
 		}
+		/* Some entry of pvec might still be exceptional ! */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				page_cache_release(page);
+				pvec.pages[i] = page;
+			}
+		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	return 0;
 out:
+	/* Some entry of pvec might still be exceptional ! */
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pvec.pages[i];
+
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			page = hmm_pagecache_page(mapping, swap);
+			unlock_page(page);
+			page_cache_release(page);
+			pvec.pages[i] = page;
+		}
+	}
 	pagevec_release(&pvec);
 	return err;
 }
@@ -3281,6 +3395,7 @@ static const struct address_space_operations ext4_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3297,6 +3412,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.direct_IO		= ext4_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3313,6 +3429,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 void ext4_set_aops(struct inode *inode)
@@ -3355,11 +3472,20 @@ static int ext4_block_zero_page_range(handle_t *handle,
 	struct page *page;
 	int err = 0;
 
+retry:
 	page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
 				   mapping_gfp_mask(mapping) & ~__GFP_FS);
 	if (!page)
 		return -ENOMEM;
 
+	if (radix_tree_exception(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto retry;
+	}
+
 	blocksize = inode->i_sb->s_blocksize;
 	max = blocksize - (offset & (blocksize - 1));
 
@@ -4529,6 +4655,13 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
 				      inode->i_size >> PAGE_CACHE_SHIFT);
 		if (!page)
 			return;
+		if (radix_tree_exception(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(inode->i_mapping, swap);
+			continue;
+		}
 		ret = __ext4_journalled_invalidatepage(page, offset,
 						PAGE_CACHE_SIZE - offset);
 		unlock_page(page);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 09/11] fs/ext4: add support for hmm migration to remote memory of pagecache.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This add support for migrating page of ext4 filesystem to remote device
memory using the hmm infrastructure. Writeback need special handling as
we want to keep content inside remote memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 fs/ext4/file.c  |  20 +++++++
 fs/ext4/inode.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 174 insertions(+), 21 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 708aad7..7c787d5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -26,6 +26,7 @@
 #include <linux/aio.h>
 #include <linux/quotaops.h>
 #include <linux/pagevec.h>
+#include <linux/hmm.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -304,6 +305,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 		unsigned long nr_pages;
 
 		num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
+retry:
 		nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index,
 					  (pgoff_t)num);
 		if (nr_pages == 0) {
@@ -321,6 +323,24 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 			break;
 		}
 
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exception(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* FIXME How to handle hmm migration failure ? */
+				hmm_pagecache_migrate(inode->i_mapping, swap);
+				for (; i < nr_pages; i++) {
+					if (radix_tree_exception(pvec.pages[i])) {
+						pvec.pages[i] = NULL;
+					}
+				}
+				pagevec_release(&pvec);
+				goto retry;
+			}
+		}
+
 		/*
 		 * If this is the first time to go into the loop and
 		 * offset is smaller than the first page offset, it will be a
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b1dc334..f2558e2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -39,6 +39,7 @@
 #include <linux/ratelimit.h>
 #include <linux/aio.h>
 #include <linux/bitops.h>
+#include <linux/hmm.h>
 
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -1462,16 +1463,37 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 			break;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
-			if (page->index > end)
-				break;
-			BUG_ON(!PageLocked(page));
-			BUG_ON(PageWriteback(page));
-			if (invalidate) {
-				block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
-				ClearPageUptodate(page);
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				pvec.pages[i] = page;
+				if (page->index > end)
+					break;
+			} else {
+				if (page->index > end)
+					break;
+				BUG_ON(!PageLocked(page));
+				BUG_ON(PageWriteback(page));
+				if (invalidate) {
+					block_invalidatepage(page, 0, PAGE_CACHE_SIZE);
+					ClearPageUptodate(page);
+				}
 			}
 			unlock_page(page);
 		}
+		for (; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				pvec.pages[i] = page;
+			}
+		}
 		index = pvec.pages[nr_pages - 1]->index + 1;
 		pagevec_release(&pvec);
 	}
@@ -2060,6 +2082,20 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
 					  PAGEVEC_SIZE);
 		if (nr_pages == 0)
 			break;
+
+		/* Replace hmm entry with the page backing it. At this point
+		 * they are uptodate and locked.
+		 */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				 pvec.pages[i] = hmm_pagecache_page(inode->i_mapping, swap);
+			}
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -2331,13 +2367,61 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 	mpd->map.m_len = 0;
 	mpd->next_page = index;
 	while (index <= end) {
+		pgoff_t save_index = index;
+		bool migrated;
+
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			goto out;
 
+		for (i = 0, migrated = false; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				/* This can not happen ! */
+				VM_BUG_ON(!is_hmm_entry(swap));
+				page = hmm_pagecache_writeback(mapping, swap);
+				if (page == NULL) {
+					migrated = true;
+					pvec.pages[i] = NULL;
+				}
+			}
+		}
+
+		/* Some rmem was migrated we need to redo the page cache lookup. */
+		if (migrated) {
+			for (i = 0; i < nr_pages; i++) {
+				struct page *page = pvec.pages[i];
+
+				if (page && radix_tree_exceptional_entry(page)) {
+					swp_entry_t swap = radix_to_swp_entry(page);
+
+					page = hmm_pagecache_page(mapping, swap);
+					unlock_page(page);
+					page_cache_release(page);
+					pvec.pages[i] = page;
+				}
+			}
+			pagevec_release(&pvec);
+			cond_resched();
+			index = save_index;
+			continue;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
+			struct page *hmm_page = NULL;
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				pvec.pages[i] = hmm_pagecache_page(mapping, swap);
+				hmm_page = page = pvec.pages[i];
+				page_cache_release(hmm_page);
+			}
 
 			/*
 			 * At this point, the page may be truncated or
@@ -2364,20 +2448,24 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			if (mpd->map.m_len > 0 && mpd->next_page != page->index)
 				goto out;
 
-			lock_page(page);
-			/*
-			 * If the page is no longer dirty, or its mapping no
-			 * longer corresponds to inode we are writing (which
-			 * means it has been truncated or invalidated), or the
-			 * page is already under writeback and we are not doing
-			 * a data integrity writeback, skip the page
-			 */
-			if (!PageDirty(page) ||
-			    (PageWriteback(page) &&
-			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
-			    unlikely(page->mapping != mapping)) {
-				unlock_page(page);
-				continue;
+			if (!hmm_page) {
+				lock_page(page);
+
+				/* If the page is no longer dirty, or its
+				 * mapping no longer corresponds to inode
+				 * we are writing (which means it has been
+				 * truncated or invalidated), or the page
+				 * is already under writeback and we are
+				 * not doing a data integrity writeback,
+				 * skip the page
+				 */
+				if (!PageDirty(page) ||
+				    (PageWriteback(page) &&
+				     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
+				    unlikely(page->mapping != mapping)) {
+					unlock_page(page);
+					continue;
+				}
 			}
 
 			wait_on_page_writeback(page);
@@ -2396,11 +2484,37 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			err = 0;
 			left--;
 		}
+		/* Some entry of pvec might still be exceptional ! */
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			if (radix_tree_exceptional_entry(page)) {
+				swp_entry_t swap = radix_to_swp_entry(page);
+
+				page = hmm_pagecache_page(mapping, swap);
+				unlock_page(page);
+				page_cache_release(page);
+				pvec.pages[i] = page;
+			}
+		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
 	return 0;
 out:
+	/* Some entry of pvec might still be exceptional ! */
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pvec.pages[i];
+
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			page = hmm_pagecache_page(mapping, swap);
+			unlock_page(page);
+			page_cache_release(page);
+			pvec.pages[i] = page;
+		}
+	}
 	pagevec_release(&pvec);
 	return err;
 }
@@ -3281,6 +3395,7 @@ static const struct address_space_operations ext4_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 static const struct address_space_operations ext4_journalled_aops = {
@@ -3297,6 +3412,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.direct_IO		= ext4_direct_IO,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 static const struct address_space_operations ext4_da_aops = {
@@ -3313,6 +3429,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
 	.error_remove_page	= generic_error_remove_page,
+	.features		= AOPS_FEATURE_HMM,
 };
 
 void ext4_set_aops(struct inode *inode)
@@ -3355,11 +3472,20 @@ static int ext4_block_zero_page_range(handle_t *handle,
 	struct page *page;
 	int err = 0;
 
+retry:
 	page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
 				   mapping_gfp_mask(mapping) & ~__GFP_FS);
 	if (!page)
 		return -ENOMEM;
 
+	if (radix_tree_exception(page)) {
+		swp_entry_t swap = radix_to_swp_entry(page);
+
+		/* FIXME How to handle hmm migration failure ? */
+		hmm_pagecache_migrate(mapping, swap);
+		goto retry;
+	}
+
 	blocksize = inode->i_sb->s_blocksize;
 	max = blocksize - (offset & (blocksize - 1));
 
@@ -4529,6 +4655,13 @@ static void ext4_wait_for_tail_page_commit(struct inode *inode)
 				      inode->i_size >> PAGE_CACHE_SHIFT);
 		if (!page)
 			return;
+		if (radix_tree_exception(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+
+			/* FIXME How to handle hmm migration failure ? */
+			hmm_pagecache_migrate(inode->i_mapping, swap);
+			continue;
+		}
 		ret = __ext4_journalled_invalidatepage(page, offset,
 						PAGE_CACHE_SIZE - offset);
 		unlock_page(page);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 10/11] hmm/dummy: dummy driver to showcase the hmm api.
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

This is a dummy driver which full fill two purposes :
  - showcase the hmm api and gives references on how to use it.
  - provide an extensive user space api to stress test hmm.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/char/Kconfig           |    9 +
 drivers/char/Makefile          |    1 +
 drivers/char/hmm_dummy.c       | 1128 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/hmm_dummy.h |   34 ++
 4 files changed, 1172 insertions(+)
 create mode 100644 drivers/char/hmm_dummy.c
 create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6e9f74a..199e111 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -600,5 +600,14 @@ config TILE_SROM
 	  device appear much like a simple EEPROM, and knows
 	  how to partition a single ROM for multiple purposes.
 
+config HMM_DUMMY
+	tristate "hmm dummy driver to test hmm."
+	depends on HMM
+	default n
+	help
+	  Say Y here if you want to build the hmm dummy driver that allow you
+	  to test the hmm infrastructure by mapping a process address space
+	  in hmm dummy driver device file. When in doubt, say "N".
+
 endmenu
 
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index a324f93..83d89b8 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -61,3 +61,4 @@ obj-$(CONFIG_JS_RTC)		+= js-rtc.o
 js-rtc-y = rtc.o
 
 obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
+obj-$(CONFIG_HMM_DUMMY)		+= hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 0000000..e87dc7c
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,1128 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In here mirror address are address in the process address space that is
+ * being mirrored. While virtual address are the address in the current
+ * process that has the hmm dummy dev file mapped (address of the file
+ * mapping).
+ *
+ * You must be carefull to not mix one and another.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+
+#include <uapi/linux/hmm_dummy.h>
+
+#define HMM_DUMMY_DEVICE_NAME		"hmm_dummy_device"
+#define HMM_DUMMY_DEVICE_MAX_MIRRORS	4
+
+struct hmm_dummy_device;
+
+struct hmm_dummy_mirror {
+	struct file		*filp;
+	struct hmm_dummy_device	*ddevice;
+	struct hmm_mirror	mirror;
+	unsigned		minor;
+	pid_t			pid;
+	struct mm_struct	*mm;
+	unsigned long		*pgdp;
+	struct mutex		mutex;
+	bool			stop;
+};
+
+struct hmm_dummy_device {
+	struct cdev		cdev;
+	struct hmm_device	device;
+	dev_t			dev;
+	int			major;
+	struct mutex		mutex;
+	char			name[32];
+	/* device file mapping tracking (keep track of all vma) */
+	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+	struct address_space	*fmapping[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+};
+
+
+/* We only create 2 device to show the inter device rmem sharing/migration
+ * capabilities.
+ */
+static struct hmm_dummy_device ddevices[2];
+
+static void hmm_dummy_device_print(struct hmm_dummy_device *device,
+				   unsigned minor,
+				   const char *format,
+				   ...)
+{
+	va_list args;
+
+	printk(KERN_INFO "[%s:%d] ", device->name, minor);
+	va_start(args, format);
+	vprintk(format, args);
+	va_end(args);
+}
+
+
+/* hmm_dummy_pt - dummy page table, the dummy device fake its own page table.
+ *
+ * Helper function to manage the dummy device page table.
+ */
+#define HMM_DUMMY_PTE_VALID_PAGE	(1UL << 0UL)
+#define HMM_DUMMY_PTE_VALID_ZERO	(1UL << 1UL)
+#define HMM_DUMMY_PTE_READ		(1UL << 2UL)
+#define HMM_DUMMY_PTE_WRITE		(1UL << 3UL)
+#define HMM_DUMMY_PTE_DIRTY		(1UL << 4UL)
+#define HMM_DUMMY_PFN_SHIFT		(PAGE_SHIFT)
+
+#define ARCH_PAGE_SIZE			((unsigned long)PAGE_SIZE)
+#define ARCH_PAGE_SHIFT			((unsigned long)PAGE_SHIFT)
+
+#define HMM_DUMMY_PTRS_PER_LEVEL	(ARCH_PAGE_SIZE / sizeof(long))
+#ifdef CONFIG_64BIT
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 3UL)
+#else
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 2UL)
+#endif
+#define HMM_DUMMY_PLD_SHIFT		(ARCH_PAGE_SHIFT)
+#define HMM_DUMMY_PMD_SHIFT		(HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_SHIFT		(HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_SHIFT		(HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PMD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_SIZE		(1UL << (HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PMD_SIZE		(1UL << (HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PUD_SIZE		(1UL << (HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PGD_SIZE		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PLD_MASK		(~(HMM_DUMMY_PLD_SIZE - 1UL))
+#define HMM_DUMMY_PMD_MASK		(~(HMM_DUMMY_PMD_SIZE - 1UL))
+#define HMM_DUMMY_PUD_MASK		(~(HMM_DUMMY_PUD_SIZE - 1UL))
+#define HMM_DUMMY_PGD_MASK		(~(HMM_DUMMY_PGD_SIZE - 1UL))
+#define HMM_DUMMY_MAX_ADDR		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+
+static inline unsigned long hmm_dummy_pld_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PLD_SHIFT) & (HMM_DUMMY_PLD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pmd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PMD_SHIFT) & (HMM_DUMMY_PMD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pud_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PUD_SHIFT) & (HMM_DUMMY_PUD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pgd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PGD_SHIFT) & (HMM_DUMMY_PGD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pld_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pmd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pud_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pgd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pld_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK) + HMM_DUMMY_PLD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pmd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK) + HMM_DUMMY_PMD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pud_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK) + HMM_DUMMY_PUD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pgd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK) + HMM_DUMMY_PGD_SIZE;
+}
+
+static inline struct page *hmm_dummy_pte_to_page(unsigned long pte)
+{
+	if (!(pte & (HMM_DUMMY_PTE_VALID_PAGE | HMM_DUMMY_PTE_VALID_ZERO))) {
+		return NULL;
+	}
+	return pfn_to_page((pte >> HMM_DUMMY_PFN_SHIFT));
+}
+
+struct hmm_dummy_pt_map {
+	struct hmm_dummy_mirror	*dmirror;
+	struct page		*pud_page;
+	struct page		*pmd_page;
+	struct page		*pld_page;
+	unsigned long		pgd_idx;
+	unsigned long		pud_idx;
+	unsigned long		pmd_idx;
+	unsigned long		*pudp;
+	unsigned long		*pmdp;
+	unsigned long		*pldp;
+};
+
+static inline unsigned long *hmm_dummy_pt_pud_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	struct hmm_dummy_mirror *dmirror = pt_map->dmirror;
+	unsigned long *pdep;
+
+	if (!dmirror->pgdp) {
+		return NULL;
+	}
+
+	if (!pt_map->pud_page || pt_map->pgd_idx != hmm_dummy_pgd_index(addr)) {
+		if (pt_map->pud_page) {
+			kunmap(pt_map->pud_page);
+			pt_map->pud_page = NULL;
+			pt_map->pudp = NULL;
+		}
+		pt_map->pgd_idx = hmm_dummy_pgd_index(addr);
+		pdep = &dmirror->pgdp[pt_map->pgd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pud_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pudp = kmap(pt_map->pud_page);
+	}
+	return pt_map->pudp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pmd_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pud_map(pt_map, addr)) {
+		return NULL;
+	}
+
+	if (!pt_map->pmd_page || pt_map->pud_idx != hmm_dummy_pud_index(addr)) {
+		if (pt_map->pmd_page) {
+			kunmap(pt_map->pmd_page);
+			pt_map->pmd_page = NULL;
+			pt_map->pmdp = NULL;
+		}
+		pt_map->pud_idx = hmm_dummy_pud_index(addr);
+		pdep = &pt_map->pudp[pt_map->pud_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pmd_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pmdp = kmap(pt_map->pmd_page);
+	}
+	return pt_map->pmdp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pld_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pmd_map(pt_map, addr)) {
+		return NULL;
+	}
+
+	if (!pt_map->pld_page || pt_map->pmd_idx != hmm_dummy_pmd_index(addr)) {
+		if (pt_map->pld_page) {
+			kunmap(pt_map->pld_page);
+			pt_map->pld_page = NULL;
+			pt_map->pldp = NULL;
+		}
+		pt_map->pmd_idx = hmm_dummy_pmd_index(addr);
+		pdep = &pt_map->pmdp[pt_map->pmd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pld_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pldp = kmap(pt_map->pld_page);
+	}
+	return pt_map->pldp;
+}
+
+static inline void hmm_dummy_pt_pld_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	if (pt_map->pld_page) {
+		kunmap(pt_map->pld_page);
+		pt_map->pld_page = NULL;
+		pt_map->pldp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pmd_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pld_unmap(pt_map);
+	if (pt_map->pmd_page) {
+		kunmap(pt_map->pmd_page);
+		pt_map->pmd_page = NULL;
+		pt_map->pmdp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pud_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pmd_unmap(pt_map);
+	if (pt_map->pud_page) {
+		kunmap(pt_map->pud_page);
+		pt_map->pud_page = NULL;
+		pt_map->pudp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pud_unmap(pt_map);
+}
+
+static int hmm_dummy_pt_alloc(struct hmm_dummy_mirror *dmirror,
+			      unsigned long faddr,
+			      unsigned long laddr)
+{
+	unsigned long *pgdp, *pudp, *pmdp;
+
+	if (dmirror->stop) {
+		return -EINVAL;
+	}
+
+	if (dmirror->pgdp == NULL) {
+		dmirror->pgdp = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		if (dmirror->pgdp == NULL) {
+			return -ENOMEM;
+		}
+	}
+
+	for (; faddr < laddr; faddr = hmm_dummy_pld_next(faddr)) {
+		struct page *pud_page, *pmd_page;
+
+		pgdp = &dmirror->pgdp[hmm_dummy_pgd_index(faddr)];
+		if (!((*pgdp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			pud_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pud_page) {
+				return -ENOMEM;
+			}
+			*pgdp  = (page_to_pfn(pud_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pgdp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		pud_page = pfn_to_page((*pgdp) >> HMM_DUMMY_PFN_SHIFT);
+		pudp = kmap(pud_page);
+		pudp = &pudp[hmm_dummy_pud_index(faddr)];
+		if (!((*pudp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			pmd_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pmd_page) {
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pudp  = (page_to_pfn(pmd_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pudp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		pmd_page = pfn_to_page((*pudp) >> HMM_DUMMY_PFN_SHIFT);
+		pmdp = kmap(pmd_page);
+		pmdp = &pmdp[hmm_dummy_pmd_index(faddr)];
+		if (!((*pmdp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			struct page *page;
+
+			page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!page) {
+				kunmap(pmd_page);
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pmdp  = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+			*pmdp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		kunmap(pmd_page);
+		kunmap(pud_page);
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_pt_free_pmd(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long faddr,
+				  unsigned long laddr)
+{
+	for (; faddr < laddr; faddr = hmm_dummy_pld_next(faddr)) {
+		unsigned long pfn, *pmdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pld_next(faddr), laddr);
+		if (faddr > hmm_dummy_pld_base(faddr) || laddr < next) {
+			continue;
+		}
+		pmdp = hmm_dummy_pt_pmd_map(pt_map, faddr);
+		if (!pmdp) {
+			continue;
+		}
+		if (!(pmdp[hmm_dummy_pmd_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pmdp[hmm_dummy_pmd_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pmdp[hmm_dummy_pmd_index(faddr)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free_pud(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long faddr,
+				  unsigned long laddr)
+{
+	for (; faddr < laddr; faddr = hmm_dummy_pmd_next(faddr)) {
+		unsigned long pfn, *pudp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pmd_next(faddr), laddr);
+		hmm_dummy_pt_free_pmd(pt_map, faddr, next);
+		hmm_dummy_pt_pmd_unmap(pt_map);
+		if (faddr > hmm_dummy_pmd_base(faddr) || laddr < next) {
+			continue;
+		}
+		pudp = hmm_dummy_pt_pud_map(pt_map, faddr);
+		if (!pudp) {
+			continue;
+		}
+		if (!(pudp[hmm_dummy_pud_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pudp[hmm_dummy_pud_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pudp[hmm_dummy_pud_index(faddr)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
+			      unsigned long faddr,
+			      unsigned long laddr)
+{
+	struct hmm_dummy_pt_map pt_map = {0};
+
+	if (!dmirror->pgdp || (laddr - faddr) < HMM_DUMMY_PLD_SIZE) {
+		return;
+	}
+
+	pt_map.dmirror = dmirror;
+
+	for (; faddr < laddr; faddr = hmm_dummy_pud_next(faddr)) {
+		unsigned long pfn, *pgdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pud_next(faddr), laddr);
+		pgdp = dmirror->pgdp;
+		hmm_dummy_pt_free_pud(&pt_map, faddr, next);
+		hmm_dummy_pt_pud_unmap(&pt_map);
+		if (faddr > hmm_dummy_pud_base(faddr) || laddr < next) {
+			continue;
+		}
+		if (!(pgdp[hmm_dummy_pgd_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pgdp[hmm_dummy_pgd_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pgdp[hmm_dummy_pgd_index(faddr)] = 0;
+		__free_page(page);
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+}
+
+
+/* hmm_ops - hmm callback for the hmm dummy driver.
+ *
+ * Below are the various callback that the hmm api require for a device. The
+ * implementation of the dummy device driver is necessarily simpler that what
+ * a real device driver would do. We do not have interrupt nor any kind of
+ * command buffer on to which schedule memory invalidation and updates.
+ */
+static void hmm_dummy_device_destroy(struct hmm_device *device)
+{
+	/* No-op for the dummy driver. */
+}
+
+static void hmm_dummy_mirror_release(struct hmm_mirror *mirror)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	dmirror->stop = true;
+	mutex_lock(&dmirror->mutex);
+	hmm_dummy_pt_free(dmirror, 0, HMM_DUMMY_MAX_ADDR);
+	if (dmirror->pgdp) {
+		kfree(dmirror->pgdp);
+		dmirror->pgdp = NULL;
+	}
+	mutex_unlock(&dmirror->mutex);
+}
+
+static void hmm_dummy_mirror_destroy(struct hmm_mirror *mirror)
+{
+	/* No-op for the dummy driver. */
+	// FIXME print that the pid is no longer mirror
+}
+
+static int hmm_dummy_fence_wait(struct hmm_fence *fence)
+{
+	/* FIXME use some kind of fake event and delay dirty and dummy page
+	 * clearing to this function.
+	 */
+	return 0;
+}
+
+static struct hmm_fence *hmm_dummy_lmem_update(struct hmm_mirror *mirror,
+					       unsigned long faddr,
+					       unsigned long laddr,
+					       enum hmm_etype etype,
+					       bool dirty)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask, or;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	/* Sanity check for debugging hmm real device driver do not have to do that. */
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_UNMAP:
+	case HMM_MUNMAP:
+	case HMM_MPROT_WONLY:
+	case HMM_MIGRATE_TO_RMEM:
+	case HMM_MIGRATE_TO_LMEM:
+		mask = 0;
+		or = 0;
+		break;
+	case HMM_MPROT_RONLY:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		or = 0;
+		break;
+	case HMM_MPROT_RANDW:
+		mask = -1L;
+		or = HMM_DUMMY_PTE_WRITE;
+		break;
+	default:
+		printk(KERN_ERR "%4d:%s invalid event type %d\n",
+		       __LINE__, __func__, etype);
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = faddr; addr < laddr; ++i, addr += PAGE_SIZE) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			continue;
+		}
+		if (dirty && ((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			struct page *page;
+
+			page = hmm_dummy_pte_to_page(*pldp);
+			if (page) {
+				set_page_dirty(page);
+			}
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+		*pldp |= or;
+		if ((*pldp) & HMM_DUMMY_PTE_VALID_ZERO) {
+			*pldp &= ~HMM_DUMMY_PTE_WRITE;
+		}
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_MUNMAP:
+		hmm_dummy_pt_free(dmirror, faddr, laddr);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static int hmm_dummy_lmem_fault(struct hmm_mirror *mirror,
+				unsigned long faddr,
+				unsigned long laddr,
+				unsigned long *pfns,
+				struct hmm_fault *fault)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long i;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0; faddr < laddr; ++i, faddr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		if (!pldp || !hmm_pfn_to_page(pfns[i])) {
+			continue;
+		}
+		pld_idx = hmm_dummy_pld_index(faddr);
+		pldp[pld_idx]  = ((pfns[i] >> HMM_PFN_SHIFT) << HMM_DUMMY_PFN_SHIFT);
+		pldp[pld_idx] |= test_bit(HMM_PFN_WRITE, &pfns[i]) ? HMM_DUMMY_PTE_WRITE : 0;
+		pldp[pld_idx] |= test_bit(HMM_PFN_VALID_PAGE, &pfns[i]) ?
+			HMM_DUMMY_PTE_VALID_PAGE : HMM_DUMMY_PTE_VALID_ZERO;
+		pldp[pld_idx] |= HMM_DUMMY_PTE_READ;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+static const struct hmm_device_ops hmm_dummy_ops = {
+	.device_destroy		= &hmm_dummy_device_destroy,
+	.mirror_release		= &hmm_dummy_mirror_release,
+	.mirror_destroy		= &hmm_dummy_mirror_destroy,
+	.fence_wait		= &hmm_dummy_fence_wait,
+	.lmem_update		= &hmm_dummy_lmem_update,
+	.lmem_fault		= &hmm_dummy_lmem_fault,
+};
+
+
+/* hmm_dummy_mmap - hmm dummy device file mmap operations.
+ *
+ * The hmm dummy driver does not allow mmap of its device file. The main reason
+ * is because the kernel lack the ability to insert page with specific custom
+ * protections inside a vma.
+ */
+static int hmm_dummy_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static void hmm_dummy_mmap_open(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static void hmm_dummy_mmap_close(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static const struct vm_operations_struct mmap_mem_ops = {
+	.fault			= hmm_dummy_mmap_fault,
+	.open			= hmm_dummy_mmap_open,
+	.close			= hmm_dummy_mmap_close,
+};
+
+
+/* hmm_dummy_fops - hmm dummy device file operations.
+ *
+ * The hmm dummy driver allow to read/write to the mirrored process through
+ * the device file. Below are the read and write and others device file
+ * callback that implement access to the mirrored address space.
+ */
+static int hmm_dummy_mirror_fault(struct hmm_dummy_mirror *dmirror,
+				  unsigned long addr,
+				  bool write)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_fault fault;
+	unsigned long faddr, laddr, npages = 4;
+	int ret;
+
+	fault.pfns = kmalloc(npages * sizeof(long), GFP_KERNEL);
+	fault.flags = write ? HMM_FAULT_WRITE : 0;
+
+	/* Showcase hmm api fault a 64k range centered on the address. */
+	fault.faddr = faddr = addr > (npages << 8) ? addr - (npages << 8) : 0;
+	fault.laddr = laddr = faddr + (npages << 10);
+
+	/* Pre-allocate device page table. */
+	mutex_lock(&dmirror->mutex);
+	ret = hmm_dummy_pt_alloc(dmirror, faddr, laddr);
+	mutex_unlock(&dmirror->mutex);
+	if (ret) {
+		goto out;
+	}
+
+	for (; faddr < laddr; faddr = fault.faddr) {
+		ret = hmm_mirror_fault(mirror, &fault);
+		/* Ignore any error that do not concern the fault address. */
+		if (addr >= fault.laddr) {
+			fault.faddr = fault.laddr;
+			fault.laddr = laddr;
+			continue;
+		}
+		if (addr < fault.faddr) {
+			/* The address was faulted successfully ignore error
+			 * for address above the one we were interested in.
+			 */
+			ret = 0;
+		}
+		goto out;
+	}
+
+out:
+	kfree(fault.pfns);
+	return ret;
+}
+
+static ssize_t hmm_dummy_fops_read(struct file *filp,
+				   char __user *buf,
+				   size_t count,
+				   loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long faddr, laddr, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp) {
+		return -ENOMEM;
+	}
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	dmirror = ddevice->dmirrors[minor];
+	mutex_unlock(&ddevice->mutex);
+	if (dmirror->stop) {
+		kfree(tmp);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	faddr = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - faddr;
+	laddr = PAGE_ALIGN(faddr + count);
+	BUG_ON(faddr == laddr);
+	pt_map.dmirror = dmirror;
+
+	for (; count; faddr += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		mutex_lock(&dmirror->mutex);
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		if (!pldp || !(pldp[pld_idx] & (HMM_DUMMY_PTE_VALID_PAGE | HMM_DUMMY_PTE_VALID_ZERO))) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(tmp, ptr + offset, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		r = copy_to_user(buf, tmp, size);
+		if (r) {
+			kfree(tmp);
+			return -EFAULT;
+		}
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, faddr, false);
+	if (r) {
+		return r;
+	}
+
+	/* Force userspace to retry read if nothing was read. */
+	return retval ? retval : -EINTR;
+}
+
+static ssize_t hmm_dummy_fops_write(struct file *filp,
+				    const char __user *buf,
+				    size_t count,
+				    loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long faddr, laddr, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp) {
+		return -ENOMEM;
+	}
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	dmirror = ddevice->dmirrors[minor];
+	mutex_unlock(&ddevice->mutex);
+	if (dmirror->stop) {
+		kfree(tmp);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	faddr = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - faddr;
+	laddr = PAGE_ALIGN(faddr + count);
+	BUG_ON(faddr == laddr);
+	pt_map.dmirror = dmirror;
+
+	for (; count; faddr += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		r = copy_from_user(tmp, buf, size);
+		if (r) {
+			kfree(tmp);
+			return -EFAULT;
+		}
+
+		mutex_lock(&dmirror->mutex);
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+				goto fault;
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(ptr + offset, tmp, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	kfree(tmp);
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, faddr, true);
+	if (r) {
+		return r;
+	}
+
+	/* Force userspace to retry write if nothing was writen. */
+	return retval ? retval : -EINTR;
+}
+
+static int hmm_dummy_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+static int hmm_dummy_fops_open(struct inode *inode, struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	/* No exclusive opens */
+	if (filp->f_flags & O_EXCL) {
+		return -EINVAL;
+	}
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	filp->private_data = ddevice;
+	ddevice->fmapping[minor] = &inode->i_data;
+
+	return 0;
+}
+
+static int hmm_dummy_fops_release(struct inode *inode,
+				  struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	dmirror = ddevice->dmirrors[minor];
+	if (dmirror && dmirror->filp == filp) {
+		if (!dmirror->stop) {
+			hmm_mirror_unregister(&dmirror->mirror);
+		}
+		ddevice->dmirrors[minor] = NULL;
+		kfree(dmirror);
+	}
+
+	return 0;
+}
+
+static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
+					  unsigned int command,
+					  unsigned long arg)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	unsigned minor;
+	int ret;
+
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	switch (command) {
+	case HMM_DUMMY_EXPOSE_MM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EBUSY;
+		}
+		/* Mirror this process address space */
+		dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+		if (dmirror == NULL) {
+			mutex_unlock(&ddevice->mutex);
+			return -ENOMEM;
+		}
+		dmirror->mm = NULL;
+		dmirror->stop = false;
+		dmirror->pid = task_pid_nr(current);
+		dmirror->ddevice = ddevice;
+		dmirror->minor = minor;
+		dmirror->filp = filp;
+		dmirror->pgdp = NULL;
+		mutex_init(&dmirror->mutex);
+		ddevice->dmirrors[minor] = dmirror;
+		mutex_unlock(&ddevice->mutex);
+
+		ret = hmm_mirror_register(&dmirror->mirror,
+					  &ddevice->device,
+					  current->mm);
+		if (ret) {
+			mutex_lock(&ddevice->mutex);
+			ddevice->dmirrors[minor] = NULL;
+			mutex_unlock(&ddevice->mutex);
+			kfree(dmirror);
+			return ret;
+		}
+		/* Success. */
+		hmm_dummy_device_print(ddevice, dmirror->minor,
+				       "mirroring address space of %d\n",
+				       dmirror->pid);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static const struct file_operations hmm_dummy_fops = {
+	.read		= hmm_dummy_fops_read,
+	.write		= hmm_dummy_fops_write,
+	.mmap		= hmm_dummy_fops_mmap,
+	.open		= hmm_dummy_fops_open,
+	.release	= hmm_dummy_fops_release,
+	.unlocked_ioctl = hmm_dummy_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+
+/*
+ * char device driver
+ */
+static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
+{
+	int ret, i;
+
+	ret = alloc_chrdev_region(&ddevice->dev, 0,
+				  HMM_DUMMY_DEVICE_MAX_MIRRORS,
+				  ddevice->name);
+	if (ret < 0) {
+		printk(KERN_ERR "alloc_chrdev_region() failed for hmm_dummy\n");
+		goto error;
+	}
+	ddevice->major = MAJOR(ddevice->dev);
+
+	cdev_init(&ddevice->cdev, &hmm_dummy_fops);
+	ret = cdev_add(&ddevice->cdev, ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+	if (ret) {
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+		goto error;
+	}
+
+	/* Register the hmm device. */
+	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
+		ddevice->dmirrors[i] = NULL;
+	}
+	mutex_init(&ddevice->mutex);
+	ddevice->device.ops = &hmm_dummy_ops;
+
+	ret = hmm_device_register(&ddevice->device, ddevice->name);
+	if (ret) {
+		cdev_del(&ddevice->cdev);
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	return ret;
+}
+
+static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
+{
+	unsigned i;
+
+	/* First finish hmm. */
+	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
+		struct hmm_dummy_mirror *dmirror;
+
+		dmirror = ddevices->dmirrors[i];
+		if (!dmirror) {
+			continue;
+		}
+		hmm_mirror_unregister(&dmirror->mirror);
+		kfree(dmirror);
+	}
+	hmm_device_unref(&ddevice->device);
+
+	cdev_del(&ddevice->cdev);
+	unregister_chrdev_region(ddevice->dev,
+				 HMM_DUMMY_DEVICE_MAX_MIRRORS);
+}
+
+static int __init hmm_dummy_init(void)
+{
+	int ret;
+
+	snprintf(ddevices[0].name, sizeof(ddevices[0].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 0);
+	ret = hmm_dummy_device_init(&ddevices[0]);
+	if (ret) {
+		return ret;
+	}
+
+	snprintf(ddevices[1].name, sizeof(ddevices[1].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 1);
+	ret = hmm_dummy_device_init(&ddevices[1]);
+	if (ret) {
+		hmm_dummy_device_fini(&ddevices[0]);
+		return ret;
+	}
+
+	printk(KERN_INFO "hmm_dummy loaded THIS IS A DANGEROUS MODULE !!!\n");
+	return 0;
+}
+
+static void __exit hmm_dummy_exit(void)
+{
+	hmm_dummy_device_fini(&ddevices[1]);
+	hmm_dummy_device_fini(&ddevices[0]);
+}
+
+module_init(hmm_dummy_init);
+module_exit(hmm_dummy_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
new file mode 100644
index 0000000..16ae0d3
--- /dev/null
+++ b/include/uapi/linux/hmm_dummy.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DUMMY_H
+#define _UAPI_LINUX_HMM_DUMMY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/irqnr.h>
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DUMMY_EXPOSE_MM	_IO( 'R', 0x00 )
+
+#endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 10/11] hmm/dummy: dummy driver to showcase the hmm api.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

This is a dummy driver which full fill two purposes :
  - showcase the hmm api and gives references on how to use it.
  - provide an extensive user space api to stress test hmm.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/char/Kconfig           |    9 +
 drivers/char/Makefile          |    1 +
 drivers/char/hmm_dummy.c       | 1128 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/hmm_dummy.h |   34 ++
 4 files changed, 1172 insertions(+)
 create mode 100644 drivers/char/hmm_dummy.c
 create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6e9f74a..199e111 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -600,5 +600,14 @@ config TILE_SROM
 	  device appear much like a simple EEPROM, and knows
 	  how to partition a single ROM for multiple purposes.
 
+config HMM_DUMMY
+	tristate "hmm dummy driver to test hmm."
+	depends on HMM
+	default n
+	help
+	  Say Y here if you want to build the hmm dummy driver that allow you
+	  to test the hmm infrastructure by mapping a process address space
+	  in hmm dummy driver device file. When in doubt, say "N".
+
 endmenu
 
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index a324f93..83d89b8 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -61,3 +61,4 @@ obj-$(CONFIG_JS_RTC)		+= js-rtc.o
 js-rtc-y = rtc.o
 
 obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
+obj-$(CONFIG_HMM_DUMMY)		+= hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 0000000..e87dc7c
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,1128 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In here mirror address are address in the process address space that is
+ * being mirrored. While virtual address are the address in the current
+ * process that has the hmm dummy dev file mapped (address of the file
+ * mapping).
+ *
+ * You must be carefull to not mix one and another.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+
+#include <uapi/linux/hmm_dummy.h>
+
+#define HMM_DUMMY_DEVICE_NAME		"hmm_dummy_device"
+#define HMM_DUMMY_DEVICE_MAX_MIRRORS	4
+
+struct hmm_dummy_device;
+
+struct hmm_dummy_mirror {
+	struct file		*filp;
+	struct hmm_dummy_device	*ddevice;
+	struct hmm_mirror	mirror;
+	unsigned		minor;
+	pid_t			pid;
+	struct mm_struct	*mm;
+	unsigned long		*pgdp;
+	struct mutex		mutex;
+	bool			stop;
+};
+
+struct hmm_dummy_device {
+	struct cdev		cdev;
+	struct hmm_device	device;
+	dev_t			dev;
+	int			major;
+	struct mutex		mutex;
+	char			name[32];
+	/* device file mapping tracking (keep track of all vma) */
+	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+	struct address_space	*fmapping[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+};
+
+
+/* We only create 2 device to show the inter device rmem sharing/migration
+ * capabilities.
+ */
+static struct hmm_dummy_device ddevices[2];
+
+static void hmm_dummy_device_print(struct hmm_dummy_device *device,
+				   unsigned minor,
+				   const char *format,
+				   ...)
+{
+	va_list args;
+
+	printk(KERN_INFO "[%s:%d] ", device->name, minor);
+	va_start(args, format);
+	vprintk(format, args);
+	va_end(args);
+}
+
+
+/* hmm_dummy_pt - dummy page table, the dummy device fake its own page table.
+ *
+ * Helper function to manage the dummy device page table.
+ */
+#define HMM_DUMMY_PTE_VALID_PAGE	(1UL << 0UL)
+#define HMM_DUMMY_PTE_VALID_ZERO	(1UL << 1UL)
+#define HMM_DUMMY_PTE_READ		(1UL << 2UL)
+#define HMM_DUMMY_PTE_WRITE		(1UL << 3UL)
+#define HMM_DUMMY_PTE_DIRTY		(1UL << 4UL)
+#define HMM_DUMMY_PFN_SHIFT		(PAGE_SHIFT)
+
+#define ARCH_PAGE_SIZE			((unsigned long)PAGE_SIZE)
+#define ARCH_PAGE_SHIFT			((unsigned long)PAGE_SHIFT)
+
+#define HMM_DUMMY_PTRS_PER_LEVEL	(ARCH_PAGE_SIZE / sizeof(long))
+#ifdef CONFIG_64BIT
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 3UL)
+#else
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 2UL)
+#endif
+#define HMM_DUMMY_PLD_SHIFT		(ARCH_PAGE_SHIFT)
+#define HMM_DUMMY_PMD_SHIFT		(HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_SHIFT		(HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_SHIFT		(HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PMD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_SIZE		(1UL << (HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PMD_SIZE		(1UL << (HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PUD_SIZE		(1UL << (HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PGD_SIZE		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PLD_MASK		(~(HMM_DUMMY_PLD_SIZE - 1UL))
+#define HMM_DUMMY_PMD_MASK		(~(HMM_DUMMY_PMD_SIZE - 1UL))
+#define HMM_DUMMY_PUD_MASK		(~(HMM_DUMMY_PUD_SIZE - 1UL))
+#define HMM_DUMMY_PGD_MASK		(~(HMM_DUMMY_PGD_SIZE - 1UL))
+#define HMM_DUMMY_MAX_ADDR		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+
+static inline unsigned long hmm_dummy_pld_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PLD_SHIFT) & (HMM_DUMMY_PLD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pmd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PMD_SHIFT) & (HMM_DUMMY_PMD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pud_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PUD_SHIFT) & (HMM_DUMMY_PUD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pgd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PGD_SHIFT) & (HMM_DUMMY_PGD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pld_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pmd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pud_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pgd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pld_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK) + HMM_DUMMY_PLD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pmd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK) + HMM_DUMMY_PMD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pud_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK) + HMM_DUMMY_PUD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pgd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK) + HMM_DUMMY_PGD_SIZE;
+}
+
+static inline struct page *hmm_dummy_pte_to_page(unsigned long pte)
+{
+	if (!(pte & (HMM_DUMMY_PTE_VALID_PAGE | HMM_DUMMY_PTE_VALID_ZERO))) {
+		return NULL;
+	}
+	return pfn_to_page((pte >> HMM_DUMMY_PFN_SHIFT));
+}
+
+struct hmm_dummy_pt_map {
+	struct hmm_dummy_mirror	*dmirror;
+	struct page		*pud_page;
+	struct page		*pmd_page;
+	struct page		*pld_page;
+	unsigned long		pgd_idx;
+	unsigned long		pud_idx;
+	unsigned long		pmd_idx;
+	unsigned long		*pudp;
+	unsigned long		*pmdp;
+	unsigned long		*pldp;
+};
+
+static inline unsigned long *hmm_dummy_pt_pud_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	struct hmm_dummy_mirror *dmirror = pt_map->dmirror;
+	unsigned long *pdep;
+
+	if (!dmirror->pgdp) {
+		return NULL;
+	}
+
+	if (!pt_map->pud_page || pt_map->pgd_idx != hmm_dummy_pgd_index(addr)) {
+		if (pt_map->pud_page) {
+			kunmap(pt_map->pud_page);
+			pt_map->pud_page = NULL;
+			pt_map->pudp = NULL;
+		}
+		pt_map->pgd_idx = hmm_dummy_pgd_index(addr);
+		pdep = &dmirror->pgdp[pt_map->pgd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pud_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pudp = kmap(pt_map->pud_page);
+	}
+	return pt_map->pudp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pmd_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pud_map(pt_map, addr)) {
+		return NULL;
+	}
+
+	if (!pt_map->pmd_page || pt_map->pud_idx != hmm_dummy_pud_index(addr)) {
+		if (pt_map->pmd_page) {
+			kunmap(pt_map->pmd_page);
+			pt_map->pmd_page = NULL;
+			pt_map->pmdp = NULL;
+		}
+		pt_map->pud_idx = hmm_dummy_pud_index(addr);
+		pdep = &pt_map->pudp[pt_map->pud_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pmd_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pmdp = kmap(pt_map->pmd_page);
+	}
+	return pt_map->pmdp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pld_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pmd_map(pt_map, addr)) {
+		return NULL;
+	}
+
+	if (!pt_map->pld_page || pt_map->pmd_idx != hmm_dummy_pmd_index(addr)) {
+		if (pt_map->pld_page) {
+			kunmap(pt_map->pld_page);
+			pt_map->pld_page = NULL;
+			pt_map->pldp = NULL;
+		}
+		pt_map->pmd_idx = hmm_dummy_pmd_index(addr);
+		pdep = &pt_map->pmdp[pt_map->pmd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pld_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pldp = kmap(pt_map->pld_page);
+	}
+	return pt_map->pldp;
+}
+
+static inline void hmm_dummy_pt_pld_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	if (pt_map->pld_page) {
+		kunmap(pt_map->pld_page);
+		pt_map->pld_page = NULL;
+		pt_map->pldp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pmd_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pld_unmap(pt_map);
+	if (pt_map->pmd_page) {
+		kunmap(pt_map->pmd_page);
+		pt_map->pmd_page = NULL;
+		pt_map->pmdp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pud_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pmd_unmap(pt_map);
+	if (pt_map->pud_page) {
+		kunmap(pt_map->pud_page);
+		pt_map->pud_page = NULL;
+		pt_map->pudp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pud_unmap(pt_map);
+}
+
+static int hmm_dummy_pt_alloc(struct hmm_dummy_mirror *dmirror,
+			      unsigned long faddr,
+			      unsigned long laddr)
+{
+	unsigned long *pgdp, *pudp, *pmdp;
+
+	if (dmirror->stop) {
+		return -EINVAL;
+	}
+
+	if (dmirror->pgdp == NULL) {
+		dmirror->pgdp = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		if (dmirror->pgdp == NULL) {
+			return -ENOMEM;
+		}
+	}
+
+	for (; faddr < laddr; faddr = hmm_dummy_pld_next(faddr)) {
+		struct page *pud_page, *pmd_page;
+
+		pgdp = &dmirror->pgdp[hmm_dummy_pgd_index(faddr)];
+		if (!((*pgdp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			pud_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pud_page) {
+				return -ENOMEM;
+			}
+			*pgdp  = (page_to_pfn(pud_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pgdp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		pud_page = pfn_to_page((*pgdp) >> HMM_DUMMY_PFN_SHIFT);
+		pudp = kmap(pud_page);
+		pudp = &pudp[hmm_dummy_pud_index(faddr)];
+		if (!((*pudp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			pmd_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pmd_page) {
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pudp  = (page_to_pfn(pmd_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pudp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		pmd_page = pfn_to_page((*pudp) >> HMM_DUMMY_PFN_SHIFT);
+		pmdp = kmap(pmd_page);
+		pmdp = &pmdp[hmm_dummy_pmd_index(faddr)];
+		if (!((*pmdp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			struct page *page;
+
+			page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!page) {
+				kunmap(pmd_page);
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pmdp  = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+			*pmdp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		kunmap(pmd_page);
+		kunmap(pud_page);
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_pt_free_pmd(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long faddr,
+				  unsigned long laddr)
+{
+	for (; faddr < laddr; faddr = hmm_dummy_pld_next(faddr)) {
+		unsigned long pfn, *pmdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pld_next(faddr), laddr);
+		if (faddr > hmm_dummy_pld_base(faddr) || laddr < next) {
+			continue;
+		}
+		pmdp = hmm_dummy_pt_pmd_map(pt_map, faddr);
+		if (!pmdp) {
+			continue;
+		}
+		if (!(pmdp[hmm_dummy_pmd_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pmdp[hmm_dummy_pmd_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pmdp[hmm_dummy_pmd_index(faddr)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free_pud(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long faddr,
+				  unsigned long laddr)
+{
+	for (; faddr < laddr; faddr = hmm_dummy_pmd_next(faddr)) {
+		unsigned long pfn, *pudp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pmd_next(faddr), laddr);
+		hmm_dummy_pt_free_pmd(pt_map, faddr, next);
+		hmm_dummy_pt_pmd_unmap(pt_map);
+		if (faddr > hmm_dummy_pmd_base(faddr) || laddr < next) {
+			continue;
+		}
+		pudp = hmm_dummy_pt_pud_map(pt_map, faddr);
+		if (!pudp) {
+			continue;
+		}
+		if (!(pudp[hmm_dummy_pud_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pudp[hmm_dummy_pud_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pudp[hmm_dummy_pud_index(faddr)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
+			      unsigned long faddr,
+			      unsigned long laddr)
+{
+	struct hmm_dummy_pt_map pt_map = {0};
+
+	if (!dmirror->pgdp || (laddr - faddr) < HMM_DUMMY_PLD_SIZE) {
+		return;
+	}
+
+	pt_map.dmirror = dmirror;
+
+	for (; faddr < laddr; faddr = hmm_dummy_pud_next(faddr)) {
+		unsigned long pfn, *pgdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pud_next(faddr), laddr);
+		pgdp = dmirror->pgdp;
+		hmm_dummy_pt_free_pud(&pt_map, faddr, next);
+		hmm_dummy_pt_pud_unmap(&pt_map);
+		if (faddr > hmm_dummy_pud_base(faddr) || laddr < next) {
+			continue;
+		}
+		if (!(pgdp[hmm_dummy_pgd_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pgdp[hmm_dummy_pgd_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pgdp[hmm_dummy_pgd_index(faddr)] = 0;
+		__free_page(page);
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+}
+
+
+/* hmm_ops - hmm callback for the hmm dummy driver.
+ *
+ * Below are the various callback that the hmm api require for a device. The
+ * implementation of the dummy device driver is necessarily simpler that what
+ * a real device driver would do. We do not have interrupt nor any kind of
+ * command buffer on to which schedule memory invalidation and updates.
+ */
+static void hmm_dummy_device_destroy(struct hmm_device *device)
+{
+	/* No-op for the dummy driver. */
+}
+
+static void hmm_dummy_mirror_release(struct hmm_mirror *mirror)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	dmirror->stop = true;
+	mutex_lock(&dmirror->mutex);
+	hmm_dummy_pt_free(dmirror, 0, HMM_DUMMY_MAX_ADDR);
+	if (dmirror->pgdp) {
+		kfree(dmirror->pgdp);
+		dmirror->pgdp = NULL;
+	}
+	mutex_unlock(&dmirror->mutex);
+}
+
+static void hmm_dummy_mirror_destroy(struct hmm_mirror *mirror)
+{
+	/* No-op for the dummy driver. */
+	// FIXME print that the pid is no longer mirror
+}
+
+static int hmm_dummy_fence_wait(struct hmm_fence *fence)
+{
+	/* FIXME use some kind of fake event and delay dirty and dummy page
+	 * clearing to this function.
+	 */
+	return 0;
+}
+
+static struct hmm_fence *hmm_dummy_lmem_update(struct hmm_mirror *mirror,
+					       unsigned long faddr,
+					       unsigned long laddr,
+					       enum hmm_etype etype,
+					       bool dirty)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask, or;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	/* Sanity check for debugging hmm real device driver do not have to do that. */
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_UNMAP:
+	case HMM_MUNMAP:
+	case HMM_MPROT_WONLY:
+	case HMM_MIGRATE_TO_RMEM:
+	case HMM_MIGRATE_TO_LMEM:
+		mask = 0;
+		or = 0;
+		break;
+	case HMM_MPROT_RONLY:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		or = 0;
+		break;
+	case HMM_MPROT_RANDW:
+		mask = -1L;
+		or = HMM_DUMMY_PTE_WRITE;
+		break;
+	default:
+		printk(KERN_ERR "%4d:%s invalid event type %d\n",
+		       __LINE__, __func__, etype);
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = faddr; addr < laddr; ++i, addr += PAGE_SIZE) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			continue;
+		}
+		if (dirty && ((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			struct page *page;
+
+			page = hmm_dummy_pte_to_page(*pldp);
+			if (page) {
+				set_page_dirty(page);
+			}
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+		*pldp |= or;
+		if ((*pldp) & HMM_DUMMY_PTE_VALID_ZERO) {
+			*pldp &= ~HMM_DUMMY_PTE_WRITE;
+		}
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_MUNMAP:
+		hmm_dummy_pt_free(dmirror, faddr, laddr);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static int hmm_dummy_lmem_fault(struct hmm_mirror *mirror,
+				unsigned long faddr,
+				unsigned long laddr,
+				unsigned long *pfns,
+				struct hmm_fault *fault)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long i;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0; faddr < laddr; ++i, faddr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		if (!pldp || !hmm_pfn_to_page(pfns[i])) {
+			continue;
+		}
+		pld_idx = hmm_dummy_pld_index(faddr);
+		pldp[pld_idx]  = ((pfns[i] >> HMM_PFN_SHIFT) << HMM_DUMMY_PFN_SHIFT);
+		pldp[pld_idx] |= test_bit(HMM_PFN_WRITE, &pfns[i]) ? HMM_DUMMY_PTE_WRITE : 0;
+		pldp[pld_idx] |= test_bit(HMM_PFN_VALID_PAGE, &pfns[i]) ?
+			HMM_DUMMY_PTE_VALID_PAGE : HMM_DUMMY_PTE_VALID_ZERO;
+		pldp[pld_idx] |= HMM_DUMMY_PTE_READ;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+static const struct hmm_device_ops hmm_dummy_ops = {
+	.device_destroy		= &hmm_dummy_device_destroy,
+	.mirror_release		= &hmm_dummy_mirror_release,
+	.mirror_destroy		= &hmm_dummy_mirror_destroy,
+	.fence_wait		= &hmm_dummy_fence_wait,
+	.lmem_update		= &hmm_dummy_lmem_update,
+	.lmem_fault		= &hmm_dummy_lmem_fault,
+};
+
+
+/* hmm_dummy_mmap - hmm dummy device file mmap operations.
+ *
+ * The hmm dummy driver does not allow mmap of its device file. The main reason
+ * is because the kernel lack the ability to insert page with specific custom
+ * protections inside a vma.
+ */
+static int hmm_dummy_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static void hmm_dummy_mmap_open(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static void hmm_dummy_mmap_close(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static const struct vm_operations_struct mmap_mem_ops = {
+	.fault			= hmm_dummy_mmap_fault,
+	.open			= hmm_dummy_mmap_open,
+	.close			= hmm_dummy_mmap_close,
+};
+
+
+/* hmm_dummy_fops - hmm dummy device file operations.
+ *
+ * The hmm dummy driver allow to read/write to the mirrored process through
+ * the device file. Below are the read and write and others device file
+ * callback that implement access to the mirrored address space.
+ */
+static int hmm_dummy_mirror_fault(struct hmm_dummy_mirror *dmirror,
+				  unsigned long addr,
+				  bool write)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_fault fault;
+	unsigned long faddr, laddr, npages = 4;
+	int ret;
+
+	fault.pfns = kmalloc(npages * sizeof(long), GFP_KERNEL);
+	fault.flags = write ? HMM_FAULT_WRITE : 0;
+
+	/* Showcase hmm api fault a 64k range centered on the address. */
+	fault.faddr = faddr = addr > (npages << 8) ? addr - (npages << 8) : 0;
+	fault.laddr = laddr = faddr + (npages << 10);
+
+	/* Pre-allocate device page table. */
+	mutex_lock(&dmirror->mutex);
+	ret = hmm_dummy_pt_alloc(dmirror, faddr, laddr);
+	mutex_unlock(&dmirror->mutex);
+	if (ret) {
+		goto out;
+	}
+
+	for (; faddr < laddr; faddr = fault.faddr) {
+		ret = hmm_mirror_fault(mirror, &fault);
+		/* Ignore any error that do not concern the fault address. */
+		if (addr >= fault.laddr) {
+			fault.faddr = fault.laddr;
+			fault.laddr = laddr;
+			continue;
+		}
+		if (addr < fault.faddr) {
+			/* The address was faulted successfully ignore error
+			 * for address above the one we were interested in.
+			 */
+			ret = 0;
+		}
+		goto out;
+	}
+
+out:
+	kfree(fault.pfns);
+	return ret;
+}
+
+static ssize_t hmm_dummy_fops_read(struct file *filp,
+				   char __user *buf,
+				   size_t count,
+				   loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long faddr, laddr, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp) {
+		return -ENOMEM;
+	}
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	dmirror = ddevice->dmirrors[minor];
+	mutex_unlock(&ddevice->mutex);
+	if (dmirror->stop) {
+		kfree(tmp);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	faddr = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - faddr;
+	laddr = PAGE_ALIGN(faddr + count);
+	BUG_ON(faddr == laddr);
+	pt_map.dmirror = dmirror;
+
+	for (; count; faddr += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		mutex_lock(&dmirror->mutex);
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		if (!pldp || !(pldp[pld_idx] & (HMM_DUMMY_PTE_VALID_PAGE | HMM_DUMMY_PTE_VALID_ZERO))) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(tmp, ptr + offset, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		r = copy_to_user(buf, tmp, size);
+		if (r) {
+			kfree(tmp);
+			return -EFAULT;
+		}
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, faddr, false);
+	if (r) {
+		return r;
+	}
+
+	/* Force userspace to retry read if nothing was read. */
+	return retval ? retval : -EINTR;
+}
+
+static ssize_t hmm_dummy_fops_write(struct file *filp,
+				    const char __user *buf,
+				    size_t count,
+				    loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long faddr, laddr, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp) {
+		return -ENOMEM;
+	}
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	dmirror = ddevice->dmirrors[minor];
+	mutex_unlock(&ddevice->mutex);
+	if (dmirror->stop) {
+		kfree(tmp);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	faddr = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - faddr;
+	laddr = PAGE_ALIGN(faddr + count);
+	BUG_ON(faddr == laddr);
+	pt_map.dmirror = dmirror;
+
+	for (; count; faddr += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		r = copy_from_user(tmp, buf, size);
+		if (r) {
+			kfree(tmp);
+			return -EFAULT;
+		}
+
+		mutex_lock(&dmirror->mutex);
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+				goto fault;
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(ptr + offset, tmp, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	kfree(tmp);
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, faddr, true);
+	if (r) {
+		return r;
+	}
+
+	/* Force userspace to retry write if nothing was writen. */
+	return retval ? retval : -EINTR;
+}
+
+static int hmm_dummy_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+static int hmm_dummy_fops_open(struct inode *inode, struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	/* No exclusive opens */
+	if (filp->f_flags & O_EXCL) {
+		return -EINVAL;
+	}
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	filp->private_data = ddevice;
+	ddevice->fmapping[minor] = &inode->i_data;
+
+	return 0;
+}
+
+static int hmm_dummy_fops_release(struct inode *inode,
+				  struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	dmirror = ddevice->dmirrors[minor];
+	if (dmirror && dmirror->filp == filp) {
+		if (!dmirror->stop) {
+			hmm_mirror_unregister(&dmirror->mirror);
+		}
+		ddevice->dmirrors[minor] = NULL;
+		kfree(dmirror);
+	}
+
+	return 0;
+}
+
+static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
+					  unsigned int command,
+					  unsigned long arg)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	unsigned minor;
+	int ret;
+
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	switch (command) {
+	case HMM_DUMMY_EXPOSE_MM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EBUSY;
+		}
+		/* Mirror this process address space */
+		dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+		if (dmirror == NULL) {
+			mutex_unlock(&ddevice->mutex);
+			return -ENOMEM;
+		}
+		dmirror->mm = NULL;
+		dmirror->stop = false;
+		dmirror->pid = task_pid_nr(current);
+		dmirror->ddevice = ddevice;
+		dmirror->minor = minor;
+		dmirror->filp = filp;
+		dmirror->pgdp = NULL;
+		mutex_init(&dmirror->mutex);
+		ddevice->dmirrors[minor] = dmirror;
+		mutex_unlock(&ddevice->mutex);
+
+		ret = hmm_mirror_register(&dmirror->mirror,
+					  &ddevice->device,
+					  current->mm);
+		if (ret) {
+			mutex_lock(&ddevice->mutex);
+			ddevice->dmirrors[minor] = NULL;
+			mutex_unlock(&ddevice->mutex);
+			kfree(dmirror);
+			return ret;
+		}
+		/* Success. */
+		hmm_dummy_device_print(ddevice, dmirror->minor,
+				       "mirroring address space of %d\n",
+				       dmirror->pid);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static const struct file_operations hmm_dummy_fops = {
+	.read		= hmm_dummy_fops_read,
+	.write		= hmm_dummy_fops_write,
+	.mmap		= hmm_dummy_fops_mmap,
+	.open		= hmm_dummy_fops_open,
+	.release	= hmm_dummy_fops_release,
+	.unlocked_ioctl = hmm_dummy_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+
+/*
+ * char device driver
+ */
+static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
+{
+	int ret, i;
+
+	ret = alloc_chrdev_region(&ddevice->dev, 0,
+				  HMM_DUMMY_DEVICE_MAX_MIRRORS,
+				  ddevice->name);
+	if (ret < 0) {
+		printk(KERN_ERR "alloc_chrdev_region() failed for hmm_dummy\n");
+		goto error;
+	}
+	ddevice->major = MAJOR(ddevice->dev);
+
+	cdev_init(&ddevice->cdev, &hmm_dummy_fops);
+	ret = cdev_add(&ddevice->cdev, ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+	if (ret) {
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+		goto error;
+	}
+
+	/* Register the hmm device. */
+	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
+		ddevice->dmirrors[i] = NULL;
+	}
+	mutex_init(&ddevice->mutex);
+	ddevice->device.ops = &hmm_dummy_ops;
+
+	ret = hmm_device_register(&ddevice->device, ddevice->name);
+	if (ret) {
+		cdev_del(&ddevice->cdev);
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	return ret;
+}
+
+static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
+{
+	unsigned i;
+
+	/* First finish hmm. */
+	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
+		struct hmm_dummy_mirror *dmirror;
+
+		dmirror = ddevices->dmirrors[i];
+		if (!dmirror) {
+			continue;
+		}
+		hmm_mirror_unregister(&dmirror->mirror);
+		kfree(dmirror);
+	}
+	hmm_device_unref(&ddevice->device);
+
+	cdev_del(&ddevice->cdev);
+	unregister_chrdev_region(ddevice->dev,
+				 HMM_DUMMY_DEVICE_MAX_MIRRORS);
+}
+
+static int __init hmm_dummy_init(void)
+{
+	int ret;
+
+	snprintf(ddevices[0].name, sizeof(ddevices[0].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 0);
+	ret = hmm_dummy_device_init(&ddevices[0]);
+	if (ret) {
+		return ret;
+	}
+
+	snprintf(ddevices[1].name, sizeof(ddevices[1].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 1);
+	ret = hmm_dummy_device_init(&ddevices[1]);
+	if (ret) {
+		hmm_dummy_device_fini(&ddevices[0]);
+		return ret;
+	}
+
+	printk(KERN_INFO "hmm_dummy loaded THIS IS A DANGEROUS MODULE !!!\n");
+	return 0;
+}
+
+static void __exit hmm_dummy_exit(void)
+{
+	hmm_dummy_device_fini(&ddevices[1]);
+	hmm_dummy_device_fini(&ddevices[0]);
+}
+
+module_init(hmm_dummy_init);
+module_exit(hmm_dummy_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
new file mode 100644
index 0000000..16ae0d3
--- /dev/null
+++ b/include/uapi/linux/hmm_dummy.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DUMMY_H
+#define _UAPI_LINUX_HMM_DUMMY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/irqnr.h>
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DUMMY_EXPOSE_MM	_IO( 'R', 0x00 )
+
+#endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 10/11] hmm/dummy: dummy driver to showcase the hmm api.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This is a dummy driver which full fill two purposes :
  - showcase the hmm api and gives references on how to use it.
  - provide an extensive user space api to stress test hmm.

This is a particularly dangerous module as it allow to access a
mirror of a process address space through its device file. Hence
it should not be enabled by default and only people actively
developing for hmm should use it.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/char/Kconfig           |    9 +
 drivers/char/Makefile          |    1 +
 drivers/char/hmm_dummy.c       | 1128 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/hmm_dummy.h |   34 ++
 4 files changed, 1172 insertions(+)
 create mode 100644 drivers/char/hmm_dummy.c
 create mode 100644 include/uapi/linux/hmm_dummy.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6e9f74a..199e111 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -600,5 +600,14 @@ config TILE_SROM
 	  device appear much like a simple EEPROM, and knows
 	  how to partition a single ROM for multiple purposes.
 
+config HMM_DUMMY
+	tristate "hmm dummy driver to test hmm."
+	depends on HMM
+	default n
+	help
+	  Say Y here if you want to build the hmm dummy driver that allow you
+	  to test the hmm infrastructure by mapping a process address space
+	  in hmm dummy driver device file. When in doubt, say "N".
+
 endmenu
 
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index a324f93..83d89b8 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -61,3 +61,4 @@ obj-$(CONFIG_JS_RTC)		+= js-rtc.o
 js-rtc-y = rtc.o
 
 obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
+obj-$(CONFIG_HMM_DUMMY)		+= hmm_dummy.o
diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
new file mode 100644
index 0000000..e87dc7c
--- /dev/null
+++ b/drivers/char/hmm_dummy.c
@@ -0,0 +1,1128 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ *
+ * In here mirror address are address in the process address space that is
+ * being mirrored. While virtual address are the address in the current
+ * process that has the hmm dummy dev file mapped (address of the file
+ * mapping).
+ *
+ * You must be carefull to not mix one and another.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+
+#include <uapi/linux/hmm_dummy.h>
+
+#define HMM_DUMMY_DEVICE_NAME		"hmm_dummy_device"
+#define HMM_DUMMY_DEVICE_MAX_MIRRORS	4
+
+struct hmm_dummy_device;
+
+struct hmm_dummy_mirror {
+	struct file		*filp;
+	struct hmm_dummy_device	*ddevice;
+	struct hmm_mirror	mirror;
+	unsigned		minor;
+	pid_t			pid;
+	struct mm_struct	*mm;
+	unsigned long		*pgdp;
+	struct mutex		mutex;
+	bool			stop;
+};
+
+struct hmm_dummy_device {
+	struct cdev		cdev;
+	struct hmm_device	device;
+	dev_t			dev;
+	int			major;
+	struct mutex		mutex;
+	char			name[32];
+	/* device file mapping tracking (keep track of all vma) */
+	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+	struct address_space	*fmapping[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+};
+
+
+/* We only create 2 device to show the inter device rmem sharing/migration
+ * capabilities.
+ */
+static struct hmm_dummy_device ddevices[2];
+
+static void hmm_dummy_device_print(struct hmm_dummy_device *device,
+				   unsigned minor,
+				   const char *format,
+				   ...)
+{
+	va_list args;
+
+	printk(KERN_INFO "[%s:%d] ", device->name, minor);
+	va_start(args, format);
+	vprintk(format, args);
+	va_end(args);
+}
+
+
+/* hmm_dummy_pt - dummy page table, the dummy device fake its own page table.
+ *
+ * Helper function to manage the dummy device page table.
+ */
+#define HMM_DUMMY_PTE_VALID_PAGE	(1UL << 0UL)
+#define HMM_DUMMY_PTE_VALID_ZERO	(1UL << 1UL)
+#define HMM_DUMMY_PTE_READ		(1UL << 2UL)
+#define HMM_DUMMY_PTE_WRITE		(1UL << 3UL)
+#define HMM_DUMMY_PTE_DIRTY		(1UL << 4UL)
+#define HMM_DUMMY_PFN_SHIFT		(PAGE_SHIFT)
+
+#define ARCH_PAGE_SIZE			((unsigned long)PAGE_SIZE)
+#define ARCH_PAGE_SHIFT			((unsigned long)PAGE_SHIFT)
+
+#define HMM_DUMMY_PTRS_PER_LEVEL	(ARCH_PAGE_SIZE / sizeof(long))
+#ifdef CONFIG_64BIT
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 3UL)
+#else
+#define HMM_DUMMY_BITS_PER_LEVEL	(ARCH_PAGE_SHIFT - 2UL)
+#endif
+#define HMM_DUMMY_PLD_SHIFT		(ARCH_PAGE_SHIFT)
+#define HMM_DUMMY_PMD_SHIFT		(HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_SHIFT		(HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_SHIFT		(HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PGD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PMD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PUD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_NPTRS		(1UL << HMM_DUMMY_BITS_PER_LEVEL)
+#define HMM_DUMMY_PLD_SIZE		(1UL << (HMM_DUMMY_PLD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PMD_SIZE		(1UL << (HMM_DUMMY_PMD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PUD_SIZE		(1UL << (HMM_DUMMY_PUD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PGD_SIZE		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+#define HMM_DUMMY_PLD_MASK		(~(HMM_DUMMY_PLD_SIZE - 1UL))
+#define HMM_DUMMY_PMD_MASK		(~(HMM_DUMMY_PMD_SIZE - 1UL))
+#define HMM_DUMMY_PUD_MASK		(~(HMM_DUMMY_PUD_SIZE - 1UL))
+#define HMM_DUMMY_PGD_MASK		(~(HMM_DUMMY_PGD_SIZE - 1UL))
+#define HMM_DUMMY_MAX_ADDR		(1UL << (HMM_DUMMY_PGD_SHIFT + HMM_DUMMY_BITS_PER_LEVEL))
+
+static inline unsigned long hmm_dummy_pld_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PLD_SHIFT) & (HMM_DUMMY_PLD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pmd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PMD_SHIFT) & (HMM_DUMMY_PMD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pud_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PUD_SHIFT) & (HMM_DUMMY_PUD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pgd_index(unsigned long addr)
+{
+	return (addr >> HMM_DUMMY_PGD_SHIFT) & (HMM_DUMMY_PGD_NPTRS - 1UL);
+}
+
+static inline unsigned long hmm_dummy_pld_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pmd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pud_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pgd_base(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK);
+}
+
+static inline unsigned long hmm_dummy_pld_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PLD_MASK) + HMM_DUMMY_PLD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pmd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PMD_MASK) + HMM_DUMMY_PMD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pud_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PUD_MASK) + HMM_DUMMY_PUD_SIZE;
+}
+
+static inline unsigned long hmm_dummy_pgd_next(unsigned long addr)
+{
+	return (addr & HMM_DUMMY_PGD_MASK) + HMM_DUMMY_PGD_SIZE;
+}
+
+static inline struct page *hmm_dummy_pte_to_page(unsigned long pte)
+{
+	if (!(pte & (HMM_DUMMY_PTE_VALID_PAGE | HMM_DUMMY_PTE_VALID_ZERO))) {
+		return NULL;
+	}
+	return pfn_to_page((pte >> HMM_DUMMY_PFN_SHIFT));
+}
+
+struct hmm_dummy_pt_map {
+	struct hmm_dummy_mirror	*dmirror;
+	struct page		*pud_page;
+	struct page		*pmd_page;
+	struct page		*pld_page;
+	unsigned long		pgd_idx;
+	unsigned long		pud_idx;
+	unsigned long		pmd_idx;
+	unsigned long		*pudp;
+	unsigned long		*pmdp;
+	unsigned long		*pldp;
+};
+
+static inline unsigned long *hmm_dummy_pt_pud_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	struct hmm_dummy_mirror *dmirror = pt_map->dmirror;
+	unsigned long *pdep;
+
+	if (!dmirror->pgdp) {
+		return NULL;
+	}
+
+	if (!pt_map->pud_page || pt_map->pgd_idx != hmm_dummy_pgd_index(addr)) {
+		if (pt_map->pud_page) {
+			kunmap(pt_map->pud_page);
+			pt_map->pud_page = NULL;
+			pt_map->pudp = NULL;
+		}
+		pt_map->pgd_idx = hmm_dummy_pgd_index(addr);
+		pdep = &dmirror->pgdp[pt_map->pgd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pud_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pudp = kmap(pt_map->pud_page);
+	}
+	return pt_map->pudp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pmd_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pud_map(pt_map, addr)) {
+		return NULL;
+	}
+
+	if (!pt_map->pmd_page || pt_map->pud_idx != hmm_dummy_pud_index(addr)) {
+		if (pt_map->pmd_page) {
+			kunmap(pt_map->pmd_page);
+			pt_map->pmd_page = NULL;
+			pt_map->pmdp = NULL;
+		}
+		pt_map->pud_idx = hmm_dummy_pud_index(addr);
+		pdep = &pt_map->pudp[pt_map->pud_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pmd_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pmdp = kmap(pt_map->pmd_page);
+	}
+	return pt_map->pmdp;
+}
+
+static inline unsigned long *hmm_dummy_pt_pld_map(struct hmm_dummy_pt_map *pt_map,
+						  unsigned long addr)
+{
+	unsigned long *pdep;
+
+	if (!hmm_dummy_pt_pmd_map(pt_map, addr)) {
+		return NULL;
+	}
+
+	if (!pt_map->pld_page || pt_map->pmd_idx != hmm_dummy_pmd_index(addr)) {
+		if (pt_map->pld_page) {
+			kunmap(pt_map->pld_page);
+			pt_map->pld_page = NULL;
+			pt_map->pldp = NULL;
+		}
+		pt_map->pmd_idx = hmm_dummy_pmd_index(addr);
+		pdep = &pt_map->pmdp[pt_map->pmd_idx];
+		if (!((*pdep) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			return NULL;
+		}
+		pt_map->pld_page = pfn_to_page((*pdep) >> HMM_DUMMY_PFN_SHIFT);
+		pt_map->pldp = kmap(pt_map->pld_page);
+	}
+	return pt_map->pldp;
+}
+
+static inline void hmm_dummy_pt_pld_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	if (pt_map->pld_page) {
+		kunmap(pt_map->pld_page);
+		pt_map->pld_page = NULL;
+		pt_map->pldp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pmd_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pld_unmap(pt_map);
+	if (pt_map->pmd_page) {
+		kunmap(pt_map->pmd_page);
+		pt_map->pmd_page = NULL;
+		pt_map->pmdp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_pud_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pmd_unmap(pt_map);
+	if (pt_map->pud_page) {
+		kunmap(pt_map->pud_page);
+		pt_map->pud_page = NULL;
+		pt_map->pudp = NULL;
+	}
+}
+
+static inline void hmm_dummy_pt_unmap(struct hmm_dummy_pt_map *pt_map)
+{
+	hmm_dummy_pt_pud_unmap(pt_map);
+}
+
+static int hmm_dummy_pt_alloc(struct hmm_dummy_mirror *dmirror,
+			      unsigned long faddr,
+			      unsigned long laddr)
+{
+	unsigned long *pgdp, *pudp, *pmdp;
+
+	if (dmirror->stop) {
+		return -EINVAL;
+	}
+
+	if (dmirror->pgdp == NULL) {
+		dmirror->pgdp = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		if (dmirror->pgdp == NULL) {
+			return -ENOMEM;
+		}
+	}
+
+	for (; faddr < laddr; faddr = hmm_dummy_pld_next(faddr)) {
+		struct page *pud_page, *pmd_page;
+
+		pgdp = &dmirror->pgdp[hmm_dummy_pgd_index(faddr)];
+		if (!((*pgdp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			pud_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pud_page) {
+				return -ENOMEM;
+			}
+			*pgdp  = (page_to_pfn(pud_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pgdp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		pud_page = pfn_to_page((*pgdp) >> HMM_DUMMY_PFN_SHIFT);
+		pudp = kmap(pud_page);
+		pudp = &pudp[hmm_dummy_pud_index(faddr)];
+		if (!((*pudp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			pmd_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!pmd_page) {
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pudp  = (page_to_pfn(pmd_page)<<HMM_DUMMY_PFN_SHIFT);
+			*pudp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		pmd_page = pfn_to_page((*pudp) >> HMM_DUMMY_PFN_SHIFT);
+		pmdp = kmap(pmd_page);
+		pmdp = &pmdp[hmm_dummy_pmd_index(faddr)];
+		if (!((*pmdp) & HMM_DUMMY_PTE_VALID_PAGE)) {
+			struct page *page;
+
+			page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+			if (!page) {
+				kunmap(pmd_page);
+				kunmap(pud_page);
+				return -ENOMEM;
+			}
+			*pmdp  = (page_to_pfn(page) << HMM_DUMMY_PFN_SHIFT);
+			*pmdp |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+
+		kunmap(pmd_page);
+		kunmap(pud_page);
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_pt_free_pmd(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long faddr,
+				  unsigned long laddr)
+{
+	for (; faddr < laddr; faddr = hmm_dummy_pld_next(faddr)) {
+		unsigned long pfn, *pmdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pld_next(faddr), laddr);
+		if (faddr > hmm_dummy_pld_base(faddr) || laddr < next) {
+			continue;
+		}
+		pmdp = hmm_dummy_pt_pmd_map(pt_map, faddr);
+		if (!pmdp) {
+			continue;
+		}
+		if (!(pmdp[hmm_dummy_pmd_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pmdp[hmm_dummy_pmd_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pmdp[hmm_dummy_pmd_index(faddr)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free_pud(struct hmm_dummy_pt_map *pt_map,
+				  unsigned long faddr,
+				  unsigned long laddr)
+{
+	for (; faddr < laddr; faddr = hmm_dummy_pmd_next(faddr)) {
+		unsigned long pfn, *pudp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pmd_next(faddr), laddr);
+		hmm_dummy_pt_free_pmd(pt_map, faddr, next);
+		hmm_dummy_pt_pmd_unmap(pt_map);
+		if (faddr > hmm_dummy_pmd_base(faddr) || laddr < next) {
+			continue;
+		}
+		pudp = hmm_dummy_pt_pud_map(pt_map, faddr);
+		if (!pudp) {
+			continue;
+		}
+		if (!(pudp[hmm_dummy_pud_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pudp[hmm_dummy_pud_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pudp[hmm_dummy_pud_index(faddr)] = 0;
+		__free_page(page);
+	}
+}
+
+static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
+			      unsigned long faddr,
+			      unsigned long laddr)
+{
+	struct hmm_dummy_pt_map pt_map = {0};
+
+	if (!dmirror->pgdp || (laddr - faddr) < HMM_DUMMY_PLD_SIZE) {
+		return;
+	}
+
+	pt_map.dmirror = dmirror;
+
+	for (; faddr < laddr; faddr = hmm_dummy_pud_next(faddr)) {
+		unsigned long pfn, *pgdp, next;
+		struct page *page;
+
+		next = min(hmm_dummy_pud_next(faddr), laddr);
+		pgdp = dmirror->pgdp;
+		hmm_dummy_pt_free_pud(&pt_map, faddr, next);
+		hmm_dummy_pt_pud_unmap(&pt_map);
+		if (faddr > hmm_dummy_pud_base(faddr) || laddr < next) {
+			continue;
+		}
+		if (!(pgdp[hmm_dummy_pgd_index(faddr)] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			continue;
+		}
+		pfn = pgdp[hmm_dummy_pgd_index(faddr)] >> HMM_DUMMY_PFN_SHIFT;
+		page = pfn_to_page(pfn);
+		pgdp[hmm_dummy_pgd_index(faddr)] = 0;
+		__free_page(page);
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+}
+
+
+/* hmm_ops - hmm callback for the hmm dummy driver.
+ *
+ * Below are the various callback that the hmm api require for a device. The
+ * implementation of the dummy device driver is necessarily simpler that what
+ * a real device driver would do. We do not have interrupt nor any kind of
+ * command buffer on to which schedule memory invalidation and updates.
+ */
+static void hmm_dummy_device_destroy(struct hmm_device *device)
+{
+	/* No-op for the dummy driver. */
+}
+
+static void hmm_dummy_mirror_release(struct hmm_mirror *mirror)
+{
+	struct hmm_dummy_mirror *dmirror;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	dmirror->stop = true;
+	mutex_lock(&dmirror->mutex);
+	hmm_dummy_pt_free(dmirror, 0, HMM_DUMMY_MAX_ADDR);
+	if (dmirror->pgdp) {
+		kfree(dmirror->pgdp);
+		dmirror->pgdp = NULL;
+	}
+	mutex_unlock(&dmirror->mutex);
+}
+
+static void hmm_dummy_mirror_destroy(struct hmm_mirror *mirror)
+{
+	/* No-op for the dummy driver. */
+	// FIXME print that the pid is no longer mirror
+}
+
+static int hmm_dummy_fence_wait(struct hmm_fence *fence)
+{
+	/* FIXME use some kind of fake event and delay dirty and dummy page
+	 * clearing to this function.
+	 */
+	return 0;
+}
+
+static struct hmm_fence *hmm_dummy_lmem_update(struct hmm_mirror *mirror,
+					       unsigned long faddr,
+					       unsigned long laddr,
+					       enum hmm_etype etype,
+					       bool dirty)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask, or;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	/* Sanity check for debugging hmm real device driver do not have to do that. */
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_UNMAP:
+	case HMM_MUNMAP:
+	case HMM_MPROT_WONLY:
+	case HMM_MIGRATE_TO_RMEM:
+	case HMM_MIGRATE_TO_LMEM:
+		mask = 0;
+		or = 0;
+		break;
+	case HMM_MPROT_RONLY:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		or = 0;
+		break;
+	case HMM_MPROT_RANDW:
+		mask = -1L;
+		or = HMM_DUMMY_PTE_WRITE;
+		break;
+	default:
+		printk(KERN_ERR "%4d:%s invalid event type %d\n",
+		       __LINE__, __func__, etype);
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = faddr; addr < laddr; ++i, addr += PAGE_SIZE) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			continue;
+		}
+		if (dirty && ((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			struct page *page;
+
+			page = hmm_dummy_pte_to_page(*pldp);
+			if (page) {
+				set_page_dirty(page);
+			}
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+		*pldp |= or;
+		if ((*pldp) & HMM_DUMMY_PTE_VALID_ZERO) {
+			*pldp &= ~HMM_DUMMY_PTE_WRITE;
+		}
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_MUNMAP:
+		hmm_dummy_pt_free(dmirror, faddr, laddr);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static int hmm_dummy_lmem_fault(struct hmm_mirror *mirror,
+				unsigned long faddr,
+				unsigned long laddr,
+				unsigned long *pfns,
+				struct hmm_fault *fault)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long i;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0; faddr < laddr; ++i, faddr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		if (!pldp || !hmm_pfn_to_page(pfns[i])) {
+			continue;
+		}
+		pld_idx = hmm_dummy_pld_index(faddr);
+		pldp[pld_idx]  = ((pfns[i] >> HMM_PFN_SHIFT) << HMM_DUMMY_PFN_SHIFT);
+		pldp[pld_idx] |= test_bit(HMM_PFN_WRITE, &pfns[i]) ? HMM_DUMMY_PTE_WRITE : 0;
+		pldp[pld_idx] |= test_bit(HMM_PFN_VALID_PAGE, &pfns[i]) ?
+			HMM_DUMMY_PTE_VALID_PAGE : HMM_DUMMY_PTE_VALID_ZERO;
+		pldp[pld_idx] |= HMM_DUMMY_PTE_READ;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+static const struct hmm_device_ops hmm_dummy_ops = {
+	.device_destroy		= &hmm_dummy_device_destroy,
+	.mirror_release		= &hmm_dummy_mirror_release,
+	.mirror_destroy		= &hmm_dummy_mirror_destroy,
+	.fence_wait		= &hmm_dummy_fence_wait,
+	.lmem_update		= &hmm_dummy_lmem_update,
+	.lmem_fault		= &hmm_dummy_lmem_fault,
+};
+
+
+/* hmm_dummy_mmap - hmm dummy device file mmap operations.
+ *
+ * The hmm dummy driver does not allow mmap of its device file. The main reason
+ * is because the kernel lack the ability to insert page with specific custom
+ * protections inside a vma.
+ */
+static int hmm_dummy_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static void hmm_dummy_mmap_open(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static void hmm_dummy_mmap_close(struct vm_area_struct *vma)
+{
+	/* nop */
+}
+
+static const struct vm_operations_struct mmap_mem_ops = {
+	.fault			= hmm_dummy_mmap_fault,
+	.open			= hmm_dummy_mmap_open,
+	.close			= hmm_dummy_mmap_close,
+};
+
+
+/* hmm_dummy_fops - hmm dummy device file operations.
+ *
+ * The hmm dummy driver allow to read/write to the mirrored process through
+ * the device file. Below are the read and write and others device file
+ * callback that implement access to the mirrored address space.
+ */
+static int hmm_dummy_mirror_fault(struct hmm_dummy_mirror *dmirror,
+				  unsigned long addr,
+				  bool write)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_fault fault;
+	unsigned long faddr, laddr, npages = 4;
+	int ret;
+
+	fault.pfns = kmalloc(npages * sizeof(long), GFP_KERNEL);
+	fault.flags = write ? HMM_FAULT_WRITE : 0;
+
+	/* Showcase hmm api fault a 64k range centered on the address. */
+	fault.faddr = faddr = addr > (npages << 8) ? addr - (npages << 8) : 0;
+	fault.laddr = laddr = faddr + (npages << 10);
+
+	/* Pre-allocate device page table. */
+	mutex_lock(&dmirror->mutex);
+	ret = hmm_dummy_pt_alloc(dmirror, faddr, laddr);
+	mutex_unlock(&dmirror->mutex);
+	if (ret) {
+		goto out;
+	}
+
+	for (; faddr < laddr; faddr = fault.faddr) {
+		ret = hmm_mirror_fault(mirror, &fault);
+		/* Ignore any error that do not concern the fault address. */
+		if (addr >= fault.laddr) {
+			fault.faddr = fault.laddr;
+			fault.laddr = laddr;
+			continue;
+		}
+		if (addr < fault.faddr) {
+			/* The address was faulted successfully ignore error
+			 * for address above the one we were interested in.
+			 */
+			ret = 0;
+		}
+		goto out;
+	}
+
+out:
+	kfree(fault.pfns);
+	return ret;
+}
+
+static ssize_t hmm_dummy_fops_read(struct file *filp,
+				   char __user *buf,
+				   size_t count,
+				   loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long faddr, laddr, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp) {
+		return -ENOMEM;
+	}
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	dmirror = ddevice->dmirrors[minor];
+	mutex_unlock(&ddevice->mutex);
+	if (dmirror->stop) {
+		kfree(tmp);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	faddr = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - faddr;
+	laddr = PAGE_ALIGN(faddr + count);
+	BUG_ON(faddr == laddr);
+	pt_map.dmirror = dmirror;
+
+	for (; count; faddr += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		mutex_lock(&dmirror->mutex);
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		if (!pldp || !(pldp[pld_idx] & (HMM_DUMMY_PTE_VALID_PAGE | HMM_DUMMY_PTE_VALID_ZERO))) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(tmp, ptr + offset, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		r = copy_to_user(buf, tmp, size);
+		if (r) {
+			kfree(tmp);
+			return -EFAULT;
+		}
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, faddr, false);
+	if (r) {
+		return r;
+	}
+
+	/* Force userspace to retry read if nothing was read. */
+	return retval ? retval : -EINTR;
+}
+
+static ssize_t hmm_dummy_fops_write(struct file *filp,
+				    const char __user *buf,
+				    size_t count,
+				    loff_t *ppos)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long faddr, laddr, offset;
+	unsigned minor;
+	ssize_t retval = 0;
+	void *tmp;
+	long r;
+
+	tmp = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!tmp) {
+		return -ENOMEM;
+	}
+
+	/* Check if we are mirroring anything */
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	mutex_lock(&ddevice->mutex);
+	if (ddevice->dmirrors[minor] == NULL) {
+		mutex_unlock(&ddevice->mutex);
+		kfree(tmp);
+		return 0;
+	}
+	dmirror = ddevice->dmirrors[minor];
+	mutex_unlock(&ddevice->mutex);
+	if (dmirror->stop) {
+		kfree(tmp);
+		return 0;
+	}
+
+	/* The range of address to lookup. */
+	faddr = (*ppos) & PAGE_MASK;
+	offset = (*ppos) - faddr;
+	laddr = PAGE_ALIGN(faddr + count);
+	BUG_ON(faddr == laddr);
+	pt_map.dmirror = dmirror;
+
+	for (; count; faddr += PAGE_SIZE, offset = 0) {
+		unsigned long *pldp, pld_idx;
+		unsigned long size = min(PAGE_SIZE - offset, count);
+		struct page *page;
+		char *ptr;
+
+		r = copy_from_user(tmp, buf, size);
+		if (r) {
+			kfree(tmp);
+			return -EFAULT;
+		}
+
+		mutex_lock(&dmirror->mutex);
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		if (!pldp || !(pldp[pld_idx] & HMM_DUMMY_PTE_VALID_PAGE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+			goto fault;
+		}
+		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
+			hmm_dummy_pt_unmap(&pt_map);
+			mutex_unlock(&dmirror->mutex);
+				goto fault;
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
+		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
+		if (!page) {
+			mutex_unlock(&dmirror->mutex);
+			BUG();
+			kfree(tmp);
+			return -EFAULT;
+		}
+		ptr = kmap(page);
+		memcpy(ptr + offset, tmp, size);
+		kunmap(page);
+		hmm_dummy_pt_unmap(&pt_map);
+		mutex_unlock(&dmirror->mutex);
+
+		retval += size;
+		*ppos += size;
+		count -= size;
+		buf += size;
+	}
+
+	kfree(tmp);
+	return retval;
+
+fault:
+	kfree(tmp);
+	r = hmm_dummy_mirror_fault(dmirror, faddr, true);
+	if (r) {
+		return r;
+	}
+
+	/* Force userspace to retry write if nothing was writen. */
+	return retval ? retval : -EINTR;
+}
+
+static int hmm_dummy_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	return -EINVAL;
+}
+
+static int hmm_dummy_fops_open(struct inode *inode, struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	/* No exclusive opens */
+	if (filp->f_flags & O_EXCL) {
+		return -EINVAL;
+	}
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	filp->private_data = ddevice;
+	ddevice->fmapping[minor] = &inode->i_data;
+
+	return 0;
+}
+
+static int hmm_dummy_fops_release(struct inode *inode,
+				  struct file *filp)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	struct cdev *cdev = inode->i_cdev;
+	const int minor = iminor(inode);
+
+	ddevice = container_of(cdev, struct hmm_dummy_device, cdev);
+	dmirror = ddevice->dmirrors[minor];
+	if (dmirror && dmirror->filp == filp) {
+		if (!dmirror->stop) {
+			hmm_mirror_unregister(&dmirror->mirror);
+		}
+		ddevice->dmirrors[minor] = NULL;
+		kfree(dmirror);
+	}
+
+	return 0;
+}
+
+static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
+					  unsigned int command,
+					  unsigned long arg)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_mirror *dmirror;
+	unsigned minor;
+	int ret;
+
+	minor = iminor(file_inode(filp));
+	ddevice = filp->private_data;
+	switch (command) {
+	case HMM_DUMMY_EXPOSE_MM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EBUSY;
+		}
+		/* Mirror this process address space */
+		dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+		if (dmirror == NULL) {
+			mutex_unlock(&ddevice->mutex);
+			return -ENOMEM;
+		}
+		dmirror->mm = NULL;
+		dmirror->stop = false;
+		dmirror->pid = task_pid_nr(current);
+		dmirror->ddevice = ddevice;
+		dmirror->minor = minor;
+		dmirror->filp = filp;
+		dmirror->pgdp = NULL;
+		mutex_init(&dmirror->mutex);
+		ddevice->dmirrors[minor] = dmirror;
+		mutex_unlock(&ddevice->mutex);
+
+		ret = hmm_mirror_register(&dmirror->mirror,
+					  &ddevice->device,
+					  current->mm);
+		if (ret) {
+			mutex_lock(&ddevice->mutex);
+			ddevice->dmirrors[minor] = NULL;
+			mutex_unlock(&ddevice->mutex);
+			kfree(dmirror);
+			return ret;
+		}
+		/* Success. */
+		hmm_dummy_device_print(ddevice, dmirror->minor,
+				       "mirroring address space of %d\n",
+				       dmirror->pid);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static const struct file_operations hmm_dummy_fops = {
+	.read		= hmm_dummy_fops_read,
+	.write		= hmm_dummy_fops_write,
+	.mmap		= hmm_dummy_fops_mmap,
+	.open		= hmm_dummy_fops_open,
+	.release	= hmm_dummy_fops_release,
+	.unlocked_ioctl = hmm_dummy_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+
+/*
+ * char device driver
+ */
+static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
+{
+	int ret, i;
+
+	ret = alloc_chrdev_region(&ddevice->dev, 0,
+				  HMM_DUMMY_DEVICE_MAX_MIRRORS,
+				  ddevice->name);
+	if (ret < 0) {
+		printk(KERN_ERR "alloc_chrdev_region() failed for hmm_dummy\n");
+		goto error;
+	}
+	ddevice->major = MAJOR(ddevice->dev);
+
+	cdev_init(&ddevice->cdev, &hmm_dummy_fops);
+	ret = cdev_add(&ddevice->cdev, ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+	if (ret) {
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+		goto error;
+	}
+
+	/* Register the hmm device. */
+	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
+		ddevice->dmirrors[i] = NULL;
+	}
+	mutex_init(&ddevice->mutex);
+	ddevice->device.ops = &hmm_dummy_ops;
+
+	ret = hmm_device_register(&ddevice->device, ddevice->name);
+	if (ret) {
+		cdev_del(&ddevice->cdev);
+		unregister_chrdev_region(ddevice->dev, HMM_DUMMY_DEVICE_MAX_MIRRORS);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	return ret;
+}
+
+static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
+{
+	unsigned i;
+
+	/* First finish hmm. */
+	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
+		struct hmm_dummy_mirror *dmirror;
+
+		dmirror = ddevices->dmirrors[i];
+		if (!dmirror) {
+			continue;
+		}
+		hmm_mirror_unregister(&dmirror->mirror);
+		kfree(dmirror);
+	}
+	hmm_device_unref(&ddevice->device);
+
+	cdev_del(&ddevice->cdev);
+	unregister_chrdev_region(ddevice->dev,
+				 HMM_DUMMY_DEVICE_MAX_MIRRORS);
+}
+
+static int __init hmm_dummy_init(void)
+{
+	int ret;
+
+	snprintf(ddevices[0].name, sizeof(ddevices[0].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 0);
+	ret = hmm_dummy_device_init(&ddevices[0]);
+	if (ret) {
+		return ret;
+	}
+
+	snprintf(ddevices[1].name, sizeof(ddevices[1].name),
+		 "%s%d", HMM_DUMMY_DEVICE_NAME, 1);
+	ret = hmm_dummy_device_init(&ddevices[1]);
+	if (ret) {
+		hmm_dummy_device_fini(&ddevices[0]);
+		return ret;
+	}
+
+	printk(KERN_INFO "hmm_dummy loaded THIS IS A DANGEROUS MODULE !!!\n");
+	return 0;
+}
+
+static void __exit hmm_dummy_exit(void)
+{
+	hmm_dummy_device_fini(&ddevices[1]);
+	hmm_dummy_device_fini(&ddevices[0]);
+}
+
+module_init(hmm_dummy_init);
+module_exit(hmm_dummy_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
new file mode 100644
index 0000000..16ae0d3
--- /dev/null
+++ b/include/uapi/linux/hmm_dummy.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/* This is a dummy driver made to exercice the HMM (hardware memory management)
+ * API of the kernel. It allow an userspace program to map its whole address
+ * space through the hmm dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DUMMY_H
+#define _UAPI_LINUX_HMM_DUMMY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/irqnr.h>
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DUMMY_EXPOSE_MM	_IO( 'R', 0x00 )
+
+#endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 11/11] hmm/dummy_driver: add support for fake remote memory using pages.
  2014-05-02 13:51 ` j.glisse
  (?)
@ 2014-05-02 13:52   ` j.glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

Fake the existent of remote memory using preallocated pages and
demonstrate how to use the hmm api related to remote memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/char/hmm_dummy.c       | 450 ++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/hmm_dummy.h |   8 +-
 2 files changed, 453 insertions(+), 5 deletions(-)

diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
index e87dc7c..2443374 100644
--- a/drivers/char/hmm_dummy.c
+++ b/drivers/char/hmm_dummy.c
@@ -48,6 +48,8 @@
 
 #define HMM_DUMMY_DEVICE_NAME		"hmm_dummy_device"
 #define HMM_DUMMY_DEVICE_MAX_MIRRORS	4
+#define HMM_DUMMY_DEVICE_RMEM_SIZE	(32UL << 20UL)
+#define HMM_DUMMY_DEVICE_RMEM_NBITS	(HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT)
 
 struct hmm_dummy_device;
 
@@ -73,8 +75,16 @@ struct hmm_dummy_device {
 	/* device file mapping tracking (keep track of all vma) */
 	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_DEVICE_MAX_MIRRORS];
 	struct address_space	*fmapping[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+	struct page		**rmem_pages;
+	unsigned long		*rmem_bitmap;
 };
 
+struct hmm_dummy_rmem {
+	struct hmm_rmem		rmem;
+	unsigned long		fuid;
+	unsigned long		luid;
+	uint16_t		*rmem_idx;
+};
 
 /* We only create 2 device to show the inter device rmem sharing/migration
  * capabilities.
@@ -482,6 +492,51 @@ static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
 }
 
 
+/* hmm_dummy_rmem - dummy remote memory using system memory pages
+ *
+ * Helper function to allocate fake remote memory out of the device rmem_pages.
+ */
+static void hmm_dummy_rmem_free(struct hmm_dummy_rmem *drmem)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_rmem *rmem = &drmem->rmem;
+	unsigned long i, npages;
+
+	npages = (rmem->luid - rmem->fuid);
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < npages; ++i) {
+		clear_bit(drmem->rmem_idx[i], ddevice->rmem_bitmap);
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	kfree(drmem->rmem_idx);
+	drmem->rmem_idx = NULL;
+}
+
+static struct hmm_dummy_rmem *hmm_dummy_rmem_new(void)
+{
+	struct hmm_dummy_rmem *drmem;
+
+	drmem = kzalloc(sizeof(*drmem), GFP_KERNEL);
+	return drmem;
+}
+
+static int hmm_dummy_mirror_lmem_to_rmem(struct hmm_dummy_mirror *dmirror,
+					 unsigned long faddr,
+					 unsigned long laddr)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_fault fault;
+	int ret;
+
+	fault.faddr = faddr & PAGE_MASK;
+	fault.laddr = PAGE_ALIGN(laddr);
+	ret = hmm_migrate_lmem_to_rmem(&fault, mirror);
+	return ret;
+}
+
+
 /* hmm_ops - hmm callback for the hmm dummy driver.
  *
  * Below are the various callback that the hmm api require for a device. The
@@ -574,7 +629,7 @@ static struct hmm_fence *hmm_dummy_lmem_update(struct hmm_mirror *mirror,
 
 			page = hmm_dummy_pte_to_page(*pldp);
 			if (page) {
-				set_page_dirty(page);
+				set_page_dirty_lock(page);
 			}
 		}
 		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
@@ -631,6 +686,318 @@ static int hmm_dummy_lmem_fault(struct hmm_mirror *mirror,
 	return 0;
 }
 
+static struct hmm_rmem *hmm_dummy_rmem_alloc(struct hmm_device *device,
+					     struct hmm_fault *fault)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	struct hmm_rmem *rmem;
+	unsigned long i, npages;
+
+	ddevice = container_of(device, struct hmm_dummy_device, device);
+
+	drmem = hmm_dummy_rmem_new();
+	if (drmem == NULL) {
+		return ERR_PTR(-ENOMEM);
+	}
+	rmem = &drmem->rmem;
+
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	drmem->rmem_idx = kmalloc(npages * sizeof(uint16_t), GFP_KERNEL);
+	if (drmem->rmem_idx == NULL) {
+		kfree(drmem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < npages; ++i) {
+		int r;
+
+		r = find_first_zero_bit(ddevice->rmem_bitmap,
+					HMM_DUMMY_DEVICE_RMEM_NBITS);
+		if (r < 0) {
+			while ((--i)) {
+				clear_bit(drmem->rmem_idx[i],
+					  ddevice->rmem_bitmap);
+			}
+			kfree(drmem->rmem_idx);
+			kfree(drmem);
+			mutex_unlock(&ddevice->mutex);
+			return ERR_PTR(-ENOMEM);
+		}
+		drmem->rmem_idx[i] = r;
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	return rmem;
+}
+
+static struct hmm_fence *hmm_dummy_rmem_update(struct hmm_mirror *mirror,
+					       struct hmm_rmem *rmem,
+					       unsigned long faddr,
+					       unsigned long laddr,
+					       unsigned long fuid,
+					       enum hmm_etype etype,
+					       bool dirty)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask, or, idx;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+	idx = fuid - rmem->fuid;
+
+	/* Sanity check for debugging hmm real device driver do not have to do that. */
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_UNMAP:
+	case HMM_MUNMAP:
+	case HMM_MPROT_WONLY:
+	case HMM_MIGRATE_TO_RMEM:
+	case HMM_MIGRATE_TO_LMEM:
+		mask = 0;
+		or = 0;
+		break;
+	case HMM_MPROT_RONLY:
+	case HMM_WRITEBACK:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		or = 0;
+		break;
+	case HMM_MPROT_RANDW:
+		mask = -1L;
+		or = HMM_DUMMY_PTE_WRITE;
+		break;
+	default:
+		printk(KERN_ERR "%4d:%s invalid event type %d\n",
+		       __LINE__, __func__, etype);
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = faddr; addr < laddr; ++i, addr += PAGE_SIZE, ++idx) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			continue;
+		}
+		if (dirty && ((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			hmm_pfn_set_dirty(&rmem->pfns[idx]);
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+		*pldp |= or;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_MUNMAP:
+		hmm_dummy_pt_free(dmirror, faddr, laddr);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static int hmm_dummy_rmem_fault(struct hmm_mirror *mirror,
+				struct hmm_rmem *rmem,
+				unsigned long faddr,
+				unsigned long laddr,
+				unsigned long fuid,
+				struct hmm_fault *fault)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_pt_map pt_map = {0};
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+	bool write = fault ? !!(fault->flags & HMM_FAULT_WRITE) : false;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	ddevice = dmirror->ddevice;
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = fuid; faddr < laddr; ++i, faddr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx, pfn, idx = i - rmem->fuid;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		if (!pldp) {
+			continue;
+		}
+		pfn = page_to_pfn(ddevice->rmem_pages[drmem->rmem_idx[idx]]);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		pldp[pld_idx]  = (pfn << HMM_DUMMY_PFN_SHIFT);
+		if (test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			pldp[pld_idx] |=  HMM_DUMMY_PTE_WRITE;
+			hmm_pfn_clear_lmem_uptodate(&rmem->pfns[idx]);
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_VALID_PAGE;
+		if (write && !test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			/* Fallback to use system memory. Other solution would be
+			 * to migrate back to system memory.
+			 */
+			hmm_pfn_clear_rmem_uptodate(&rmem->pfns[idx]);
+			if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+				struct page *spage, *dpage;
+
+				dpage = hmm_pfn_to_page(rmem->pfns[idx]);
+				spage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+				copy_highpage(dpage, spage);
+				hmm_pfn_set_lmem_uptodate(&rmem->pfns[idx]);
+			}
+			pfn = rmem->pfns[idx] >> HMM_PFN_SHIFT;
+			pldp[pld_idx]  = (pfn << HMM_DUMMY_PFN_SHIFT);
+			pldp[pld_idx] |= HMM_DUMMY_PTE_WRITE;
+			pldp[pld_idx] |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+struct hmm_fence *hmm_dummy_rmem_to_lmem(struct hmm_rmem *rmem,
+					 unsigned long fuid,
+					 unsigned long luid)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+
+	for (i = fuid; i < luid; ++i) {
+		unsigned long idx = i - rmem->fuid;
+		struct page *spage, *dpage;
+
+		if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+			/* This lmem page is already uptodate. */
+			continue;
+		}
+		spage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+		dpage = hmm_pfn_to_page(rmem->pfns[idx]);
+		if (!dpage) {
+			return ERR_PTR(-EINVAL);
+		}
+		copy_highpage(dpage, spage);
+		hmm_pfn_set_lmem_uptodate(&rmem->pfns[idx]);
+	}
+
+	return NULL;
+}
+
+struct hmm_fence *hmm_dummy_lmem_to_rmem(struct hmm_rmem *rmem,
+					 unsigned long fuid,
+					 unsigned long luid)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+
+	for (i = fuid; i < luid; ++i) {
+		unsigned long idx = i - rmem->fuid;
+		struct page *spage, *dpage;
+
+		if (test_bit(HMM_PFN_RMEM_UPTODATE, &rmem->pfns[idx])) {
+			/* This rmem page is already uptodate. */
+			continue;
+		}
+		dpage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+		spage = hmm_pfn_to_page(rmem->pfns[idx]);
+		if (!spage) {
+			return ERR_PTR(-EINVAL);
+		}
+		copy_highpage(dpage, spage);
+		hmm_pfn_set_rmem_uptodate(&rmem->pfns[idx]);
+	}
+
+	return NULL;
+}
+
+static int hmm_dummy_rmem_do_split(struct hmm_rmem *rmem,
+				   unsigned long fuid,
+				   unsigned long luid)
+{
+	struct hmm_dummy_rmem *drmem, *dnew;
+	struct hmm_fault fault;
+	struct hmm_rmem *new;
+	unsigned long i, pgoff, npages;
+	int ret;
+
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	npages = (luid - fuid);
+	pgoff = (fuid == rmem->fuid) ? 0 : fuid - rmem->fuid;
+	fault.faddr = 0;
+	fault.laddr = npages << PAGE_SHIFT;
+	new = hmm_dummy_rmem_alloc(rmem->device, &fault);
+	if (IS_ERR(new)) {
+		return PTR_ERR(new);
+	}
+	dnew = container_of(new, struct hmm_dummy_rmem, rmem);
+
+	new->fuid = fuid;
+	new->luid = luid;
+	ret = hmm_rmem_split_new(rmem, new);
+	if (ret) {
+		return ret;
+	}
+
+	/* Update the rmem it is fine to hold no lock as no one else can access
+	 * both of this rmem object as long as the range are reserved.
+	 */
+	for (i = 0; i < npages; ++i) {
+		dnew->rmem_idx[i] = drmem->rmem_idx[i + pgoff];
+	}
+	if (!pgoff) {
+		for (i = 0; i < (rmem->luid - rmem->fuid); ++i) {
+			drmem->rmem_idx[i] = drmem->rmem_idx[i + npages];
+		}
+	}
+
+	return 0;
+}
+
+static int hmm_dummy_rmem_split(struct hmm_rmem *rmem,
+				unsigned long fuid,
+				unsigned long luid)
+{
+	int ret;
+
+	if (fuid > rmem->fuid) {
+		ret = hmm_dummy_rmem_do_split(rmem, rmem->fuid, fuid);
+		if (ret) {
+			return ret;
+		}
+	}
+	if (luid < rmem->luid) {
+		ret = hmm_dummy_rmem_do_split(rmem, luid, rmem->luid);
+		if (ret) {
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_rmem_destroy(struct hmm_rmem *rmem)
+{
+	struct hmm_dummy_rmem *drmem;
+
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	hmm_dummy_rmem_free(drmem);
+	kfree(drmem);
+}
+
 static const struct hmm_device_ops hmm_dummy_ops = {
 	.device_destroy		= &hmm_dummy_device_destroy,
 	.mirror_release		= &hmm_dummy_mirror_release,
@@ -638,6 +1005,14 @@ static const struct hmm_device_ops hmm_dummy_ops = {
 	.fence_wait		= &hmm_dummy_fence_wait,
 	.lmem_update		= &hmm_dummy_lmem_update,
 	.lmem_fault		= &hmm_dummy_lmem_fault,
+	.rmem_alloc		= &hmm_dummy_rmem_alloc,
+	.rmem_update		= &hmm_dummy_rmem_update,
+	.rmem_fault		= &hmm_dummy_rmem_fault,
+	.rmem_to_lmem		= &hmm_dummy_rmem_to_lmem,
+	.lmem_to_rmem		= &hmm_dummy_lmem_to_rmem,
+	.rmem_split		= &hmm_dummy_rmem_split,
+	.rmem_split_adjust	= &hmm_dummy_rmem_split,
+	.rmem_destroy		= &hmm_dummy_rmem_destroy,
 };
 
 
@@ -880,7 +1255,7 @@ static ssize_t hmm_dummy_fops_write(struct file *filp,
 		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
 			hmm_dummy_pt_unmap(&pt_map);
 			mutex_unlock(&dmirror->mutex);
-				goto fault;
+			goto fault;
 		}
 		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
 		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
@@ -964,8 +1339,11 @@ static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
 					  unsigned int command,
 					  unsigned long arg)
 {
+	struct hmm_dummy_migrate dmigrate;
 	struct hmm_dummy_device *ddevice;
 	struct hmm_dummy_mirror *dmirror;
+	struct hmm_mirror *mirror;
+	void __user *uarg = (void __user *)arg;
 	unsigned minor;
 	int ret;
 
@@ -1011,6 +1389,31 @@ static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
 				       "mirroring address space of %d\n",
 				       dmirror->pid);
 		return 0;
+	case HMM_DUMMY_MIGRATE_TO_RMEM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (!dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EINVAL;
+		}
+		mirror = &dmirror->mirror;
+		mutex_unlock(&ddevice->mutex);
+
+		if (copy_from_user(&dmigrate, uarg, sizeof(dmigrate))) {
+			return -EFAULT;
+		}
+
+		ret = hmm_dummy_pt_alloc(dmirror,
+					 dmigrate.faddr,
+					 dmigrate.laddr);
+		if (ret) {
+			return ret;
+		}
+
+		ret = hmm_dummy_mirror_lmem_to_rmem(dmirror,
+						    dmigrate.faddr,
+						    dmigrate.laddr);
+		return ret;
 	default:
 		return -EINVAL;
 	}
@@ -1034,7 +1437,31 @@ static const struct file_operations hmm_dummy_fops = {
  */
 static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
 {
-	int ret, i;
+	struct page **pages;
+	unsigned long *bitmap;
+	int ret, i, npages;
+
+	npages = HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT;
+	bitmap = kzalloc(BITS_TO_LONGS(npages) * sizeof(long), GFP_KERNEL);
+	if (!bitmap) {
+		return -ENOMEM;
+	}
+	pages = kzalloc(npages * sizeof(void*), GFP_KERNEL);
+	if (!pages) {
+		kfree(bitmap);
+		return -ENOMEM;
+	}
+	for (i = 0; i < npages; ++i) {
+		pages[i] = alloc_page(GFP_KERNEL);
+		if (!pages[i]) {
+			while ((--i)) {
+				__free_page(pages[i]);
+			}
+			kfree(bitmap);
+			kfree(pages);
+			return -ENOMEM;
+		}
+	}
 
 	ret = alloc_chrdev_region(&ddevice->dev, 0,
 				  HMM_DUMMY_DEVICE_MAX_MIRRORS,
@@ -1066,15 +1493,23 @@ static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
 		goto error;
 	}
 
+	ddevice->rmem_bitmap = bitmap;
+	ddevice->rmem_pages = pages;
+
 	return 0;
 
 error:
+	for (i = 0; i < npages; ++i) {
+		__free_page(pages[i]);
+	}
+	kfree(bitmap);
+	kfree(pages);
 	return ret;
 }
 
 static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
 {
-	unsigned i;
+	unsigned i, npages;
 
 	/* First finish hmm. */
 	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
@@ -1092,6 +1527,13 @@ static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
 	cdev_del(&ddevice->cdev);
 	unregister_chrdev_region(ddevice->dev,
 				 HMM_DUMMY_DEVICE_MAX_MIRRORS);
+
+	npages = HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		__free_page(ddevice->rmem_pages[i]);
+	}
+	kfree(ddevice->rmem_bitmap);
+	kfree(ddevice->rmem_pages);
 }
 
 static int __init hmm_dummy_init(void)
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
index 16ae0d3..027c453 100644
--- a/include/uapi/linux/hmm_dummy.h
+++ b/include/uapi/linux/hmm_dummy.h
@@ -29,6 +29,12 @@
 #include <linux/irqnr.h>
 
 /* Expose the address space of the calling process through hmm dummy dev file */
-#define HMM_DUMMY_EXPOSE_MM	_IO( 'R', 0x00 )
+#define HMM_DUMMY_EXPOSE_MM		_IO( 'R', 0x00 )
+#define HMM_DUMMY_MIGRATE_TO_RMEM	_IO( 'R', 0x01 )
+
+struct hmm_dummy_migrate {
+	uint64_t		faddr;
+	uint64_t		laddr;
+};
 
 #endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 11/11] hmm/dummy_driver: add support for fake remote memory using pages.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

Fake the existent of remote memory using preallocated pages and
demonstrate how to use the hmm api related to remote memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/char/hmm_dummy.c       | 450 ++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/hmm_dummy.h |   8 +-
 2 files changed, 453 insertions(+), 5 deletions(-)

diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
index e87dc7c..2443374 100644
--- a/drivers/char/hmm_dummy.c
+++ b/drivers/char/hmm_dummy.c
@@ -48,6 +48,8 @@
 
 #define HMM_DUMMY_DEVICE_NAME		"hmm_dummy_device"
 #define HMM_DUMMY_DEVICE_MAX_MIRRORS	4
+#define HMM_DUMMY_DEVICE_RMEM_SIZE	(32UL << 20UL)
+#define HMM_DUMMY_DEVICE_RMEM_NBITS	(HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT)
 
 struct hmm_dummy_device;
 
@@ -73,8 +75,16 @@ struct hmm_dummy_device {
 	/* device file mapping tracking (keep track of all vma) */
 	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_DEVICE_MAX_MIRRORS];
 	struct address_space	*fmapping[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+	struct page		**rmem_pages;
+	unsigned long		*rmem_bitmap;
 };
 
+struct hmm_dummy_rmem {
+	struct hmm_rmem		rmem;
+	unsigned long		fuid;
+	unsigned long		luid;
+	uint16_t		*rmem_idx;
+};
 
 /* We only create 2 device to show the inter device rmem sharing/migration
  * capabilities.
@@ -482,6 +492,51 @@ static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
 }
 
 
+/* hmm_dummy_rmem - dummy remote memory using system memory pages
+ *
+ * Helper function to allocate fake remote memory out of the device rmem_pages.
+ */
+static void hmm_dummy_rmem_free(struct hmm_dummy_rmem *drmem)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_rmem *rmem = &drmem->rmem;
+	unsigned long i, npages;
+
+	npages = (rmem->luid - rmem->fuid);
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < npages; ++i) {
+		clear_bit(drmem->rmem_idx[i], ddevice->rmem_bitmap);
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	kfree(drmem->rmem_idx);
+	drmem->rmem_idx = NULL;
+}
+
+static struct hmm_dummy_rmem *hmm_dummy_rmem_new(void)
+{
+	struct hmm_dummy_rmem *drmem;
+
+	drmem = kzalloc(sizeof(*drmem), GFP_KERNEL);
+	return drmem;
+}
+
+static int hmm_dummy_mirror_lmem_to_rmem(struct hmm_dummy_mirror *dmirror,
+					 unsigned long faddr,
+					 unsigned long laddr)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_fault fault;
+	int ret;
+
+	fault.faddr = faddr & PAGE_MASK;
+	fault.laddr = PAGE_ALIGN(laddr);
+	ret = hmm_migrate_lmem_to_rmem(&fault, mirror);
+	return ret;
+}
+
+
 /* hmm_ops - hmm callback for the hmm dummy driver.
  *
  * Below are the various callback that the hmm api require for a device. The
@@ -574,7 +629,7 @@ static struct hmm_fence *hmm_dummy_lmem_update(struct hmm_mirror *mirror,
 
 			page = hmm_dummy_pte_to_page(*pldp);
 			if (page) {
-				set_page_dirty(page);
+				set_page_dirty_lock(page);
 			}
 		}
 		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
@@ -631,6 +686,318 @@ static int hmm_dummy_lmem_fault(struct hmm_mirror *mirror,
 	return 0;
 }
 
+static struct hmm_rmem *hmm_dummy_rmem_alloc(struct hmm_device *device,
+					     struct hmm_fault *fault)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	struct hmm_rmem *rmem;
+	unsigned long i, npages;
+
+	ddevice = container_of(device, struct hmm_dummy_device, device);
+
+	drmem = hmm_dummy_rmem_new();
+	if (drmem == NULL) {
+		return ERR_PTR(-ENOMEM);
+	}
+	rmem = &drmem->rmem;
+
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	drmem->rmem_idx = kmalloc(npages * sizeof(uint16_t), GFP_KERNEL);
+	if (drmem->rmem_idx == NULL) {
+		kfree(drmem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < npages; ++i) {
+		int r;
+
+		r = find_first_zero_bit(ddevice->rmem_bitmap,
+					HMM_DUMMY_DEVICE_RMEM_NBITS);
+		if (r < 0) {
+			while ((--i)) {
+				clear_bit(drmem->rmem_idx[i],
+					  ddevice->rmem_bitmap);
+			}
+			kfree(drmem->rmem_idx);
+			kfree(drmem);
+			mutex_unlock(&ddevice->mutex);
+			return ERR_PTR(-ENOMEM);
+		}
+		drmem->rmem_idx[i] = r;
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	return rmem;
+}
+
+static struct hmm_fence *hmm_dummy_rmem_update(struct hmm_mirror *mirror,
+					       struct hmm_rmem *rmem,
+					       unsigned long faddr,
+					       unsigned long laddr,
+					       unsigned long fuid,
+					       enum hmm_etype etype,
+					       bool dirty)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask, or, idx;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+	idx = fuid - rmem->fuid;
+
+	/* Sanity check for debugging hmm real device driver do not have to do that. */
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_UNMAP:
+	case HMM_MUNMAP:
+	case HMM_MPROT_WONLY:
+	case HMM_MIGRATE_TO_RMEM:
+	case HMM_MIGRATE_TO_LMEM:
+		mask = 0;
+		or = 0;
+		break;
+	case HMM_MPROT_RONLY:
+	case HMM_WRITEBACK:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		or = 0;
+		break;
+	case HMM_MPROT_RANDW:
+		mask = -1L;
+		or = HMM_DUMMY_PTE_WRITE;
+		break;
+	default:
+		printk(KERN_ERR "%4d:%s invalid event type %d\n",
+		       __LINE__, __func__, etype);
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = faddr; addr < laddr; ++i, addr += PAGE_SIZE, ++idx) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			continue;
+		}
+		if (dirty && ((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			hmm_pfn_set_dirty(&rmem->pfns[idx]);
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+		*pldp |= or;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_MUNMAP:
+		hmm_dummy_pt_free(dmirror, faddr, laddr);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static int hmm_dummy_rmem_fault(struct hmm_mirror *mirror,
+				struct hmm_rmem *rmem,
+				unsigned long faddr,
+				unsigned long laddr,
+				unsigned long fuid,
+				struct hmm_fault *fault)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_pt_map pt_map = {0};
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+	bool write = fault ? !!(fault->flags & HMM_FAULT_WRITE) : false;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	ddevice = dmirror->ddevice;
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = fuid; faddr < laddr; ++i, faddr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx, pfn, idx = i - rmem->fuid;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		if (!pldp) {
+			continue;
+		}
+		pfn = page_to_pfn(ddevice->rmem_pages[drmem->rmem_idx[idx]]);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		pldp[pld_idx]  = (pfn << HMM_DUMMY_PFN_SHIFT);
+		if (test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			pldp[pld_idx] |=  HMM_DUMMY_PTE_WRITE;
+			hmm_pfn_clear_lmem_uptodate(&rmem->pfns[idx]);
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_VALID_PAGE;
+		if (write && !test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			/* Fallback to use system memory. Other solution would be
+			 * to migrate back to system memory.
+			 */
+			hmm_pfn_clear_rmem_uptodate(&rmem->pfns[idx]);
+			if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+				struct page *spage, *dpage;
+
+				dpage = hmm_pfn_to_page(rmem->pfns[idx]);
+				spage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+				copy_highpage(dpage, spage);
+				hmm_pfn_set_lmem_uptodate(&rmem->pfns[idx]);
+			}
+			pfn = rmem->pfns[idx] >> HMM_PFN_SHIFT;
+			pldp[pld_idx]  = (pfn << HMM_DUMMY_PFN_SHIFT);
+			pldp[pld_idx] |= HMM_DUMMY_PTE_WRITE;
+			pldp[pld_idx] |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+struct hmm_fence *hmm_dummy_rmem_to_lmem(struct hmm_rmem *rmem,
+					 unsigned long fuid,
+					 unsigned long luid)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+
+	for (i = fuid; i < luid; ++i) {
+		unsigned long idx = i - rmem->fuid;
+		struct page *spage, *dpage;
+
+		if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+			/* This lmem page is already uptodate. */
+			continue;
+		}
+		spage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+		dpage = hmm_pfn_to_page(rmem->pfns[idx]);
+		if (!dpage) {
+			return ERR_PTR(-EINVAL);
+		}
+		copy_highpage(dpage, spage);
+		hmm_pfn_set_lmem_uptodate(&rmem->pfns[idx]);
+	}
+
+	return NULL;
+}
+
+struct hmm_fence *hmm_dummy_lmem_to_rmem(struct hmm_rmem *rmem,
+					 unsigned long fuid,
+					 unsigned long luid)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+
+	for (i = fuid; i < luid; ++i) {
+		unsigned long idx = i - rmem->fuid;
+		struct page *spage, *dpage;
+
+		if (test_bit(HMM_PFN_RMEM_UPTODATE, &rmem->pfns[idx])) {
+			/* This rmem page is already uptodate. */
+			continue;
+		}
+		dpage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+		spage = hmm_pfn_to_page(rmem->pfns[idx]);
+		if (!spage) {
+			return ERR_PTR(-EINVAL);
+		}
+		copy_highpage(dpage, spage);
+		hmm_pfn_set_rmem_uptodate(&rmem->pfns[idx]);
+	}
+
+	return NULL;
+}
+
+static int hmm_dummy_rmem_do_split(struct hmm_rmem *rmem,
+				   unsigned long fuid,
+				   unsigned long luid)
+{
+	struct hmm_dummy_rmem *drmem, *dnew;
+	struct hmm_fault fault;
+	struct hmm_rmem *new;
+	unsigned long i, pgoff, npages;
+	int ret;
+
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	npages = (luid - fuid);
+	pgoff = (fuid == rmem->fuid) ? 0 : fuid - rmem->fuid;
+	fault.faddr = 0;
+	fault.laddr = npages << PAGE_SHIFT;
+	new = hmm_dummy_rmem_alloc(rmem->device, &fault);
+	if (IS_ERR(new)) {
+		return PTR_ERR(new);
+	}
+	dnew = container_of(new, struct hmm_dummy_rmem, rmem);
+
+	new->fuid = fuid;
+	new->luid = luid;
+	ret = hmm_rmem_split_new(rmem, new);
+	if (ret) {
+		return ret;
+	}
+
+	/* Update the rmem it is fine to hold no lock as no one else can access
+	 * both of this rmem object as long as the range are reserved.
+	 */
+	for (i = 0; i < npages; ++i) {
+		dnew->rmem_idx[i] = drmem->rmem_idx[i + pgoff];
+	}
+	if (!pgoff) {
+		for (i = 0; i < (rmem->luid - rmem->fuid); ++i) {
+			drmem->rmem_idx[i] = drmem->rmem_idx[i + npages];
+		}
+	}
+
+	return 0;
+}
+
+static int hmm_dummy_rmem_split(struct hmm_rmem *rmem,
+				unsigned long fuid,
+				unsigned long luid)
+{
+	int ret;
+
+	if (fuid > rmem->fuid) {
+		ret = hmm_dummy_rmem_do_split(rmem, rmem->fuid, fuid);
+		if (ret) {
+			return ret;
+		}
+	}
+	if (luid < rmem->luid) {
+		ret = hmm_dummy_rmem_do_split(rmem, luid, rmem->luid);
+		if (ret) {
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_rmem_destroy(struct hmm_rmem *rmem)
+{
+	struct hmm_dummy_rmem *drmem;
+
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	hmm_dummy_rmem_free(drmem);
+	kfree(drmem);
+}
+
 static const struct hmm_device_ops hmm_dummy_ops = {
 	.device_destroy		= &hmm_dummy_device_destroy,
 	.mirror_release		= &hmm_dummy_mirror_release,
@@ -638,6 +1005,14 @@ static const struct hmm_device_ops hmm_dummy_ops = {
 	.fence_wait		= &hmm_dummy_fence_wait,
 	.lmem_update		= &hmm_dummy_lmem_update,
 	.lmem_fault		= &hmm_dummy_lmem_fault,
+	.rmem_alloc		= &hmm_dummy_rmem_alloc,
+	.rmem_update		= &hmm_dummy_rmem_update,
+	.rmem_fault		= &hmm_dummy_rmem_fault,
+	.rmem_to_lmem		= &hmm_dummy_rmem_to_lmem,
+	.lmem_to_rmem		= &hmm_dummy_lmem_to_rmem,
+	.rmem_split		= &hmm_dummy_rmem_split,
+	.rmem_split_adjust	= &hmm_dummy_rmem_split,
+	.rmem_destroy		= &hmm_dummy_rmem_destroy,
 };
 
 
@@ -880,7 +1255,7 @@ static ssize_t hmm_dummy_fops_write(struct file *filp,
 		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
 			hmm_dummy_pt_unmap(&pt_map);
 			mutex_unlock(&dmirror->mutex);
-				goto fault;
+			goto fault;
 		}
 		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
 		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
@@ -964,8 +1339,11 @@ static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
 					  unsigned int command,
 					  unsigned long arg)
 {
+	struct hmm_dummy_migrate dmigrate;
 	struct hmm_dummy_device *ddevice;
 	struct hmm_dummy_mirror *dmirror;
+	struct hmm_mirror *mirror;
+	void __user *uarg = (void __user *)arg;
 	unsigned minor;
 	int ret;
 
@@ -1011,6 +1389,31 @@ static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
 				       "mirroring address space of %d\n",
 				       dmirror->pid);
 		return 0;
+	case HMM_DUMMY_MIGRATE_TO_RMEM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (!dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EINVAL;
+		}
+		mirror = &dmirror->mirror;
+		mutex_unlock(&ddevice->mutex);
+
+		if (copy_from_user(&dmigrate, uarg, sizeof(dmigrate))) {
+			return -EFAULT;
+		}
+
+		ret = hmm_dummy_pt_alloc(dmirror,
+					 dmigrate.faddr,
+					 dmigrate.laddr);
+		if (ret) {
+			return ret;
+		}
+
+		ret = hmm_dummy_mirror_lmem_to_rmem(dmirror,
+						    dmigrate.faddr,
+						    dmigrate.laddr);
+		return ret;
 	default:
 		return -EINVAL;
 	}
@@ -1034,7 +1437,31 @@ static const struct file_operations hmm_dummy_fops = {
  */
 static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
 {
-	int ret, i;
+	struct page **pages;
+	unsigned long *bitmap;
+	int ret, i, npages;
+
+	npages = HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT;
+	bitmap = kzalloc(BITS_TO_LONGS(npages) * sizeof(long), GFP_KERNEL);
+	if (!bitmap) {
+		return -ENOMEM;
+	}
+	pages = kzalloc(npages * sizeof(void*), GFP_KERNEL);
+	if (!pages) {
+		kfree(bitmap);
+		return -ENOMEM;
+	}
+	for (i = 0; i < npages; ++i) {
+		pages[i] = alloc_page(GFP_KERNEL);
+		if (!pages[i]) {
+			while ((--i)) {
+				__free_page(pages[i]);
+			}
+			kfree(bitmap);
+			kfree(pages);
+			return -ENOMEM;
+		}
+	}
 
 	ret = alloc_chrdev_region(&ddevice->dev, 0,
 				  HMM_DUMMY_DEVICE_MAX_MIRRORS,
@@ -1066,15 +1493,23 @@ static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
 		goto error;
 	}
 
+	ddevice->rmem_bitmap = bitmap;
+	ddevice->rmem_pages = pages;
+
 	return 0;
 
 error:
+	for (i = 0; i < npages; ++i) {
+		__free_page(pages[i]);
+	}
+	kfree(bitmap);
+	kfree(pages);
 	return ret;
 }
 
 static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
 {
-	unsigned i;
+	unsigned i, npages;
 
 	/* First finish hmm. */
 	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
@@ -1092,6 +1527,13 @@ static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
 	cdev_del(&ddevice->cdev);
 	unregister_chrdev_region(ddevice->dev,
 				 HMM_DUMMY_DEVICE_MAX_MIRRORS);
+
+	npages = HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		__free_page(ddevice->rmem_pages[i]);
+	}
+	kfree(ddevice->rmem_bitmap);
+	kfree(ddevice->rmem_pages);
 }
 
 static int __init hmm_dummy_init(void)
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
index 16ae0d3..027c453 100644
--- a/include/uapi/linux/hmm_dummy.h
+++ b/include/uapi/linux/hmm_dummy.h
@@ -29,6 +29,12 @@
 #include <linux/irqnr.h>
 
 /* Expose the address space of the calling process through hmm dummy dev file */
-#define HMM_DUMMY_EXPOSE_MM	_IO( 'R', 0x00 )
+#define HMM_DUMMY_EXPOSE_MM		_IO( 'R', 0x00 )
+#define HMM_DUMMY_MIGRATE_TO_RMEM	_IO( 'R', 0x01 )
+
+struct hmm_dummy_migrate {
+	uint64_t		faddr;
+	uint64_t		laddr;
+};
 
 #endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 11/11] hmm/dummy_driver: add support for fake remote memory using pages.
@ 2014-05-02 13:52   ` j.glisse
  0 siblings, 0 replies; 107+ messages in thread
From: j.glisse @ 2014-05-02 13:52 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel; +Cc: Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Fake the existent of remote memory using preallocated pages and
demonstrate how to use the hmm api related to remote memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/char/hmm_dummy.c       | 450 ++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/hmm_dummy.h |   8 +-
 2 files changed, 453 insertions(+), 5 deletions(-)

diff --git a/drivers/char/hmm_dummy.c b/drivers/char/hmm_dummy.c
index e87dc7c..2443374 100644
--- a/drivers/char/hmm_dummy.c
+++ b/drivers/char/hmm_dummy.c
@@ -48,6 +48,8 @@
 
 #define HMM_DUMMY_DEVICE_NAME		"hmm_dummy_device"
 #define HMM_DUMMY_DEVICE_MAX_MIRRORS	4
+#define HMM_DUMMY_DEVICE_RMEM_SIZE	(32UL << 20UL)
+#define HMM_DUMMY_DEVICE_RMEM_NBITS	(HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT)
 
 struct hmm_dummy_device;
 
@@ -73,8 +75,16 @@ struct hmm_dummy_device {
 	/* device file mapping tracking (keep track of all vma) */
 	struct hmm_dummy_mirror	*dmirrors[HMM_DUMMY_DEVICE_MAX_MIRRORS];
 	struct address_space	*fmapping[HMM_DUMMY_DEVICE_MAX_MIRRORS];
+	struct page		**rmem_pages;
+	unsigned long		*rmem_bitmap;
 };
 
+struct hmm_dummy_rmem {
+	struct hmm_rmem		rmem;
+	unsigned long		fuid;
+	unsigned long		luid;
+	uint16_t		*rmem_idx;
+};
 
 /* We only create 2 device to show the inter device rmem sharing/migration
  * capabilities.
@@ -482,6 +492,51 @@ static void hmm_dummy_pt_free(struct hmm_dummy_mirror *dmirror,
 }
 
 
+/* hmm_dummy_rmem - dummy remote memory using system memory pages
+ *
+ * Helper function to allocate fake remote memory out of the device rmem_pages.
+ */
+static void hmm_dummy_rmem_free(struct hmm_dummy_rmem *drmem)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_rmem *rmem = &drmem->rmem;
+	unsigned long i, npages;
+
+	npages = (rmem->luid - rmem->fuid);
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < npages; ++i) {
+		clear_bit(drmem->rmem_idx[i], ddevice->rmem_bitmap);
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	kfree(drmem->rmem_idx);
+	drmem->rmem_idx = NULL;
+}
+
+static struct hmm_dummy_rmem *hmm_dummy_rmem_new(void)
+{
+	struct hmm_dummy_rmem *drmem;
+
+	drmem = kzalloc(sizeof(*drmem), GFP_KERNEL);
+	return drmem;
+}
+
+static int hmm_dummy_mirror_lmem_to_rmem(struct hmm_dummy_mirror *dmirror,
+					 unsigned long faddr,
+					 unsigned long laddr)
+{
+	struct hmm_mirror *mirror = &dmirror->mirror;
+	struct hmm_fault fault;
+	int ret;
+
+	fault.faddr = faddr & PAGE_MASK;
+	fault.laddr = PAGE_ALIGN(laddr);
+	ret = hmm_migrate_lmem_to_rmem(&fault, mirror);
+	return ret;
+}
+
+
 /* hmm_ops - hmm callback for the hmm dummy driver.
  *
  * Below are the various callback that the hmm api require for a device. The
@@ -574,7 +629,7 @@ static struct hmm_fence *hmm_dummy_lmem_update(struct hmm_mirror *mirror,
 
 			page = hmm_dummy_pte_to_page(*pldp);
 			if (page) {
-				set_page_dirty(page);
+				set_page_dirty_lock(page);
 			}
 		}
 		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
@@ -631,6 +686,318 @@ static int hmm_dummy_lmem_fault(struct hmm_mirror *mirror,
 	return 0;
 }
 
+static struct hmm_rmem *hmm_dummy_rmem_alloc(struct hmm_device *device,
+					     struct hmm_fault *fault)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	struct hmm_rmem *rmem;
+	unsigned long i, npages;
+
+	ddevice = container_of(device, struct hmm_dummy_device, device);
+
+	drmem = hmm_dummy_rmem_new();
+	if (drmem == NULL) {
+		return ERR_PTR(-ENOMEM);
+	}
+	rmem = &drmem->rmem;
+
+	npages = (fault->laddr - fault->faddr) >> PAGE_SHIFT;
+	drmem->rmem_idx = kmalloc(npages * sizeof(uint16_t), GFP_KERNEL);
+	if (drmem->rmem_idx == NULL) {
+		kfree(drmem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	mutex_lock(&ddevice->mutex);
+	for (i = 0; i < npages; ++i) {
+		int r;
+
+		r = find_first_zero_bit(ddevice->rmem_bitmap,
+					HMM_DUMMY_DEVICE_RMEM_NBITS);
+		if (r < 0) {
+			while ((--i)) {
+				clear_bit(drmem->rmem_idx[i],
+					  ddevice->rmem_bitmap);
+			}
+			kfree(drmem->rmem_idx);
+			kfree(drmem);
+			mutex_unlock(&ddevice->mutex);
+			return ERR_PTR(-ENOMEM);
+		}
+		drmem->rmem_idx[i] = r;
+	}
+	mutex_unlock(&ddevice->mutex);
+
+	return rmem;
+}
+
+static struct hmm_fence *hmm_dummy_rmem_update(struct hmm_mirror *mirror,
+					       struct hmm_rmem *rmem,
+					       unsigned long faddr,
+					       unsigned long laddr,
+					       unsigned long fuid,
+					       enum hmm_etype etype,
+					       bool dirty)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_pt_map pt_map = {0};
+	unsigned long addr, i, mask, or, idx;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	pt_map.dmirror = dmirror;
+	idx = fuid - rmem->fuid;
+
+	/* Sanity check for debugging hmm real device driver do not have to do that. */
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_UNMAP:
+	case HMM_MUNMAP:
+	case HMM_MPROT_WONLY:
+	case HMM_MIGRATE_TO_RMEM:
+	case HMM_MIGRATE_TO_LMEM:
+		mask = 0;
+		or = 0;
+		break;
+	case HMM_MPROT_RONLY:
+	case HMM_WRITEBACK:
+		mask = ~HMM_DUMMY_PTE_WRITE;
+		or = 0;
+		break;
+	case HMM_MPROT_RANDW:
+		mask = -1L;
+		or = HMM_DUMMY_PTE_WRITE;
+		break;
+	default:
+		printk(KERN_ERR "%4d:%s invalid event type %d\n",
+		       __LINE__, __func__, etype);
+		return ERR_PTR(-EIO);
+	}
+
+	mutex_lock(&dmirror->mutex);
+	for (i = 0, addr = faddr; addr < laddr; ++i, addr += PAGE_SIZE, ++idx) {
+		unsigned long *pldp;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, addr);
+		if (!pldp) {
+			continue;
+		}
+		if (dirty && ((*pldp) & HMM_DUMMY_PTE_DIRTY)) {
+			hmm_pfn_set_dirty(&rmem->pfns[idx]);
+		}
+		*pldp &= ~HMM_DUMMY_PTE_DIRTY;
+		*pldp &= mask;
+		*pldp |= or;
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+
+	switch (etype) {
+	case HMM_UNREGISTER:
+	case HMM_MUNMAP:
+		hmm_dummy_pt_free(dmirror, faddr, laddr);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&dmirror->mutex);
+	return NULL;
+}
+
+static int hmm_dummy_rmem_fault(struct hmm_mirror *mirror,
+				struct hmm_rmem *rmem,
+				unsigned long faddr,
+				unsigned long laddr,
+				unsigned long fuid,
+				struct hmm_fault *fault)
+{
+	struct hmm_dummy_mirror *dmirror;
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_pt_map pt_map = {0};
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+	bool write = fault ? !!(fault->flags & HMM_FAULT_WRITE) : false;
+
+	dmirror = container_of(mirror, struct hmm_dummy_mirror, mirror);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	ddevice = dmirror->ddevice;
+	pt_map.dmirror = dmirror;
+
+	mutex_lock(&dmirror->mutex);
+	for (i = fuid; faddr < laddr; ++i, faddr += PAGE_SIZE) {
+		unsigned long *pldp, pld_idx, pfn, idx = i - rmem->fuid;
+
+		pldp = hmm_dummy_pt_pld_map(&pt_map, faddr);
+		if (!pldp) {
+			continue;
+		}
+		pfn = page_to_pfn(ddevice->rmem_pages[drmem->rmem_idx[idx]]);
+		pld_idx = hmm_dummy_pld_index(faddr);
+		pldp[pld_idx]  = (pfn << HMM_DUMMY_PFN_SHIFT);
+		if (test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			pldp[pld_idx] |=  HMM_DUMMY_PTE_WRITE;
+			hmm_pfn_clear_lmem_uptodate(&rmem->pfns[idx]);
+		}
+		pldp[pld_idx] |= HMM_DUMMY_PTE_VALID_PAGE;
+		if (write && !test_bit(HMM_PFN_WRITE, &rmem->pfns[idx])) {
+			/* Fallback to use system memory. Other solution would be
+			 * to migrate back to system memory.
+			 */
+			hmm_pfn_clear_rmem_uptodate(&rmem->pfns[idx]);
+			if (!test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+				struct page *spage, *dpage;
+
+				dpage = hmm_pfn_to_page(rmem->pfns[idx]);
+				spage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+				copy_highpage(dpage, spage);
+				hmm_pfn_set_lmem_uptodate(&rmem->pfns[idx]);
+			}
+			pfn = rmem->pfns[idx] >> HMM_PFN_SHIFT;
+			pldp[pld_idx]  = (pfn << HMM_DUMMY_PFN_SHIFT);
+			pldp[pld_idx] |= HMM_DUMMY_PTE_WRITE;
+			pldp[pld_idx] |= HMM_DUMMY_PTE_VALID_PAGE;
+		}
+	}
+	hmm_dummy_pt_unmap(&pt_map);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+struct hmm_fence *hmm_dummy_rmem_to_lmem(struct hmm_rmem *rmem,
+					 unsigned long fuid,
+					 unsigned long luid)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+
+	for (i = fuid; i < luid; ++i) {
+		unsigned long idx = i - rmem->fuid;
+		struct page *spage, *dpage;
+
+		if (test_bit(HMM_PFN_LMEM_UPTODATE, &rmem->pfns[idx])) {
+			/* This lmem page is already uptodate. */
+			continue;
+		}
+		spage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+		dpage = hmm_pfn_to_page(rmem->pfns[idx]);
+		if (!dpage) {
+			return ERR_PTR(-EINVAL);
+		}
+		copy_highpage(dpage, spage);
+		hmm_pfn_set_lmem_uptodate(&rmem->pfns[idx]);
+	}
+
+	return NULL;
+}
+
+struct hmm_fence *hmm_dummy_lmem_to_rmem(struct hmm_rmem *rmem,
+					 unsigned long fuid,
+					 unsigned long luid)
+{
+	struct hmm_dummy_device *ddevice;
+	struct hmm_dummy_rmem *drmem;
+	unsigned long i;
+
+	ddevice = container_of(rmem->device, struct hmm_dummy_device, device);
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+
+	for (i = fuid; i < luid; ++i) {
+		unsigned long idx = i - rmem->fuid;
+		struct page *spage, *dpage;
+
+		if (test_bit(HMM_PFN_RMEM_UPTODATE, &rmem->pfns[idx])) {
+			/* This rmem page is already uptodate. */
+			continue;
+		}
+		dpage = ddevice->rmem_pages[drmem->rmem_idx[idx]];
+		spage = hmm_pfn_to_page(rmem->pfns[idx]);
+		if (!spage) {
+			return ERR_PTR(-EINVAL);
+		}
+		copy_highpage(dpage, spage);
+		hmm_pfn_set_rmem_uptodate(&rmem->pfns[idx]);
+	}
+
+	return NULL;
+}
+
+static int hmm_dummy_rmem_do_split(struct hmm_rmem *rmem,
+				   unsigned long fuid,
+				   unsigned long luid)
+{
+	struct hmm_dummy_rmem *drmem, *dnew;
+	struct hmm_fault fault;
+	struct hmm_rmem *new;
+	unsigned long i, pgoff, npages;
+	int ret;
+
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	npages = (luid - fuid);
+	pgoff = (fuid == rmem->fuid) ? 0 : fuid - rmem->fuid;
+	fault.faddr = 0;
+	fault.laddr = npages << PAGE_SHIFT;
+	new = hmm_dummy_rmem_alloc(rmem->device, &fault);
+	if (IS_ERR(new)) {
+		return PTR_ERR(new);
+	}
+	dnew = container_of(new, struct hmm_dummy_rmem, rmem);
+
+	new->fuid = fuid;
+	new->luid = luid;
+	ret = hmm_rmem_split_new(rmem, new);
+	if (ret) {
+		return ret;
+	}
+
+	/* Update the rmem it is fine to hold no lock as no one else can access
+	 * both of this rmem object as long as the range are reserved.
+	 */
+	for (i = 0; i < npages; ++i) {
+		dnew->rmem_idx[i] = drmem->rmem_idx[i + pgoff];
+	}
+	if (!pgoff) {
+		for (i = 0; i < (rmem->luid - rmem->fuid); ++i) {
+			drmem->rmem_idx[i] = drmem->rmem_idx[i + npages];
+		}
+	}
+
+	return 0;
+}
+
+static int hmm_dummy_rmem_split(struct hmm_rmem *rmem,
+				unsigned long fuid,
+				unsigned long luid)
+{
+	int ret;
+
+	if (fuid > rmem->fuid) {
+		ret = hmm_dummy_rmem_do_split(rmem, rmem->fuid, fuid);
+		if (ret) {
+			return ret;
+		}
+	}
+	if (luid < rmem->luid) {
+		ret = hmm_dummy_rmem_do_split(rmem, luid, rmem->luid);
+		if (ret) {
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void hmm_dummy_rmem_destroy(struct hmm_rmem *rmem)
+{
+	struct hmm_dummy_rmem *drmem;
+
+	drmem = container_of(rmem, struct hmm_dummy_rmem, rmem);
+	hmm_dummy_rmem_free(drmem);
+	kfree(drmem);
+}
+
 static const struct hmm_device_ops hmm_dummy_ops = {
 	.device_destroy		= &hmm_dummy_device_destroy,
 	.mirror_release		= &hmm_dummy_mirror_release,
@@ -638,6 +1005,14 @@ static const struct hmm_device_ops hmm_dummy_ops = {
 	.fence_wait		= &hmm_dummy_fence_wait,
 	.lmem_update		= &hmm_dummy_lmem_update,
 	.lmem_fault		= &hmm_dummy_lmem_fault,
+	.rmem_alloc		= &hmm_dummy_rmem_alloc,
+	.rmem_update		= &hmm_dummy_rmem_update,
+	.rmem_fault		= &hmm_dummy_rmem_fault,
+	.rmem_to_lmem		= &hmm_dummy_rmem_to_lmem,
+	.lmem_to_rmem		= &hmm_dummy_lmem_to_rmem,
+	.rmem_split		= &hmm_dummy_rmem_split,
+	.rmem_split_adjust	= &hmm_dummy_rmem_split,
+	.rmem_destroy		= &hmm_dummy_rmem_destroy,
 };
 
 
@@ -880,7 +1255,7 @@ static ssize_t hmm_dummy_fops_write(struct file *filp,
 		if (!(pldp[pld_idx] & HMM_DUMMY_PTE_WRITE)) {
 			hmm_dummy_pt_unmap(&pt_map);
 			mutex_unlock(&dmirror->mutex);
-				goto fault;
+			goto fault;
 		}
 		pldp[pld_idx] |= HMM_DUMMY_PTE_DIRTY;
 		page = hmm_dummy_pte_to_page(pldp[pld_idx]);
@@ -964,8 +1339,11 @@ static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
 					  unsigned int command,
 					  unsigned long arg)
 {
+	struct hmm_dummy_migrate dmigrate;
 	struct hmm_dummy_device *ddevice;
 	struct hmm_dummy_mirror *dmirror;
+	struct hmm_mirror *mirror;
+	void __user *uarg = (void __user *)arg;
 	unsigned minor;
 	int ret;
 
@@ -1011,6 +1389,31 @@ static long hmm_dummy_fops_unlocked_ioctl(struct file *filp,
 				       "mirroring address space of %d\n",
 				       dmirror->pid);
 		return 0;
+	case HMM_DUMMY_MIGRATE_TO_RMEM:
+		mutex_lock(&ddevice->mutex);
+		dmirror = ddevice->dmirrors[minor];
+		if (!dmirror) {
+			mutex_unlock(&ddevice->mutex);
+			return -EINVAL;
+		}
+		mirror = &dmirror->mirror;
+		mutex_unlock(&ddevice->mutex);
+
+		if (copy_from_user(&dmigrate, uarg, sizeof(dmigrate))) {
+			return -EFAULT;
+		}
+
+		ret = hmm_dummy_pt_alloc(dmirror,
+					 dmigrate.faddr,
+					 dmigrate.laddr);
+		if (ret) {
+			return ret;
+		}
+
+		ret = hmm_dummy_mirror_lmem_to_rmem(dmirror,
+						    dmigrate.faddr,
+						    dmigrate.laddr);
+		return ret;
 	default:
 		return -EINVAL;
 	}
@@ -1034,7 +1437,31 @@ static const struct file_operations hmm_dummy_fops = {
  */
 static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
 {
-	int ret, i;
+	struct page **pages;
+	unsigned long *bitmap;
+	int ret, i, npages;
+
+	npages = HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT;
+	bitmap = kzalloc(BITS_TO_LONGS(npages) * sizeof(long), GFP_KERNEL);
+	if (!bitmap) {
+		return -ENOMEM;
+	}
+	pages = kzalloc(npages * sizeof(void*), GFP_KERNEL);
+	if (!pages) {
+		kfree(bitmap);
+		return -ENOMEM;
+	}
+	for (i = 0; i < npages; ++i) {
+		pages[i] = alloc_page(GFP_KERNEL);
+		if (!pages[i]) {
+			while ((--i)) {
+				__free_page(pages[i]);
+			}
+			kfree(bitmap);
+			kfree(pages);
+			return -ENOMEM;
+		}
+	}
 
 	ret = alloc_chrdev_region(&ddevice->dev, 0,
 				  HMM_DUMMY_DEVICE_MAX_MIRRORS,
@@ -1066,15 +1493,23 @@ static int hmm_dummy_device_init(struct hmm_dummy_device *ddevice)
 		goto error;
 	}
 
+	ddevice->rmem_bitmap = bitmap;
+	ddevice->rmem_pages = pages;
+
 	return 0;
 
 error:
+	for (i = 0; i < npages; ++i) {
+		__free_page(pages[i]);
+	}
+	kfree(bitmap);
+	kfree(pages);
 	return ret;
 }
 
 static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
 {
-	unsigned i;
+	unsigned i, npages;
 
 	/* First finish hmm. */
 	for (i = 0; i < HMM_DUMMY_DEVICE_MAX_MIRRORS; i++) {
@@ -1092,6 +1527,13 @@ static void hmm_dummy_device_fini(struct hmm_dummy_device *ddevice)
 	cdev_del(&ddevice->cdev);
 	unregister_chrdev_region(ddevice->dev,
 				 HMM_DUMMY_DEVICE_MAX_MIRRORS);
+
+	npages = HMM_DUMMY_DEVICE_RMEM_SIZE >> PAGE_SHIFT;
+	for (i = 0; i < npages; ++i) {
+		__free_page(ddevice->rmem_pages[i]);
+	}
+	kfree(ddevice->rmem_bitmap);
+	kfree(ddevice->rmem_pages);
 }
 
 static int __init hmm_dummy_init(void)
diff --git a/include/uapi/linux/hmm_dummy.h b/include/uapi/linux/hmm_dummy.h
index 16ae0d3..027c453 100644
--- a/include/uapi/linux/hmm_dummy.h
+++ b/include/uapi/linux/hmm_dummy.h
@@ -29,6 +29,12 @@
 #include <linux/irqnr.h>
 
 /* Expose the address space of the calling process through hmm dummy dev file */
-#define HMM_DUMMY_EXPOSE_MM	_IO( 'R', 0x00 )
+#define HMM_DUMMY_EXPOSE_MM		_IO( 'R', 0x00 )
+#define HMM_DUMMY_MIGRATE_TO_RMEM	_IO( 'R', 0x01 )
+
+struct hmm_dummy_migrate {
+	uint64_t		faddr;
+	uint64_t		laddr;
+};
 
 #endif /* _UAPI_LINUX_RANDOM_H */
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-02 13:51 ` j.glisse
@ 2014-05-06 10:29   ` Peter Zijlstra
  -1 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2014-05-06 10:29 UTC (permalink / raw)
  To: j.glisse
  Cc: linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Hagga

[-- Attachment #1: Type: text/plain, Size: 21640 bytes --]


So you forgot to CC Linus, Linus has expressed some dislike for
preemptible mmu_notifiers in the recent past:

  https://lkml.org/lkml/2013/9/30/385

And here you're proposing to add dependencies on it.

Left the original msg in tact for the new Cc's.

On Fri, May 02, 2014 at 09:51:59AM -0400, j.glisse@gmail.com wrote:
> In a nutshell:
> 
> The heterogeneous memory management (hmm) patchset implement a new api that
> sit on top of the mmu notifier api. It provides a simple api to device driver
> to mirror a process address space without having to lock or take reference on
> page and block them from being reclam or migrated. Any changes on a process
> address space is mirrored to the device page table by the hmm code. To achieve
> this not only we need each driver to implement a set of callback functions but
> hmm also interface itself in many key location of the mm code and fs code.
> Moreover hmm allow to migrate range of memory to the device remote memory to
> take advantages of its lower latency and higher bandwidth.
> 
> The why:
> 
> We want to be able to mirror a process address space so that compute api such
> as OpenCL or other similar api can start using the exact same address space on
> the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
> we believe that we will see more and more specialize unit functions that will
> want to mirror process address using their own mmu.
> 
> To achieve this hmm requires :
>  A.1 - Hardware requirements
>  A.2 - sleeping inside mmu_notifier
>  A.3 - context information for mmu_notifier callback (patch 1 and 2)
>  A.4 - new helper function for memcg (patch 5)
>  A.5 - special swap type and fault handling code
>  A.6 - file backed memory and filesystem changes
>  A.7 - The write back expectation
> 
> While avoiding :
>  B.1 - No new page flag
>  B.2 - No special page reclamation code
> 
> Finally the rest of this email deals with :
>  C.1 - Alternative designs
>  C.2 - Hardware solution
>  C.3 - Routines marked EXPORT_SYMBOL
>  C.4 - Planned features
>  C.5 - Getting upstream
> 
> But first patchlist :
> 
>  0001 - Clarify the use of TTU_UNMAP as being done for VMSCAN or POISONING
>  0002 - Give context information to mmu_notifier callback ie why the callback
>         is made for (because of munmap call, or page migration, ...).
>  0003 - Provide the vma for which the invalidation is happening to mmu_notifier
>         callback. This is mostly and optimization to avoid looking up again the
>         vma inside the mmu_notifier callback.
>  0004 - Add new helper to the generic interval tree (which use rb tree).
>  0005 - Add new helper to memcg so that anonymous page can be accounted as well
>         as unaccounted without a page struct. Also add a new helper function to
>         transfer a charge to a page (charge which have been accounted without a
>         struct page in the first place).
>  0006 - Introduce the hmm basic code to support simple device mirroring of the
>         address space. It is fully functional modulo some missing bit (guard or
>         huge page and few other small corner cases).
>  0007 - Introduce support for migrating anonymous memory to device memory. This
>         involve introducing a new special swap type and teach the mm page fault
>         code about hmm.
>  0008 - Introduce support for migrating shared or private memory that is backed
>         by a file. This is way more complex than anonymous case as it needs to
>         synchronize with and exclude other kernel code path that might try to
>         access those pages.
>  0009 - Add hmm support to ext4 filesystem.
>  0010 - Introduce a simple dummy driver that showcase use of the hmm api.
>  0011 - Add support for remote memory to the dummy driver.
> 
> I believe that patch 1, 2, 3 are use full on their own as they could help fix
> some kvm issues (see https://lkml.org/lkml/2014/1/15/125) and they do not
> modify behavior of any current code (except that patch 3 might result in a
> larger number of call to mmu_notifier as many as there is different vma for
> a range).
> 
> Other patches have many rough edges but we would like to validate our design
> and see what we need to change before smoothing out any of them.
> 
> 
> A.1 - Hardware requirements :
> 
> The hardware must have its own mmu with a page table per process it wants to
> mirror. The device mmu mandatory features are :
>   - per page read only flag.
>   - page fault support that stop/suspend hardware thread and support resuming
>     those hardware thread once the page fault have been serviced.
>   - same number of bits for the virtual address as the target architecture (for
>     instance 48 bits on current AMD 64).
> 
> Advanced optional features :
>   - per page dirty bit (indicating the hardware did write to the page).
>   - per page access bit (indicating the hardware did access the page).
> 
> 
> A.2 - Sleeping in mmu notifier callback :
> 
> Because update device mmu might need to sleep, either for taking device driver
> lock (which might be consider fixable) or simply because invalidating the mmu
> might take several hundred millisecond and might involve allocating device or
> driver resources to perform the operation any of which might require to sleep.
> 
> Thus we need to be able to sleep inside mmu_notifier_invalidate_range_start at
> the very least. Also we need to call to mmu_notifier_change_pte to be bracketed
> by mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end.
> We need this because mmu_notifier_change_pte is call with the anon vma lock
> held (and this is a non sleepable lock).
> 
> 
> A.3 - Context information for mmu_notifier callback :
> 
> There is a need to provide more context information on why a mmu_notifier call
> back does happen. Was it because userspace call munmap ? Or was it because the
> kernel is trying to free some memory ? Or because page is being migrated ?
> 
> The context is provided by using unique enum value associated with call site of
> mmu_notifier functions. The patch here just add the enum value and modify each
> call site to pass along the proper value.
> 
> The context information is important for management of the secondary mmu. For
> instance on a munmap the device driver will want to free all resources used by
> that range (device page table memory). This could as well solve the issue that
> was discussed in this thread https://lkml.org/lkml/2014/1/15/125 kvm can ignore
> mmu_notifier_invalidate_range based on the enum value.
> 
> 
> A.4 - New helper function for memcg :
> 
> To keep memory control working as expect with the introduction of remote memory
> we need to add new helper function so we can account anonymous remote memory as
> if it was backed by a page. We also need to be able to transfer charge from the
> remote memory to pages and we need to be able clear a page cgroup without side
> effect to the memcg.
> 
> The patchset currently does add a new type of memory resource but instead just
> account remote memory as local memory (struct page) is. This is done with the
> minimum amount of change to the memcg code. I believe they are correct.
> 
> It might make sense to introduce a new sub-type of memory down the road so that
> device memory can be included inside the memcg accounting but we choose to not
> do so at first.
> 
> 
> A.5 - Special swap type and fault handling code :
> 
> When some range of address is backed by device memory we need cpu fault to be
> aware of that so it can ask hmm to trigger migration back to local memory. To
> avoid too much code disruption we do so by adding a new special hmm swap type
> that is special cased in various place inside the mm page fault code. Refer to
> patch 7 for details.
> 
> 
> A.6 - File backed memory and filesystem changes :
> 
> Using remote memory for range of address backed by a file is more complex than
> anonymous memory. There are lot more code path that might want to access pages
> that cache a file (for read, write, splice, ...). To avoid disrupting the code
> too much and sleeping inside page cache look up we decided to add hmm support
> on a per filesystem basis. So each filesystem can be teach about hmm and how to
> interact with it correctly.
> 
> The design is relatively simple, the radix tree is updated to use special hmm
> swap entry for any page which is in remote memory. Thus any radix tree look up
> will find the special entry and will know it needs to synchronize itself with
> hmm to access the file.
> 
> There is however subtleties. Updating the radix tree does not guarantee that
> hmm is the sole user of the page, another kernel/user thread might have done a
> radix look up before the radix tree update.
> 
> The solution to this issue is to first update the radix tree, then lock each
> page we are migrating, then unmap it from all the process using it and setting
> its mapping field to NULL so that once we unlock the page all existing code
> will thought that the page was either truncated or reclaimed in both cases all
> existing kernel code path will eith perform new look and see the hmm special
> entry or will just skip the page. Those code path were audited to insure that
> their behavior and expected result are not modified by this.
> 
> However this does not insure us exclusive access to the page. So at first when
> migrating such page to remote memory we map it read only inside the device and
> keep the page around so that both the device copy and the page copy contain the
> same data. If the device wishes to write to this remote memory then it call hmm
> fault code.
> 
> To allow write on remote memory hmm will try to free the page, if the page can
> be free then it means hmm is the unique user of the page and the remote memory
> can safely be written to. If not then this means that the page content might
> still be in use by some other process and the device driver have to choose to
> either wait or use the local memory instead. So local memory page are kept as
> long as there are other user for them. We likely need to hookup some special
> page reclamation code to force reclaiming those pages after a while.
> 
> 
> A.7 - The write back expectation :
> 
> We also wanted to preserve the writeback and dirty balancing as we believe this
> is an important behavior (avoiding dirty content to stay for too long inside
> remote memory without being write back to disk). To avoid constantly migrating
> memory back and forth we decided to use existing page (hmm keep all shared page
> around and never free them for the lifetime of rmem object they are associated
> with) as temporary writeback source. On writeback the remote memory is mapped
> read only on the device and copied back to local memory which is use as source
> for the disk write.
> 
> This design choice can however be seen as counter productive as it means that
> the device using hmm will see its rmem map read only for writeback and then
> will have to wait for writeback to go through. Another choice would be to
> forget writeback while memory is on the device and pretend page are clear but
> this would break fsync and similar API for file that does have part of its
> content inside some device memory.
> 
> Middle ground might be to keep fsync and alike working but to ignore any other
> writeback.
> 
> 
> B.1 - No new page flag :
> 
> While adding a new page flag would certainly help to find a different design to
> implement the hmm feature set. We tried to only think about design that do not
> require such a new flag.
> 
> 
> B.2 - No special page reclamation code :
> 
> This is one of the big issue, should be isolate pages that are actively use
> by a device from the regular lru to a specific lru managed by the hmm code.
> In this patchset we decided to avoid doing so as it would just add complexity
> to already complex code.
> 
> Current code will trigger sleep inside vmscan when trying to reclaim page that
> belong to a process which is mirrored on a device. Is this acceptable or should
> we add a new hmm lru list that would handle all pages used by device in special
> way so that those pages are isolated from the regular page reclamation code.
> 
> 
> C.1 - Alternative designs :
> 
> The current design is the one we believe provide enough ground to support all
> necessary features while keeping complexity as low as possible. However i think
> it is important to state that several others designs were tested and to explain
> why they were discarded.
> 
> D1) One of the first design introduced a secondary page table directly updated
>   by hmm helper functions. Hope was that this secondary page table could be in
>   some way directly use by the device. That was naive ... to say the least.
> 
> D2) The secondary page table with hmm specific format, was another design that
>   we tested. In this one the secondary page table was not intended to be use by
>   the device but was intended to serve as a buffer btw the cpu page table and
>   the device page table. Update to the device page table would use the hmm page
>   table.
> 
>   While this secondary page table allow to track what is actively use and also
>   gather statistics about it. It does require memory, in worst case as much as
>   the cpu page table.
> 
>   Another issue is that synchronization between cpu update and device trying to
>   access this secondary page table was either prone to lock contention. Or was
>   getting awfully complex to avoid locking all while duplicating complexity
>   inside each of the device driver.
> 
>   The killing bullet was however the fact that the code was littered with bug
>   condition about discrepancy between the cpu and the hmm page table.
> 
> D3) Use a structure to track all actively mirrored range per process and per
>   device. This allow to have an exact view of which range of memory is in use
>   by which device.
> 
>   Again this need a lot of memory to track each of the active range and worst
>   case would need more memory than a secondary page table (one struct range per
>   page).
> 
>   Issue here was with the complexity or merging and splitting range on address
>   space changes.
> 
> D4) Use a structure to track all active mirrored range per process (shared by
>   all the devices that mirror the same process). This partially address the
>   memory requirement of D3 but this leave the complexity of range merging and
>   splitting intact.
> 
> The current design is a simplification of D4 in which we only track range of
> memory for memory that have been migrated to device memory. So for any others
> operations hmm directly access the cpu page table and forward the appropriate
> information to the device driver through the hmm api. We might need to go back
> to D4 design or a variation of it for some of the features we want add.
> 
> 
> C.2 - Hardware solution :
> 
> What hmm try to achieve can be partially achieved using hardware solution. Such
> hardware solution is part of PCIE specification with the PASID (process address
> space id) and ATS (address translation service). With both of this PCIE feature
> a device can ask for a virtual address of a given process to be translated into
> its corresponding physical address. To achieve this the IOMMU bridge is capable
> of understanding and walking the cpu page table of a process. See the IOMMUv2
> implementation inside the linux kernel for reference.
> 
> There is two huge restriction with hardware solution to this problem. First an
> obvious one is that you need hardware support. While HMM also require hardware
> support on the GPU side it does not on the architecture side (no requirement on
> IOMMU, or any bridges that are between the GPU and the system memory). This is
> a strong advantages to HMM it only require hardware support to one specific
> part.
> 
> The second restriction is that hardware solution like IOMMUv2 does not permit
> migrating chunk of memory to the device local memory which means under-using
> hardware resources (all discrete GPU comes with fast local memory that can
> have more than ten times the bandwidth of system memory).
> 
> This two reasons alone, are we believe enough to justify hmm usefulness.
> 
> Moreover hmm can work in a hybrid solution where non migrated chunk of memory
> goes through the hardware solution (IOMMUv2 for instance) and only the memory
> that is migrated to the device is handled by the hmm code. The requirement for
> the hardware is minimal, the hardware need to support the PASID & ATS (or any
> other hardware implementation of the same idea) on page granularity basis (it
> could be on the granularity of any level of the device page table so no need
> to populate all levels of the device page table). Which is the best solution
> for the problem.
> 
> 
> C.3 - Routines marked EXPORT_SYMBOL
> 
> As these routines are intended to be referenced in device drivers, they
> are marked EXPORT_SYMBOL as is common practice. This encourages adoption
> of HMM in both GPL and non-GPL drivers, and allows ongoing collaboration
> with one of the primary authors of this idea.
> 
> I think it would be beneficial to include this feature as soon as possible.
> Early collaborators can go to the trouble of fixing and polishing the HMM
> implementation, allowing it to fully bake by the time other drivers start
> implementing features requiring it. We are confident that this API will be
> useful to others as they catch up with supporting hardware.
> 
> 
> C.4 - Planned features :
> 
> We are planning to add various features down the road once we can clear the
> basic design. Most important ones are :
>   - Allowing inter-device migration for compatible devices.
>   - Allowing hmm_rmem without backing storage (simplify some of the driver).
>   - Device specific memcg.
>   - Improvement to allow APU to take advantages of rmem, by hiding the page
>     from the cpu the gpu can use a different memory controller link that do
>     not require cache coherency with the cpu and thus provide higher bandwidth.
>   - Atomic device memory operation by unmapping on the cpu while the device is
>     performing atomic operation (this require hardware mmu to differentiate
>     between regular memory access and atomic memory access and to have a flag
>     that allow atomic memory access on per page basis).
>   - Pining private memory to rmem this would be a useful feature to add and
>     would require addition of a new flag to madvise. Any cpu access would
>     result in SIGBUS for the cpu process.
> 
> 
> C.5 - Getting upstream :
> 
> So what should i do to get this patchset in a mergeable form at least at first
> as a staging feature ? Right now the patchset has few rough edges around huge
> page support and other smaller issues. But as said above i believe that patch
> 1, 2, 3 and 4 can be merge as is as they do not modify current behavior while
> being useful to other.
> 
> Should i implement some secondary hmm specific lru and their associated worker
> thread to avoid having the regular reclaim code to end up sleeping waiting for
> a device to update its page table ?
> 
> Should i go for a totaly different design ? If so what direction ? As stated
> above we explored other design and i listed there flaws.
> 
> Any others things that i need to fix/address/change/improve ?
> 
> Comments and flames are welcome.
> 
> Cheers,
> Jérôme Glisse
> 
> To: <linux-kernel@vger.kernel.org>,
> To: linux-mm <linux-mm@kvack.org>,
> To: <linux-fsdevel@vger.kernel.org>,
> Cc: "Mel Gorman" <mgorman@suse.de>,
> Cc: "H. Peter Anvin" <hpa@zytor.com>,
> Cc: "Peter Zijlstra" <peterz@infradead.org>,
> Cc: "Andrew Morton" <akpm@linux-foundation.org>,
> Cc: "Linda Wang" <lwang@redhat.com>,
> Cc: "Kevin E Martin" <kem@redhat.com>,
> Cc: "Jerome Glisse" <jglisse@redhat.com>,
> Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
> Cc: "Johannes Weiner" <jweiner@redhat.com>,
> Cc: "Larry Woodman" <lwoodman@redhat.com>,
> Cc: "Rik van Riel" <riel@redhat.com>,
> Cc: "Dave Airlie" <airlied@redhat.com>,
> Cc: "Jeff Law" <law@redhat.com>,
> Cc: "Brendan Conoboy" <blc@redhat.com>,
> Cc: "Joe Donohue" <jdonohue@redhat.com>,
> Cc: "Duncan Poole" <dpoole@nvidia.com>,
> Cc: "Sherry Cheung" <SCheung@nvidia.com>,
> Cc: "Subhash Gutti" <sgutti@nvidia.com>,
> Cc: "John Hubbard" <jhubbard@nvidia.com>,
> Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
> Cc: "Lucien Dunning" <ldunning@nvidia.com>,
> Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
> Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
> Cc: "Haggai Eran" <haggaie@mellanox.com>,
> Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
> Cc: "Sagi Grimberg" <sagig@mellanox.com>
> Cc: "Shachar Raindel" <raindel@mellanox.com>,
> Cc: "Liran Liss" <liranl@mellanox.com>,
> Cc: "Roland Dreier" <roland@purestorage.com>,
> Cc: "Sander, Ben" <ben.sander@amd.com>,
> Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
> Cc: "Bridgman, John" <John.Bridgman@amd.com>,
> Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
> Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
> Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
> Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
> Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 10:29   ` Peter Zijlstra
  0 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2014-05-06 10:29 UTC (permalink / raw)
  To: j.glisse
  Cc: linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Linus Torvalds,
	Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 21640 bytes --]


So you forgot to CC Linus, Linus has expressed some dislike for
preemptible mmu_notifiers in the recent past:

  https://lkml.org/lkml/2013/9/30/385

And here you're proposing to add dependencies on it.

Left the original msg in tact for the new Cc's.

On Fri, May 02, 2014 at 09:51:59AM -0400, j.glisse@gmail.com wrote:
> In a nutshell:
> 
> The heterogeneous memory management (hmm) patchset implement a new api that
> sit on top of the mmu notifier api. It provides a simple api to device driver
> to mirror a process address space without having to lock or take reference on
> page and block them from being reclam or migrated. Any changes on a process
> address space is mirrored to the device page table by the hmm code. To achieve
> this not only we need each driver to implement a set of callback functions but
> hmm also interface itself in many key location of the mm code and fs code.
> Moreover hmm allow to migrate range of memory to the device remote memory to
> take advantages of its lower latency and higher bandwidth.
> 
> The why:
> 
> We want to be able to mirror a process address space so that compute api such
> as OpenCL or other similar api can start using the exact same address space on
> the GPU as on the CPU. This will greatly simplify usages of those api. Moreover
> we believe that we will see more and more specialize unit functions that will
> want to mirror process address using their own mmu.
> 
> To achieve this hmm requires :
>  A.1 - Hardware requirements
>  A.2 - sleeping inside mmu_notifier
>  A.3 - context information for mmu_notifier callback (patch 1 and 2)
>  A.4 - new helper function for memcg (patch 5)
>  A.5 - special swap type and fault handling code
>  A.6 - file backed memory and filesystem changes
>  A.7 - The write back expectation
> 
> While avoiding :
>  B.1 - No new page flag
>  B.2 - No special page reclamation code
> 
> Finally the rest of this email deals with :
>  C.1 - Alternative designs
>  C.2 - Hardware solution
>  C.3 - Routines marked EXPORT_SYMBOL
>  C.4 - Planned features
>  C.5 - Getting upstream
> 
> But first patchlist :
> 
>  0001 - Clarify the use of TTU_UNMAP as being done for VMSCAN or POISONING
>  0002 - Give context information to mmu_notifier callback ie why the callback
>         is made for (because of munmap call, or page migration, ...).
>  0003 - Provide the vma for which the invalidation is happening to mmu_notifier
>         callback. This is mostly and optimization to avoid looking up again the
>         vma inside the mmu_notifier callback.
>  0004 - Add new helper to the generic interval tree (which use rb tree).
>  0005 - Add new helper to memcg so that anonymous page can be accounted as well
>         as unaccounted without a page struct. Also add a new helper function to
>         transfer a charge to a page (charge which have been accounted without a
>         struct page in the first place).
>  0006 - Introduce the hmm basic code to support simple device mirroring of the
>         address space. It is fully functional modulo some missing bit (guard or
>         huge page and few other small corner cases).
>  0007 - Introduce support for migrating anonymous memory to device memory. This
>         involve introducing a new special swap type and teach the mm page fault
>         code about hmm.
>  0008 - Introduce support for migrating shared or private memory that is backed
>         by a file. This is way more complex than anonymous case as it needs to
>         synchronize with and exclude other kernel code path that might try to
>         access those pages.
>  0009 - Add hmm support to ext4 filesystem.
>  0010 - Introduce a simple dummy driver that showcase use of the hmm api.
>  0011 - Add support for remote memory to the dummy driver.
> 
> I believe that patch 1, 2, 3 are use full on their own as they could help fix
> some kvm issues (see https://lkml.org/lkml/2014/1/15/125) and they do not
> modify behavior of any current code (except that patch 3 might result in a
> larger number of call to mmu_notifier as many as there is different vma for
> a range).
> 
> Other patches have many rough edges but we would like to validate our design
> and see what we need to change before smoothing out any of them.
> 
> 
> A.1 - Hardware requirements :
> 
> The hardware must have its own mmu with a page table per process it wants to
> mirror. The device mmu mandatory features are :
>   - per page read only flag.
>   - page fault support that stop/suspend hardware thread and support resuming
>     those hardware thread once the page fault have been serviced.
>   - same number of bits for the virtual address as the target architecture (for
>     instance 48 bits on current AMD 64).
> 
> Advanced optional features :
>   - per page dirty bit (indicating the hardware did write to the page).
>   - per page access bit (indicating the hardware did access the page).
> 
> 
> A.2 - Sleeping in mmu notifier callback :
> 
> Because update device mmu might need to sleep, either for taking device driver
> lock (which might be consider fixable) or simply because invalidating the mmu
> might take several hundred millisecond and might involve allocating device or
> driver resources to perform the operation any of which might require to sleep.
> 
> Thus we need to be able to sleep inside mmu_notifier_invalidate_range_start at
> the very least. Also we need to call to mmu_notifier_change_pte to be bracketed
> by mmu_notifier_invalidate_range_start and mmu_notifier_invalidate_range_end.
> We need this because mmu_notifier_change_pte is call with the anon vma lock
> held (and this is a non sleepable lock).
> 
> 
> A.3 - Context information for mmu_notifier callback :
> 
> There is a need to provide more context information on why a mmu_notifier call
> back does happen. Was it because userspace call munmap ? Or was it because the
> kernel is trying to free some memory ? Or because page is being migrated ?
> 
> The context is provided by using unique enum value associated with call site of
> mmu_notifier functions. The patch here just add the enum value and modify each
> call site to pass along the proper value.
> 
> The context information is important for management of the secondary mmu. For
> instance on a munmap the device driver will want to free all resources used by
> that range (device page table memory). This could as well solve the issue that
> was discussed in this thread https://lkml.org/lkml/2014/1/15/125 kvm can ignore
> mmu_notifier_invalidate_range based on the enum value.
> 
> 
> A.4 - New helper function for memcg :
> 
> To keep memory control working as expect with the introduction of remote memory
> we need to add new helper function so we can account anonymous remote memory as
> if it was backed by a page. We also need to be able to transfer charge from the
> remote memory to pages and we need to be able clear a page cgroup without side
> effect to the memcg.
> 
> The patchset currently does add a new type of memory resource but instead just
> account remote memory as local memory (struct page) is. This is done with the
> minimum amount of change to the memcg code. I believe they are correct.
> 
> It might make sense to introduce a new sub-type of memory down the road so that
> device memory can be included inside the memcg accounting but we choose to not
> do so at first.
> 
> 
> A.5 - Special swap type and fault handling code :
> 
> When some range of address is backed by device memory we need cpu fault to be
> aware of that so it can ask hmm to trigger migration back to local memory. To
> avoid too much code disruption we do so by adding a new special hmm swap type
> that is special cased in various place inside the mm page fault code. Refer to
> patch 7 for details.
> 
> 
> A.6 - File backed memory and filesystem changes :
> 
> Using remote memory for range of address backed by a file is more complex than
> anonymous memory. There are lot more code path that might want to access pages
> that cache a file (for read, write, splice, ...). To avoid disrupting the code
> too much and sleeping inside page cache look up we decided to add hmm support
> on a per filesystem basis. So each filesystem can be teach about hmm and how to
> interact with it correctly.
> 
> The design is relatively simple, the radix tree is updated to use special hmm
> swap entry for any page which is in remote memory. Thus any radix tree look up
> will find the special entry and will know it needs to synchronize itself with
> hmm to access the file.
> 
> There is however subtleties. Updating the radix tree does not guarantee that
> hmm is the sole user of the page, another kernel/user thread might have done a
> radix look up before the radix tree update.
> 
> The solution to this issue is to first update the radix tree, then lock each
> page we are migrating, then unmap it from all the process using it and setting
> its mapping field to NULL so that once we unlock the page all existing code
> will thought that the page was either truncated or reclaimed in both cases all
> existing kernel code path will eith perform new look and see the hmm special
> entry or will just skip the page. Those code path were audited to insure that
> their behavior and expected result are not modified by this.
> 
> However this does not insure us exclusive access to the page. So at first when
> migrating such page to remote memory we map it read only inside the device and
> keep the page around so that both the device copy and the page copy contain the
> same data. If the device wishes to write to this remote memory then it call hmm
> fault code.
> 
> To allow write on remote memory hmm will try to free the page, if the page can
> be free then it means hmm is the unique user of the page and the remote memory
> can safely be written to. If not then this means that the page content might
> still be in use by some other process and the device driver have to choose to
> either wait or use the local memory instead. So local memory page are kept as
> long as there are other user for them. We likely need to hookup some special
> page reclamation code to force reclaiming those pages after a while.
> 
> 
> A.7 - The write back expectation :
> 
> We also wanted to preserve the writeback and dirty balancing as we believe this
> is an important behavior (avoiding dirty content to stay for too long inside
> remote memory without being write back to disk). To avoid constantly migrating
> memory back and forth we decided to use existing page (hmm keep all shared page
> around and never free them for the lifetime of rmem object they are associated
> with) as temporary writeback source. On writeback the remote memory is mapped
> read only on the device and copied back to local memory which is use as source
> for the disk write.
> 
> This design choice can however be seen as counter productive as it means that
> the device using hmm will see its rmem map read only for writeback and then
> will have to wait for writeback to go through. Another choice would be to
> forget writeback while memory is on the device and pretend page are clear but
> this would break fsync and similar API for file that does have part of its
> content inside some device memory.
> 
> Middle ground might be to keep fsync and alike working but to ignore any other
> writeback.
> 
> 
> B.1 - No new page flag :
> 
> While adding a new page flag would certainly help to find a different design to
> implement the hmm feature set. We tried to only think about design that do not
> require such a new flag.
> 
> 
> B.2 - No special page reclamation code :
> 
> This is one of the big issue, should be isolate pages that are actively use
> by a device from the regular lru to a specific lru managed by the hmm code.
> In this patchset we decided to avoid doing so as it would just add complexity
> to already complex code.
> 
> Current code will trigger sleep inside vmscan when trying to reclaim page that
> belong to a process which is mirrored on a device. Is this acceptable or should
> we add a new hmm lru list that would handle all pages used by device in special
> way so that those pages are isolated from the regular page reclamation code.
> 
> 
> C.1 - Alternative designs :
> 
> The current design is the one we believe provide enough ground to support all
> necessary features while keeping complexity as low as possible. However i think
> it is important to state that several others designs were tested and to explain
> why they were discarded.
> 
> D1) One of the first design introduced a secondary page table directly updated
>   by hmm helper functions. Hope was that this secondary page table could be in
>   some way directly use by the device. That was naive ... to say the least.
> 
> D2) The secondary page table with hmm specific format, was another design that
>   we tested. In this one the secondary page table was not intended to be use by
>   the device but was intended to serve as a buffer btw the cpu page table and
>   the device page table. Update to the device page table would use the hmm page
>   table.
> 
>   While this secondary page table allow to track what is actively use and also
>   gather statistics about it. It does require memory, in worst case as much as
>   the cpu page table.
> 
>   Another issue is that synchronization between cpu update and device trying to
>   access this secondary page table was either prone to lock contention. Or was
>   getting awfully complex to avoid locking all while duplicating complexity
>   inside each of the device driver.
> 
>   The killing bullet was however the fact that the code was littered with bug
>   condition about discrepancy between the cpu and the hmm page table.
> 
> D3) Use a structure to track all actively mirrored range per process and per
>   device. This allow to have an exact view of which range of memory is in use
>   by which device.
> 
>   Again this need a lot of memory to track each of the active range and worst
>   case would need more memory than a secondary page table (one struct range per
>   page).
> 
>   Issue here was with the complexity or merging and splitting range on address
>   space changes.
> 
> D4) Use a structure to track all active mirrored range per process (shared by
>   all the devices that mirror the same process). This partially address the
>   memory requirement of D3 but this leave the complexity of range merging and
>   splitting intact.
> 
> The current design is a simplification of D4 in which we only track range of
> memory for memory that have been migrated to device memory. So for any others
> operations hmm directly access the cpu page table and forward the appropriate
> information to the device driver through the hmm api. We might need to go back
> to D4 design or a variation of it for some of the features we want add.
> 
> 
> C.2 - Hardware solution :
> 
> What hmm try to achieve can be partially achieved using hardware solution. Such
> hardware solution is part of PCIE specification with the PASID (process address
> space id) and ATS (address translation service). With both of this PCIE feature
> a device can ask for a virtual address of a given process to be translated into
> its corresponding physical address. To achieve this the IOMMU bridge is capable
> of understanding and walking the cpu page table of a process. See the IOMMUv2
> implementation inside the linux kernel for reference.
> 
> There is two huge restriction with hardware solution to this problem. First an
> obvious one is that you need hardware support. While HMM also require hardware
> support on the GPU side it does not on the architecture side (no requirement on
> IOMMU, or any bridges that are between the GPU and the system memory). This is
> a strong advantages to HMM it only require hardware support to one specific
> part.
> 
> The second restriction is that hardware solution like IOMMUv2 does not permit
> migrating chunk of memory to the device local memory which means under-using
> hardware resources (all discrete GPU comes with fast local memory that can
> have more than ten times the bandwidth of system memory).
> 
> This two reasons alone, are we believe enough to justify hmm usefulness.
> 
> Moreover hmm can work in a hybrid solution where non migrated chunk of memory
> goes through the hardware solution (IOMMUv2 for instance) and only the memory
> that is migrated to the device is handled by the hmm code. The requirement for
> the hardware is minimal, the hardware need to support the PASID & ATS (or any
> other hardware implementation of the same idea) on page granularity basis (it
> could be on the granularity of any level of the device page table so no need
> to populate all levels of the device page table). Which is the best solution
> for the problem.
> 
> 
> C.3 - Routines marked EXPORT_SYMBOL
> 
> As these routines are intended to be referenced in device drivers, they
> are marked EXPORT_SYMBOL as is common practice. This encourages adoption
> of HMM in both GPL and non-GPL drivers, and allows ongoing collaboration
> with one of the primary authors of this idea.
> 
> I think it would be beneficial to include this feature as soon as possible.
> Early collaborators can go to the trouble of fixing and polishing the HMM
> implementation, allowing it to fully bake by the time other drivers start
> implementing features requiring it. We are confident that this API will be
> useful to others as they catch up with supporting hardware.
> 
> 
> C.4 - Planned features :
> 
> We are planning to add various features down the road once we can clear the
> basic design. Most important ones are :
>   - Allowing inter-device migration for compatible devices.
>   - Allowing hmm_rmem without backing storage (simplify some of the driver).
>   - Device specific memcg.
>   - Improvement to allow APU to take advantages of rmem, by hiding the page
>     from the cpu the gpu can use a different memory controller link that do
>     not require cache coherency with the cpu and thus provide higher bandwidth.
>   - Atomic device memory operation by unmapping on the cpu while the device is
>     performing atomic operation (this require hardware mmu to differentiate
>     between regular memory access and atomic memory access and to have a flag
>     that allow atomic memory access on per page basis).
>   - Pining private memory to rmem this would be a useful feature to add and
>     would require addition of a new flag to madvise. Any cpu access would
>     result in SIGBUS for the cpu process.
> 
> 
> C.5 - Getting upstream :
> 
> So what should i do to get this patchset in a mergeable form at least at first
> as a staging feature ? Right now the patchset has few rough edges around huge
> page support and other smaller issues. But as said above i believe that patch
> 1, 2, 3 and 4 can be merge as is as they do not modify current behavior while
> being useful to other.
> 
> Should i implement some secondary hmm specific lru and their associated worker
> thread to avoid having the regular reclaim code to end up sleeping waiting for
> a device to update its page table ?
> 
> Should i go for a totaly different design ? If so what direction ? As stated
> above we explored other design and i listed there flaws.
> 
> Any others things that i need to fix/address/change/improve ?
> 
> Comments and flames are welcome.
> 
> Cheers,
> Jérôme Glisse
> 
> To: <linux-kernel@vger.kernel.org>,
> To: linux-mm <linux-mm@kvack.org>,
> To: <linux-fsdevel@vger.kernel.org>,
> Cc: "Mel Gorman" <mgorman@suse.de>,
> Cc: "H. Peter Anvin" <hpa@zytor.com>,
> Cc: "Peter Zijlstra" <peterz@infradead.org>,
> Cc: "Andrew Morton" <akpm@linux-foundation.org>,
> Cc: "Linda Wang" <lwang@redhat.com>,
> Cc: "Kevin E Martin" <kem@redhat.com>,
> Cc: "Jerome Glisse" <jglisse@redhat.com>,
> Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
> Cc: "Johannes Weiner" <jweiner@redhat.com>,
> Cc: "Larry Woodman" <lwoodman@redhat.com>,
> Cc: "Rik van Riel" <riel@redhat.com>,
> Cc: "Dave Airlie" <airlied@redhat.com>,
> Cc: "Jeff Law" <law@redhat.com>,
> Cc: "Brendan Conoboy" <blc@redhat.com>,
> Cc: "Joe Donohue" <jdonohue@redhat.com>,
> Cc: "Duncan Poole" <dpoole@nvidia.com>,
> Cc: "Sherry Cheung" <SCheung@nvidia.com>,
> Cc: "Subhash Gutti" <sgutti@nvidia.com>,
> Cc: "John Hubbard" <jhubbard@nvidia.com>,
> Cc: "Mark Hairgrove" <mhairgrove@nvidia.com>,
> Cc: "Lucien Dunning" <ldunning@nvidia.com>,
> Cc: "Cameron Buschardt" <cabuschardt@nvidia.com>,
> Cc: "Arvind Gopalakrishnan" <arvindg@nvidia.com>,
> Cc: "Haggai Eran" <haggaie@mellanox.com>,
> Cc: "Or Gerlitz" <ogerlitz@mellanox.com>,
> Cc: "Sagi Grimberg" <sagig@mellanox.com>
> Cc: "Shachar Raindel" <raindel@mellanox.com>,
> Cc: "Liran Liss" <liranl@mellanox.com>,
> Cc: "Roland Dreier" <roland@purestorage.com>,
> Cc: "Sander, Ben" <ben.sander@amd.com>,
> Cc: "Stoner, Greg" <Greg.Stoner@amd.com>,
> Cc: "Bridgman, John" <John.Bridgman@amd.com>,
> Cc: "Mantor, Michael" <Michael.Mantor@amd.com>,
> Cc: "Blinzer, Paul" <Paul.Blinzer@amd.com>,
> Cc: "Morichetti, Laurent" <Laurent.Morichetti@amd.com>,
> Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
> Cc: "Gabbay, Oded" <Oded.Gabbay@amd.com>,
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 10:29   ` Peter Zijlstra
@ 2014-05-06 14:57     ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.glisse, linux-mm, Linux Kernel Mailing List, linux-fsdevel,
	Mel Gorman, H. Peter Anvin, Andrew Morton, Linda Wang,
	Kevin E Martin, Jerome Glisse, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Jeff Law,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron

On Tue, May 6, 2014 at 3:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So you forgot to CC Linus, Linus has expressed some dislike for
> preemptible mmu_notifiers in the recent past:

Indeed. I think we *really* should change that anonvma rwsem into an
rwlock. We had performance numbers that showed it needs to be done.

The *last* thing we want is to have random callbacks that can block in
this critical region. So now I think making it an rwlock is a good
idea just to make sure that never happens.

Seriously, the mmu_notifiers were misdesigned to begin with, and much
too deep. We're not screwing up the VM any more because of them.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 14:57     ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.glisse, linux-mm, Linux Kernel Mailing List, linux-fsdevel,
	Mel Gorman, H. Peter Anvin, Andrew Morton, Linda Wang,
	Kevin E Martin, Jerome Glisse, Andrea Arcangeli, Johannes Weiner,
	Larry Woodman, Rik van Riel, Dave Airlie, Jeff Law,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Or Gerlitz, Sagi Grimberg, Shachar Raindel, Liran Liss,
	Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman, John, Mantor,
	Michael, Blinzer, Paul, Morichetti, Laurent, Deucher, Alexander,
	Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 3:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So you forgot to CC Linus, Linus has expressed some dislike for
> preemptible mmu_notifiers in the recent past:

Indeed. I think we *really* should change that anonvma rwsem into an
rwlock. We had performance numbers that showed it needs to be done.

The *last* thing we want is to have random callbacks that can block in
this critical region. So now I think making it an rwlock is a good
idea just to make sure that never happens.

Seriously, the mmu_notifiers were misdesigned to begin with, and much
too deep. We're not screwing up the VM any more because of them.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 14:57     ` Linus Torvalds
@ 2014-05-06 15:00       ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 15:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 07:57:02AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 3:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So you forgot to CC Linus, Linus has expressed some dislike for
> > preemptible mmu_notifiers in the recent past:
> 
> Indeed. I think we *really* should change that anonvma rwsem into an
> rwlock. We had performance numbers that showed it needs to be done.
> 
> The *last* thing we want is to have random callbacks that can block in
> this critical region. So now I think making it an rwlock is a good
> idea just to make sure that never happens.
> 
> Seriously, the mmu_notifiers were misdesigned to begin with, and much
> too deep. We're not screwing up the VM any more because of them.
> 
>                  Linus

So question becomes how to implement process address space mirroring
without pinning memory and track cpu page table update knowing that
device page table update is unbound can not be atomic from cpu point
of view.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 15:00       ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 15:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 07:57:02AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 3:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So you forgot to CC Linus, Linus has expressed some dislike for
> > preemptible mmu_notifiers in the recent past:
> 
> Indeed. I think we *really* should change that anonvma rwsem into an
> rwlock. We had performance numbers that showed it needs to be done.
> 
> The *last* thing we want is to have random callbacks that can block in
> this critical region. So now I think making it an rwlock is a good
> idea just to make sure that never happens.
> 
> Seriously, the mmu_notifiers were misdesigned to begin with, and much
> too deep. We're not screwing up the VM any more because of them.
> 
>                  Linus

So question becomes how to implement process address space mirroring
without pinning memory and track cpu page table update knowing that
device page table update is unbound can not be atomic from cpu point
of view.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 15:00       ` Jerome Glisse
@ 2014-05-06 15:18         ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 15:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 6, 2014 at 8:00 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> So question becomes how to implement process address space mirroring
> without pinning memory and track cpu page table update knowing that
> device page table update is unbound can not be atomic from cpu point
> of view.

Perhaps as a fake TLB and interacting with the TLB shootdown? And
making sure that everything is atomic?

Some of these devices are going to actually *share* the real page
tables. Not "cache" them. Actually use the page tables directly.
That's where all these on-die APU things are going, where the device
really ends up being something much more like ASMP (asymmetric
multi-processing) than a traditional external device.

So we *will* have to extend our notion of TLB shootdown to have not
just a mask of possible active CPU's, but possible active devices. No
question about that.

But doing this with sleeping in some stupid VM notifier is completely
out of the question, because it *CANNOT EVEN WORK* for that eventual
real goal of sharing the physical page tables where the device can do
things like atomic dirty/accessed bit settings etc. It can only work
for crappy sh*t that does the half-way thing. It's completely racy wrt
the actual page table updates. That kind of page table sharing needs
true atomicity for exactly the same reason we need it for our current
SMP. So it needs to have all the same page table locking rules etc.
Not that shitty notifier callback.

As I said, the VM notifiers were misdesigned to begin with. They are
an abomination. We're not going to extend on that and make it worse.
We are *certainly* not going to make them blocking and screwing our
core VM that way. And that's doubly and triply true when it cannot
work for the generic case _anyway_.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 15:18         ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 15:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 8:00 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> So question becomes how to implement process address space mirroring
> without pinning memory and track cpu page table update knowing that
> device page table update is unbound can not be atomic from cpu point
> of view.

Perhaps as a fake TLB and interacting with the TLB shootdown? And
making sure that everything is atomic?

Some of these devices are going to actually *share* the real page
tables. Not "cache" them. Actually use the page tables directly.
That's where all these on-die APU things are going, where the device
really ends up being something much more like ASMP (asymmetric
multi-processing) than a traditional external device.

So we *will* have to extend our notion of TLB shootdown to have not
just a mask of possible active CPU's, but possible active devices. No
question about that.

But doing this with sleeping in some stupid VM notifier is completely
out of the question, because it *CANNOT EVEN WORK* for that eventual
real goal of sharing the physical page tables where the device can do
things like atomic dirty/accessed bit settings etc. It can only work
for crappy sh*t that does the half-way thing. It's completely racy wrt
the actual page table updates. That kind of page table sharing needs
true atomicity for exactly the same reason we need it for our current
SMP. So it needs to have all the same page table locking rules etc.
Not that shitty notifier callback.

As I said, the VM notifiers were misdesigned to begin with. They are
an abomination. We're not going to extend on that and make it worse.
We are *certainly* not going to make them blocking and screwing our
core VM that way. And that's doubly and triply true when it cannot
work for the generic case _anyway_.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 15:18         ` Linus Torvalds
@ 2014-05-06 15:33           ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 15:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 08:18:34AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 8:00 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > So question becomes how to implement process address space mirroring
> > without pinning memory and track cpu page table update knowing that
> > device page table update is unbound can not be atomic from cpu point
> > of view.
> 
> Perhaps as a fake TLB and interacting with the TLB shootdown? And
> making sure that everything is atomic?
> 
> Some of these devices are going to actually *share* the real page
> tables. Not "cache" them. Actually use the page tables directly.
> That's where all these on-die APU things are going, where the device
> really ends up being something much more like ASMP (asymmetric
> multi-processing) than a traditional external device.
> 
> So we *will* have to extend our notion of TLB shootdown to have not
> just a mask of possible active CPU's, but possible active devices. No
> question about that.

Well no, as i said and explain in my mail APU and IOMMUv2 is a one sided
coin and you can not use the device memory with such solution. So yes
there is interest from many player to mirror the cpu page table by other
means than by having the IOMMU walk the cpu page table (this include
AMD).

> 
> But doing this with sleeping in some stupid VM notifier is completely
> out of the question, because it *CANNOT EVEN WORK* for that eventual
> real goal of sharing the physical page tables where the device can do
> things like atomic dirty/accessed bit settings etc. It can only work
> for crappy sh*t that does the half-way thing. It's completely racy wrt
> the actual page table updates. That kind of page table sharing needs
> true atomicity for exactly the same reason we need it for our current
> SMP. So it needs to have all the same page table locking rules etc.
> Not that shitty notifier callback.
> 
> As I said, the VM notifiers were misdesigned to begin with. They are
> an abomination. We're not going to extend on that and make it worse.
> We are *certainly* not going to make them blocking and screwing our
> core VM that way. And that's doubly and triply true when it cannot
> work for the generic case _anyway_.
> 
>               Linus

So how can i solve the issue at hand. A device that has its own page
table and can not mirror the cpu page table, nor can the device page
table be updated atomicly from the cpu. Yes such device will exist
and the IOMMUv2 walking the cpu page table is not capable of supporting
GPU memory which is a big big big needed feature. Compare 20Gb/s vs
300Gb/s of GPU memory.

I understand that we do not want to sleep when updating process cpu
page table but note that only process that use the gpu would have to
sleep. So only process that can actually benefit from the using GPU
will suffer the consequences.

That said it also play a role with page reclamation hence why i am
proposing to have a separate lru for page involve with a GPU.

So having the hardware walking the cpu page table is out of the
question.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 15:33           ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 15:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 08:18:34AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 8:00 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > So question becomes how to implement process address space mirroring
> > without pinning memory and track cpu page table update knowing that
> > device page table update is unbound can not be atomic from cpu point
> > of view.
> 
> Perhaps as a fake TLB and interacting with the TLB shootdown? And
> making sure that everything is atomic?
> 
> Some of these devices are going to actually *share* the real page
> tables. Not "cache" them. Actually use the page tables directly.
> That's where all these on-die APU things are going, where the device
> really ends up being something much more like ASMP (asymmetric
> multi-processing) than a traditional external device.
> 
> So we *will* have to extend our notion of TLB shootdown to have not
> just a mask of possible active CPU's, but possible active devices. No
> question about that.

Well no, as i said and explain in my mail APU and IOMMUv2 is a one sided
coin and you can not use the device memory with such solution. So yes
there is interest from many player to mirror the cpu page table by other
means than by having the IOMMU walk the cpu page table (this include
AMD).

> 
> But doing this with sleeping in some stupid VM notifier is completely
> out of the question, because it *CANNOT EVEN WORK* for that eventual
> real goal of sharing the physical page tables where the device can do
> things like atomic dirty/accessed bit settings etc. It can only work
> for crappy sh*t that does the half-way thing. It's completely racy wrt
> the actual page table updates. That kind of page table sharing needs
> true atomicity for exactly the same reason we need it for our current
> SMP. So it needs to have all the same page table locking rules etc.
> Not that shitty notifier callback.
> 
> As I said, the VM notifiers were misdesigned to begin with. They are
> an abomination. We're not going to extend on that and make it worse.
> We are *certainly* not going to make them blocking and screwing our
> core VM that way. And that's doubly and triply true when it cannot
> work for the generic case _anyway_.
> 
>               Linus

So how can i solve the issue at hand. A device that has its own page
table and can not mirror the cpu page table, nor can the device page
table be updated atomicly from the cpu. Yes such device will exist
and the IOMMUv2 walking the cpu page table is not capable of supporting
GPU memory which is a big big big needed feature. Compare 20Gb/s vs
300Gb/s of GPU memory.

I understand that we do not want to sleep when updating process cpu
page table but note that only process that use the gpu would have to
sleep. So only process that can actually benefit from the using GPU
will suffer the consequences.

That said it also play a role with page reclamation hence why i am
proposing to have a separate lru for page involve with a GPU.

So having the hardware walking the cpu page table is out of the
question.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 15:33           ` Jerome Glisse
@ 2014-05-06 15:42             ` Rik van Riel
  -1 siblings, 0 replies; 107+ messages in thread
From: Rik van Riel @ 2014-05-06 15:42 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Dave Airlie, Jeff Law,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove

On 05/06/2014 11:33 AM, Jerome Glisse wrote:

> So how can i solve the issue at hand. A device that has its own page
> table and can not mirror the cpu page table, nor can the device page
> table be updated atomicly from the cpu. Yes such device will exist

TLB invalidation on very large systems can already take
essentially forever.

Are we OK with extending that "forever" period for heterogeneous
memory management with crappy devices, or is this something we
could/should look into improving in the general case?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 15:42             ` Rik van Riel
  0 siblings, 0 replies; 107+ messages in thread
From: Rik van Riel @ 2014-05-06 15:42 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Dave Airlie, Jeff Law,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Or Gerlitz, Sagi Grimberg, Shachar Raindel, Liran Liss,
	Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman, John, Mantor,
	Michael, Blinzer, Paul, Morichetti, Laurent, Deucher, Alexander,
	Gabbay, Oded, Davidlohr Bueso

On 05/06/2014 11:33 AM, Jerome Glisse wrote:

> So how can i solve the issue at hand. A device that has its own page
> table and can not mirror the cpu page table, nor can the device page
> table be updated atomicly from the cpu. Yes such device will exist

TLB invalidation on very large systems can already take
essentially forever.

Are we OK with extending that "forever" period for heterogeneous
memory management with crappy devices, or is this something we
could/should look into improving in the general case?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 15:33           ` Jerome Glisse
@ 2014-05-06 15:47             ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 15:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 6, 2014 at 8:33 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> So how can i solve the issue at hand. A device that has its own page
> table and can not mirror the cpu page table, nor can the device page
> table be updated atomicly from the cpu.

So? Just model it as a TLB.

Sure, the TLB is slow and crappy and is in external memory rather than
on-die, but it's still a TLB.

We have CPU's that do that kind of crazy thing (powerpc and sparc both
have these kinds of "in-memory TLB extensions" in addition to the
on-die TLB, they just call them "inverse page tables" to try to fool
people about what they are).

> I understand that we do not want to sleep when updating process cpu
> page table but note that only process that use the gpu would have to
> sleep. So only process that can actually benefit from the using GPU
> will suffer the consequences.

NO!

You don't get it. If a callback can sleep, then we cannot protect it
with a spinlock.

It doesn't matter if it only sleeps once in a millennium. It still
forces its crap on the rest of the system.

So there is no way in hell that we will allow that VM notifier crap. None.

And as I've mentioned, there is a correct place to slot this in, and
that correct way is the _only_ way to ever support future GPU's that
_do_ share direct access to the page tables.

So trying to do it any other way is broken _anyway_.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 15:47             ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 15:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 8:33 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> So how can i solve the issue at hand. A device that has its own page
> table and can not mirror the cpu page table, nor can the device page
> table be updated atomicly from the cpu.

So? Just model it as a TLB.

Sure, the TLB is slow and crappy and is in external memory rather than
on-die, but it's still a TLB.

We have CPU's that do that kind of crazy thing (powerpc and sparc both
have these kinds of "in-memory TLB extensions" in addition to the
on-die TLB, they just call them "inverse page tables" to try to fool
people about what they are).

> I understand that we do not want to sleep when updating process cpu
> page table but note that only process that use the gpu would have to
> sleep. So only process that can actually benefit from the using GPU
> will suffer the consequences.

NO!

You don't get it. If a callback can sleep, then we cannot protect it
with a spinlock.

It doesn't matter if it only sleeps once in a millennium. It still
forces its crap on the rest of the system.

So there is no way in hell that we will allow that VM notifier crap. None.

And as I've mentioned, there is a correct place to slot this in, and
that correct way is the _only_ way to ever support future GPU's that
_do_ share direct access to the page tables.

So trying to do it any other way is broken _anyway_.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 15:47             ` Linus Torvalds
@ 2014-05-06 16:18               ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 16:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 08:47:48AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 8:33 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > So how can i solve the issue at hand. A device that has its own page
> > table and can not mirror the cpu page table, nor can the device page
> > table be updated atomicly from the cpu.
> 
> So? Just model it as a TLB.
> 
> Sure, the TLB is slow and crappy and is in external memory rather than
> on-die, but it's still a TLB.
> 
> We have CPU's that do that kind of crazy thing (powerpc and sparc both
> have these kinds of "in-memory TLB extensions" in addition to the
> on-die TLB, they just call them "inverse page tables" to try to fool
> people about what they are).
> 
> > I understand that we do not want to sleep when updating process cpu
> > page table but note that only process that use the gpu would have to
> > sleep. So only process that can actually benefit from the using GPU
> > will suffer the consequences.
> 
> NO!
> 
> You don't get it. If a callback can sleep, then we cannot protect it
> with a spinlock.
> 
> It doesn't matter if it only sleeps once in a millennium. It still
> forces its crap on the rest of the system.

I do understand that i was pointing out that if i move to, tlb which i
am fine with, i will still need to sleep there. That's all i wanted to
stress, i did not wanted force using mmu_notifier, i am fine with them
becoming atomic as long as i have a place where i can intercept cpu
page table update and propagate them to device mmu.

> 
> So there is no way in hell that we will allow that VM notifier crap. None.
> 
> And as I've mentioned, there is a correct place to slot this in, and
> that correct way is the _only_ way to ever support future GPU's that
> _do_ share direct access to the page tables.

This work was done in cooperation with NVidia and we discussed with AMD too
so i am very much aware of what is coming next on hardware front and being
able to have GPU have their own GPU page table ie not walking the CPU one is
something of interest to the people who design those future generation of
GPU.

> 
> So trying to do it any other way is broken _anyway_.
> 
>            Linus

I will respin without using mmu_notifier and by hooking it as tlb shootdown.
But it will still need to sleep during the device tlb shootdown and that's
the point i want to make sure is clear.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:18               ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 16:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 08:47:48AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 8:33 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > So how can i solve the issue at hand. A device that has its own page
> > table and can not mirror the cpu page table, nor can the device page
> > table be updated atomicly from the cpu.
> 
> So? Just model it as a TLB.
> 
> Sure, the TLB is slow and crappy and is in external memory rather than
> on-die, but it's still a TLB.
> 
> We have CPU's that do that kind of crazy thing (powerpc and sparc both
> have these kinds of "in-memory TLB extensions" in addition to the
> on-die TLB, they just call them "inverse page tables" to try to fool
> people about what they are).
> 
> > I understand that we do not want to sleep when updating process cpu
> > page table but note that only process that use the gpu would have to
> > sleep. So only process that can actually benefit from the using GPU
> > will suffer the consequences.
> 
> NO!
> 
> You don't get it. If a callback can sleep, then we cannot protect it
> with a spinlock.
> 
> It doesn't matter if it only sleeps once in a millennium. It still
> forces its crap on the rest of the system.

I do understand that i was pointing out that if i move to, tlb which i
am fine with, i will still need to sleep there. That's all i wanted to
stress, i did not wanted force using mmu_notifier, i am fine with them
becoming atomic as long as i have a place where i can intercept cpu
page table update and propagate them to device mmu.

> 
> So there is no way in hell that we will allow that VM notifier crap. None.
> 
> And as I've mentioned, there is a correct place to slot this in, and
> that correct way is the _only_ way to ever support future GPU's that
> _do_ share direct access to the page tables.

This work was done in cooperation with NVidia and we discussed with AMD too
so i am very much aware of what is coming next on hardware front and being
able to have GPU have their own GPU page table ie not walking the CPU one is
something of interest to the people who design those future generation of
GPU.

> 
> So trying to do it any other way is broken _anyway_.
> 
>            Linus

I will respin without using mmu_notifier and by hooking it as tlb shootdown.
But it will still need to sleep during the device tlb shootdown and that's
the point i want to make sure is clear.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 15:47             ` Linus Torvalds
@ 2014-05-06 16:30               ` Rik van Riel
  -1 siblings, 0 replies; 107+ messages in thread
From: Rik van Riel @ 2014-05-06 16:30 UTC (permalink / raw)
  To: Linus Torvalds, Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Dave Airlie, Jeff Law,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove

On 05/06/2014 11:47 AM, Linus Torvalds wrote:

> And as I've mentioned, there is a correct place to slot this in, and
> that correct way is the _only_ way to ever support future GPU's that
> _do_ share direct access to the page tables.

The GPU runs a lot faster when using video memory, instead
of system memory, on the other side of the PCIe bus.

The CPU cannot directly address video memory.

How can shared page table access work, given these constraints?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:30               ` Rik van Riel
  0 siblings, 0 replies; 107+ messages in thread
From: Rik van Riel @ 2014-05-06 16:30 UTC (permalink / raw)
  To: Linus Torvalds, Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Dave Airlie, Jeff Law,
	Brendan Conoboy, Joe Donohue, Duncan Poole, Sherry Cheung,
	Subhash Gutti, John Hubbard, Mark Hairgrove, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, Haggai Eran,
	Or Gerlitz, Sagi Grimberg, Shachar Raindel, Liran Liss,
	Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman, John, Mantor,
	Michael, Blinzer, Paul, Morichetti, Laurent, Deucher, Alexander,
	Gabbay, Oded, Davidlohr Bueso

On 05/06/2014 11:47 AM, Linus Torvalds wrote:

> And as I've mentioned, there is a correct place to slot this in, and
> that correct way is the _only_ way to ever support future GPU's that
> _do_ share direct access to the page tables.

The GPU runs a lot faster when using video memory, instead
of system memory, on the other side of the PCIe bus.

The CPU cannot directly address video memory.

How can shared page table access work, given these constraints?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:18               ` Jerome Glisse
@ 2014-05-06 16:32                 ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 16:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 6, 2014 at 9:18 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> I do understand that i was pointing out that if i move to, tlb which i
> am fine with, i will still need to sleep there.

No can do. The TLB flushing itself is called with a spinlock held, and
we need to continue to do that.

Why do you really need to sleep? Because that sounds bogus.

What you *should* do is send the flush message, and not wait for any
reply. You can then possibly wait for the result later on: we already
have this multi-stage TLB flush model (using the "mmu_gather"
structure) that has three phases:

 - create mmu_gather (allocate space for batching etc). This can sleep.
 - do the actual flushing (possibly multiple times). This is the
"synchronous with the VM" part and cannot sleep.
 - tear down the mmu_gather data structures and actually free the
pages we batched. This can sleep.

and what I think a GPU flush has to do is to do the actual flushes
when asked to (because that's what it will need to do to work with a
real TLB eventually), but if there's some crazy asynchronous
acknowledge thing from hardware, it's possible to perhaps wait for
that in the final phase (*before* we free the pages we gathered).

Now, such an asynchronous model had better not mark page tables dirty
after we flushed (we'd lose that information), but quite frankly,
anything that is remote enough to need some async flush thing cannor
sanely be close enough to be closely tied to the actual real page
tables, so I don't think we need to care.

Anyway, I really think that the existing mmu_gather model *should*
work fine for this all. It may be a bit inconvenient for crazy
hardware, but the important part is that it definitely should work for
any future hardware that actually gets this right.

It does likely involve adding some kind of device list to "struct
mm_struct", and I'm sure there's some extension to "struct mmu_gather"
too, but _conceptually_ it should all be reasonably non-invasive.

Knock wood.

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:32                 ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 16:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 9:18 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> I do understand that i was pointing out that if i move to, tlb which i
> am fine with, i will still need to sleep there.

No can do. The TLB flushing itself is called with a spinlock held, and
we need to continue to do that.

Why do you really need to sleep? Because that sounds bogus.

What you *should* do is send the flush message, and not wait for any
reply. You can then possibly wait for the result later on: we already
have this multi-stage TLB flush model (using the "mmu_gather"
structure) that has three phases:

 - create mmu_gather (allocate space for batching etc). This can sleep.
 - do the actual flushing (possibly multiple times). This is the
"synchronous with the VM" part and cannot sleep.
 - tear down the mmu_gather data structures and actually free the
pages we batched. This can sleep.

and what I think a GPU flush has to do is to do the actual flushes
when asked to (because that's what it will need to do to work with a
real TLB eventually), but if there's some crazy asynchronous
acknowledge thing from hardware, it's possible to perhaps wait for
that in the final phase (*before* we free the pages we gathered).

Now, such an asynchronous model had better not mark page tables dirty
after we flushed (we'd lose that information), but quite frankly,
anything that is remote enough to need some async flush thing cannor
sanely be close enough to be closely tied to the actual real page
tables, so I don't think we need to care.

Anyway, I really think that the existing mmu_gather model *should*
work fine for this all. It may be a bit inconvenient for crazy
hardware, but the important part is that it definitely should work for
any future hardware that actually gets this right.

It does likely involve adding some kind of device list to "struct
mm_struct", and I'm sure there's some extension to "struct mmu_gather"
too, but _conceptually_ it should all be reasonably non-invasive.

Knock wood.

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:30               ` Rik van Riel
@ 2014-05-06 16:34                 ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 16:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Dave Airlie, Jeff Law, Brendan Conoboy, Joe Donohue,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning

On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
>
> The GPU runs a lot faster when using video memory, instead
> of system memory, on the other side of the PCIe bus.

The nineties called, and they want their old broken model back.

Get with the times. No high-performance future GPU will ever run
behind the PCIe bus. We still have a few straggling historical
artifacts, but everybody knows where the future is headed.

They are already cache-coherent because flushing caches etc was too
damn expensive. They're getting more so.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:34                 ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 16:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Dave Airlie, Jeff Law, Brendan Conoboy, Joe Donohue,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
>
> The GPU runs a lot faster when using video memory, instead
> of system memory, on the other side of the PCIe bus.

The nineties called, and they want their old broken model back.

Get with the times. No high-performance future GPU will ever run
behind the PCIe bus. We still have a few straggling historical
artifacts, but everybody knows where the future is headed.

They are already cache-coherent because flushing caches etc was too
damn expensive. They're getting more so.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:34                 ` Linus Torvalds
@ 2014-05-06 16:47                   ` Rik van Riel
  -1 siblings, 0 replies; 107+ messages in thread
From: Rik van Riel @ 2014-05-06 16:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Dave Airlie, Jeff Law, Brendan Conoboy, Joe Donohue,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard

On 05/06/2014 12:34 PM, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> The GPU runs a lot faster when using video memory, instead
>> of system memory, on the other side of the PCIe bus.
> 
> The nineties called, and they want their old broken model back.
> 
> Get with the times. No high-performance future GPU will ever run
> behind the PCIe bus. We still have a few straggling historical
> artifacts, but everybody knows where the future is headed.
> 
> They are already cache-coherent because flushing caches etc was too
> damn expensive. They're getting more so.

I suppose that VRAM could simply be turned into a very high
capacity CPU cache for the GPU, for the case where people
want/need an add-on card.

With a few hundred MB of "CPU cache" on the video card, we
could offload processing to the GPU very easily, without
having to worry about multiple address or page table formats
on the CPU side.

A new generation of GPU hardware seems to come out every
six months or so, so I guess we could live with TLB
invalidations to the first generations of hardware being
comically slow :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:47                   ` Rik van Riel
  0 siblings, 0 replies; 107+ messages in thread
From: Rik van Riel @ 2014-05-06 16:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Dave Airlie, Jeff Law, Brendan Conoboy, Joe Donohue,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On 05/06/2014 12:34 PM, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> The GPU runs a lot faster when using video memory, instead
>> of system memory, on the other side of the PCIe bus.
> 
> The nineties called, and they want their old broken model back.
> 
> Get with the times. No high-performance future GPU will ever run
> behind the PCIe bus. We still have a few straggling historical
> artifacts, but everybody knows where the future is headed.
> 
> They are already cache-coherent because flushing caches etc was too
> damn expensive. They're getting more so.

I suppose that VRAM could simply be turned into a very high
capacity CPU cache for the GPU, for the case where people
want/need an add-on card.

With a few hundred MB of "CPU cache" on the video card, we
could offload processing to the GPU very easily, without
having to worry about multiple address or page table formats
on the CPU side.

A new generation of GPU hardware seems to come out every
six months or so, so I guess we could live with TLB
invalidations to the first generations of hardware being
comically slow :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:32                 ` Linus Torvalds
@ 2014-05-06 16:49                   ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 09:32:16AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:18 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there.
> 
> No can do. The TLB flushing itself is called with a spinlock held, and
> we need to continue to do that.
> 
> Why do you really need to sleep? Because that sounds bogus.
> 
> What you *should* do is send the flush message, and not wait for any
> reply. You can then possibly wait for the result later on: we already
> have this multi-stage TLB flush model (using the "mmu_gather"
> structure) that has three phases:
> 
>  - create mmu_gather (allocate space for batching etc). This can sleep.
>  - do the actual flushing (possibly multiple times). This is the
> "synchronous with the VM" part and cannot sleep.
>  - tear down the mmu_gather data structures and actually free the
> pages we batched. This can sleep.
> 
> and what I think a GPU flush has to do is to do the actual flushes
> when asked to (because that's what it will need to do to work with a
> real TLB eventually), but if there's some crazy asynchronous
> acknowledge thing from hardware, it's possible to perhaps wait for
> that in the final phase (*before* we free the pages we gathered).

Plan i had in mind was to add an item atomicly inside mmu notifier to
schedule work on the gpu and have the tlb wait on the gpu to acknowledge
that it did update its page table and it is done using those pages.
This would happen in tlb_flush_mmu

> 
> Now, such an asynchronous model had better not mark page tables dirty
> after we flushed (we'd lose that information), but quite frankly,
> anything that is remote enough to need some async flush thing cannor
> sanely be close enough to be closely tied to the actual real page
> tables, so I don't think we need to care.

That's an issue as soon as i schedule the work (read as early on as i
can) the gpu can report any of the page as dirty and it can possibly
do so only once we wait for it in tlb_flush_mmu.

> 
> Anyway, I really think that the existing mmu_gather model *should*
> work fine for this all. It may be a bit inconvenient for crazy
> hardware, but the important part is that it definitely should work for
> any future hardware that actually gets this right.
> 

I stress again the GPU with dedicated memory is not going away on the
opposite you might see more dedicated memory not accessible from the
CPU.

> It does likely involve adding some kind of device list to "struct
> mm_struct", and I'm sure there's some extension to "struct mmu_gather"
> too, but _conceptually_ it should all be reasonably non-invasive.
> 
> Knock wood.
> 
>             Linus

I will port over to piggy back on mmu gather and other tlb flush. I will
post as soon as i have something that works with features this patchset
has.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:49                   ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 09:32:16AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:18 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there.
> 
> No can do. The TLB flushing itself is called with a spinlock held, and
> we need to continue to do that.
> 
> Why do you really need to sleep? Because that sounds bogus.
> 
> What you *should* do is send the flush message, and not wait for any
> reply. You can then possibly wait for the result later on: we already
> have this multi-stage TLB flush model (using the "mmu_gather"
> structure) that has three phases:
> 
>  - create mmu_gather (allocate space for batching etc). This can sleep.
>  - do the actual flushing (possibly multiple times). This is the
> "synchronous with the VM" part and cannot sleep.
>  - tear down the mmu_gather data structures and actually free the
> pages we batched. This can sleep.
> 
> and what I think a GPU flush has to do is to do the actual flushes
> when asked to (because that's what it will need to do to work with a
> real TLB eventually), but if there's some crazy asynchronous
> acknowledge thing from hardware, it's possible to perhaps wait for
> that in the final phase (*before* we free the pages we gathered).

Plan i had in mind was to add an item atomicly inside mmu notifier to
schedule work on the gpu and have the tlb wait on the gpu to acknowledge
that it did update its page table and it is done using those pages.
This would happen in tlb_flush_mmu

> 
> Now, such an asynchronous model had better not mark page tables dirty
> after we flushed (we'd lose that information), but quite frankly,
> anything that is remote enough to need some async flush thing cannor
> sanely be close enough to be closely tied to the actual real page
> tables, so I don't think we need to care.

That's an issue as soon as i schedule the work (read as early on as i
can) the gpu can report any of the page as dirty and it can possibly
do so only once we wait for it in tlb_flush_mmu.

> 
> Anyway, I really think that the existing mmu_gather model *should*
> work fine for this all. It may be a bit inconvenient for crazy
> hardware, but the important part is that it definitely should work for
> any future hardware that actually gets this right.
> 

I stress again the GPU with dedicated memory is not going away on the
opposite you might see more dedicated memory not accessible from the
CPU.

> It does likely involve adding some kind of device list to "struct
> mm_struct", and I'm sure there's some extension to "struct mmu_gather"
> too, but _conceptually_ it should all be reasonably non-invasive.
> 
> Knock wood.
> 
>             Linus

I will port over to piggy back on mmu gather and other tlb flush. I will
post as soon as i have something that works with features this patchset
has.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:47                   ` Rik van Riel
@ 2014-05-06 16:54                     ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 16:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Dave Airlie, Jeff Law, Brendan Conoboy, Joe Donohue,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning

On Tue, May 06, 2014 at 12:47:16PM -0400, Rik van Riel wrote:
> On 05/06/2014 12:34 PM, Linus Torvalds wrote:
> > On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
> >>
> >> The GPU runs a lot faster when using video memory, instead
> >> of system memory, on the other side of the PCIe bus.
> > 
> > The nineties called, and they want their old broken model back.
> > 
> > Get with the times. No high-performance future GPU will ever run
> > behind the PCIe bus. We still have a few straggling historical
> > artifacts, but everybody knows where the future is headed.
> > 
> > They are already cache-coherent because flushing caches etc was too
> > damn expensive. They're getting more so.
> 
> I suppose that VRAM could simply be turned into a very high
> capacity CPU cache for the GPU, for the case where people
> want/need an add-on card.
> 
> With a few hundred MB of "CPU cache" on the video card, we
> could offload processing to the GPU very easily, without
> having to worry about multiple address or page table formats
> on the CPU side.
> 
> A new generation of GPU hardware seems to come out every
> six months or so, so I guess we could live with TLB
> invalidations to the first generations of hardware being
> comically slow :)
> 

I do not want to speak for any GPU manufacturer but i think it is safe
to say that there will be dedicated memory for GPU for a long time. It
is not going anywhere soon and it is a lot more than few hundred MB,
think several GB. If you think about 4k, 8k screen you really gonna want
8GB at least on desktop computer and for compute you will likely see
16GB or 32GB as common size.

Again i stress that there is nothing on the horizon that let me believe
that regular memory associated to CPU will ever come close to the bandwith
that exist with memory associated to GPU. It is already more than 10 times
faster on GPU and as far as i know the gap will grow even more in the next
generation.

So dedicated memory to gpu should not be discarded as something that is
vanishing quite the contrary it should be acknowledge as something that is
here to stay a lot longer afaict.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 16:54                     ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 16:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Dave Airlie, Jeff Law, Brendan Conoboy, Joe Donohue,
	Duncan Poole, Sherry Cheung, Subhash Gutti, John Hubbard,
	Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 12:47:16PM -0400, Rik van Riel wrote:
> On 05/06/2014 12:34 PM, Linus Torvalds wrote:
> > On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
> >>
> >> The GPU runs a lot faster when using video memory, instead
> >> of system memory, on the other side of the PCIe bus.
> > 
> > The nineties called, and they want their old broken model back.
> > 
> > Get with the times. No high-performance future GPU will ever run
> > behind the PCIe bus. We still have a few straggling historical
> > artifacts, but everybody knows where the future is headed.
> > 
> > They are already cache-coherent because flushing caches etc was too
> > damn expensive. They're getting more so.
> 
> I suppose that VRAM could simply be turned into a very high
> capacity CPU cache for the GPU, for the case where people
> want/need an add-on card.
> 
> With a few hundred MB of "CPU cache" on the video card, we
> could offload processing to the GPU very easily, without
> having to worry about multiple address or page table formats
> on the CPU side.
> 
> A new generation of GPU hardware seems to come out every
> six months or so, so I guess we could live with TLB
> invalidations to the first generations of hardware being
> comically slow :)
> 

I do not want to speak for any GPU manufacturer but i think it is safe
to say that there will be dedicated memory for GPU for a long time. It
is not going anywhere soon and it is a lot more than few hundred MB,
think several GB. If you think about 4k, 8k screen you really gonna want
8GB at least on desktop computer and for compute you will likely see
16GB or 32GB as common size.

Again i stress that there is nothing on the horizon that let me believe
that regular memory associated to CPU will ever come close to the bandwith
that exist with memory associated to GPU. It is already more than 10 times
faster on GPU and as far as i know the gap will grow even more in the next
generation.

So dedicated memory to gpu should not be discarded as something that is
vanishing quite the contrary it should be acknowledge as something that is
here to stay a lot longer afaict.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:32                 ` Linus Torvalds
@ 2014-05-06 17:28                   ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 17:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 09:32:16AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:18 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there.
> 
> No can do. The TLB flushing itself is called with a spinlock held, and
> we need to continue to do that.
> 
> Why do you really need to sleep? Because that sounds bogus.
> 
> What you *should* do is send the flush message, and not wait for any
> reply. You can then possibly wait for the result later on: we already
> have this multi-stage TLB flush model (using the "mmu_gather"
> structure) that has three phases:
> 
>  - create mmu_gather (allocate space for batching etc). This can sleep.
>  - do the actual flushing (possibly multiple times). This is the
> "synchronous with the VM" part and cannot sleep.
>  - tear down the mmu_gather data structures and actually free the
> pages we batched. This can sleep.
> 
> and what I think a GPU flush has to do is to do the actual flushes
> when asked to (because that's what it will need to do to work with a
> real TLB eventually), but if there's some crazy asynchronous
> acknowledge thing from hardware, it's possible to perhaps wait for
> that in the final phase (*before* we free the pages we gathered).
> 
> Now, such an asynchronous model had better not mark page tables dirty
> after we flushed (we'd lose that information), but quite frankly,
> anything that is remote enough to need some async flush thing cannor
> sanely be close enough to be closely tied to the actual real page
> tables, so I don't think we need to care.
> 
> Anyway, I really think that the existing mmu_gather model *should*
> work fine for this all. It may be a bit inconvenient for crazy
> hardware, but the important part is that it definitely should work for
> any future hardware that actually gets this right.
> 
> It does likely involve adding some kind of device list to "struct
> mm_struct", and I'm sure there's some extension to "struct mmu_gather"
> too, but _conceptually_ it should all be reasonably non-invasive.
> 
> Knock wood.
> 
>             Linus

Also, just to be sure, are my changes to the radix tree otherwise
acceptable at least in principle. As explained in my long email
when migrating file backed page to device memory we want to make
sure that no one else try to use those pages.

The way i have done it is described in my long email but in a nutshell
it use special swap entry inside the radix tree and have filesystem
code knows about that and call into the hmm to migrate back to memory
if needed. Writeback use a temporary bounce page (ie once writeback is
done the gpu page can be remapped writeable and the bounce page forgotten).

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 17:28                   ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 17:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 09:32:16AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:18 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there.
> 
> No can do. The TLB flushing itself is called with a spinlock held, and
> we need to continue to do that.
> 
> Why do you really need to sleep? Because that sounds bogus.
> 
> What you *should* do is send the flush message, and not wait for any
> reply. You can then possibly wait for the result later on: we already
> have this multi-stage TLB flush model (using the "mmu_gather"
> structure) that has three phases:
> 
>  - create mmu_gather (allocate space for batching etc). This can sleep.
>  - do the actual flushing (possibly multiple times). This is the
> "synchronous with the VM" part and cannot sleep.
>  - tear down the mmu_gather data structures and actually free the
> pages we batched. This can sleep.
> 
> and what I think a GPU flush has to do is to do the actual flushes
> when asked to (because that's what it will need to do to work with a
> real TLB eventually), but if there's some crazy asynchronous
> acknowledge thing from hardware, it's possible to perhaps wait for
> that in the final phase (*before* we free the pages we gathered).
> 
> Now, such an asynchronous model had better not mark page tables dirty
> after we flushed (we'd lose that information), but quite frankly,
> anything that is remote enough to need some async flush thing cannor
> sanely be close enough to be closely tied to the actual real page
> tables, so I don't think we need to care.
> 
> Anyway, I really think that the existing mmu_gather model *should*
> work fine for this all. It may be a bit inconvenient for crazy
> hardware, but the important part is that it definitely should work for
> any future hardware that actually gets this right.
> 
> It does likely involve adding some kind of device list to "struct
> mm_struct", and I'm sure there's some extension to "struct mmu_gather"
> too, but _conceptually_ it should all be reasonably non-invasive.
> 
> Knock wood.
> 
>             Linus

Also, just to be sure, are my changes to the radix tree otherwise
acceptable at least in principle. As explained in my long email
when migrating file backed page to device memory we want to make
sure that no one else try to use those pages.

The way i have done it is described in my long email but in a nutshell
it use special swap entry inside the radix tree and have filesystem
code knows about that and call into the hmm to migrate back to memory
if needed. Writeback use a temporary bounce page (ie once writeback is
done the gpu page can be remapped writeable and the bounce page forgotten).

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 17:28                   ` Jerome Glisse
@ 2014-05-06 17:43                     ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 17:43 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 6, 2014 at 10:28 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> Also, just to be sure, are my changes to the radix tree otherwise
> acceptable at least in principle.

It looks like it adds several "loop over each page" cases just to
check whether each page might be on remote memory.

Which seems a complete waste of time 99.99% of the time.

But maybe I'm looking at the wrong patch.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 17:43                     ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 17:43 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 10:28 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> Also, just to be sure, are my changes to the radix tree otherwise
> acceptable at least in principle.

It looks like it adds several "loop over each page" cases just to
check whether each page might be on remote memory.

Which seems a complete waste of time 99.99% of the time.

But maybe I'm looking at the wrong patch.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:54                     ` Jerome Glisse
@ 2014-05-06 18:02                       ` H. Peter Anvin
  -1 siblings, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-06 18:02 UTC (permalink / raw)
  To: Jerome Glisse, Rik van Riel
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	Andrew Morton, Linda Wang, Kevin E Martin, Jerome Glisse,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard

<ogerlitz@mellanox.com>,Sagi Grimberg <sagig@mellanox.com>,Shachar Raindel <raindel@mellanox.com>,Liran Liss <liranl@mellanox.com>,Roland Dreier <roland@purestorage.com>,"Sander, Ben" <ben.sander@amd.com>,"Stoner, Greg" <Greg.Stoner@amd.com>,"Bridgman, John" <John.Bridgman@amd.com>,"Mantor, Michael" <Michael.Mantor@amd.com>,"Blinzer, Paul" <Paul.Blinzer@amd.com>,"Morichetti, Laurent" <Laurent.Morichetti@amd.com>,"Deucher, Alexander" <Alexander.Deucher@amd.com>,"Gabbay, Oded" <Oded.Gabbay@amd.com>,Davidlohr Bueso <davidlohr@hp.com>
Message-ID: <0bf54468-3ed1-4cd4-b771-4836c78dde14@email.android.com>

Nothing wrong with device-side memory, but not having it accessible by the CPU seems fundamentally brown from the point of view of unified memory addressing.

On May 6, 2014 9:54:08 AM PDT, Jerome Glisse <j.glisse@gmail.com> wrote:
>On Tue, May 06, 2014 at 12:47:16PM -0400, Rik van Riel wrote:
>> On 05/06/2014 12:34 PM, Linus Torvalds wrote:
>> > On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com>
>wrote:
>> >>
>> >> The GPU runs a lot faster when using video memory, instead
>> >> of system memory, on the other side of the PCIe bus.
>> > 
>> > The nineties called, and they want their old broken model back.
>> > 
>> > Get with the times. No high-performance future GPU will ever run
>> > behind the PCIe bus. We still have a few straggling historical
>> > artifacts, but everybody knows where the future is headed.
>> > 
>> > They are already cache-coherent because flushing caches etc was too
>> > damn expensive. They're getting more so.
>> 
>> I suppose that VRAM could simply be turned into a very high
>> capacity CPU cache for the GPU, for the case where people
>> want/need an add-on card.
>> 
>> With a few hundred MB of "CPU cache" on the video card, we
>> could offload processing to the GPU very easily, without
>> having to worry about multiple address or page table formats
>> on the CPU side.
>> 
>> A new generation of GPU hardware seems to come out every
>> six months or so, so I guess we could live with TLB
>> invalidations to the first generations of hardware being
>> comically slow :)
>> 
>
>I do not want to speak for any GPU manufacturer but i think it is safe
>to say that there will be dedicated memory for GPU for a long time. It
>is not going anywhere soon and it is a lot more than few hundred MB,
>think several GB. If you think about 4k, 8k screen you really gonna
>want
>8GB at least on desktop computer and for compute you will likely see
>16GB or 32GB as common size.
>
>Again i stress that there is nothing on the horizon that let me believe
>that regular memory associated to CPU will ever come close to the
>bandwith
>that exist with memory associated to GPU. It is already more than 10
>times
>faster on GPU and as far as i know the gap will grow even more in the
>next
>generation.
>
>So dedicated memory to gpu should not be discarded as something that is
>vanishing quite the contrary it should be acknowledge as something that
>is
>here to stay a lot longer afaict.
>
>Cheers,
>Jérôme

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 18:02                       ` H. Peter Anvin
  0 siblings, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-06 18:02 UTC (permalink / raw)
  To: Jerome Glisse, Rik van Riel
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	Andrew Morton, Linda Wang, Kevin E Martin, Jerome Glisse,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or.Gerlitz

<ogerlitz@mellanox.com>,Sagi Grimberg <sagig@mellanox.com>,Shachar Raindel <raindel@mellanox.com>,Liran Liss <liranl@mellanox.com>,Roland Dreier <roland@purestorage.com>,"Sander, Ben" <ben.sander@amd.com>,"Stoner, Greg" <Greg.Stoner@amd.com>,"Bridgman, John" <John.Bridgman@amd.com>,"Mantor, Michael" <Michael.Mantor@amd.com>,"Blinzer, Paul" <Paul.Blinzer@amd.com>,"Morichetti, Laurent" <Laurent.Morichetti@amd.com>,"Deucher, Alexander" <Alexander.Deucher@amd.com>,"Gabbay, Oded" <Oded.Gabbay@amd.com>,Davidlohr Bueso <davidlohr@hp.com>
Message-ID: <0bf54468-3ed1-4cd4-b771-4836c78dde14@email.android.com>

Nothing wrong with device-side memory, but not having it accessible by the CPU seems fundamentally brown from the point of view of unified memory addressing.

On May 6, 2014 9:54:08 AM PDT, Jerome Glisse <j.glisse@gmail.com> wrote:
>On Tue, May 06, 2014 at 12:47:16PM -0400, Rik van Riel wrote:
>> On 05/06/2014 12:34 PM, Linus Torvalds wrote:
>> > On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com>
>wrote:
>> >>
>> >> The GPU runs a lot faster when using video memory, instead
>> >> of system memory, on the other side of the PCIe bus.
>> > 
>> > The nineties called, and they want their old broken model back.
>> > 
>> > Get with the times. No high-performance future GPU will ever run
>> > behind the PCIe bus. We still have a few straggling historical
>> > artifacts, but everybody knows where the future is headed.
>> > 
>> > They are already cache-coherent because flushing caches etc was too
>> > damn expensive. They're getting more so.
>> 
>> I suppose that VRAM could simply be turned into a very high
>> capacity CPU cache for the GPU, for the case where people
>> want/need an add-on card.
>> 
>> With a few hundred MB of "CPU cache" on the video card, we
>> could offload processing to the GPU very easily, without
>> having to worry about multiple address or page table formats
>> on the CPU side.
>> 
>> A new generation of GPU hardware seems to come out every
>> six months or so, so I guess we could live with TLB
>> invalidations to the first generations of hardware being
>> comically slow :)
>> 
>
>I do not want to speak for any GPU manufacturer but i think it is safe
>to say that there will be dedicated memory for GPU for a long time. It
>is not going anywhere soon and it is a lot more than few hundred MB,
>think several GB. If you think about 4k, 8k screen you really gonna
>want
>8GB at least on desktop computer and for compute you will likely see
>16GB or 32GB as common size.
>
>Again i stress that there is nothing on the horizon that let me believe
>that regular memory associated to CPU will ever come close to the
>bandwith
>that exist with memory associated to GPU. It is already more than 10
>times
>faster on GPU and as far as i know the gap will grow even more in the
>next
>generation.
>
>So dedicated memory to gpu should not be discarded as something that is
>vanishing quite the contrary it should be acknowledge as something that
>is
>here to stay a lot longer afaict.
>
>Cheers,
>JA(C)rA'me

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 17:43                     ` Linus Torvalds
@ 2014-05-06 18:13                       ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 10:43:22AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 10:28 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > Also, just to be sure, are my changes to the radix tree otherwise
> > acceptable at least in principle.
> 
> It looks like it adds several "loop over each page" cases just to
> check whether each page might be on remote memory.
> 
> Which seems a complete waste of time 99.99% of the time.
> 
> But maybe I'm looking at the wrong patch.
> 
>           Linus

It is patch 8 for core changes and patch 9 to demonstrate per fs changes.

So yes each place that does radix tree lookup need to check that the entries
it got out of the radix tree are not special one and because many place in
the code gang lookup with pagevec_lookup there is need to go over entries
that were looked up to take appropriate action.

Other design is to do migration inside the various radix tree lookup functions
but that means going out of rcu section and possibly sleeping waiting for the
GPU to copy back things into system memory.

I could grow the radix function to return some bool to avoid looping over for
case where there is no special entry.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 18:13                       ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 10:43:22AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 10:28 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > Also, just to be sure, are my changes to the radix tree otherwise
> > acceptable at least in principle.
> 
> It looks like it adds several "loop over each page" cases just to
> check whether each page might be on remote memory.
> 
> Which seems a complete waste of time 99.99% of the time.
> 
> But maybe I'm looking at the wrong patch.
> 
>           Linus

It is patch 8 for core changes and patch 9 to demonstrate per fs changes.

So yes each place that does radix tree lookup need to check that the entries
it got out of the radix tree are not special one and because many place in
the code gang lookup with pagevec_lookup there is need to go over entries
that were looked up to take appropriate action.

Other design is to do migration inside the various radix tree lookup functions
but that means going out of rcu section and possibly sleeping waiting for the
GPU to copy back things into system memory.

I could grow the radix function to return some bool to avoid looping over for
case where there is no special entry.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 18:13                       ` Jerome Glisse
@ 2014-05-06 18:22                         ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 18:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 6, 2014 at 11:13 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> I could grow the radix function to return some bool to avoid looping over for
> case where there is no special entry.

.. or even just a bool (or counter) associated with the mapping to
mark whether any special entries exist at all.

Also, the code to turn special entries is duplicated over and over
again, usually together with a "FIXME - what about migration failure",
so it would make sense to do that as it's own function.

But conceptually I don't hate it. I didn't much like having random
hmm_pagecache_migrate() calls in core vm code, and code like this

+                       hmm_pagecache_migrate(mapping, swap);
+                       spd.pages[page_nr] = find_get_page(mapping,
index + page_nr);

looks fundamentally racy, and in other places you seemed to assume
that all exceptional entries are always about hmm, which looked
questionable. But those are details.  The concept of putting a special
swap entry in the mapping radix trees I don't necessarily find
objectionable per se.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 18:22                         ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-06 18:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 6, 2014 at 11:13 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>
> I could grow the radix function to return some bool to avoid looping over for
> case where there is no special entry.

.. or even just a bool (or counter) associated with the mapping to
mark whether any special entries exist at all.

Also, the code to turn special entries is duplicated over and over
again, usually together with a "FIXME - what about migration failure",
so it would make sense to do that as it's own function.

But conceptually I don't hate it. I didn't much like having random
hmm_pagecache_migrate() calls in core vm code, and code like this

+                       hmm_pagecache_migrate(mapping, swap);
+                       spd.pages[page_nr] = find_get_page(mapping,
index + page_nr);

looks fundamentally racy, and in other places you seemed to assume
that all exceptional entries are always about hmm, which looked
questionable. But those are details.  The concept of putting a special
swap entry in the mapping radix trees I don't necessarily find
objectionable per se.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 18:02                       ` H. Peter Anvin
@ 2014-05-06 18:26                         ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 18:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Rik van Riel, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	Andrew Morton, Linda Wang, Kevin E Martin, Jerome Glisse,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 11:02:33AM -0700, H. Peter Anvin wrote:
> <ogerlitz@mellanox.com>,Sagi Grimberg <sagig@mellanox.com>,Shachar Raindel <raindel@mellanox.com>,Liran Liss <liranl@mellanox.com>,Roland Dreier <roland@purestorage.com>,"Sander, Ben" <ben.sander@amd.com>,"Stoner, Greg" <Greg.Stoner@amd.com>,"Bridgman, John" <John.Bridgman@amd.com>,"Mantor, Michael" <Michael.Mantor@amd.com>,"Blinzer, Paul" <Paul.Blinzer@amd.com>,"Morichetti, Laurent" <Laurent.Morichetti@amd.com>,"Deucher, Alexander" <Alexander.Deucher@amd.com>,"Gabbay, Oded" <Oded.Gabbay@amd.com>,Davidlohr Bueso <davidlohr@hp.com>
> Message-ID: <0bf54468-3ed1-4cd4-b771-4836c78dde14@email.android.com>
> 
> Nothing wrong with device-side memory, but not having it accessible by
> the CPU seems fundamentally brown from the point of view of unified
> memory addressing.

Unified memory addressing does not imply CPU and GPU working on same set
of data at same time. So having part of the address space only accessible
by GPU while it's actively working on it make sense. The GPU then can have
low latency (no pcie bus) and enormous bandwith and thus perform the
computation a lot faster.

Note that my patchset handle cpu page fault while data is inside GPU memory
and migrate it back to system memory. So from CPU point of view it is just
as if things were in some kind of swap device except that the swap device
is actualy doing some useful computation.

Also on the cache coherent front, cache coherency has a cost, a very high
cost. This is why even on APU (where the GPU and CPU are on same die and
the mmu of the GPU and CPU have privilege link think today AMD APU or next
year intel skylake) you have two memory link, one cache coherent with the
CPU and another one that is not cache coherent with the CPU. The latter
link is way faster and my patchset is also intended to help taking advantages
of this second link (http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf)

Cheers,
Jérôme

> 
> On May 6, 2014 9:54:08 AM PDT, Jerome Glisse <j.glisse@gmail.com> wrote:
> >On Tue, May 06, 2014 at 12:47:16PM -0400, Rik van Riel wrote:
> >> On 05/06/2014 12:34 PM, Linus Torvalds wrote:
> >> > On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com>
> >wrote:
> >> >>
> >> >> The GPU runs a lot faster when using video memory, instead
> >> >> of system memory, on the other side of the PCIe bus.
> >> > 
> >> > The nineties called, and they want their old broken model back.
> >> > 
> >> > Get with the times. No high-performance future GPU will ever run
> >> > behind the PCIe bus. We still have a few straggling historical
> >> > artifacts, but everybody knows where the future is headed.
> >> > 
> >> > They are already cache-coherent because flushing caches etc was too
> >> > damn expensive. They're getting more so.
> >> 
> >> I suppose that VRAM could simply be turned into a very high
> >> capacity CPU cache for the GPU, for the case where people
> >> want/need an add-on card.
> >> 
> >> With a few hundred MB of "CPU cache" on the video card, we
> >> could offload processing to the GPU very easily, without
> >> having to worry about multiple address or page table formats
> >> on the CPU side.
> >> 
> >> A new generation of GPU hardware seems to come out every
> >> six months or so, so I guess we could live with TLB
> >> invalidations to the first generations of hardware being
> >> comically slow :)
> >> 
> >
> >I do not want to speak for any GPU manufacturer but i think it is safe
> >to say that there will be dedicated memory for GPU for a long time. It
> >is not going anywhere soon and it is a lot more than few hundred MB,
> >think several GB. If you think about 4k, 8k screen you really gonna
> >want
> >8GB at least on desktop computer and for compute you will likely see
> >16GB or 32GB as common size.
> >
> >Again i stress that there is nothing on the horizon that let me believe
> >that regular memory associated to CPU will ever come close to the
> >bandwith
> >that exist with memory associated to GPU. It is already more than 10
> >times
> >faster on GPU and as far as i know the gap will grow even more in the
> >next
> >generation.
> >
> >So dedicated memory to gpu should not be discarded as something that is
> >vanishing quite the contrary it should be acknowledge as something that
> >is
> >here to stay a lot longer afaict.
> >
> >Cheers,
> >Jérôme
> 
> -- 
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 18:26                         ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 18:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Rik van Riel, Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	Andrew Morton, Linda Wang, Kevin E Martin, Jerome Glisse,
	Andrea Arcangeli, Johannes Weiner, Larry Woodman, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or.Gerlitz

On Tue, May 06, 2014 at 11:02:33AM -0700, H. Peter Anvin wrote:
> <ogerlitz@mellanox.com>,Sagi Grimberg <sagig@mellanox.com>,Shachar Raindel <raindel@mellanox.com>,Liran Liss <liranl@mellanox.com>,Roland Dreier <roland@purestorage.com>,"Sander, Ben" <ben.sander@amd.com>,"Stoner, Greg" <Greg.Stoner@amd.com>,"Bridgman, John" <John.Bridgman@amd.com>,"Mantor, Michael" <Michael.Mantor@amd.com>,"Blinzer, Paul" <Paul.Blinzer@amd.com>,"Morichetti, Laurent" <Laurent.Morichetti@amd.com>,"Deucher, Alexander" <Alexander.Deucher@amd.com>,"Gabbay, Oded" <Oded.Gabbay@amd.com>,Davidlohr Bueso <davidlohr@hp.com>
> Message-ID: <0bf54468-3ed1-4cd4-b771-4836c78dde14@email.android.com>
> 
> Nothing wrong with device-side memory, but not having it accessible by
> the CPU seems fundamentally brown from the point of view of unified
> memory addressing.

Unified memory addressing does not imply CPU and GPU working on same set
of data at same time. So having part of the address space only accessible
by GPU while it's actively working on it make sense. The GPU then can have
low latency (no pcie bus) and enormous bandwith and thus perform the
computation a lot faster.

Note that my patchset handle cpu page fault while data is inside GPU memory
and migrate it back to system memory. So from CPU point of view it is just
as if things were in some kind of swap device except that the swap device
is actualy doing some useful computation.

Also on the cache coherent front, cache coherency has a cost, a very high
cost. This is why even on APU (where the GPU and CPU are on same die and
the mmu of the GPU and CPU have privilege link think today AMD APU or next
year intel skylake) you have two memory link, one cache coherent with the
CPU and another one that is not cache coherent with the CPU. The latter
link is way faster and my patchset is also intended to help taking advantages
of this second link (http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf)

Cheers,
Jerome

> 
> On May 6, 2014 9:54:08 AM PDT, Jerome Glisse <j.glisse@gmail.com> wrote:
> >On Tue, May 06, 2014 at 12:47:16PM -0400, Rik van Riel wrote:
> >> On 05/06/2014 12:34 PM, Linus Torvalds wrote:
> >> > On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com>
> >wrote:
> >> >>
> >> >> The GPU runs a lot faster when using video memory, instead
> >> >> of system memory, on the other side of the PCIe bus.
> >> > 
> >> > The nineties called, and they want their old broken model back.
> >> > 
> >> > Get with the times. No high-performance future GPU will ever run
> >> > behind the PCIe bus. We still have a few straggling historical
> >> > artifacts, but everybody knows where the future is headed.
> >> > 
> >> > They are already cache-coherent because flushing caches etc was too
> >> > damn expensive. They're getting more so.
> >> 
> >> I suppose that VRAM could simply be turned into a very high
> >> capacity CPU cache for the GPU, for the case where people
> >> want/need an add-on card.
> >> 
> >> With a few hundred MB of "CPU cache" on the video card, we
> >> could offload processing to the GPU very easily, without
> >> having to worry about multiple address or page table formats
> >> on the CPU side.
> >> 
> >> A new generation of GPU hardware seems to come out every
> >> six months or so, so I guess we could live with TLB
> >> invalidations to the first generations of hardware being
> >> comically slow :)
> >> 
> >
> >I do not want to speak for any GPU manufacturer but i think it is safe
> >to say that there will be dedicated memory for GPU for a long time. It
> >is not going anywhere soon and it is a lot more than few hundred MB,
> >think several GB. If you think about 4k, 8k screen you really gonna
> >want
> >8GB at least on desktop computer and for compute you will likely see
> >16GB or 32GB as common size.
> >
> >Again i stress that there is nothing on the horizon that let me believe
> >that regular memory associated to CPU will ever come close to the
> >bandwith
> >that exist with memory associated to GPU. It is already more than 10
> >times
> >faster on GPU and as far as i know the gap will grow even more in the
> >next
> >generation.
> >
> >So dedicated memory to gpu should not be discarded as something that is
> >vanishing quite the contrary it should be acknowledge as something that
> >is
> >here to stay a lot longer afaict.
> >
> >Cheers,
> >Jerome
> 
> -- 
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 18:22                         ` Linus Torvalds
@ 2014-05-06 18:38                           ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Tue, May 06, 2014 at 11:22:48AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 11:13 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > I could grow the radix function to return some bool to avoid looping over for
> > case where there is no special entry.
> 
> .. or even just a bool (or counter) associated with the mapping to
> mark whether any special entries exist at all.
> 
> Also, the code to turn special entries is duplicated over and over
> again, usually together with a "FIXME - what about migration failure",
> so it would make sense to do that as it's own function.
> 

Migration failure is when something goes horribly wrong and GPU can not
copy back the page to system memory that philosophical question associated
is what to do about other process ? Make them SIGBUS ?

The answer so far is consider this as any kind of cpu thread that would
crash and only half write content it wanted into the page. So other thread
will use the lastest version of the data we have. Thread that triggered
the migration to the GPU memory would see a SIGBUS (those thread are GPU
aware as they use some form of GPU api such as OpenCL).

> But conceptually I don't hate it. I didn't much like having random
> hmm_pagecache_migrate() calls in core vm code, and code like this
> 
> +                       hmm_pagecache_migrate(mapping, swap);
> +                       spd.pages[page_nr] = find_get_page(mapping,
> index + page_nr);
> 
> looks fundamentally racy, and in other places you seemed to assume
> that all exceptional entries are always about hmm, which looked
> questionable. But those are details.  The concept of putting a special
> swap entry in the mapping radix trees I don't necessarily find
> objectionable per se.
> 
>            Linus

So far only shmem use special entry and my patchset did not support it
as i wanted to vet the design first.

The hmm_pagecache_migrate is the function that trigger migration back to
system memory. Once again the expectation is that such code path will
neve be call, only the process that use the GPU and the mmaped file will
ever access those pages and this process knows that it should not access
them while they are on the GPU so if it does it has to suffer the
consequences.

Thanks a lot for all the feedback, much appreciated.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 18:38                           ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-06 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, May 06, 2014 at 11:22:48AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 11:13 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >
> > I could grow the radix function to return some bool to avoid looping over for
> > case where there is no special entry.
> 
> .. or even just a bool (or counter) associated with the mapping to
> mark whether any special entries exist at all.
> 
> Also, the code to turn special entries is duplicated over and over
> again, usually together with a "FIXME - what about migration failure",
> so it would make sense to do that as it's own function.
> 

Migration failure is when something goes horribly wrong and GPU can not
copy back the page to system memory that philosophical question associated
is what to do about other process ? Make them SIGBUS ?

The answer so far is consider this as any kind of cpu thread that would
crash and only half write content it wanted into the page. So other thread
will use the lastest version of the data we have. Thread that triggered
the migration to the GPU memory would see a SIGBUS (those thread are GPU
aware as they use some form of GPU api such as OpenCL).

> But conceptually I don't hate it. I didn't much like having random
> hmm_pagecache_migrate() calls in core vm code, and code like this
> 
> +                       hmm_pagecache_migrate(mapping, swap);
> +                       spd.pages[page_nr] = find_get_page(mapping,
> index + page_nr);
> 
> looks fundamentally racy, and in other places you seemed to assume
> that all exceptional entries are always about hmm, which looked
> questionable. But those are details.  The concept of putting a special
> swap entry in the mapping radix trees I don't necessarily find
> objectionable per se.
> 
>            Linus

So far only shmem use special entry and my patchset did not support it
as i wanted to vet the design first.

The hmm_pagecache_migrate is the function that trigger migration back to
system memory. Once again the expectation is that such code path will
neve be call, only the process that use the GPU and the mmaped file will
ever access those pages and this process knows that it should not access
them while they are on the GPU so if it does it has to suffer the
consequences.

Thanks a lot for all the feedback, much appreciated.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:34                 ` Linus Torvalds
@ 2014-05-06 22:44                   ` David Airlie
  -1 siblings, 0 replies; 107+ messages in thread
From: David Airlie @ 2014-05-06 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard


> 
> On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
> >
> > The GPU runs a lot faster when using video memory, instead
> > of system memory, on the other side of the PCIe bus.
> 
> The nineties called, and they want their old broken model back.
> 
> Get with the times. No high-performance future GPU will ever run
> behind the PCIe bus. We still have a few straggling historical
> artifacts, but everybody knows where the future is headed.
> 
> They are already cache-coherent because flushing caches etc was too
> damn expensive. They're getting more so.

The future might be closer coupled, but it still might not be cache coherent, it might also just be a faster PCIE, considering the current one is a lot faster than the 90s PCI you talk about.

No current high-performance GPU runs in front of the PCIe bus, Intel are still catching up to the performance level of anyone else and others still remain ahead. Even intel make MIC cards for compute that put stuff on the other side of the PCIE divide.

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-06 22:44                   ` David Airlie
  0 siblings, 0 replies; 107+ messages in thread
From: David Airlie @ 2014-05-06 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Ben Sander, Greg Stoner,
	John Bridgman, Michael Mantor, Paul Blinzer, Laurent Morichetti,
	Alexander Deucher, Oded Gabbay, Davidlohr Bueso


> 
> On Tue, May 6, 2014 at 9:30 AM, Rik van Riel <riel@redhat.com> wrote:
> >
> > The GPU runs a lot faster when using video memory, instead
> > of system memory, on the other side of the PCIe bus.
> 
> The nineties called, and they want their old broken model back.
> 
> Get with the times. No high-performance future GPU will ever run
> behind the PCIe bus. We still have a few straggling historical
> artifacts, but everybody knows where the future is headed.
> 
> They are already cache-coherent because flushing caches etc was too
> damn expensive. They're getting more so.

The future might be closer coupled, but it still might not be cache coherent, it might also just be a faster PCIE, considering the current one is a lot faster than the 90s PCI you talk about.

No current high-performance GPU runs in front of the PCIe bus, Intel are still catching up to the performance level of anyone else and others still remain ahead. Even intel make MIC cards for compute that put stuff on the other side of the PCIE divide.

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 10:29   ` Peter Zijlstra
@ 2014-05-07  2:33     ` Davidlohr Bueso
  -1 siblings, 0 replies; 107+ messages in thread
From: Davidlohr Bueso @ 2014-05-07  2:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> So you forgot to CC Linus, Linus has expressed some dislike for
> preemptible mmu_notifiers in the recent past:
> 
>   https://lkml.org/lkml/2013/9/30/385

I'm glad this came up again.

So I've been running benchmarks (mostly aim7, which nicely exercises our
locks) comparing my recent v4 for rwsem optimistic spinning against
previous implementation ideas for the anon-vma lock, mostly:

- rwsem (currently)
- rwlock_t
- qrwlock_t
- rwsem+optspin

Of course, *any* change provides significant improvement in throughput
for several workloads, by avoiding to block -- there are more
performance numbers in the different patches. This is fairly obvious.

What is perhaps not so obvious is that rwsem+optimistic spinning beats
all others, including the improved qrwlock from Waiman and Peter. This
is mostly because of the idea of cancelable MCS, which was mimic'ed from
mutexes. The delta in most cases is around +10-15%, which is non
trivial.

I mention this because from a performance PoV, we'll stop caring so much
about the type of lock we require in the notifier related code. So while
this is not conclusive, I'm not as opposed to keeping the locks blocking
as I once was. Now this might still imply things like poor design
choices, but that's neither here nor there.

/me sees Sagi smiling ;)

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07  2:33     ` Davidlohr Bueso
  0 siblings, 0 replies; 107+ messages in thread
From: Davidlohr Bueso @ 2014-05-07  2:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Linus Torvalds

On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> So you forgot to CC Linus, Linus has expressed some dislike for
> preemptible mmu_notifiers in the recent past:
> 
>   https://lkml.org/lkml/2013/9/30/385

I'm glad this came up again.

So I've been running benchmarks (mostly aim7, which nicely exercises our
locks) comparing my recent v4 for rwsem optimistic spinning against
previous implementation ideas for the anon-vma lock, mostly:

- rwsem (currently)
- rwlock_t
- qrwlock_t
- rwsem+optspin

Of course, *any* change provides significant improvement in throughput
for several workloads, by avoiding to block -- there are more
performance numbers in the different patches. This is fairly obvious.

What is perhaps not so obvious is that rwsem+optimistic spinning beats
all others, including the improved qrwlock from Waiman and Peter. This
is mostly because of the idea of cancelable MCS, which was mimic'ed from
mutexes. The delta in most cases is around +10-15%, which is non
trivial.

I mention this because from a performance PoV, we'll stop caring so much
about the type of lock we require in the notifier related code. So while
this is not conclusive, I'm not as opposed to keeping the locks blocking
as I once was. Now this might still imply things like poor design
choices, but that's neither here nor there.

/me sees Sagi smiling ;)

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:18               ` Jerome Glisse
@ 2014-05-07  7:14                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 107+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-07  7:14 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung

On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> 
> I do understand that i was pointing out that if i move to, tlb which i
> am fine with, i will still need to sleep there. That's all i wanted to
> stress, i did not wanted force using mmu_notifier, i am fine with them
> becoming atomic as long as i have a place where i can intercept cpu
> page table update and propagate them to device mmu.

Your MMU notifier can maintain a map of "dirty" PTEs and you do the
actual synchronization in the subsequent flush_tlb_* , you need to add
hooks there but it's much less painful than in the notifiers.

*However* Linus, even then we can't sleep. We do things like
ptep_clear_flush() that need the PTL and have the synchronous flush
semantics.

Sure, today we wait, possibly for a long time, with IPIs, but we do not
sleep. Jerome would have to operate within a similar context. No sleep
for you :)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07  7:14                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 107+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-07  7:14 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> 
> I do understand that i was pointing out that if i move to, tlb which i
> am fine with, i will still need to sleep there. That's all i wanted to
> stress, i did not wanted force using mmu_notifier, i am fine with them
> becoming atomic as long as i have a place where i can intercept cpu
> page table update and propagate them to device mmu.

Your MMU notifier can maintain a map of "dirty" PTEs and you do the
actual synchronization in the subsequent flush_tlb_* , you need to add
hooks there but it's much less painful than in the notifiers.

*However* Linus, even then we can't sleep. We do things like
ptep_clear_flush() that need the PTL and have the synchronous flush
semantics.

Sure, today we wait, possibly for a long time, with IPIs, but we do not
sleep. Jerome would have to operate within a similar context. No sleep
for you :)

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-06 16:32                 ` Linus Torvalds
@ 2014-05-07  7:18                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 107+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-07  7:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash

On Tue, 2014-05-06 at 09:32 -0700, Linus Torvalds wrote:
> and what I think a GPU flush has to do is to do the actual flushes
> when asked to (because that's what it will need to do to work with a
> real TLB eventually), but if there's some crazy asynchronous
> acknowledge thing from hardware, it's possible to perhaps wait for
> that in the final phase (*before* we free the pages we gathered).

Hrm, difficult. We have some pretty strong assumptions that
ptep_clear_flush() is fully synchronous as far as I can tell... ie, your
trick would work for the unmap case but everything else is still
problematic.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07  7:18                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 107+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-07  7:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jerome Glisse, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Tue, 2014-05-06 at 09:32 -0700, Linus Torvalds wrote:
> and what I think a GPU flush has to do is to do the actual flushes
> when asked to (because that's what it will need to do to work with a
> real TLB eventually), but if there's some crazy asynchronous
> acknowledge thing from hardware, it's possible to perhaps wait for
> that in the final phase (*before* we free the pages we gathered).

Hrm, difficult. We have some pretty strong assumptions that
ptep_clear_flush() is fully synchronous as far as I can tell... ie, your
trick would work for the unmap case but everything else is still
problematic.

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-07  7:14                 ` Benjamin Herrenschmidt
@ 2014-05-07 12:39                   ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-07 12:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove

On Wed, May 07, 2014 at 05:14:52PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> > 
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there. That's all i wanted to
> > stress, i did not wanted force using mmu_notifier, i am fine with them
> > becoming atomic as long as i have a place where i can intercept cpu
> > page table update and propagate them to device mmu.
> 
> Your MMU notifier can maintain a map of "dirty" PTEs and you do the
> actual synchronization in the subsequent flush_tlb_* , you need to add
> hooks there but it's much less painful than in the notifiers.

Well getting back the dirty info from the GPU also require to sleep. Maybe
i should explain how it is suppose to work. GPU have several command buffer
and execute instructions inside those command buffer in sequential order.
To update the GPU mmu you need to schedule command into one of those command
buffer but when you do so you do not know how much command are in front of
you and how long it will take to the GPU to get to your command.

Yes GPU this patchset target have preemption but it is not as flexible as
CPU preemption there is not kernel thread running and scheduling, all the
scheduling is done in hardware. So the preemption is more limited that on
CPU.

That is why any update or information retrieval from the GPU need to go
through some command buffer and no matter how high priority the command
buffer for mmu update is, it can still long time (think flushing thousand
of GPU thread and saving there context).

> 
> *However* Linus, even then we can't sleep. We do things like
> ptep_clear_flush() that need the PTL and have the synchronous flush
> semantics.
> 
> Sure, today we wait, possibly for a long time, with IPIs, but we do not
> sleep. Jerome would have to operate within a similar context. No sleep
> for you :)
> 
> Cheers,
> Ben.

So for the ptep_clear_flush my idea is to have a special lru for page that
are in use by the GPU. This will prevent the page reclaimation try_to_unmap
and thus the ptep_clear_flush. I would block ksm so again another user that
would no do ptep_clear_flush. I would need to fix remap_file_pages either
adding some callback there or refactor the unmap and tlb flushing.

Finaly for page migration i see several solutions, forbid it (easy for me
but likely not what we want) have special code inside migrate code to handle
page in use by a device, or have special code inside try_to_unmap to handle
it.

I think this is all the current user of ptep_clear_flush and derivative that
does flush tlb while holding spinlock.

Note that for special lru or event special handling of page in use by a device
i need a new page flag. Would this be acceptable ?

For the special lru i was thinking of doing it per device as anyway each device
is unlikely to constantly address all the page it has mapped. Simple lru list
would do and probably offering some helper for device driver to mark page accessed
so page frequently use are not reclaim.

But a global list is fine as well and simplify the case diffirent device use
same pages.

Cheers,
Jérôme Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07 12:39                   ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-07 12:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Wed, May 07, 2014 at 05:14:52PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> > 
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there. That's all i wanted to
> > stress, i did not wanted force using mmu_notifier, i am fine with them
> > becoming atomic as long as i have a place where i can intercept cpu
> > page table update and propagate them to device mmu.
> 
> Your MMU notifier can maintain a map of "dirty" PTEs and you do the
> actual synchronization in the subsequent flush_tlb_* , you need to add
> hooks there but it's much less painful than in the notifiers.

Well getting back the dirty info from the GPU also require to sleep. Maybe
i should explain how it is suppose to work. GPU have several command buffer
and execute instructions inside those command buffer in sequential order.
To update the GPU mmu you need to schedule command into one of those command
buffer but when you do so you do not know how much command are in front of
you and how long it will take to the GPU to get to your command.

Yes GPU this patchset target have preemption but it is not as flexible as
CPU preemption there is not kernel thread running and scheduling, all the
scheduling is done in hardware. So the preemption is more limited that on
CPU.

That is why any update or information retrieval from the GPU need to go
through some command buffer and no matter how high priority the command
buffer for mmu update is, it can still long time (think flushing thousand
of GPU thread and saving there context).

> 
> *However* Linus, even then we can't sleep. We do things like
> ptep_clear_flush() that need the PTL and have the synchronous flush
> semantics.
> 
> Sure, today we wait, possibly for a long time, with IPIs, but we do not
> sleep. Jerome would have to operate within a similar context. No sleep
> for you :)
> 
> Cheers,
> Ben.

So for the ptep_clear_flush my idea is to have a special lru for page that
are in use by the GPU. This will prevent the page reclaimation try_to_unmap
and thus the ptep_clear_flush. I would block ksm so again another user that
would no do ptep_clear_flush. I would need to fix remap_file_pages either
adding some callback there or refactor the unmap and tlb flushing.

Finaly for page migration i see several solutions, forbid it (easy for me
but likely not what we want) have special code inside migrate code to handle
page in use by a device, or have special code inside try_to_unmap to handle
it.

I think this is all the current user of ptep_clear_flush and derivative that
does flush tlb while holding spinlock.

Note that for special lru or event special handling of page in use by a device
i need a new page flag. Would this be acceptable ?

For the special lru i was thinking of doing it per device as anyway each device
is unlikely to constantly address all the page it has mapped. Simple lru list
would do and probably offering some helper for device driver to mark page accessed
so page frequently use are not reclaim.

But a global list is fine as well and simplify the case diffirent device use
same pages.

Cheers,
Jerome Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-07  2:33     ` Davidlohr Bueso
@ 2014-05-07 13:00       ` Peter Zijlstra
  -1 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2014-05-07 13:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

[-- Attachment #1: Type: text/plain, Size: 603 bytes --]

On Tue, May 06, 2014 at 07:33:07PM -0700, Davidlohr Bueso wrote:

> So I've been running benchmarks (mostly aim7, which nicely exercises our
> locks) comparing my recent v4 for rwsem optimistic spinning against
> previous implementation ideas for the anon-vma lock, mostly:

> - rwlock_t
> - qrwlock_t

Which reminds me; can you provide the numbers for rwlock_t vs qrwlock_t
in a numeric form so I can include them in the qrwlock_t changelog.

That way I can queue those patches for inclusion, I think we want a fair
rwlock_t if we can show (and you graphs do iirc) that it doesn't cost us
performance.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07 13:00       ` Peter Zijlstra
  0 siblings, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2014-05-07 13:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 603 bytes --]

On Tue, May 06, 2014 at 07:33:07PM -0700, Davidlohr Bueso wrote:

> So I've been running benchmarks (mostly aim7, which nicely exercises our
> locks) comparing my recent v4 for rwsem optimistic spinning against
> previous implementation ideas for the anon-vma lock, mostly:

> - rwlock_t
> - qrwlock_t

Which reminds me; can you provide the numbers for rwlock_t vs qrwlock_t
in a numeric form so I can include them in the qrwlock_t changelog.

That way I can queue those patches for inclusion, I think we want a fair
rwlock_t if we can show (and you graphs do iirc) that it doesn't cost us
performance.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-07  2:33     ` Davidlohr Bueso
@ 2014-05-07 16:21       ` Linus Torvalds
  -1 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-07 16:21 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Jerome Glisse, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove

On Tue, May 6, 2014 at 7:33 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
=>
> What is perhaps not so obvious is that rwsem+optimistic spinning beats
> all others, including the improved qrwlock from Waiman and Peter. This
> is mostly because of the idea of cancelable MCS, which was mimic'ed from
> mutexes. The delta in most cases is around +10-15%, which is non
> trivial.

Ahh, excellent news. That way I don't have to worry about the anonvma
lock any more. I'll just forget about it right now, in fact.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07 16:21       ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-07 16:21 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Jerome Glisse, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded

On Tue, May 6, 2014 at 7:33 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
=>
> What is perhaps not so obvious is that rwsem+optimistic spinning beats
> all others, including the improved qrwlock from Waiman and Peter. This
> is mostly because of the idea of cancelable MCS, which was mimic'ed from
> mutexes. The delta in most cases is around +10-15%, which is non
> trivial.

Ahh, excellent news. That way I don't have to worry about the anonvma
lock any more. I'll just forget about it right now, in fact.

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-07 13:00       ` Peter Zijlstra
@ 2014-05-07 17:34         ` Davidlohr Bueso
  -1 siblings, 0 replies; 107+ messages in thread
From: Davidlohr Bueso @ 2014-05-07 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Wed, 2014-05-07 at 15:00 +0200, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 07:33:07PM -0700, Davidlohr Bueso wrote:
> 
> > So I've been running benchmarks (mostly aim7, which nicely exercises our
> > locks) comparing my recent v4 for rwsem optimistic spinning against
> > previous implementation ideas for the anon-vma lock, mostly:
> 
> > - rwlock_t
> > - qrwlock_t
> 
> Which reminds me; can you provide the numbers for rwlock_t vs qrwlock_t
> in a numeric form so I can include them in the qrwlock_t changelog.

Ah, right. I was lazy and just showed you the graphs.

> That way I can queue those patches for inclusion, I think we want a fair
> rwlock_t if we can show (and you graphs do iirc) that it doesn't cost us
> performance.

I agree, fairness is much welcome here. And I agree that despite my good
numbers, and that we should keep the anon vma lock as a rwsem, its still
worth merging the qrwlock stuff. I'll cookup my regular numeric table
today.

Thanks,
Davidlohr


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-07 17:34         ` Davidlohr Bueso
  0 siblings, 0 replies; 107+ messages in thread
From: Davidlohr Bueso @ 2014-05-07 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Linus Torvalds

On Wed, 2014-05-07 at 15:00 +0200, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 07:33:07PM -0700, Davidlohr Bueso wrote:
> 
> > So I've been running benchmarks (mostly aim7, which nicely exercises our
> > locks) comparing my recent v4 for rwsem optimistic spinning against
> > previous implementation ideas for the anon-vma lock, mostly:
> 
> > - rwlock_t
> > - qrwlock_t
> 
> Which reminds me; can you provide the numbers for rwlock_t vs qrwlock_t
> in a numeric form so I can include them in the qrwlock_t changelog.

Ah, right. I was lazy and just showed you the graphs.

> That way I can queue those patches for inclusion, I think we want a fair
> rwlock_t if we can show (and you graphs do iirc) that it doesn't cost us
> performance.

I agree, fairness is much welcome here. And I agree that despite my good
numbers, and that we should keep the anon vma lock as a rwsem, its still
worth merging the qrwlock stuff. I'll cookup my regular numeric table
today.

Thanks,
Davidlohr


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-07  2:33     ` Davidlohr Bueso
@ 2014-05-08 16:47       ` sagi grimberg
  -1 siblings, 0 replies; 107+ messages in thread
From: sagi grimberg @ 2014-05-08 16:47 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
>> So you forgot to CC Linus, Linus has expressed some dislike for
>> preemptible mmu_notifiers in the recent past:
>>
>>    https://lkml.org/lkml/2013/9/30/385
> I'm glad this came up again.
>
> So I've been running benchmarks (mostly aim7, which nicely exercises our
> locks) comparing my recent v4 for rwsem optimistic spinning against
> previous implementation ideas for the anon-vma lock, mostly:
>
> - rwsem (currently)
> - rwlock_t
> - qrwlock_t
> - rwsem+optspin
>
> Of course, *any* change provides significant improvement in throughput
> for several workloads, by avoiding to block -- there are more
> performance numbers in the different patches. This is fairly obvious.
>
> What is perhaps not so obvious is that rwsem+optimistic spinning beats
> all others, including the improved qrwlock from Waiman and Peter. This
> is mostly because of the idea of cancelable MCS, which was mimic'ed from
> mutexes. The delta in most cases is around +10-15%, which is non
> trivial.

These are great news David!

> I mention this because from a performance PoV, we'll stop caring so much
> about the type of lock we require in the notifier related code. So while
> this is not conclusive, I'm not as opposed to keeping the locks blocking
> as I once was. Now this might still imply things like poor design
> choices, but that's neither here nor there.

So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
We will be more than satisfied with it as it will allow us to guarantee 
device
MMU update.

> /me sees Sagi smiling ;)

:)

Sagi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-08 16:47       ` sagi grimberg
  0 siblings, 0 replies; 107+ messages in thread
From: sagi grimberg @ 2014-05-08 16:47 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra
  Cc: j.glisse, linux-mm, linux-kernel, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Linus Torvalds

On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
>> So you forgot to CC Linus, Linus has expressed some dislike for
>> preemptible mmu_notifiers in the recent past:
>>
>>    https://lkml.org/lkml/2013/9/30/385
> I'm glad this came up again.
>
> So I've been running benchmarks (mostly aim7, which nicely exercises our
> locks) comparing my recent v4 for rwsem optimistic spinning against
> previous implementation ideas for the anon-vma lock, mostly:
>
> - rwsem (currently)
> - rwlock_t
> - qrwlock_t
> - rwsem+optspin
>
> Of course, *any* change provides significant improvement in throughput
> for several workloads, by avoiding to block -- there are more
> performance numbers in the different patches. This is fairly obvious.
>
> What is perhaps not so obvious is that rwsem+optimistic spinning beats
> all others, including the improved qrwlock from Waiman and Peter. This
> is mostly because of the idea of cancelable MCS, which was mimic'ed from
> mutexes. The delta in most cases is around +10-15%, which is non
> trivial.

These are great news David!

> I mention this because from a performance PoV, we'll stop caring so much
> about the type of lock we require in the notifier related code. So while
> this is not conclusive, I'm not as opposed to keeping the locks blocking
> as I once was. Now this might still imply things like poor design
> choices, but that's neither here nor there.

So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
We will be more than satisfied with it as it will allow us to guarantee 
device
MMU update.

> /me sees Sagi smiling ;)

:)

Sagi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-08 16:47       ` sagi grimberg
@ 2014-05-08 17:56         ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-08 17:56 UTC (permalink / raw)
  To: sagi grimberg
  Cc: Davidlohr Bueso, Peter Zijlstra, linux-mm, linux-kernel,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt

On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote:
> On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> >>So you forgot to CC Linus, Linus has expressed some dislike for
> >>preemptible mmu_notifiers in the recent past:
> >>
> >>   https://lkml.org/lkml/2013/9/30/385
> >I'm glad this came up again.
> >
> >So I've been running benchmarks (mostly aim7, which nicely exercises our
> >locks) comparing my recent v4 for rwsem optimistic spinning against
> >previous implementation ideas for the anon-vma lock, mostly:
> >
> >- rwsem (currently)
> >- rwlock_t
> >- qrwlock_t
> >- rwsem+optspin
> >
> >Of course, *any* change provides significant improvement in throughput
> >for several workloads, by avoiding to block -- there are more
> >performance numbers in the different patches. This is fairly obvious.
> >
> >What is perhaps not so obvious is that rwsem+optimistic spinning beats
> >all others, including the improved qrwlock from Waiman and Peter. This
> >is mostly because of the idea of cancelable MCS, which was mimic'ed from
> >mutexes. The delta in most cases is around +10-15%, which is non
> >trivial.
> 
> These are great news David!
> 
> >I mention this because from a performance PoV, we'll stop caring so much
> >about the type of lock we require in the notifier related code. So while
> >this is not conclusive, I'm not as opposed to keeping the locks blocking
> >as I once was. Now this might still imply things like poor design
> >choices, but that's neither here nor there.
> 
> So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
> We will be more than satisfied with it as it will allow us to
> guarantee device
> MMU update.
> 
> >/me sees Sagi smiling ;)
> 
> :)

So i started doing thing with tlb flush but i must say things looks ugly.
I need a new page flag (goodbye 32bits platform) and i need my own lru and
page reclaimation for any page in use by a device, i need to hook up inside
try_to_unmap or migrate (but i will do the former). I am trying to be smart
by trying to schedule a worker on another cpu before before sending the ipi
so that while the ipi is in progress hopefully another cpu might schedule
the invalidation on the GPU and the wait after ipi for the gpu will be quick.

So all in all this is looking ugly and it does not change the fact that i
sleep (well need to be able to sleep). It just move the sleeping to another
part.

Maybe i should stress that with the mmu_notifier version it only sleep for
process that are using the GPU those process are using userspace API like
OpenCL which are not playing well with fork, ie read do not use fork if
you are using such API.

So for my case if a process has mm->hmm set to something that would mean
that there is a GPU using that address space and that it is unlikely to
go under the massive workload that people try to optimize the anon_vma
lock for.

My point is that with rwsem+optspin it could try spinning if mm->hmm
was NULL and make the massive fork workload go fast, or it could sleep
directly if mm->hmm is set.

This way my addition are not damaging anyone workload, only the workload
that would use hmm would likely have lock contention on fork but those
workload should not fork in the first place and if they do they should
pay a price.

I will finish up the tlb hackish version of hmm so people can judge how
ugly it is (in my view) and send it here as soon as i can.

But i think it's clear that with rwsem+optspin we can make all workload
happy and fast.

Cheers,
Jérôme Glisse

> 
> Sagi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-08 17:56         ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-08 17:56 UTC (permalink / raw)
  To: sagi grimberg
  Cc: Davidlohr Bueso, Peter Zijlstra, linux-mm, linux-kernel,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Shachar Raindel, Liran Liss,
	Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman, John, Mantor,
	Michael, Blinzer, Paul, Morichetti, Laurent, Deucher, Alexander,
	Gabbay, Oded, Linus Torvalds

On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote:
> On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> >>So you forgot to CC Linus, Linus has expressed some dislike for
> >>preemptible mmu_notifiers in the recent past:
> >>
> >>   https://lkml.org/lkml/2013/9/30/385
> >I'm glad this came up again.
> >
> >So I've been running benchmarks (mostly aim7, which nicely exercises our
> >locks) comparing my recent v4 for rwsem optimistic spinning against
> >previous implementation ideas for the anon-vma lock, mostly:
> >
> >- rwsem (currently)
> >- rwlock_t
> >- qrwlock_t
> >- rwsem+optspin
> >
> >Of course, *any* change provides significant improvement in throughput
> >for several workloads, by avoiding to block -- there are more
> >performance numbers in the different patches. This is fairly obvious.
> >
> >What is perhaps not so obvious is that rwsem+optimistic spinning beats
> >all others, including the improved qrwlock from Waiman and Peter. This
> >is mostly because of the idea of cancelable MCS, which was mimic'ed from
> >mutexes. The delta in most cases is around +10-15%, which is non
> >trivial.
> 
> These are great news David!
> 
> >I mention this because from a performance PoV, we'll stop caring so much
> >about the type of lock we require in the notifier related code. So while
> >this is not conclusive, I'm not as opposed to keeping the locks blocking
> >as I once was. Now this might still imply things like poor design
> >choices, but that's neither here nor there.
> 
> So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
> We will be more than satisfied with it as it will allow us to
> guarantee device
> MMU update.
> 
> >/me sees Sagi smiling ;)
> 
> :)

So i started doing thing with tlb flush but i must say things looks ugly.
I need a new page flag (goodbye 32bits platform) and i need my own lru and
page reclaimation for any page in use by a device, i need to hook up inside
try_to_unmap or migrate (but i will do the former). I am trying to be smart
by trying to schedule a worker on another cpu before before sending the ipi
so that while the ipi is in progress hopefully another cpu might schedule
the invalidation on the GPU and the wait after ipi for the gpu will be quick.

So all in all this is looking ugly and it does not change the fact that i
sleep (well need to be able to sleep). It just move the sleeping to another
part.

Maybe i should stress that with the mmu_notifier version it only sleep for
process that are using the GPU those process are using userspace API like
OpenCL which are not playing well with fork, ie read do not use fork if
you are using such API.

So for my case if a process has mm->hmm set to something that would mean
that there is a GPU using that address space and that it is unlikely to
go under the massive workload that people try to optimize the anon_vma
lock for.

My point is that with rwsem+optspin it could try spinning if mm->hmm
was NULL and make the massive fork workload go fast, or it could sleep
directly if mm->hmm is set.

This way my addition are not damaging anyone workload, only the workload
that would use hmm would likely have lock contention on fork but those
workload should not fork in the first place and if they do they should
pay a price.

I will finish up the tlb hackish version of hmm so people can judge how
ugly it is (in my view) and send it here as soon as i can.

But i think it's clear that with rwsem+optspin we can make all workload
happy and fast.

Cheers,
Jerome Glisse

> 
> Sagi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-07  7:14                 ` Benjamin Herrenschmidt
@ 2014-05-09  1:26                   ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-09  1:26 UTC (permalink / raw)
  To: Linus Torvalds, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning

On Wed, May 07, 2014 at 05:14:52PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> > 
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there. That's all i wanted to
> > stress, i did not wanted force using mmu_notifier, i am fine with them
> > becoming atomic as long as i have a place where i can intercept cpu
> > page table update and propagate them to device mmu.
> 
> Your MMU notifier can maintain a map of "dirty" PTEs and you do the
> actual synchronization in the subsequent flush_tlb_* , you need to add
> hooks there but it's much less painful than in the notifiers.
> 
> *However* Linus, even then we can't sleep. We do things like
> ptep_clear_flush() that need the PTL and have the synchronous flush
> semantics.
> 
> Sure, today we wait, possibly for a long time, with IPIs, but we do not
> sleep. Jerome would have to operate within a similar context. No sleep
> for you :)
> 
> Cheers,
> Ben.
> 
> 

So Linus, Benjamin is right there was couple case i did not think about.
For instance with cow page, one thread might trigger copy on write allocate
new page and update page table and another cpu thread might start using the
new page before we even get a chance to update the GPU page table thus GPU
could be working on outdated data.

Same kind of race exist on fork when we write protect a page or on when we
split a huge page.

I thought that i only needed to special case page reclaimation, migration
and forbid things like ksm but i am wrong.

So with that in mind are you ok if i pursue the mmu_notifier case taking
into account the result about rwsem+optspin that would allow to make the
many fork workload fast while still allowing mmu_notifier callback to
sleep ?

Otherwise i have no other choice than to add something like mmu_notifier
in the place where there can a be race (huge page split, cow, ...). Which
sounds like a bad idea to me when mmu_notifier is perfect for the job.

Cheers,
Jérôme Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-09  1:26                   ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-09  1:26 UTC (permalink / raw)
  To: Linus Torvalds, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, linux-mm, Linux Kernel Mailing List,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Sagi Grimberg, Shachar Raindel,
	Liran Liss, Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman,
	John, Mantor, Michael, Blinzer, Paul, Morichetti, Laurent,
	Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Wed, May 07, 2014 at 05:14:52PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 12:18 -0400, Jerome Glisse wrote:
> > 
> > I do understand that i was pointing out that if i move to, tlb which i
> > am fine with, i will still need to sleep there. That's all i wanted to
> > stress, i did not wanted force using mmu_notifier, i am fine with them
> > becoming atomic as long as i have a place where i can intercept cpu
> > page table update and propagate them to device mmu.
> 
> Your MMU notifier can maintain a map of "dirty" PTEs and you do the
> actual synchronization in the subsequent flush_tlb_* , you need to add
> hooks there but it's much less painful than in the notifiers.
> 
> *However* Linus, even then we can't sleep. We do things like
> ptep_clear_flush() that need the PTL and have the synchronous flush
> semantics.
> 
> Sure, today we wait, possibly for a long time, with IPIs, but we do not
> sleep. Jerome would have to operate within a similar context. No sleep
> for you :)
> 
> Cheers,
> Ben.
> 
> 

So Linus, Benjamin is right there was couple case i did not think about.
For instance with cow page, one thread might trigger copy on write allocate
new page and update page table and another cpu thread might start using the
new page before we even get a chance to update the GPU page table thus GPU
could be working on outdated data.

Same kind of race exist on fork when we write protect a page or on when we
split a huge page.

I thought that i only needed to special case page reclaimation, migration
and forbid things like ksm but i am wrong.

So with that in mind are you ok if i pursue the mmu_notifier case taking
into account the result about rwsem+optspin that would allow to make the
many fork workload fast while still allowing mmu_notifier callback to
sleep ?

Otherwise i have no other choice than to add something like mmu_notifier
in the place where there can a be race (huge page split, cow, ...). Which
sounds like a bad idea to me when mmu_notifier is perfect for the job.

Cheers,
Jerome Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-08 17:56         ` Jerome Glisse
@ 2014-05-09  1:42           ` Davidlohr Bueso
  -1 siblings, 0 replies; 107+ messages in thread
From: Davidlohr Bueso @ 2014-05-09  1:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: sagi grimberg, Peter Zijlstra, linux-mm, linux-kernel,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt

On Thu, 2014-05-08 at 13:56 -0400, Jerome Glisse wrote:
> On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote:
> > On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> > >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> > >>So you forgot to CC Linus, Linus has expressed some dislike for
> > >>preemptible mmu_notifiers in the recent past:
> > >>
> > >>   https://lkml.org/lkml/2013/9/30/385
> > >I'm glad this came up again.
> > >
> > >So I've been running benchmarks (mostly aim7, which nicely exercises our
> > >locks) comparing my recent v4 for rwsem optimistic spinning against
> > >previous implementation ideas for the anon-vma lock, mostly:
> > >
> > >- rwsem (currently)
> > >- rwlock_t
> > >- qrwlock_t
> > >- rwsem+optspin
> > >
> > >Of course, *any* change provides significant improvement in throughput
> > >for several workloads, by avoiding to block -- there are more
> > >performance numbers in the different patches. This is fairly obvious.
> > >
> > >What is perhaps not so obvious is that rwsem+optimistic spinning beats
> > >all others, including the improved qrwlock from Waiman and Peter. This
> > >is mostly because of the idea of cancelable MCS, which was mimic'ed from
> > >mutexes. The delta in most cases is around +10-15%, which is non
> > >trivial.
> > 
> > These are great news David!
> > 
> > >I mention this because from a performance PoV, we'll stop caring so much
> > >about the type of lock we require in the notifier related code. So while
> > >this is not conclusive, I'm not as opposed to keeping the locks blocking
> > >as I once was. Now this might still imply things like poor design
> > >choices, but that's neither here nor there.
> > 
> > So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
> > We will be more than satisfied with it as it will allow us to
> > guarantee device
> > MMU update.
> > 
> > >/me sees Sagi smiling ;)
> > 
> > :)
> 
> So i started doing thing with tlb flush but i must say things looks ugly.
> I need a new page flag (goodbye 32bits platform) and i need my own lru and
> page reclaimation for any page in use by a device, i need to hook up inside
> try_to_unmap or migrate (but i will do the former). I am trying to be smart
> by trying to schedule a worker on another cpu before before sending the ipi
> so that while the ipi is in progress hopefully another cpu might schedule
> the invalidation on the GPU and the wait after ipi for the gpu will be quick.
> 
> So all in all this is looking ugly and it does not change the fact that i
> sleep (well need to be able to sleep). It just move the sleeping to another
> part.
> 
> Maybe i should stress that with the mmu_notifier version it only sleep for
> process that are using the GPU those process are using userspace API like
> OpenCL which are not playing well with fork, ie read do not use fork if
> you are using such API.
> 
> So for my case if a process has mm->hmm set to something that would mean
> that there is a GPU using that address space and that it is unlikely to
> go under the massive workload that people try to optimize the anon_vma
> lock for.
> 
> My point is that with rwsem+optspin it could try spinning if mm->hmm
> was NULL and make the massive fork workload go fast, or it could sleep
> directly if mm->hmm is set.

Sorry? Unless I'm misunderstanding you, we don't do such things. Our
locks are generic and need to work for any circumstance, no special
cases here and there... _specially_ with these kind of things. So no,
rwsem will spin as long as the owner is set, just like any other users.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-09  1:42           ` Davidlohr Bueso
  0 siblings, 0 replies; 107+ messages in thread
From: Davidlohr Bueso @ 2014-05-09  1:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: sagi grimberg, Peter Zijlstra, linux-mm, linux-kernel,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Shachar Raindel, Liran Liss,
	Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman, John, Mantor,
	Michael, Blinzer, Paul, Morichetti, Laurent, Deucher, Alexander,
	Gabbay, Oded, Linus Torvalds

On Thu, 2014-05-08 at 13:56 -0400, Jerome Glisse wrote:
> On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote:
> > On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> > >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> > >>So you forgot to CC Linus, Linus has expressed some dislike for
> > >>preemptible mmu_notifiers in the recent past:
> > >>
> > >>   https://lkml.org/lkml/2013/9/30/385
> > >I'm glad this came up again.
> > >
> > >So I've been running benchmarks (mostly aim7, which nicely exercises our
> > >locks) comparing my recent v4 for rwsem optimistic spinning against
> > >previous implementation ideas for the anon-vma lock, mostly:
> > >
> > >- rwsem (currently)
> > >- rwlock_t
> > >- qrwlock_t
> > >- rwsem+optspin
> > >
> > >Of course, *any* change provides significant improvement in throughput
> > >for several workloads, by avoiding to block -- there are more
> > >performance numbers in the different patches. This is fairly obvious.
> > >
> > >What is perhaps not so obvious is that rwsem+optimistic spinning beats
> > >all others, including the improved qrwlock from Waiman and Peter. This
> > >is mostly because of the idea of cancelable MCS, which was mimic'ed from
> > >mutexes. The delta in most cases is around +10-15%, which is non
> > >trivial.
> > 
> > These are great news David!
> > 
> > >I mention this because from a performance PoV, we'll stop caring so much
> > >about the type of lock we require in the notifier related code. So while
> > >this is not conclusive, I'm not as opposed to keeping the locks blocking
> > >as I once was. Now this might still imply things like poor design
> > >choices, but that's neither here nor there.
> > 
> > So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
> > We will be more than satisfied with it as it will allow us to
> > guarantee device
> > MMU update.
> > 
> > >/me sees Sagi smiling ;)
> > 
> > :)
> 
> So i started doing thing with tlb flush but i must say things looks ugly.
> I need a new page flag (goodbye 32bits platform) and i need my own lru and
> page reclaimation for any page in use by a device, i need to hook up inside
> try_to_unmap or migrate (but i will do the former). I am trying to be smart
> by trying to schedule a worker on another cpu before before sending the ipi
> so that while the ipi is in progress hopefully another cpu might schedule
> the invalidation on the GPU and the wait after ipi for the gpu will be quick.
> 
> So all in all this is looking ugly and it does not change the fact that i
> sleep (well need to be able to sleep). It just move the sleeping to another
> part.
> 
> Maybe i should stress that with the mmu_notifier version it only sleep for
> process that are using the GPU those process are using userspace API like
> OpenCL which are not playing well with fork, ie read do not use fork if
> you are using such API.
> 
> So for my case if a process has mm->hmm set to something that would mean
> that there is a GPU using that address space and that it is unlikely to
> go under the massive workload that people try to optimize the anon_vma
> lock for.
> 
> My point is that with rwsem+optspin it could try spinning if mm->hmm
> was NULL and make the massive fork workload go fast, or it could sleep
> directly if mm->hmm is set.

Sorry? Unless I'm misunderstanding you, we don't do such things. Our
locks are generic and need to work for any circumstance, no special
cases here and there... _specially_ with these kind of things. So no,
rwsem will spin as long as the owner is set, just like any other users.

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-09  1:42           ` Davidlohr Bueso
@ 2014-05-09  1:45             ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-09  1:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: sagi grimberg, Peter Zijlstra, linux-mm, linux-kernel,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt

On Thu, May 08, 2014 at 06:42:14PM -0700, Davidlohr Bueso wrote:
> On Thu, 2014-05-08 at 13:56 -0400, Jerome Glisse wrote:
> > On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote:
> > > On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> > > >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> > > >>So you forgot to CC Linus, Linus has expressed some dislike for
> > > >>preemptible mmu_notifiers in the recent past:
> > > >>
> > > >>   https://lkml.org/lkml/2013/9/30/385
> > > >I'm glad this came up again.
> > > >
> > > >So I've been running benchmarks (mostly aim7, which nicely exercises our
> > > >locks) comparing my recent v4 for rwsem optimistic spinning against
> > > >previous implementation ideas for the anon-vma lock, mostly:
> > > >
> > > >- rwsem (currently)
> > > >- rwlock_t
> > > >- qrwlock_t
> > > >- rwsem+optspin
> > > >
> > > >Of course, *any* change provides significant improvement in throughput
> > > >for several workloads, by avoiding to block -- there are more
> > > >performance numbers in the different patches. This is fairly obvious.
> > > >
> > > >What is perhaps not so obvious is that rwsem+optimistic spinning beats
> > > >all others, including the improved qrwlock from Waiman and Peter. This
> > > >is mostly because of the idea of cancelable MCS, which was mimic'ed from
> > > >mutexes. The delta in most cases is around +10-15%, which is non
> > > >trivial.
> > > 
> > > These are great news David!
> > > 
> > > >I mention this because from a performance PoV, we'll stop caring so much
> > > >about the type of lock we require in the notifier related code. So while
> > > >this is not conclusive, I'm not as opposed to keeping the locks blocking
> > > >as I once was. Now this might still imply things like poor design
> > > >choices, but that's neither here nor there.
> > > 
> > > So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
> > > We will be more than satisfied with it as it will allow us to
> > > guarantee device
> > > MMU update.
> > > 
> > > >/me sees Sagi smiling ;)
> > > 
> > > :)
> > 
> > So i started doing thing with tlb flush but i must say things looks ugly.
> > I need a new page flag (goodbye 32bits platform) and i need my own lru and
> > page reclaimation for any page in use by a device, i need to hook up inside
> > try_to_unmap or migrate (but i will do the former). I am trying to be smart
> > by trying to schedule a worker on another cpu before before sending the ipi
> > so that while the ipi is in progress hopefully another cpu might schedule
> > the invalidation on the GPU and the wait after ipi for the gpu will be quick.
> > 
> > So all in all this is looking ugly and it does not change the fact that i
> > sleep (well need to be able to sleep). It just move the sleeping to another
> > part.
> > 
> > Maybe i should stress that with the mmu_notifier version it only sleep for
> > process that are using the GPU those process are using userspace API like
> > OpenCL which are not playing well with fork, ie read do not use fork if
> > you are using such API.
> > 
> > So for my case if a process has mm->hmm set to something that would mean
> > that there is a GPU using that address space and that it is unlikely to
> > go under the massive workload that people try to optimize the anon_vma
> > lock for.
> > 
> > My point is that with rwsem+optspin it could try spinning if mm->hmm
> > was NULL and make the massive fork workload go fast, or it could sleep
> > directly if mm->hmm is set.
> 
> Sorry? Unless I'm misunderstanding you, we don't do such things. Our
> locks are generic and need to work for any circumstance, no special
> cases here and there... _specially_ with these kind of things. So no,
> rwsem will spin as long as the owner is set, just like any other users.
> 
> Thanks,
> Davidlohr
> 

I do not mind spining all time i was just thinking that it could be optimize
away in case there is hmm for the current mm as it means that any way there
very much likely gonna be a schedule inside the mmu_notifier.

But if you prefer keep code generic i am fine with wasting cpu cycle.

Cheers,
Jérôme Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-09  1:45             ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-09  1:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: sagi grimberg, Peter Zijlstra, linux-mm, linux-kernel,
	linux-fsdevel, Mel Gorman, H. Peter Anvin, Andrew Morton,
	Linda Wang, Kevin E Martin, Jerome Glisse, Andrea Arcangeli,
	Johannes Weiner, Larry Woodman, Rik van Riel, Dave Airlie,
	Jeff Law, Brendan Conoboy, Joe Donohue, Duncan Poole,
	Sherry Cheung, Subhash Gutti, John Hubbard, Mark Hairgrove,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	Haggai Eran, Or Gerlitz, Shachar Raindel, Liran Liss,
	Roland Dreier, Sander, Ben, Stoner, Greg, Bridgman, John, Mantor,
	Michael, Blinzer, Paul, Morichetti, Laurent, Deucher, Alexander,
	Gabbay, Oded, Linus Torvalds

On Thu, May 08, 2014 at 06:42:14PM -0700, Davidlohr Bueso wrote:
> On Thu, 2014-05-08 at 13:56 -0400, Jerome Glisse wrote:
> > On Thu, May 08, 2014 at 07:47:04PM +0300, sagi grimberg wrote:
> > > On 5/7/2014 5:33 AM, Davidlohr Bueso wrote:
> > > >On Tue, 2014-05-06 at 12:29 +0200, Peter Zijlstra wrote:
> > > >>So you forgot to CC Linus, Linus has expressed some dislike for
> > > >>preemptible mmu_notifiers in the recent past:
> > > >>
> > > >>   https://lkml.org/lkml/2013/9/30/385
> > > >I'm glad this came up again.
> > > >
> > > >So I've been running benchmarks (mostly aim7, which nicely exercises our
> > > >locks) comparing my recent v4 for rwsem optimistic spinning against
> > > >previous implementation ideas for the anon-vma lock, mostly:
> > > >
> > > >- rwsem (currently)
> > > >- rwlock_t
> > > >- qrwlock_t
> > > >- rwsem+optspin
> > > >
> > > >Of course, *any* change provides significant improvement in throughput
> > > >for several workloads, by avoiding to block -- there are more
> > > >performance numbers in the different patches. This is fairly obvious.
> > > >
> > > >What is perhaps not so obvious is that rwsem+optimistic spinning beats
> > > >all others, including the improved qrwlock from Waiman and Peter. This
> > > >is mostly because of the idea of cancelable MCS, which was mimic'ed from
> > > >mutexes. The delta in most cases is around +10-15%, which is non
> > > >trivial.
> > > 
> > > These are great news David!
> > > 
> > > >I mention this because from a performance PoV, we'll stop caring so much
> > > >about the type of lock we require in the notifier related code. So while
> > > >this is not conclusive, I'm not as opposed to keeping the locks blocking
> > > >as I once was. Now this might still imply things like poor design
> > > >choices, but that's neither here nor there.
> > > 
> > > So is the rwsem+opt strategy the way to go Given it keeps everyone happy?
> > > We will be more than satisfied with it as it will allow us to
> > > guarantee device
> > > MMU update.
> > > 
> > > >/me sees Sagi smiling ;)
> > > 
> > > :)
> > 
> > So i started doing thing with tlb flush but i must say things looks ugly.
> > I need a new page flag (goodbye 32bits platform) and i need my own lru and
> > page reclaimation for any page in use by a device, i need to hook up inside
> > try_to_unmap or migrate (but i will do the former). I am trying to be smart
> > by trying to schedule a worker on another cpu before before sending the ipi
> > so that while the ipi is in progress hopefully another cpu might schedule
> > the invalidation on the GPU and the wait after ipi for the gpu will be quick.
> > 
> > So all in all this is looking ugly and it does not change the fact that i
> > sleep (well need to be able to sleep). It just move the sleeping to another
> > part.
> > 
> > Maybe i should stress that with the mmu_notifier version it only sleep for
> > process that are using the GPU those process are using userspace API like
> > OpenCL which are not playing well with fork, ie read do not use fork if
> > you are using such API.
> > 
> > So for my case if a process has mm->hmm set to something that would mean
> > that there is a GPU using that address space and that it is unlikely to
> > go under the massive workload that people try to optimize the anon_vma
> > lock for.
> > 
> > My point is that with rwsem+optspin it could try spinning if mm->hmm
> > was NULL and make the massive fork workload go fast, or it could sleep
> > directly if mm->hmm is set.
> 
> Sorry? Unless I'm misunderstanding you, we don't do such things. Our
> locks are generic and need to work for any circumstance, no special
> cases here and there... _specially_ with these kind of things. So no,
> rwsem will spin as long as the owner is set, just like any other users.
> 
> Thanks,
> Davidlohr
> 

I do not mind spining all time i was just thinking that it could be optimize
away in case there is hmm for the current mm as it means that any way there
very much likely gonna be a schedule inside the mmu_notifier.

But if you prefer keep code generic i am fine with wasting cpu cycle.

Cheers,
Jerome Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-09  1:26                   ` Jerome Glisse
@ 2014-05-10  4:28                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 107+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-10  4:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung

On Thu, 2014-05-08 at 21:26 -0400, Jerome Glisse wrote:
> Otherwise i have no other choice than to add something like mmu_notifier
> in the place where there can a be race (huge page split, cow, ...). Which
> sounds like a bad idea to me when mmu_notifier is perfect for the job.

Even there, how are you going to find a sleepable context ? All that stuff
has the PTL held.

Cheers,
Ben.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-10  4:28                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 107+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-10  4:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Thu, 2014-05-08 at 21:26 -0400, Jerome Glisse wrote:
> Otherwise i have no other choice than to add something like mmu_notifier
> in the place where there can a be race (huge page split, cow, ...). Which
> sounds like a bad idea to me when mmu_notifier is perfect for the job.

Even there, how are you going to find a sleepable context ? All that stuff
has the PTL held.

Cheers,
Ben.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
  2014-05-10  4:28                     ` Benjamin Herrenschmidt
@ 2014-05-11  0:48                       ` Jerome Glisse
  -1 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-11  0:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove

On Sat, May 10, 2014 at 02:28:35PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2014-05-08 at 21:26 -0400, Jerome Glisse wrote:
> > Otherwise i have no other choice than to add something like mmu_notifier
> > in the place where there can a be race (huge page split, cow, ...). Which
> > sounds like a bad idea to me when mmu_notifier is perfect for the job.
> 
> Even there, how are you going to find a sleepable context ? All that stuff
> has the PTL held.
> 
> Cheers,
> Ben.

All i need is invalidate_page and invalidate_range_start both of which are not
call while holding any lock beside that anon_vma and or mmap_sem. So i am fine
on that front.

The change_pte callback are bracketed by call to invalidate_range_start / end.

What is important is to flush GPU page table prior to any update that would
make the cpu page table point to a different page. For invalidate_page this
is ok because so far it is call either as page reclaimation and those page
is change to swap entry or file entry. Or it is call for page migration and
cpu page table is set to special migration entry (there is also the memory
failure case but this one is harmless too).

So as far as i can tell i am safe on that front and only mmu_notifier provide
the early warning that i need notably for COW.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC] Heterogeneous memory management (mirror process address space on a device mmu).
@ 2014-05-11  0:48                       ` Jerome Glisse
  0 siblings, 0 replies; 107+ messages in thread
From: Jerome Glisse @ 2014-05-11  0:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Peter Zijlstra, linux-mm,
	Linux Kernel Mailing List, linux-fsdevel, Mel Gorman,
	H. Peter Anvin, Andrew Morton, Linda Wang, Kevin E Martin,
	Jerome Glisse, Andrea Arcangeli, Johannes Weiner, Larry Woodman,
	Rik van Riel, Dave Airlie, Jeff Law, Brendan Conoboy,
	Joe Donohue, Duncan Poole, Sherry Cheung, Subhash Gutti,
	John Hubbard, Mark Hairgrove, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Haggai Eran, Or Gerlitz, Sagi Grimberg,
	Shachar Raindel, Liran Liss, Roland Dreier, Sander, Ben, Stoner,
	Greg, Bridgman, John, Mantor, Michael, Blinzer, Paul, Morichetti,
	Laurent, Deucher, Alexander, Gabbay, Oded, Davidlohr Bueso

On Sat, May 10, 2014 at 02:28:35PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2014-05-08 at 21:26 -0400, Jerome Glisse wrote:
> > Otherwise i have no other choice than to add something like mmu_notifier
> > in the place where there can a be race (huge page split, cow, ...). Which
> > sounds like a bad idea to me when mmu_notifier is perfect for the job.
> 
> Even there, how are you going to find a sleepable context ? All that stuff
> has the PTL held.
> 
> Cheers,
> Ben.

All i need is invalidate_page and invalidate_range_start both of which are not
call while holding any lock beside that anon_vma and or mmap_sem. So i am fine
on that front.

The change_pte callback are bracketed by call to invalidate_range_start / end.

What is important is to flush GPU page table prior to any update that would
make the cpu page table point to a different page. For invalidate_page this
is ok because so far it is call either as page reclaimation and those page
is change to swap entry or file entry. Or it is call for page migration and
cpu page table is set to special migration entry (there is also the memory
failure case but this one is harmless too).

So as far as i can tell i am safe on that front and only mmu_notifier provide
the early warning that i need notably for COW.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2014-05-11  0:48 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-02 13:51 [RFC] Heterogeneous memory management (mirror process address space on a device mmu) j.glisse
2014-05-02 13:51 ` j.glisse
2014-05-02 13:52 ` [PATCH 01/11] mm: differentiate unmap for vmscan from other unmap j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 02/11] mmu_notifier: add action information to address invalidation j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 03/11] mmu_notifier: pass through vma to invalidate_range and invalidate_page j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 04/11] interval_tree: helper to find previous item of a node in rb interval tree j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 05/11] mm/memcg: support accounting null page and transfering null charge to new page j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 06/11] hmm: heterogeneous memory management j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 07/11] hmm: support moving anonymous page to remote memory j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 08/11] hmm: support for migrate file backed pages " j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 09/11] fs/ext4: add support for hmm migration to remote memory of pagecache j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 10/11] hmm/dummy: dummy driver to showcase the hmm api j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52 ` [PATCH 11/11] hmm/dummy_driver: add support for fake remote memory using pages j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-02 13:52   ` j.glisse
2014-05-06 10:29 ` [RFC] Heterogeneous memory management (mirror process address space on a device mmu) Peter Zijlstra
2014-05-06 10:29   ` Peter Zijlstra
2014-05-06 14:57   ` Linus Torvalds
2014-05-06 14:57     ` Linus Torvalds
2014-05-06 15:00     ` Jerome Glisse
2014-05-06 15:00       ` Jerome Glisse
2014-05-06 15:18       ` Linus Torvalds
2014-05-06 15:18         ` Linus Torvalds
2014-05-06 15:33         ` Jerome Glisse
2014-05-06 15:33           ` Jerome Glisse
2014-05-06 15:42           ` Rik van Riel
2014-05-06 15:42             ` Rik van Riel
2014-05-06 15:47           ` Linus Torvalds
2014-05-06 15:47             ` Linus Torvalds
2014-05-06 16:18             ` Jerome Glisse
2014-05-06 16:18               ` Jerome Glisse
2014-05-06 16:32               ` Linus Torvalds
2014-05-06 16:32                 ` Linus Torvalds
2014-05-06 16:49                 ` Jerome Glisse
2014-05-06 16:49                   ` Jerome Glisse
2014-05-06 17:28                 ` Jerome Glisse
2014-05-06 17:28                   ` Jerome Glisse
2014-05-06 17:43                   ` Linus Torvalds
2014-05-06 17:43                     ` Linus Torvalds
2014-05-06 18:13                     ` Jerome Glisse
2014-05-06 18:13                       ` Jerome Glisse
2014-05-06 18:22                       ` Linus Torvalds
2014-05-06 18:22                         ` Linus Torvalds
2014-05-06 18:38                         ` Jerome Glisse
2014-05-06 18:38                           ` Jerome Glisse
2014-05-07  7:18                 ` Benjamin Herrenschmidt
2014-05-07  7:18                   ` Benjamin Herrenschmidt
2014-05-07  7:14               ` Benjamin Herrenschmidt
2014-05-07  7:14                 ` Benjamin Herrenschmidt
2014-05-07 12:39                 ` Jerome Glisse
2014-05-07 12:39                   ` Jerome Glisse
2014-05-09  1:26                 ` Jerome Glisse
2014-05-09  1:26                   ` Jerome Glisse
2014-05-10  4:28                   ` Benjamin Herrenschmidt
2014-05-10  4:28                     ` Benjamin Herrenschmidt
2014-05-11  0:48                     ` Jerome Glisse
2014-05-11  0:48                       ` Jerome Glisse
2014-05-06 16:30             ` Rik van Riel
2014-05-06 16:30               ` Rik van Riel
2014-05-06 16:34               ` Linus Torvalds
2014-05-06 16:34                 ` Linus Torvalds
2014-05-06 16:47                 ` Rik van Riel
2014-05-06 16:47                   ` Rik van Riel
2014-05-06 16:54                   ` Jerome Glisse
2014-05-06 16:54                     ` Jerome Glisse
2014-05-06 18:02                     ` H. Peter Anvin
2014-05-06 18:02                       ` H. Peter Anvin
2014-05-06 18:26                       ` Jerome Glisse
2014-05-06 18:26                         ` Jerome Glisse
2014-05-06 22:44                 ` David Airlie
2014-05-06 22:44                   ` David Airlie
2014-05-07  2:33   ` Davidlohr Bueso
2014-05-07  2:33     ` Davidlohr Bueso
2014-05-07 13:00     ` Peter Zijlstra
2014-05-07 13:00       ` Peter Zijlstra
2014-05-07 17:34       ` Davidlohr Bueso
2014-05-07 17:34         ` Davidlohr Bueso
2014-05-07 16:21     ` Linus Torvalds
2014-05-07 16:21       ` Linus Torvalds
2014-05-08 16:47     ` sagi grimberg
2014-05-08 16:47       ` sagi grimberg
2014-05-08 17:56       ` Jerome Glisse
2014-05-08 17:56         ` Jerome Glisse
2014-05-09  1:42         ` Davidlohr Bueso
2014-05-09  1:42           ` Davidlohr Bueso
2014-05-09  1:45           ` Jerome Glisse
2014-05-09  1:45             ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.