All of lore.kernel.org
 help / color / mirror / Atom feed
* mm preparatory patches for HMM and IOMMUv2
@ 2014-06-28  2:00 ` Jérôme Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky

Andrew so here are a set of mm patch that do some ground modification to core
mm code. They apply on top of today's linux-next and they pass checkpatch.pl
with flying color (except patch 4 but i did not wanted to be a nazi about 80
char line).

Patch 1 is the mmput notifier call chain we discussed with AMD.

Patch 2, 3 and 4 are so far only useful to HMM but i am discussing with AMD and
i believe it will be useful to them to (in the context of IOMMUv2).

Patch 2 allows to differentiate page unmap for vmscan reason or for poisoning.

Patch 3 associate mmu_notifier with an event type allowing to take different code
path inside mmu_notifier callback depending on what is currently happening to the
cpu page table. There is no functional change, it just add a new argument to the
various mmu_notifier calls and callback.

Patch 4 pass along the vma into which the range invalidation is happening. There
is few functional changes in place where mmu_notifier_range_invalidate_start/end
used [0, -1] as range, instead now those place call the notifier once for each
vma. This might prove to add unwanted overhead hence why i did it as a separate
patch.

I did not include the core hmm patch but i intend to send a v4 next week. So i
really would like to see those included for next release.

As usual comments welcome.

Cheers,
Jérôme Glisse


^ permalink raw reply	[flat|nested] 76+ messages in thread

* mm preparatory patches for HMM and IOMMUv2
@ 2014-06-28  2:00 ` Jérôme Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky

Andrew so here are a set of mm patch that do some ground modification to core
mm code. They apply on top of today's linux-next and they pass checkpatch.pl
with flying color (except patch 4 but i did not wanted to be a nazi about 80
char line).

Patch 1 is the mmput notifier call chain we discussed with AMD.

Patch 2, 3 and 4 are so far only useful to HMM but i am discussing with AMD and
i believe it will be useful to them to (in the context of IOMMUv2).

Patch 2 allows to differentiate page unmap for vmscan reason or for poisoning.

Patch 3 associate mmu_notifier with an event type allowing to take different code
path inside mmu_notifier callback depending on what is currently happening to the
cpu page table. There is no functional change, it just add a new argument to the
various mmu_notifier calls and callback.

Patch 4 pass along the vma into which the range invalidation is happening. There
is few functional changes in place where mmu_notifier_range_invalidate_start/end
used [0, -1] as range, instead now those place call the notifier once for each
vma. This might prove to add unwanted overhead hence why i did it as a separate
patch.

I did not include the core hmm patch but i intend to send a v4 next week. So i
really would like to see those included for next release.

As usual comments welcome.

Cheers,
JA(C)rA'me Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-28  2:00 ` Jérôme Glisse
@ 2014-06-28  2:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

Several subsystem require a callback when a mm struct is being destroy
so that they can cleanup there respective per mm struct. Instead of
having each subsystem add its callback to mmput use a notifier chain
to call each of the subsystem.

This will allow new subsystem to register callback even if they are
module. There should be no contention on the rw semaphore protecting
the call chain and the impact on the code path should be low and
burried in the noise.

Note that this patch also move the call to cleanup functions after
exit_mmap so that new call back can assume that mmu_notifier_release
have already been call. This does not impact existing cleanup functions
as they do not rely on anything that exit_mmap is freeing. Also moved
khugepaged_exit to exit_mmap so that ordering is preserved for that
function.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 fs/aio.c                | 29 ++++++++++++++++++++++-------
 include/linux/aio.h     |  2 --
 include/linux/ksm.h     | 11 -----------
 include/linux/sched.h   |  5 +++++
 include/linux/uprobes.h |  1 -
 kernel/events/uprobes.c | 19 ++++++++++++++++---
 kernel/fork.c           | 22 ++++++++++++++++++----
 mm/ksm.c                | 26 +++++++++++++++++++++-----
 mm/mmap.c               |  3 +++
 9 files changed, 85 insertions(+), 33 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c1d8c48..1d06e92 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/notifier.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
 EXPORT_SYMBOL(wait_on_sync_kiocb);
 
 /*
- * exit_aio: called when the last user of mm goes away.  At this point, there is
+ * aio_exit: called when the last user of mm goes away.  At this point, there is
  * no way for any new requests to be submited or any of the io_* syscalls to be
  * called on the context.
  *
  * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
  * them.
  */
-void exit_aio(struct mm_struct *mm)
+static int aio_exit(struct notifier_block *nb,
+		    unsigned long action, void *data)
 {
+	struct mm_struct *mm = data;
 	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
 	int i;
 
 	if (!table)
-		return;
+		return 0;
 
 	for (i = 0; i < table->nr; ++i) {
 		struct kioctx *ctx = table->table[i];
@@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
 			continue;
 		/*
 		 * We don't need to bother with munmap() here - exit_mmap(mm)
-		 * is coming and it'll unmap everything. And we simply can't,
-		 * this is not necessarily our ->mm.
-		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
-		 * that it needs to unmap the area, just set it to 0.
+		 * have already been call and everything is unmap by now. But
+		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
+		 * non-zero ->mmap_size as indicator that it needs to unmap the
+		 * area.
 		 */
 		ctx->mmap_size = 0;
 		kill_ioctx(mm, ctx, NULL);
@@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
 
 	RCU_INIT_POINTER(mm->ioctx_table, NULL);
 	kfree(table);
+	return 0;
 }
 
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
@@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 	}
 	return ret;
 }
+
+static struct notifier_block aio_mmput_nb = {
+	.notifier_call		= aio_exit,
+	.priority		= 1,
+};
+
+static int __init aio_init(void)
+{
+	return mmput_register_notifier(&aio_mmput_nb);
+}
+subsys_initcall(aio_init);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..6308fac 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
 extern void aio_complete(struct kiocb *iocb, long res, long res2);
 struct mm_struct;
-extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
@@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
 struct mm_struct;
-static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
 				struct iocb __user * __user *iocbpp,
 				bool compat) { return 0; }
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..84c184f 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -20,7 +20,6 @@ struct mem_cgroup;
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
-void __ksm_exit(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
-{
-	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
-		__ksm_exit(mm);
-}
-
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
-{
-}
-
 static inline int PageKsm(struct page *page)
 {
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 322d4fc..428b3cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
 		__mmdrop(mm);
 }
 
+/* mmput call list of notifier and subsystem/module can register
+ * new one through this call.
+ */
+extern int mmput_register_notifier(struct notifier_block *nb);
+extern int mmput_unregister_notifier(struct notifier_block *nb);
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
 /* Grab a reference to a task's mm, if it is not already going away */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 4f844c6..44e7267 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
 extern bool uprobe_deny_signal(void);
 extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
-extern void uprobe_clear_state(struct mm_struct *mm);
 extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
 extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
 extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 46b7c31..32b04dc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -37,6 +37,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/task_work.h>
 #include <linux/shmem_fs.h>
+#include <linux/notifier.h>
 
 #include <linux/uprobes.h>
 
@@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
 /*
  * uprobe_clear_state - Free the area allocated for slots.
  */
-void uprobe_clear_state(struct mm_struct *mm)
+static int uprobe_clear_state(struct notifier_block *nb,
+			      unsigned long action, void *data)
 {
+	struct mm_struct *mm = data;
 	struct xol_area *area = mm->uprobes_state.xol_area;
 
 	if (!area)
-		return;
+		return 0;
 
 	put_page(area->page);
 	kfree(area->bitmap);
 	kfree(area);
+	return 0;
 }
 
 void uprobe_start_dup_mmap(void)
@@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
 	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
 };
 
+static struct notifier_block uprobe_mmput_nb = {
+	.notifier_call		= uprobe_clear_state,
+	.priority		= 0,
+};
+
 static int __init init_uprobes(void)
 {
-	int i;
+	int i, err;
 
 	for (i = 0; i < UPROBES_HASH_SZ; i++)
 		mutex_init(&uprobes_mmap_mutex[i]);
@@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
 	if (percpu_init_rwsem(&dup_mmap_sem))
 		return -ENOMEM;
 
+	err = mmput_register_notifier(&uprobe_mmput_nb);
+	if (err)
+		return err;
+
 	return register_die_notifier(&uprobe_exception_nb);
 }
 __initcall(init_uprobes);
diff --git a/kernel/fork.c b/kernel/fork.c
index dd8864f..b448509 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -87,6 +87,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>
 
+static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
 EXPORT_SYMBOL_GPL(__mmdrop);
 
 /*
+ * Register a notifier that will be call by mmput
+ */
+int mmput_register_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&mmput_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(mmput_register_notifier);
+
+int mmput_unregister_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
+
+/*
  * Decrement the use count and release all resources for an mm.
  */
 void mmput(struct mm_struct *mm)
@@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
-		uprobe_clear_state(mm);
-		exit_aio(mm);
-		ksm_exit(mm);
-		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
+		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..cb1e976 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
 #include <linux/freezer.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/notifier.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 		ksm_scan.mm_slot = slot;
 		spin_unlock(&ksm_mmlist_lock);
 		/*
-		 * Although we tested list_empty() above, a racing __ksm_exit
+		 * Although we tested list_empty() above, a racing ksm_exit
 		 * of the last mm on the list may have removed it since then.
 		 */
 		if (slot == &ksm_mm_head)
@@ -1658,9 +1659,9 @@ next_mm:
 		/*
 		 * We've completed a full scan of all vmas, holding mmap_sem
 		 * throughout, and found no VM_MERGEABLE: so do the same as
-		 * __ksm_exit does to remove this mm from all our lists now.
-		 * This applies either when cleaning up after __ksm_exit
-		 * (but beware: we can reach here even before __ksm_exit),
+		 * ksm_exit does to remove this mm from all our lists now.
+		 * This applies either when cleaning up after ksm_exit
+		 * (but beware: we can reach here even before ksm_exit),
 		 * or when all VM_MERGEABLE areas have been unmapped (and
 		 * mmap_sem then protects against race with MADV_MERGEABLE).
 		 */
@@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
 	return 0;
 }
 
-void __ksm_exit(struct mm_struct *mm)
+static int ksm_exit(struct notifier_block *nb,
+		    unsigned long action, void *data)
 {
+	struct mm_struct *mm = data;
 	struct mm_slot *mm_slot;
 	int easy_to_free = 0;
 
+	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
+		return 0;
+
 	/*
 	 * This process is exiting: if it's straightforward (as is the
 	 * case when ksmd was never running), free mm_slot immediately.
@@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
+	return 0;
 }
 
 struct page *ksm_might_need_to_copy(struct page *page,
@@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
 };
 #endif /* CONFIG_SYSFS */
 
+static struct notifier_block ksm_mmput_nb = {
+	.notifier_call		= ksm_exit,
+	.priority		= 2,
+};
+
 static int __init ksm_init(void)
 {
 	struct task_struct *ksm_thread;
 	int err;
 
+	err = mmput_register_notifier(&ksm_mmput_nb);
+	if (err)
+		return err;
+
 	err = ksm_slab_init();
 	if (err)
 		goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index 61aec93..b684a21 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 
+	/* Important to call this first. */
+	khugepaged_exit(mm);
+
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-28  2:00   ` Jérôme Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

Several subsystem require a callback when a mm struct is being destroy
so that they can cleanup there respective per mm struct. Instead of
having each subsystem add its callback to mmput use a notifier chain
to call each of the subsystem.

This will allow new subsystem to register callback even if they are
module. There should be no contention on the rw semaphore protecting
the call chain and the impact on the code path should be low and
burried in the noise.

Note that this patch also move the call to cleanup functions after
exit_mmap so that new call back can assume that mmu_notifier_release
have already been call. This does not impact existing cleanup functions
as they do not rely on anything that exit_mmap is freeing. Also moved
khugepaged_exit to exit_mmap so that ordering is preserved for that
function.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 fs/aio.c                | 29 ++++++++++++++++++++++-------
 include/linux/aio.h     |  2 --
 include/linux/ksm.h     | 11 -----------
 include/linux/sched.h   |  5 +++++
 include/linux/uprobes.h |  1 -
 kernel/events/uprobes.c | 19 ++++++++++++++++---
 kernel/fork.c           | 22 ++++++++++++++++++----
 mm/ksm.c                | 26 +++++++++++++++++++++-----
 mm/mmap.c               |  3 +++
 9 files changed, 85 insertions(+), 33 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index c1d8c48..1d06e92 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/notifier.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
 EXPORT_SYMBOL(wait_on_sync_kiocb);
 
 /*
- * exit_aio: called when the last user of mm goes away.  At this point, there is
+ * aio_exit: called when the last user of mm goes away.  At this point, there is
  * no way for any new requests to be submited or any of the io_* syscalls to be
  * called on the context.
  *
  * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
  * them.
  */
-void exit_aio(struct mm_struct *mm)
+static int aio_exit(struct notifier_block *nb,
+		    unsigned long action, void *data)
 {
+	struct mm_struct *mm = data;
 	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
 	int i;
 
 	if (!table)
-		return;
+		return 0;
 
 	for (i = 0; i < table->nr; ++i) {
 		struct kioctx *ctx = table->table[i];
@@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
 			continue;
 		/*
 		 * We don't need to bother with munmap() here - exit_mmap(mm)
-		 * is coming and it'll unmap everything. And we simply can't,
-		 * this is not necessarily our ->mm.
-		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
-		 * that it needs to unmap the area, just set it to 0.
+		 * have already been call and everything is unmap by now. But
+		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
+		 * non-zero ->mmap_size as indicator that it needs to unmap the
+		 * area.
 		 */
 		ctx->mmap_size = 0;
 		kill_ioctx(mm, ctx, NULL);
@@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
 
 	RCU_INIT_POINTER(mm->ioctx_table, NULL);
 	kfree(table);
+	return 0;
 }
 
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
@@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 	}
 	return ret;
 }
+
+static struct notifier_block aio_mmput_nb = {
+	.notifier_call		= aio_exit,
+	.priority		= 1,
+};
+
+static int __init aio_init(void)
+{
+	return mmput_register_notifier(&aio_mmput_nb);
+}
+subsys_initcall(aio_init);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..6308fac 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
 extern void aio_complete(struct kiocb *iocb, long res, long res2);
 struct mm_struct;
-extern void exit_aio(struct mm_struct *mm);
 extern long do_io_submit(aio_context_t ctx_id, long nr,
 			 struct iocb __user *__user *iocbpp, bool compat);
 void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
@@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
 static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
 static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
 struct mm_struct;
-static inline void exit_aio(struct mm_struct *mm) { }
 static inline long do_io_submit(aio_context_t ctx_id, long nr,
 				struct iocb __user * __user *iocbpp,
 				bool compat) { return 0; }
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..84c184f 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -20,7 +20,6 @@ struct mem_cgroup;
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
-void __ksm_exit(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
-{
-	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
-		__ksm_exit(mm);
-}
-
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 	return 0;
 }
 
-static inline void ksm_exit(struct mm_struct *mm)
-{
-}
-
 static inline int PageKsm(struct page *page)
 {
 	return 0;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 322d4fc..428b3cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
 		__mmdrop(mm);
 }
 
+/* mmput call list of notifier and subsystem/module can register
+ * new one through this call.
+ */
+extern int mmput_register_notifier(struct notifier_block *nb);
+extern int mmput_unregister_notifier(struct notifier_block *nb);
 /* mmput gets rid of the mappings and all user-space */
 extern void mmput(struct mm_struct *);
 /* Grab a reference to a task's mm, if it is not already going away */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 4f844c6..44e7267 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
 extern bool uprobe_deny_signal(void);
 extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
-extern void uprobe_clear_state(struct mm_struct *mm);
 extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
 extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
 extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 46b7c31..32b04dc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -37,6 +37,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/task_work.h>
 #include <linux/shmem_fs.h>
+#include <linux/notifier.h>
 
 #include <linux/uprobes.h>
 
@@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
 /*
  * uprobe_clear_state - Free the area allocated for slots.
  */
-void uprobe_clear_state(struct mm_struct *mm)
+static int uprobe_clear_state(struct notifier_block *nb,
+			      unsigned long action, void *data)
 {
+	struct mm_struct *mm = data;
 	struct xol_area *area = mm->uprobes_state.xol_area;
 
 	if (!area)
-		return;
+		return 0;
 
 	put_page(area->page);
 	kfree(area->bitmap);
 	kfree(area);
+	return 0;
 }
 
 void uprobe_start_dup_mmap(void)
@@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
 	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
 };
 
+static struct notifier_block uprobe_mmput_nb = {
+	.notifier_call		= uprobe_clear_state,
+	.priority		= 0,
+};
+
 static int __init init_uprobes(void)
 {
-	int i;
+	int i, err;
 
 	for (i = 0; i < UPROBES_HASH_SZ; i++)
 		mutex_init(&uprobes_mmap_mutex[i]);
@@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
 	if (percpu_init_rwsem(&dup_mmap_sem))
 		return -ENOMEM;
 
+	err = mmput_register_notifier(&uprobe_mmput_nb);
+	if (err)
+		return err;
+
 	return register_die_notifier(&uprobe_exception_nb);
 }
 __initcall(init_uprobes);
diff --git a/kernel/fork.c b/kernel/fork.c
index dd8864f..b448509 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -87,6 +87,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>
 
+static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
 EXPORT_SYMBOL_GPL(__mmdrop);
 
 /*
+ * Register a notifier that will be call by mmput
+ */
+int mmput_register_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&mmput_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(mmput_register_notifier);
+
+int mmput_unregister_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
+
+/*
  * Decrement the use count and release all resources for an mm.
  */
 void mmput(struct mm_struct *mm)
@@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
-		uprobe_clear_state(mm);
-		exit_aio(mm);
-		ksm_exit(mm);
-		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
+		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..cb1e976 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -37,6 +37,7 @@
 #include <linux/freezer.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/notifier.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 		ksm_scan.mm_slot = slot;
 		spin_unlock(&ksm_mmlist_lock);
 		/*
-		 * Although we tested list_empty() above, a racing __ksm_exit
+		 * Although we tested list_empty() above, a racing ksm_exit
 		 * of the last mm on the list may have removed it since then.
 		 */
 		if (slot == &ksm_mm_head)
@@ -1658,9 +1659,9 @@ next_mm:
 		/*
 		 * We've completed a full scan of all vmas, holding mmap_sem
 		 * throughout, and found no VM_MERGEABLE: so do the same as
-		 * __ksm_exit does to remove this mm from all our lists now.
-		 * This applies either when cleaning up after __ksm_exit
-		 * (but beware: we can reach here even before __ksm_exit),
+		 * ksm_exit does to remove this mm from all our lists now.
+		 * This applies either when cleaning up after ksm_exit
+		 * (but beware: we can reach here even before ksm_exit),
 		 * or when all VM_MERGEABLE areas have been unmapped (and
 		 * mmap_sem then protects against race with MADV_MERGEABLE).
 		 */
@@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
 	return 0;
 }
 
-void __ksm_exit(struct mm_struct *mm)
+static int ksm_exit(struct notifier_block *nb,
+		    unsigned long action, void *data)
 {
+	struct mm_struct *mm = data;
 	struct mm_slot *mm_slot;
 	int easy_to_free = 0;
 
+	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
+		return 0;
+
 	/*
 	 * This process is exiting: if it's straightforward (as is the
 	 * case when ksmd was never running), free mm_slot immediately.
@@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
+	return 0;
 }
 
 struct page *ksm_might_need_to_copy(struct page *page,
@@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
 };
 #endif /* CONFIG_SYSFS */
 
+static struct notifier_block ksm_mmput_nb = {
+	.notifier_call		= ksm_exit,
+	.priority		= 2,
+};
+
 static int __init ksm_init(void)
 {
 	struct task_struct *ksm_thread;
 	int err;
 
+	err = mmput_register_notifier(&ksm_mmput_nb);
+	if (err)
+		return err;
+
 	err = ksm_slab_init();
 	if (err)
 		goto out;
diff --git a/mm/mmap.c b/mm/mmap.c
index 61aec93..b684a21 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 
+	/* Important to call this first. */
+	khugepaged_exit(mm);
+
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.
  2014-06-28  2:00 ` Jérôme Glisse
@ 2014-06-28  2:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/rmap.h | 15 ++++++++-------
 mm/memory-failure.c  |  2 +-
 mm/vmscan.c          |  4 ++--
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be57450..eddbc07 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,13 +72,14 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_UNMAP = 1,			/* unmap mode */
-	TTU_MIGRATION = 2,		/* migration mode */
-	TTU_MUNLOCK = 4,		/* munlock mode */
-
-	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
-	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
-	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_VMSCAN = 1,			/* unmap for vmscan */
+	TTU_POISON = 2,			/* unmap for poison */
+	TTU_MIGRATION = 4,		/* migration mode */
+	TTU_MUNLOCK = 8,		/* munlock mode */
+
+	TTU_IGNORE_MLOCK = (1 << 9),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 10),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a7a89eb..ba176c4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page *p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno, int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d24fd6..5a7d286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
-			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			TTU_VMSCAN|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.
@ 2014-06-28  2:00   ` Jérôme Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

New code will need to be able to differentiate between a regular unmap and
an unmap trigger by vmscan in which case we want to be as quick as possible.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/rmap.h | 15 ++++++++-------
 mm/memory-failure.c  |  2 +-
 mm/vmscan.c          |  4 ++--
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be57450..eddbc07 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -72,13 +72,14 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_UNMAP = 1,			/* unmap mode */
-	TTU_MIGRATION = 2,		/* migration mode */
-	TTU_MUNLOCK = 4,		/* munlock mode */
-
-	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
-	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
-	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_VMSCAN = 1,			/* unmap for vmscan */
+	TTU_POISON = 2,			/* unmap for poison */
+	TTU_MIGRATION = 4,		/* migration mode */
+	TTU_MUNLOCK = 8,		/* munlock mode */
+
+	TTU_IGNORE_MLOCK = (1 << 9),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 10),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a7a89eb..ba176c4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page *p,
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno, int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d24fd6..5a7d286 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
-			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			TTU_VMSCAN|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
@@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
 				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
 				&nr_writeback, &nr_immediate,
 				false);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
  2014-06-28  2:00 ` Jérôme Glisse
@ 2014-06-28  2:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

The event information will be usefull for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their intented usage
    also documenting what exceptation the listener can have in
    respect to each event.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/iommu/amd_iommu_v2.c            |  14 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/hugetlb.h                 |   7 +-
 include/linux/mmu_notifier.h            | 117 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/filemap_xip.c                        |   2 +-
 mm/huge_memory.c                        |  51 ++++++++------
 mm/hugetlb.c                            |  25 ++++---
 mm/ksm.c                                |  18 +++--
 mm/memory.c                             |  27 +++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 +++++---
 mm/mprotect.c                           |  33 ++++++---
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |  24 +++++--
 virt/kvm/kvm_main.c                     |  12 ++--
 19 files changed, 291 insertions(+), 119 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 21ea928..ed6f35e 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -56,7 +56,8 @@ struct i915_mmu_object {
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 499b436..2bb9771 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
 			  struct mm_struct *mm,
 			  unsigned long address,
-			  pte_t pte)
+			  pte_t pte,
+			  enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+				      unsigned long start,
+				      unsigned long end,
+				      enum mmu_event event)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
-				    unsigned long start, unsigned long end)
+				    unsigned long start,
+				    unsigned long end,
+				    enum mmu_event event)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..e9e79f7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0,
+							    -1, MMU_STATUS);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
@@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0,
+							  -1, MMU_STATUS);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6a836ef..d7e512f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -6,6 +6,7 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/mmu_notifier.h>
 #include <linux/list.h>
 #include <linux/kref.h>
 
@@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_event event);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
@@ -148,7 +150,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
 #define is_hugepage_active(x)	false
 
 static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_event event)
 {
 	return 0;
 }
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..82e9577 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,52 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* Event report finer informations to the callback allowing the event listener
+ * to take better action. There are only few kinds of events :
+ *
+ *   - MMU_MIGRATE memory is migrating from one page to another thus all write
+ *     access must stop after invalidate_range_start callback returns. And no
+ *     read access should be allowed either as new page can be remapped with
+ *     write access before the invalidate_range_end callback happen and thus
+ *     any read access to old page might access outdated informations. Several
+ *     source to this event like page moving to swap (for various reasons like
+ *     page reclaim), outcome of mremap syscall, migration for numa reasons,
+ *     balancing memory pool, write fault on read only page trigger a new page
+ *     to be allocated and used, ...
+ *   - MMU_MPROT_NONE memory access protection is change, no page in the range
+ *     can be accessed in either read or write mode but the range of address
+ *     is still valid. All access are still fine until invalidate_range_end
+ *     callback returns.
+ *   - MMU_MPROT_RONLY memory access proctection is changing to read only.
+ *     All access are still fine until invalidate_range_end callback returns.
+ *   - MMU_MPROT_RANDW memory access proctection is changing to read an write.
+ *     All access are still fine until invalidate_range_end callback returns.
+ *   - MMU_MPROT_WONLY memory access proctection is changing to write only.
+ *     All access are still fine until invalidate_range_end callback returns.
+ *   - MMU_MUNMAP the range is being unmaped (outcome of a munmap syscall). It
+ *     is fine to still have read/write access until the invalidate_range_end
+ *     callback returns. This also imply that secondary page table can be trim
+ *     as the address range is no longer valid.
+ *   - MMU_WB memory is being write back to disk, all write access must stop
+ *     after invalidate_range_start callback returns. Read access are still
+ *     allowed.
+ *   - MMU_STATUS memory status change, like soft dirty.
+ *
+ * In doubt when adding a new notifier caller use MMU_MIGRATE it will always
+ * result in expected behavior but will not allow listener a chance to optimize
+ * its events.
+ */
+enum mmu_event {
+	MMU_MIGRATE = 0,
+	MMU_MPROT_NONE,
+	MMU_MPROT_RONLY,
+	MMU_MPROT_RANDW,
+	MMU_MPROT_WONLY,
+	MMU_MUNMAP,
+	MMU_STATUS,
+	MMU_WB,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -79,7 +125,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -90,7 +137,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -137,10 +185,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 };
 
 /*
@@ -177,13 +229,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -208,31 +267,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -278,13 +344,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -307,22 +373,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 32b04dc..296f81e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -177,7 +177,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -195,7 +196,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -209,7 +212,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..a2b3f09 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d562a9..fa30857 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1020,6 +1020,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		set_page_private(pages[i], (unsigned long)memcg);
 	}
 
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
+
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		copy_user_highpage(pages[i], page + i,
 				   haddr + PAGE_SIZE * i, vma);
@@ -1027,10 +1032,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_free_pages;
@@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1073,7 +1075,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1157,16 +1160,17 @@ alloc:
 
 	count_vm_event(THP_FAULT_ALLOC);
 
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
+
 	if (!page)
 		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
 	else
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
 	spin_lock(ptl);
 	if (page)
 		put_user_huge_page(page);
@@ -1197,7 +1201,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_STATUS);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_STATUS);
 
 	return ret;
 }
@@ -2446,7 +2453,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2456,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2845,24 +2854,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7faab71..73e1576 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2565,7 +2565,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2615,7 +2616,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2641,7 +2643,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2712,7 +2715,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2899,7 +2903,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2919,7 +2924,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3344,7 +3350,8 @@ same_page:
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_event event)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -3356,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, event);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3386,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, event);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index cb1e976..4b659f1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -873,7 +873,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MPROT_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -905,7 +906,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_MPROT_RONLY);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -913,7 +914,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MPROT_RONLY);
 out:
 	return err;
 }
@@ -949,7 +951,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -962,7 +965,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -972,7 +977,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 09e2cd0..d3908f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_MIGRATE);
 	return ret;
 }
 
@@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2208,7 +2211,8 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2240,7 +2244,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2279,7 +2283,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index ab43fbf..b526c72 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,12 +1820,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1878,7 +1880,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 41cefdf..9decb88 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -159,14 +165,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -174,7 +183,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..6ce6c23 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_event event)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
@@ -157,7 +158,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, event);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -185,7 +187,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
@@ -194,7 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_event event)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -206,7 +209,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, event);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -214,7 +217,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, enum mmu_event event)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -231,7 +234,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, event);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -247,11 +250,23 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       int dirty_accountable, int prot_numa)
 {
 	unsigned long pages;
+	enum mmu_event event = MMU_MPROT_NONE;
+
+	/* At this points vm_flags is updated. */
+	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
+		event = MMU_MPROT_RANDW;
+	else if (vma->vm_flags & VM_WRITE)
+		event = MMU_MPROT_WONLY;
+	else if (vma->vm_flags & VM_READ)
+		event = MMU_MPROT_RONLY;
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end,
+						  newprot, event);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot,
+						dirty_accountable,
+						prot_numa, event);
 
 	return pages;
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..6827d2f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 7928ddd..bd7e6d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_STATUS;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, event);
 out:
 	return ret;
 
@@ -1287,7 +1291,9 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma,
+				struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_STATUS;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+							 vma, page,
+							 (enum ttu_flags)arg)
+							 == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..6e1992f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
@ 2014-06-28  2:00   ` Jérôme Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The event information will be usefull for new user of mmu_notifier API.
The event argument differentiate between a vma disappearing, a page
being write protected or simply a page being unmaped. This allow new
user to take different path for different event for instance on unmap
the resource used to track a vma are still valid and should stay around.
While if the event is saying that a vma is being destroy it means that any
resources used to track this vma can be free.

Changed since v1:
  - renamed action into event (updated commit message too).
  - simplified the event names and clarified their intented usage
    also documenting what exceptation the listener can have in
    respect to each event.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
 drivers/iommu/amd_iommu_v2.c            |  14 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
 drivers/xen/gntdev.c                    |   9 ++-
 fs/proc/task_mmu.c                      |   6 +-
 include/linux/hugetlb.h                 |   7 +-
 include/linux/mmu_notifier.h            | 117 ++++++++++++++++++++++++++------
 kernel/events/uprobes.c                 |  10 ++-
 mm/filemap_xip.c                        |   2 +-
 mm/huge_memory.c                        |  51 ++++++++------
 mm/hugetlb.c                            |  25 ++++---
 mm/ksm.c                                |  18 +++--
 mm/memory.c                             |  27 +++++---
 mm/migrate.c                            |   9 ++-
 mm/mmu_notifier.c                       |  28 +++++---
 mm/mprotect.c                           |  33 ++++++---
 mm/mremap.c                             |   6 +-
 mm/rmap.c                               |  24 +++++--
 virt/kvm/kvm_main.c                     |  12 ++--
 19 files changed, 291 insertions(+), 119 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 21ea928..ed6f35e 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -56,7 +56,8 @@ struct i915_mmu_object {
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
 	struct interval_tree_node *it = NULL;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 499b436..2bb9771 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 static void mn_change_pte(struct mmu_notifier *mn,
 			  struct mm_struct *mm,
 			  unsigned long address,
-			  pte_t pte)
+			  pte_t pte,
+			  enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
-			       unsigned long address)
+			       unsigned long address,
+			       enum mmu_event event)
 {
 	__mn_flush_page(mn, address);
 }
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+				      unsigned long start,
+				      unsigned long end,
+				      enum mmu_event event)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
@@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
-				    unsigned long start, unsigned long end)
+				    unsigned long start,
+				    unsigned long end,
+				    enum mmu_event event)
 {
 	struct pasid_state *pasid_state;
 	struct device_state *dev_state;
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index 2129274..e67fed1 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
 				     struct mm_struct *mm, unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
+				unsigned long address,
+				enum mmu_event event)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 073b4a1..fe9da94 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start,
+				unsigned long end,
+				enum mmu_event event)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
@@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
-			 unsigned long address)
+			 unsigned long address,
+			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..e9e79f7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		};
 		down_read(&mm->mmap_sem);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0, -1);
+			mmu_notifier_invalidate_range_start(mm, 0,
+							    -1, MMU_STATUS);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
@@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					&clear_refs_walk);
 		}
 		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0, -1);
+			mmu_notifier_invalidate_range_end(mm, 0,
+							  -1, MMU_STATUS);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6a836ef..d7e512f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -6,6 +6,7 @@
 #include <linux/fs.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/cgroup.h>
+#include <linux/mmu_notifier.h>
 #include <linux/list.h>
 #include <linux/kref.h>
 
@@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot);
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_event event);
 
 #else /* !CONFIG_HUGETLB_PAGE */
 
@@ -148,7 +150,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
 #define is_hugepage_active(x)	false
 
 static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_event event)
 {
 	return 0;
 }
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..82e9577 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -9,6 +9,52 @@
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
+/* Event report finer informations to the callback allowing the event listener
+ * to take better action. There are only few kinds of events :
+ *
+ *   - MMU_MIGRATE memory is migrating from one page to another thus all write
+ *     access must stop after invalidate_range_start callback returns. And no
+ *     read access should be allowed either as new page can be remapped with
+ *     write access before the invalidate_range_end callback happen and thus
+ *     any read access to old page might access outdated informations. Several
+ *     source to this event like page moving to swap (for various reasons like
+ *     page reclaim), outcome of mremap syscall, migration for numa reasons,
+ *     balancing memory pool, write fault on read only page trigger a new page
+ *     to be allocated and used, ...
+ *   - MMU_MPROT_NONE memory access protection is change, no page in the range
+ *     can be accessed in either read or write mode but the range of address
+ *     is still valid. All access are still fine until invalidate_range_end
+ *     callback returns.
+ *   - MMU_MPROT_RONLY memory access proctection is changing to read only.
+ *     All access are still fine until invalidate_range_end callback returns.
+ *   - MMU_MPROT_RANDW memory access proctection is changing to read an write.
+ *     All access are still fine until invalidate_range_end callback returns.
+ *   - MMU_MPROT_WONLY memory access proctection is changing to write only.
+ *     All access are still fine until invalidate_range_end callback returns.
+ *   - MMU_MUNMAP the range is being unmaped (outcome of a munmap syscall). It
+ *     is fine to still have read/write access until the invalidate_range_end
+ *     callback returns. This also imply that secondary page table can be trim
+ *     as the address range is no longer valid.
+ *   - MMU_WB memory is being write back to disk, all write access must stop
+ *     after invalidate_range_start callback returns. Read access are still
+ *     allowed.
+ *   - MMU_STATUS memory status change, like soft dirty.
+ *
+ * In doubt when adding a new notifier caller use MMU_MIGRATE it will always
+ * result in expected behavior but will not allow listener a chance to optimize
+ * its events.
+ */
+enum mmu_event {
+	MMU_MIGRATE = 0,
+	MMU_MPROT_NONE,
+	MMU_MPROT_RONLY,
+	MMU_MPROT_RANDW,
+	MMU_MPROT_WONLY,
+	MMU_MUNMAP,
+	MMU_STATUS,
+	MMU_WB,
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -79,7 +125,8 @@ struct mmu_notifier_ops {
 	void (*change_pte)(struct mmu_notifier *mn,
 			   struct mm_struct *mm,
 			   unsigned long address,
-			   pte_t pte);
+			   pte_t pte,
+			   enum mmu_event event);
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
@@ -90,7 +137,8 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long address);
+				unsigned long address,
+				enum mmu_event event);
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
@@ -137,10 +185,14 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start,
+				       unsigned long end,
+				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
-				     unsigned long start, unsigned long end);
+				     unsigned long start,
+				     unsigned long end,
+				     enum mmu_event event);
 };
 
 /*
@@ -177,13 +229,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
-				      unsigned long address, pte_t pte);
+				      unsigned long address,
+				      pte_t pte,
+				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						  unsigned long start,
+						  unsigned long end,
+						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+						unsigned long start,
+						unsigned long end,
+						enum mmu_event event);
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
 {
@@ -208,31 +267,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_change_pte(mm, address, pte);
+		__mmu_notifier_change_pte(mm, address, pte, event);
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
+		__mmu_notifier_invalidate_page(mm, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end);
+		__mmu_notifier_invalidate_range_end(mm, start, end, event);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -278,13 +344,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
  * old page would remain mapped readonly in the secondary MMUs after the new
  * page is already writable by some CPU through the primary MMU.
  */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
+#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
 ({									\
 	struct mm_struct *___mm = __mm;					\
 	unsigned long ___address = __address;				\
 	pte_t ___pte = __pte;						\
 									\
-	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
+	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
 	set_pte_at(___mm, ___address, __ptep, ___pte);			\
 })
 
@@ -307,22 +373,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-					   unsigned long address, pte_t pte)
+					   unsigned long address,
+					   pte_t pte,
+					   enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+						unsigned long address,
+						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						       unsigned long start,
+						       unsigned long end,
+						       enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+						     unsigned long start,
+						     unsigned long end,
+						     enum mmu_event event)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 32b04dc..296f81e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -177,7 +177,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -195,7 +196,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -209,7 +212,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..a2b3f09 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,7 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address);
+			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d562a9..fa30857 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1020,6 +1020,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		set_page_private(pages[i], (unsigned long)memcg);
 	}
 
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
+
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		copy_user_highpage(pages[i], page + i,
 				   haddr + PAGE_SIZE * i, vma);
@@ -1027,10 +1032,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_free_pages;
@@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1073,7 +1075,8 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1157,16 +1160,17 @@ alloc:
 
 	count_vm_event(THP_FAULT_ALLOC);
 
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
+
 	if (!page)
 		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
 	else
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-
 	spin_lock(ptl);
 	if (page)
 		put_user_huge_page(page);
@@ -1197,7 +1201,8 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
 out_unlock:
@@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_STATUS);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
 	if (pmd) {
@@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_STATUS);
 
 	return ret;
 }
@@ -2446,7 +2453,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -2456,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
@@ -2845,24 +2854,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7faab71..73e1576 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2565,7 +2565,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_start(src, mmun_start,
+						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
@@ -2615,7 +2616,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 	return ret;
 }
@@ -2641,7 +2643,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 again:
 	for (address = start; address < end; address += sz) {
 		ptep = huge_pte_offset(mm, address);
@@ -2712,7 +2715,8 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2899,7 +2903,8 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
 	 * before the page tables are altered
@@ -2919,7 +2924,8 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 
@@ -3344,7 +3350,8 @@ same_page:
 }
 
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
-		unsigned long address, unsigned long end, pgprot_t newprot)
+		unsigned long address, unsigned long end, pgprot_t newprot,
+		enum mmu_event event)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long start = address;
@@ -3356,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, event);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3386,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, event);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index cb1e976..4b659f1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -873,7 +873,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MPROT_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -905,7 +906,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, addr, ptep, entry, MMU_MPROT_RONLY);
 	}
 	*orig_pte = *ptep;
 	err = 0;
@@ -913,7 +914,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MPROT_RONLY);
 out:
 	return err;
 }
@@ -949,7 +951,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
@@ -962,7 +965,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+	set_pte_at_notify(mm, addr, ptep,
+			  mk_pte(kpage, vma->vm_page_prot),
+			  MMU_MIGRATE);
 
 	page_remove_rmap(page);
 	if (!page_mapped(page))
@@ -972,7 +977,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 09e2cd0..d3908f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_end   = end;
 	if (is_cow)
 		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
-						    mmun_end);
+						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
+						  MMU_MIGRATE);
 	return ret;
 }
 
@@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_start(mm, start_addr,
+					    end_addr, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
+	mmu_notifier_invalidate_range_end(mm, start_addr,
+					  end_addr, MMU_MUNMAP);
 }
 
 /**
@@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end);
+	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end);
+	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2208,7 +2211,8 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	/*
 	 * Re-check the pte - we dropped the lock
@@ -2240,7 +2244,7 @@ gotten:
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
+		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
 		update_mmu_cache(vma, address, page_table);
 		if (old_page) {
 			/*
@@ -2279,7 +2283,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/migrate.c b/mm/migrate.c
index ab43fbf..b526c72 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,12 +1820,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		mmu_notifier_invalidate_range_end(mm, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -1878,7 +1880,8 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 41cefdf..9decb88 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
 	return young;
 }
 
-void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
-			       pte_t pte)
+void __mmu_notifier_change_pte(struct mm_struct *mm,
+			       unsigned long address,
+			       pte_t pte,
+			       enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
-			mn->ops->change_pte(mn, mm, address, pte);
+			mn->ops->change_pte(mn, mm, address, pte, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
+				    unsigned long address,
+				    enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
+			mn->ops->invalidate_page(mn, mm, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					   unsigned long start,
+					   unsigned long end,
+					   enum mmu_event event)
+
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -159,14 +165,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+			mn->ops->invalidate_range_start(mn, mm, start,
+							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+					 unsigned long start,
+					 unsigned long end,
+					 enum mmu_event event)
 {
 	struct mmu_notifier *mn;
 	int id;
@@ -174,7 +183,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start, end);
+			mn->ops->invalidate_range_end(mn, mm, start,
+						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..6ce6c23 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_event event)
 {
 	pmd_t *pmd;
 	struct mm_struct *mm = vma->vm_mm;
@@ -157,7 +158,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start, end);
+			mmu_notifier_invalidate_range_start(mm, mni_start,
+							    end, event);
 		}
 
 		if (pmd_trans_huge(*pmd)) {
@@ -185,7 +187,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end);
+		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
@@ -194,7 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		pgd_t *pgd, unsigned long addr, unsigned long end,
-		pgprot_t newprot, int dirty_accountable, int prot_numa)
+		pgprot_t newprot, int dirty_accountable, int prot_numa,
+		enum mmu_event event)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -206,7 +209,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, event);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
@@ -214,7 +217,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 
 static unsigned long change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
-		int dirty_accountable, int prot_numa)
+		int dirty_accountable, int prot_numa, enum mmu_event event)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -231,7 +234,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
 		pages += change_pud_range(vma, pgd, addr, next, newprot,
-				 dirty_accountable, prot_numa);
+				 dirty_accountable, prot_numa, event);
 	} while (pgd++, addr = next, addr != end);
 
 	/* Only flush the TLB if we actually modified any entries: */
@@ -247,11 +250,23 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 		       int dirty_accountable, int prot_numa)
 {
 	unsigned long pages;
+	enum mmu_event event = MMU_MPROT_NONE;
+
+	/* At this points vm_flags is updated. */
+	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
+		event = MMU_MPROT_RANDW;
+	else if (vma->vm_flags & VM_WRITE)
+		event = MMU_MPROT_WONLY;
+	else if (vma->vm_flags & VM_READ)
+		event = MMU_MPROT_RONLY;
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot);
+		pages = hugetlb_change_protection(vma, start, end,
+						  newprot, event);
 	else
-		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+		pages = change_protection_range(vma, start, end, newprot,
+						dirty_accountable,
+						prot_numa, event);
 
 	return pages;
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..6827d2f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 7928ddd..bd7e6d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, MMU_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_STATUS;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_page(mm, address, event);
 out:
 	return ret;
 
@@ -1287,7 +1291,9 @@ out_mlock:
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
 static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
-		struct vm_area_struct *vma, struct page *check_page)
+				struct vm_area_struct *vma,
+				struct page *check_page,
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
@@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
+	enum mmu_event event = MMU_MIGRATE;
+
+	if (flags & TTU_MUNLOCK)
+		event = MMU_STATUS;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
@@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
 			while (cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
 				if (try_to_unmap_cluster(cursor, &mapcount,
-						vma, page) == SWAP_MLOCK)
+							 vma, page,
+							 (enum ttu_flags)arg)
+							 == SWAP_MLOCK)
 					ret = SWAP_MLOCK;
 				cursor += CLUSTER_SIZE;
 				vma->vm_private_data = (void *) cursor;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..6e1992f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
-					     unsigned long address)
+					     unsigned long address,
+					     enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush, idx;
@@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
-					pte_t pte)
+					pte_t pte,
+					enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int idx;
@@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
@@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
 						  unsigned long start,
-						  unsigned long end)
+						  unsigned long end,
+						  enum mmu_event event)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
  2014-06-28  2:00 ` Jérôme Glisse
@ 2014-06-28  2:00   ` Jérôme Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
 drivers/iommu/amd_iommu_v2.c            |  3 +++
 drivers/misc/sgi-gru/grutlbpurge.c      |  6 ++++-
 drivers/xen/gntdev.c                    |  4 +++-
 fs/proc/task_mmu.c                      | 16 ++++++++-----
 include/linux/mmu_notifier.h            | 19 ++++++++++++---
 kernel/events/uprobes.c                 |  4 ++--
 mm/filemap_xip.c                        |  3 ++-
 mm/huge_memory.c                        | 26 ++++++++++----------
 mm/hugetlb.c                            | 16 ++++++-------
 mm/ksm.c                                |  8 +++----
 mm/memory.c                             | 42 +++++++++++++++++++++------------
 mm/migrate.c                            |  6 ++---
 mm/mmu_notifier.c                       |  9 ++++---
 mm/mprotect.c                           |  5 ++--
 mm/mremap.c                             |  4 ++--
 mm/rmap.c                               |  9 +++----
 virt/kvm/kvm_main.c                     |  3 +++
 18 files changed, 116 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index ed6f35e..191ac71 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -55,6 +55,7 @@ struct i915_mmu_object {
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_event event)
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 2bb9771..9f9e706 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
+			       struct vm_area_struct *vma,
 			       unsigned long address,
 			       enum mmu_event event)
 {
@@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
+				      struct vm_area_struct *vma,
 				      unsigned long start,
 				      unsigned long end,
 				      enum mmu_event event)
@@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    enum mmu_event event)
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..d02e4c7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start, unsigned long end,
 				       enum mmu_event event)
 {
@@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
+				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
+				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_event event)
 {
@@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_event event)
 {
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index fe9da94..219928b 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long start,
 				unsigned long end,
 				enum mmu_event event)
@@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
+			 struct vm_area_struct *vma,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e9e79f7..8b0f25d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -829,13 +829,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0,
-							    -1, MMU_STATUS);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
 				continue;
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_start(mm, vma,
+								    vma->vm_start,
+								    vma->vm_end,
+								    MMU_STATUS);
 			/*
 			 * Writing 1 to /proc/pid/clear_refs affects all pages.
 			 *
@@ -857,10 +859,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			}
 			walk_page_range(vma->vm_start, vma->vm_end,
 					&clear_refs_walk);
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_end(mm, vma,
+								  vma->vm_start,
+								  vma->vm_end,
+								  MMU_STATUS);
 		}
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0,
-							  -1, MMU_STATUS);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 82e9577..8907e5d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -137,6 +137,7 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_event event);
 
@@ -185,11 +186,13 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start,
 				       unsigned long end,
 				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
 				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_event event);
@@ -233,13 +236,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      pte_t pte,
 				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long start,
 						unsigned long end,
 						enum mmu_event event);
@@ -276,29 +282,33 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, event);
+		__mmu_notifier_invalidate_page(mm, vma, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, vma, start,
+						      end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -380,12 +390,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_event event)
@@ -393,6 +405,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_event event)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 296f81e..0f552bc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -177,7 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
@@ -212,7 +212,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index a2b3f09..f0113df 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,8 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+			mmu_notifier_invalidate_page(mm, vma, address,
+						     MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa30857..cc74b60 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1022,7 +1022,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1064,7 +1064,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
@@ -1075,7 +1075,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
@@ -1162,7 +1162,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	if (!page)
@@ -1201,7 +1201,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
@@ -1637,7 +1637,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_STATUS);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
@@ -1653,7 +1653,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_STATUS);
 
 	return ret;
@@ -2453,7 +2453,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
@@ -2464,7 +2464,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
@@ -2854,19 +2854,19 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 		return;
 	}
@@ -2874,7 +2874,7 @@ again:
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73e1576..15f0123 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2565,7 +2565,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
+		mmu_notifier_invalidate_range_start(src, vma, mmun_start,
 						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
@@ -2616,7 +2616,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
+		mmu_notifier_invalidate_range_end(src, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 
 	return ret;
@@ -2643,7 +2643,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 again:
 	for (address = start; address < end; address += sz) {
@@ -2715,7 +2715,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
@@ -2903,7 +2903,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
@@ -2924,7 +2924,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
@@ -3363,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, event);
+	mmu_notifier_invalidate_range_start(mm, vma, start, end, event);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3393,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end, event);
+	mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 4b659f1..1f3c4d7 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -873,7 +873,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MPROT_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
@@ -914,7 +914,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MPROT_RONLY);
 out:
 	return err;
@@ -951,7 +951,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -977,7 +977,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 out:
 	return err;
diff --git a/mm/memory.c b/mm/memory.c
index d3908f0..4717579 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_start = addr;
 	mmun_end   = end;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
+		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
 						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
@@ -1067,8 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
-						  MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	return ret;
 }
 
@@ -1372,12 +1372,17 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start_addr, vma->vm_start),
+						    min(end_addr, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start_addr, vma->vm_start),
+						  min(end_addr, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 }
 
 /**
@@ -1399,10 +1404,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start, vma->vm_start),
+						    min(end, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start, vma->vm_start),
+						  min(end, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1425,9 +1437,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2211,7 +2223,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	/*
@@ -2283,7 +2295,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index b526c72..0c61aa9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,13 +1820,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1880,7 +1880,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 9decb88..87e6bc5 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long address,
 				    enum mmu_event event)
 {
@@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, event);
+			mn->ops->invalidate_page(mn, mm, vma, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long start,
 					   unsigned long end,
 					   enum mmu_event event)
@@ -165,7 +167,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
+			mn->ops->invalidate_range_start(mn, vma, mm, start,
 							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
@@ -173,6 +175,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
 					 enum mmu_event event)
@@ -183,7 +186,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
+			mn->ops->invalidate_range_end(mn, vma, mm, start,
 						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6ce6c23..16ce504 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
+			mmu_notifier_invalidate_range_start(mm, vma, mni_start,
 							    end, event);
 		}
 
@@ -187,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
+		mmu_notifier_invalidate_range_end(mm, vma, mni_start,
+						  end, event);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 6827d2f..9bee6de 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
@@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
diff --git a/mm/rmap.c b/mm/rmap.c
index bd7e6d7..f1be50d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_WB);
+		mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address, event);
+		mmu_notifier_invalidate_page(mm, vma, address, event);
 out:
 	return ret;
 
@@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
+					    mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6e1992f..c4b7bf9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
+					     struct vm_area_struct *vma,
 					     unsigned long address,
 					     enum mmu_event event)
 {
@@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
+						    struct vm_area_struct *vma,
 						    unsigned long start,
 						    unsigned long end,
 						    enum mmu_event event)
@@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_event event)
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
@ 2014-06-28  2:00   ` Jérôme Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jérôme Glisse @ 2014-06-28  2:00 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel
  Cc: mgorman, hpa, peterz, aarcange, riel, jweiner, torvalds,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse

From: JA(C)rA'me Glisse <jglisse@redhat.com>

New user of the mmu_notifier interface need to lookup vma in order to
perform the invalidation operation. Instead of redoing a vma lookup
inside the callback just pass through the vma from the call site where
it is already available.

This needs small refactoring in memory.c to call invalidate_range on
vma boundary the overhead should be low enough.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
 drivers/iommu/amd_iommu_v2.c            |  3 +++
 drivers/misc/sgi-gru/grutlbpurge.c      |  6 ++++-
 drivers/xen/gntdev.c                    |  4 +++-
 fs/proc/task_mmu.c                      | 16 ++++++++-----
 include/linux/mmu_notifier.h            | 19 ++++++++++++---
 kernel/events/uprobes.c                 |  4 ++--
 mm/filemap_xip.c                        |  3 ++-
 mm/huge_memory.c                        | 26 ++++++++++----------
 mm/hugetlb.c                            | 16 ++++++-------
 mm/ksm.c                                |  8 +++----
 mm/memory.c                             | 42 +++++++++++++++++++++------------
 mm/migrate.c                            |  6 ++---
 mm/mmu_notifier.c                       |  9 ++++---
 mm/mprotect.c                           |  5 ++--
 mm/mremap.c                             |  4 ++--
 mm/rmap.c                               |  9 +++----
 virt/kvm/kvm_main.c                     |  3 +++
 18 files changed, 116 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index ed6f35e..191ac71 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -55,6 +55,7 @@ struct i915_mmu_object {
 
 static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_event event)
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 2bb9771..9f9e706 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
 
 static void mn_invalidate_page(struct mmu_notifier *mn,
 			       struct mm_struct *mm,
+			       struct vm_area_struct *vma,
 			       unsigned long address,
 			       enum mmu_event event)
 {
@@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_start(struct mmu_notifier *mn,
 				      struct mm_struct *mm,
+				      struct vm_area_struct *vma,
 				      unsigned long start,
 				      unsigned long end,
 				      enum mmu_event event)
@@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static void mn_invalidate_range_end(struct mmu_notifier *mn,
 				    struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    enum mmu_event event)
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e67fed1..d02e4c7 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
  */
 static void gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start, unsigned long end,
 				       enum mmu_event event)
 {
@@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long start,
+				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
+				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_event event)
 {
@@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_event event)
 {
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index fe9da94..219928b 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
 
 static void mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long start,
 				unsigned long end,
 				enum mmu_event event)
@@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 
 static void mn_invl_page(struct mmu_notifier *mn,
 			 struct mm_struct *mm,
+			 struct vm_area_struct *vma,
 			 unsigned long address,
 			 enum mmu_event event)
 {
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
+	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, event);
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e9e79f7..8b0f25d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -829,13 +829,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			.private = &cp,
 		};
 		down_read(&mm->mmap_sem);
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_start(mm, 0,
-							    -1, MMU_STATUS);
 		for (vma = mm->mmap; vma; vma = vma->vm_next) {
 			cp.vma = vma;
 			if (is_vm_hugetlb_page(vma))
 				continue;
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_start(mm, vma,
+								    vma->vm_start,
+								    vma->vm_end,
+								    MMU_STATUS);
 			/*
 			 * Writing 1 to /proc/pid/clear_refs affects all pages.
 			 *
@@ -857,10 +859,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			}
 			walk_page_range(vma->vm_start, vma->vm_end,
 					&clear_refs_walk);
+			if (type == CLEAR_REFS_SOFT_DIRTY)
+				mmu_notifier_invalidate_range_end(mm, vma,
+								  vma->vm_start,
+								  vma->vm_end,
+								  MMU_STATUS);
 		}
-		if (type == CLEAR_REFS_SOFT_DIRTY)
-			mmu_notifier_invalidate_range_end(mm, 0,
-							  -1, MMU_STATUS);
 		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 82e9577..8907e5d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -137,6 +137,7 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
+				struct vm_area_struct *vma,
 				unsigned long address,
 				enum mmu_event event);
 
@@ -185,11 +186,13 @@ struct mmu_notifier_ops {
 	 */
 	void (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
+				       struct vm_area_struct *vma,
 				       unsigned long start,
 				       unsigned long end,
 				       enum mmu_event event);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
+				     struct vm_area_struct *vma,
 				     unsigned long start,
 				     unsigned long end,
 				     enum mmu_event event);
@@ -233,13 +236,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      pte_t pte,
 				      enum mmu_event event);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					  unsigned long address,
 					  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_event event);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long start,
 						unsigned long end,
 						enum mmu_event event);
@@ -276,29 +282,33 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address, event);
+		__mmu_notifier_invalidate_page(mm, vma, address, event);
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end, event);
+		__mmu_notifier_invalidate_range_start(mm, vma, start,
+						      end, event);
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_event event)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_end(mm, start, end, event);
+		__mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
 }
 
 static inline void mmu_notifier_mm_init(struct mm_struct *mm)
@@ -380,12 +390,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+						struct vm_area_struct *vma,
 						unsigned long address,
 						enum mmu_event event)
 {
 }
 
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+						       struct vm_area_struct *vma,
 						       unsigned long start,
 						       unsigned long end,
 						       enum mmu_event event)
@@ -393,6 +405,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+						     struct vm_area_struct *vma,
 						     unsigned long start,
 						     unsigned long end,
 						     enum mmu_event event)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 296f81e..0f552bc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -177,7 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
@@ -212,7 +212,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	err = 0;
  unlock:
 	mem_cgroup_cancel_charge(kpage, memcg);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	unlock_page(page);
 	return err;
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index a2b3f09..f0113df 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -198,7 +198,8 @@ retry:
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
 			/* must invalidate_page _before_ freeing the page */
-			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
+			mmu_notifier_invalidate_page(mm, vma, address,
+						     MMU_MIGRATE);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa30857..cc74b60 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1022,7 +1022,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1064,7 +1064,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	ret |= VM_FAULT_WRITE;
@@ -1075,7 +1075,7 @@ out:
 
 out_free_pages:
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
@@ -1162,7 +1162,7 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	if (!page)
@@ -1201,7 +1201,7 @@ alloc:
 	}
 	spin_unlock(ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 out:
 	return ret;
@@ -1637,7 +1637,7 @@ static int __split_huge_page_splitting(struct page *page,
 	const unsigned long mmun_start = address;
 	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_STATUS);
 	pmd = page_check_address_pmd(page, mm, address,
 			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
@@ -1653,7 +1653,7 @@ static int __split_huge_page_splitting(struct page *page,
 		ret = 1;
 		spin_unlock(ptl);
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_STATUS);
 
 	return ret;
@@ -2453,7 +2453,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	mmun_start = address;
 	mmun_end   = address + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
 	/*
@@ -2464,7 +2464,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	spin_lock(pte_ptl);
@@ -2854,19 +2854,19 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
 again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 		return;
 	}
 	if (is_huge_zero_pmd(*pmd)) {
 		__split_huge_zero_page_pmd(vma, haddr, pmd);
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 		return;
 	}
@@ -2874,7 +2874,7 @@ again:
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	get_page(page);
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	split_huge_page(page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73e1576..15f0123 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2565,7 +2565,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	mmun_start = vma->vm_start;
 	mmun_end = vma->vm_end;
 	if (cow)
-		mmu_notifier_invalidate_range_start(src, mmun_start,
+		mmu_notifier_invalidate_range_start(src, vma, mmun_start,
 						    mmun_end, MMU_MIGRATE);
 
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
@@ -2616,7 +2616,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	}
 
 	if (cow)
-		mmu_notifier_invalidate_range_end(src, mmun_start,
+		mmu_notifier_invalidate_range_end(src, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 
 	return ret;
@@ -2643,7 +2643,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 again:
 	for (address = start; address < end; address += sz) {
@@ -2715,7 +2715,7 @@ unlock:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	tlb_end_vma(tlb, vma);
 }
@@ -2903,7 +2903,7 @@ retry_avoidcopy:
 
 	mmun_start = address & huge_page_mask(h);
 	mmun_end = mmun_start + huge_page_size(h);
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	/*
 	 * Retake the page table lock to check for racing updates
@@ -2924,7 +2924,7 @@ retry_avoidcopy:
 		new_page = old_page;
 	}
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
@@ -3363,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end, event);
+	mmu_notifier_invalidate_range_start(mm, vma, start, end, event);
 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -3393,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 */
 	flush_tlb_range(vma, start, end);
 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(mm, start, end, event);
+	mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
 
 	return pages << h->order;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 4b659f1..1f3c4d7 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -873,7 +873,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MPROT_RONLY);
 
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
@@ -914,7 +914,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MPROT_RONLY);
 out:
 	return err;
@@ -951,7 +951,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	mmun_start = addr;
 	mmun_end   = addr + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -977,7 +977,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 out:
 	return err;
diff --git a/mm/memory.c b/mm/memory.c
index d3908f0..4717579 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mmun_start = addr;
 	mmun_end   = end;
 	if (is_cow)
-		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
+		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
 						    mmun_end, MMU_MIGRATE);
 
 	ret = 0;
@@ -1067,8 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow)
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
-						  MMU_MIGRATE);
+		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start,
+						  mmun_end, MMU_MIGRATE);
 	return ret;
 }
 
@@ -1372,12 +1372,17 @@ void unmap_vmas(struct mmu_gather *tlb,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	mmu_notifier_invalidate_range_start(mm, start_addr,
-					    end_addr, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start_addr, vma->vm_start),
+						    min(end_addr, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
-	mmu_notifier_invalidate_range_end(mm, start_addr,
-					  end_addr, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start_addr, vma->vm_start),
+						  min(end_addr, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 }
 
 /**
@@ -1399,10 +1404,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
+	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+		mmu_notifier_invalidate_range_start(mm, vma,
+						    max(start, vma->vm_start),
+						    min(end, vma->vm_end),
+						    MMU_MUNMAP);
 		unmap_single_vma(&tlb, vma, start, end, details);
-	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
+		mmu_notifier_invalidate_range_end(mm, vma,
+						  max(start, vma->vm_start),
+						  min(end, vma->vm_end),
+						  MMU_MUNMAP);
+	}
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1425,9 +1437,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, address, end);
 	update_hiwater_rss(mm);
-	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
 	unmap_single_vma(&tlb, vma, address, end, details);
-	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
+	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
 	tlb_finish_mmu(&tlb, address, end);
 }
 
@@ -2211,7 +2223,7 @@ gotten:
 
 	mmun_start  = address & PAGE_MASK;
 	mmun_end    = mmun_start + PAGE_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	/*
@@ -2283,7 +2295,7 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (mmun_end > mmun_start)
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 	if (old_page) {
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index b526c72..0c61aa9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1820,13 +1820,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start,
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
 fail_putback:
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start,
+		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 						  mmun_end, MMU_MIGRATE);
 
 		/* Reverse changes made by migrate_page_copy() */
@@ -1880,7 +1880,7 @@ fail_putback:
 	page_remove_rmap(page);
 
 	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start,
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 9decb88..87e6bc5 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
 }
 
 void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+				    struct vm_area_struct *vma,
 				    unsigned long address,
 				    enum mmu_event event)
 {
@@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address, event);
+			mn->ops->invalidate_page(mn, mm, vma, address, event);
 	}
 	srcu_read_unlock(&srcu, id);
 }
 
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+					   struct vm_area_struct *vma,
 					   unsigned long start,
 					   unsigned long end,
 					   enum mmu_event event)
@@ -165,7 +167,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start,
+			mn->ops->invalidate_range_start(mn, vma, mm, start,
 							end, event);
 	}
 	srcu_read_unlock(&srcu, id);
@@ -173,6 +175,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
 void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
 					 enum mmu_event event)
@@ -183,7 +186,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_end)
-			mn->ops->invalidate_range_end(mn, mm, start,
+			mn->ops->invalidate_range_end(mn, vma, mm, start,
 						      end, event);
 	}
 	srcu_read_unlock(&srcu, id);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6ce6c23..16ce504 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		/* invoke the mmu notifier if the pmd is populated */
 		if (!mni_start) {
 			mni_start = addr;
-			mmu_notifier_invalidate_range_start(mm, mni_start,
+			mmu_notifier_invalidate_range_start(mm, vma, mni_start,
 							    end, event);
 		}
 
@@ -187,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	} while (pmd++, addr = next, addr != end);
 
 	if (mni_start)
-		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
+		mmu_notifier_invalidate_range_end(mm, vma, mni_start,
+						  end, event);
 
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
diff --git a/mm/mremap.c b/mm/mremap.c
index 6827d2f..9bee6de 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmun_start = old_addr;
 	mmun_end   = old_end;
-	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
+	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
 					    mmun_end, MMU_MIGRATE);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
@@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
+	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
 					  mmun_end, MMU_MIGRATE);
 
 	return len + old_addr - old_end;	/* how much done */
diff --git a/mm/rmap.c b/mm/rmap.c
index bd7e6d7..f1be50d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 
 	if (ret) {
-		mmu_notifier_invalidate_page(mm, address, MMU_WB);
+		mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
 		(*cleaned)++;
 	}
 out:
@@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address, event);
+		mmu_notifier_invalidate_page(mm, vma, address, event);
 out:
 	return ret;
 
@@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 	mmun_start = address;
 	mmun_end   = end;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
+	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
+					    mmun_end, event);
 
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
@@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
+	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6e1992f..c4b7bf9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 
 static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
+					     struct vm_area_struct *vma,
 					     unsigned long address,
 					     enum mmu_event event)
 {
@@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
+						    struct vm_area_struct *vma,
 						    unsigned long start,
 						    unsigned long end,
 						    enum mmu_event event)
@@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 						  struct mm_struct *mm,
+						  struct vm_area_struct *vma,
 						  unsigned long start,
 						  unsigned long end,
 						  enum mmu_event event)
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
  2014-06-28  2:00   ` Jérôme Glisse
@ 2014-06-30  3:29     ` John Hubbard
  -1 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  3:29 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 33563 bytes --]

On Fri, 27 Jun 2014, Jérôme Glisse wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> New user of the mmu_notifier interface need to lookup vma in order to
> perform the invalidation operation. Instead of redoing a vma lookup
> inside the callback just pass through the vma from the call site where
> it is already available.
> 
> This needs small refactoring in memory.c to call invalidate_range on
> vma boundary the overhead should be low enough.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
>  drivers/iommu/amd_iommu_v2.c            |  3 +++
>  drivers/misc/sgi-gru/grutlbpurge.c      |  6 ++++-
>  drivers/xen/gntdev.c                    |  4 +++-
>  fs/proc/task_mmu.c                      | 16 ++++++++-----
>  include/linux/mmu_notifier.h            | 19 ++++++++++++---
>  kernel/events/uprobes.c                 |  4 ++--
>  mm/filemap_xip.c                        |  3 ++-
>  mm/huge_memory.c                        | 26 ++++++++++----------
>  mm/hugetlb.c                            | 16 ++++++-------
>  mm/ksm.c                                |  8 +++----
>  mm/memory.c                             | 42 +++++++++++++++++++++------------
>  mm/migrate.c                            |  6 ++---
>  mm/mmu_notifier.c                       |  9 ++++---
>  mm/mprotect.c                           |  5 ++--
>  mm/mremap.c                             |  4 ++--
>  mm/rmap.c                               |  9 +++----
>  virt/kvm/kvm_main.c                     |  3 +++
>  18 files changed, 116 insertions(+), 68 deletions(-)
> 

Hi Jerome, considering that you have to change every call site already, it 
seems to me that it would be ideal to just delete the mm argument from all 
of these invalidate_range* callbacks. In other words, replace the mm 
argument with the new vma argument.  I don't see much point in passing 
them both around, and while it would make the patch a *bit* larger, it's 
mostly just an extra line or two per call site:

  mm = vma->vm_mm;

Also, passing the vma around really does seem like a good approach, but it 
does cause a bunch of additional calls to the invalidate_range* routines,
because we generate a call per vma, instead of just one for the entire mm.
So that brings up a couple questions:

1) Is there any chance that this could cause measurable performance 
regressions?

2) Should you put a little note in the commit message, mentioning this 
point?

> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index ed6f35e..191ac71 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -55,6 +55,7 @@ struct i915_mmu_object {
>  
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
> +						       struct vm_area_struct *vma,
>  						       unsigned long start,
>  						       unsigned long end,
>  						       enum mmu_event event)

That routine has a local variable named vma, so it might be polite to 
rename the local variable, to make it more obvious to the reader that they 
are different. Of course, since the compiler knows which is which, feel 
free to ignore this comment.

> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 2bb9771..9f9e706 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  			       struct mm_struct *mm,
> +			       struct vm_area_struct *vma,
>  			       unsigned long address,
>  			       enum mmu_event event)
>  {
> @@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  				      struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
>  				      unsigned long start,
>  				      unsigned long end,
>  				      enum mmu_event event)
> @@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_range_end(struct mmu_notifier *mn,
>  				    struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
>  				    unsigned long start,
>  				    unsigned long end,
>  				    enum mmu_event event)
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index e67fed1..d02e4c7 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   */
>  static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> +				       struct vm_area_struct *vma,
>  				       unsigned long start, unsigned long end,
>  				       enum mmu_event event)
>  {
> @@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> -				     struct mm_struct *mm, unsigned long start,
> +				     struct mm_struct *mm,
> +				     struct vm_area_struct *vma,
> +				     unsigned long start,
>  				     unsigned long end,
>  				     enum mmu_event event)
>  {
> @@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				unsigned long address,
>  				enum mmu_event event)
>  {
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index fe9da94..219928b 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
>  
>  static void mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				unsigned long start,
>  				unsigned long end,
>  				enum mmu_event event)
> @@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invl_page(struct mmu_notifier *mn,
>  			 struct mm_struct *mm,
> +			 struct vm_area_struct *vma,
>  			 unsigned long address,
>  			 enum mmu_event event)
>  {
> -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
> +	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, event);
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e9e79f7..8b0f25d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -829,13 +829,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			.private = &cp,
>  		};
>  		down_read(&mm->mmap_sem);
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_start(mm, 0,
> -							    -1, MMU_STATUS);
>  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  			cp.vma = vma;
>  			if (is_vm_hugetlb_page(vma))
>  				continue;
> +			if (type == CLEAR_REFS_SOFT_DIRTY)
> +				mmu_notifier_invalidate_range_start(mm, vma,
> +								    vma->vm_start,
> +								    vma->vm_end,
> +								    MMU_STATUS);
>  			/*
>  			 * Writing 1 to /proc/pid/clear_refs affects all pages.
>  			 *
> @@ -857,10 +859,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			}
>  			walk_page_range(vma->vm_start, vma->vm_end,
>  					&clear_refs_walk);
> +			if (type == CLEAR_REFS_SOFT_DIRTY)
> +				mmu_notifier_invalidate_range_end(mm, vma,
> +								  vma->vm_start,
> +								  vma->vm_end,
> +								  MMU_STATUS);
>  		}
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0,
> -							  -1, MMU_STATUS);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  		mmput(mm);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 82e9577..8907e5d 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -137,6 +137,7 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				unsigned long address,
>  				enum mmu_event event);
>  
> @@ -185,11 +186,13 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> +				       struct vm_area_struct *vma,
>  				       unsigned long start,
>  				       unsigned long end,
>  				       enum mmu_event event);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
> +				     struct vm_area_struct *vma,
>  				     unsigned long start,
>  				     unsigned long end,
>  				     enum mmu_event event);
> @@ -233,13 +236,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      pte_t pte,
>  				      enum mmu_event event);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> +					   struct vm_area_struct *vma,
>  					  unsigned long address,
>  					  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +						  struct vm_area_struct *vma,
>  						  unsigned long start,
>  						  unsigned long end,
>  						  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +						struct vm_area_struct *vma,
>  						unsigned long start,
>  						unsigned long end,
>  						enum mmu_event event);
> @@ -276,29 +282,33 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> +						struct vm_area_struct *vma,
>  						unsigned long address,
>  						enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_page(mm, address, event);
> +		__mmu_notifier_invalidate_page(mm, vma, address, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +						       struct vm_area_struct *vma,
>  						       unsigned long start,
>  						       unsigned long end,
>  						       enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> +		__mmu_notifier_invalidate_range_start(mm, vma, start,
> +						      end, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +						     struct vm_area_struct *vma,
>  						     unsigned long start,
>  						     unsigned long end,
>  						     enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_end(mm, start, end, event);
> +		__mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
>  }
>  
>  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> @@ -380,12 +390,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> +						struct vm_area_struct *vma,
>  						unsigned long address,
>  						enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +						       struct vm_area_struct *vma,
>  						       unsigned long start,
>  						       unsigned long end,
>  						       enum mmu_event event)
> @@ -393,6 +405,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +						     struct vm_area_struct *vma,
>  						     unsigned long start,
>  						     unsigned long end,
>  						     enum mmu_event event)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 296f81e..0f552bc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -177,7 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	/* For try_to_free_swap() and munlock_vma_page() below */
>  	lock_page(page);
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	err = -EAGAIN;
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> @@ -212,7 +212,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	err = 0;
>   unlock:
>  	mem_cgroup_cancel_charge(kpage, memcg);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	unlock_page(page);
>  	return err;
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index a2b3f09..f0113df 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -198,7 +198,8 @@ retry:
>  			BUG_ON(pte_dirty(pteval));
>  			pte_unmap_unlock(pte, ptl);
>  			/* must invalidate_page _before_ freeing the page */
> -			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> +			mmu_notifier_invalidate_page(mm, vma, address,
> +						     MMU_MIGRATE);
>  			page_cache_release(page);
>  		}
>  	}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fa30857..cc74b60 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1022,7 +1022,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> @@ -1064,7 +1064,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	page_remove_rmap(page);
>  	spin_unlock(ptl);
>  
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	ret |= VM_FAULT_WRITE;
> @@ -1075,7 +1075,7 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		memcg = (void *)page_private(pages[i]);
> @@ -1162,7 +1162,7 @@ alloc:
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	if (!page)
> @@ -1201,7 +1201,7 @@ alloc:
>  	}
>  	spin_unlock(ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  out:
>  	return ret;
> @@ -1637,7 +1637,7 @@ static int __split_huge_page_splitting(struct page *page,
>  	const unsigned long mmun_start = address;
>  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_STATUS);
>  	pmd = page_check_address_pmd(page, mm, address,
>  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
> @@ -1653,7 +1653,7 @@ static int __split_huge_page_splitting(struct page *page,
>  		ret = 1;
>  		spin_unlock(ptl);
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_STATUS);
>  
>  	return ret;
> @@ -2453,7 +2453,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	mmun_start = address;
>  	mmun_end   = address + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
> @@ -2464,7 +2464,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	_pmd = pmdp_clear_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	spin_lock(pte_ptl);
> @@ -2854,19 +2854,19 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
>  again:
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	if (is_huge_zero_pmd(*pmd)) {
>  		__split_huge_zero_page_pmd(vma, haddr, pmd);
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
> @@ -2874,7 +2874,7 @@ again:
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	get_page(page);
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	split_huge_page(page);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 73e1576..15f0123 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2565,7 +2565,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	mmun_start = vma->vm_start;
>  	mmun_end = vma->vm_end;
>  	if (cow)
> -		mmu_notifier_invalidate_range_start(src, mmun_start,
> +		mmu_notifier_invalidate_range_start(src, vma, mmun_start,
>  						    mmun_end, MMU_MIGRATE);
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> @@ -2616,7 +2616,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	}
>  
>  	if (cow)
> -		mmu_notifier_invalidate_range_end(src, mmun_start,
> +		mmu_notifier_invalidate_range_end(src, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  
>  	return ret;
> @@ -2643,7 +2643,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	BUG_ON(end & ~huge_page_mask(h));
>  
>  	tlb_start_vma(tlb, vma);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  again:
>  	for (address = start; address < end; address += sz) {
> @@ -2715,7 +2715,7 @@ unlock:
>  		if (address < end && !ref_page)
>  			goto again;
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	tlb_end_vma(tlb, vma);
>  }
> @@ -2903,7 +2903,7 @@ retry_avoidcopy:
>  
>  	mmun_start = address & huge_page_mask(h);
>  	mmun_end = mmun_start + huge_page_size(h);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	/*
>  	 * Retake the page table lock to check for racing updates
> @@ -2924,7 +2924,7 @@ retry_avoidcopy:
>  		new_page = old_page;
>  	}
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> @@ -3363,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end, event);
> +	mmu_notifier_invalidate_range_start(mm, vma, start, end, event);
>  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  	for (; address < end; address += huge_page_size(h)) {
>  		spinlock_t *ptl;
> @@ -3393,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 */
>  	flush_tlb_range(vma, start, end);
>  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -	mmu_notifier_invalidate_range_end(mm, start, end, event);
> +	mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
>  
>  	return pages << h->order;
>  }
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 4b659f1..1f3c4d7 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -873,7 +873,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MPROT_RONLY);
>  
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> @@ -914,7 +914,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  out_unlock:
>  	pte_unmap_unlock(ptep, ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MPROT_RONLY);
>  out:
>  	return err;
> @@ -951,7 +951,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> @@ -977,7 +977,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	pte_unmap_unlock(ptep, ptl);
>  	err = 0;
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  out:
>  	return err;
> diff --git a/mm/memory.c b/mm/memory.c
> index d3908f0..4717579 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	mmun_start = addr;
>  	mmun_end   = end;
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> +		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
>  						    mmun_end, MMU_MIGRATE);
>  
>  	ret = 0;
> @@ -1067,8 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>  
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> -						  MMU_MIGRATE);
> +		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  	return ret;
>  }
>  
> @@ -1372,12 +1372,17 @@ void unmap_vmas(struct mmu_gather *tlb,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  
> -	mmu_notifier_invalidate_range_start(mm, start_addr,
> -					    end_addr, MMU_MUNMAP);
> -	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> +	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> +		mmu_notifier_invalidate_range_start(mm, vma,
> +						    max(start_addr, vma->vm_start),
> +						    min(end_addr, vma->vm_end),
> +						    MMU_MUNMAP);
>  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> -	mmu_notifier_invalidate_range_end(mm, start_addr,
> -					  end_addr, MMU_MUNMAP);
> +		mmu_notifier_invalidate_range_end(mm, vma,
> +						  max(start_addr, vma->vm_start),
> +						  min(end_addr, vma->vm_end),
> +						  MMU_MUNMAP);
> +	}
>  }
>  
>  /**
> @@ -1399,10 +1404,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
> -	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> +	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
> +		mmu_notifier_invalidate_range_start(mm, vma,
> +						    max(start, vma->vm_start),
> +						    min(end, vma->vm_end),
> +						    MMU_MUNMAP);
>  		unmap_single_vma(&tlb, vma, start, end, details);
> -	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
> +		mmu_notifier_invalidate_range_end(mm, vma,
> +						  max(start, vma->vm_start),
> +						  min(end, vma->vm_end),
> +						  MMU_MUNMAP);
> +	}
>  	tlb_finish_mmu(&tlb, start, end);
>  }
>  
> @@ -1425,9 +1437,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, address, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
> +	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
>  	unmap_single_vma(&tlb, vma, address, end, details);
> -	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
> +	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, address, end);
>  }
>  
> @@ -2211,7 +2223,7 @@ gotten:
>  
>  	mmun_start  = address & PAGE_MASK;
>  	mmun_end    = mmun_start + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	/*
> @@ -2283,7 +2295,7 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (mmun_end > mmun_start)
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  	if (old_page) {
>  		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b526c72..0c61aa9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1820,13 +1820,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	WARN_ON(PageLRU(new_page));
>  
>  	/* Recheck the target PMD */
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>  fail_putback:
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  
>  		/* Reverse changes made by migrate_page_copy() */
> @@ -1880,7 +1880,7 @@ fail_putback:
>  	page_remove_rmap(page);
>  
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	/* Take an "isolate" reference and put new page on the LRU. */
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 9decb88..87e6bc5 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
>  }
>  
>  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
>  				    unsigned long address,
>  				    enum mmu_event event)
>  {
> @@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_page)
> -			mn->ops->invalidate_page(mn, mm, address, event);
> +			mn->ops->invalidate_page(mn, mm, vma, address, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +					   struct vm_area_struct *vma,
>  					   unsigned long start,
>  					   unsigned long end,
>  					   enum mmu_event event)
> @@ -165,7 +167,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start,
> +			mn->ops->invalidate_range_start(mn, vma, mm, start,
>  							end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
> @@ -173,6 +175,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
>  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +					 struct vm_area_struct *vma,
>  					 unsigned long start,
>  					 unsigned long end,
>  					 enum mmu_event event)
> @@ -183,7 +186,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_end)
> -			mn->ops->invalidate_range_end(mn, mm, start,
> +			mn->ops->invalidate_range_end(mn, vma, mm, start,
>  						      end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 6ce6c23..16ce504 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		/* invoke the mmu notifier if the pmd is populated */
>  		if (!mni_start) {
>  			mni_start = addr;
> -			mmu_notifier_invalidate_range_start(mm, mni_start,
> +			mmu_notifier_invalidate_range_start(mm, vma, mni_start,
>  							    end, event);
>  		}
>  
> @@ -187,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  	} while (pmd++, addr = next, addr != end);
>  
>  	if (mni_start)
> -		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
> +		mmu_notifier_invalidate_range_end(mm, vma, mni_start,
> +						  end, event);
>  
>  	if (nr_huge_updates)
>  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 6827d2f..9bee6de 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  
>  	mmun_start = old_addr;
>  	mmun_end   = old_end;
> -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> @@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	if (likely(need_flush))
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
> -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	return len + old_addr - old_end;	/* how much done */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd7e6d7..f1be50d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  	pte_unmap_unlock(pte, ptl);
>  
>  	if (ret) {
> -		mmu_notifier_invalidate_page(mm, address, MMU_WB);
> +		mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
>  		(*cleaned)++;
>  	}
>  out:
> @@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> -		mmu_notifier_invalidate_page(mm, address, event);
> +		mmu_notifier_invalidate_page(mm, vma, address, event);
>  out:
>  	return ret;
>  
> @@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  
>  	mmun_start = address;
>  	mmun_end   = end;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> +					    mmun_end, event);
>  
>  	/*
>  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> @@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  		(*mapcount)--;
>  	}
>  	pte_unmap_unlock(pte - 1, ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
>  	if (locked_vma)
>  		up_read(&vma->vm_mm->mmap_sem);
>  	return ret;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6e1992f..c4b7bf9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
> +					     struct vm_area_struct *vma,
>  					     unsigned long address,
>  					     enum mmu_event event)
>  {
> @@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
> +						    struct vm_area_struct *vma,
>  						    unsigned long start,
>  						    unsigned long end,
>  						    enum mmu_event event)
> @@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  						  struct mm_struct *mm,
> +						  struct vm_area_struct *vma,
>  						  unsigned long start,
>  						  unsigned long end,
>  						  enum mmu_event event)
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Other than the refinements suggested above, I can't seem to find anything 
wrong with this patch, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
@ 2014-06-30  3:29     ` John Hubbard
  0 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  3:29 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 33569 bytes --]

On Fri, 27 Jun 2014, JA(C)rA'me Glisse wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> New user of the mmu_notifier interface need to lookup vma in order to
> perform the invalidation operation. Instead of redoing a vma lookup
> inside the callback just pass through the vma from the call site where
> it is already available.
> 
> This needs small refactoring in memory.c to call invalidate_range on
> vma boundary the overhead should be low enough.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
>  drivers/iommu/amd_iommu_v2.c            |  3 +++
>  drivers/misc/sgi-gru/grutlbpurge.c      |  6 ++++-
>  drivers/xen/gntdev.c                    |  4 +++-
>  fs/proc/task_mmu.c                      | 16 ++++++++-----
>  include/linux/mmu_notifier.h            | 19 ++++++++++++---
>  kernel/events/uprobes.c                 |  4 ++--
>  mm/filemap_xip.c                        |  3 ++-
>  mm/huge_memory.c                        | 26 ++++++++++----------
>  mm/hugetlb.c                            | 16 ++++++-------
>  mm/ksm.c                                |  8 +++----
>  mm/memory.c                             | 42 +++++++++++++++++++++------------
>  mm/migrate.c                            |  6 ++---
>  mm/mmu_notifier.c                       |  9 ++++---
>  mm/mprotect.c                           |  5 ++--
>  mm/mremap.c                             |  4 ++--
>  mm/rmap.c                               |  9 +++----
>  virt/kvm/kvm_main.c                     |  3 +++
>  18 files changed, 116 insertions(+), 68 deletions(-)
> 

Hi Jerome, considering that you have to change every call site already, it 
seems to me that it would be ideal to just delete the mm argument from all 
of these invalidate_range* callbacks. In other words, replace the mm 
argument with the new vma argument.  I don't see much point in passing 
them both around, and while it would make the patch a *bit* larger, it's 
mostly just an extra line or two per call site:

  mm = vma->vm_mm;

Also, passing the vma around really does seem like a good approach, but it 
does cause a bunch of additional calls to the invalidate_range* routines,
because we generate a call per vma, instead of just one for the entire mm.
So that brings up a couple questions:

1) Is there any chance that this could cause measurable performance 
regressions?

2) Should you put a little note in the commit message, mentioning this 
point?

> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index ed6f35e..191ac71 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -55,6 +55,7 @@ struct i915_mmu_object {
>  
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
> +						       struct vm_area_struct *vma,
>  						       unsigned long start,
>  						       unsigned long end,
>  						       enum mmu_event event)

That routine has a local variable named vma, so it might be polite to 
rename the local variable, to make it more obvious to the reader that they 
are different. Of course, since the compiler knows which is which, feel 
free to ignore this comment.

> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 2bb9771..9f9e706 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  			       struct mm_struct *mm,
> +			       struct vm_area_struct *vma,
>  			       unsigned long address,
>  			       enum mmu_event event)
>  {
> @@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  				      struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
>  				      unsigned long start,
>  				      unsigned long end,
>  				      enum mmu_event event)
> @@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_range_end(struct mmu_notifier *mn,
>  				    struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
>  				    unsigned long start,
>  				    unsigned long end,
>  				    enum mmu_event event)
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index e67fed1..d02e4c7 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   */
>  static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> +				       struct vm_area_struct *vma,
>  				       unsigned long start, unsigned long end,
>  				       enum mmu_event event)
>  {
> @@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> -				     struct mm_struct *mm, unsigned long start,
> +				     struct mm_struct *mm,
> +				     struct vm_area_struct *vma,
> +				     unsigned long start,
>  				     unsigned long end,
>  				     enum mmu_event event)
>  {
> @@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				unsigned long address,
>  				enum mmu_event event)
>  {
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index fe9da94..219928b 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
>  
>  static void mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				unsigned long start,
>  				unsigned long end,
>  				enum mmu_event event)
> @@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invl_page(struct mmu_notifier *mn,
>  			 struct mm_struct *mm,
> +			 struct vm_area_struct *vma,
>  			 unsigned long address,
>  			 enum mmu_event event)
>  {
> -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
> +	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, event);
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e9e79f7..8b0f25d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -829,13 +829,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			.private = &cp,
>  		};
>  		down_read(&mm->mmap_sem);
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_start(mm, 0,
> -							    -1, MMU_STATUS);
>  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  			cp.vma = vma;
>  			if (is_vm_hugetlb_page(vma))
>  				continue;
> +			if (type == CLEAR_REFS_SOFT_DIRTY)
> +				mmu_notifier_invalidate_range_start(mm, vma,
> +								    vma->vm_start,
> +								    vma->vm_end,
> +								    MMU_STATUS);
>  			/*
>  			 * Writing 1 to /proc/pid/clear_refs affects all pages.
>  			 *
> @@ -857,10 +859,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			}
>  			walk_page_range(vma->vm_start, vma->vm_end,
>  					&clear_refs_walk);
> +			if (type == CLEAR_REFS_SOFT_DIRTY)
> +				mmu_notifier_invalidate_range_end(mm, vma,
> +								  vma->vm_start,
> +								  vma->vm_end,
> +								  MMU_STATUS);
>  		}
> -		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0,
> -							  -1, MMU_STATUS);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  		mmput(mm);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 82e9577..8907e5d 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -137,6 +137,7 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> +				struct vm_area_struct *vma,
>  				unsigned long address,
>  				enum mmu_event event);
>  
> @@ -185,11 +186,13 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> +				       struct vm_area_struct *vma,
>  				       unsigned long start,
>  				       unsigned long end,
>  				       enum mmu_event event);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
> +				     struct vm_area_struct *vma,
>  				     unsigned long start,
>  				     unsigned long end,
>  				     enum mmu_event event);
> @@ -233,13 +236,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      pte_t pte,
>  				      enum mmu_event event);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> +					   struct vm_area_struct *vma,
>  					  unsigned long address,
>  					  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +						  struct vm_area_struct *vma,
>  						  unsigned long start,
>  						  unsigned long end,
>  						  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +						struct vm_area_struct *vma,
>  						unsigned long start,
>  						unsigned long end,
>  						enum mmu_event event);
> @@ -276,29 +282,33 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> +						struct vm_area_struct *vma,
>  						unsigned long address,
>  						enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_page(mm, address, event);
> +		__mmu_notifier_invalidate_page(mm, vma, address, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +						       struct vm_area_struct *vma,
>  						       unsigned long start,
>  						       unsigned long end,
>  						       enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> +		__mmu_notifier_invalidate_range_start(mm, vma, start,
> +						      end, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +						     struct vm_area_struct *vma,
>  						     unsigned long start,
>  						     unsigned long end,
>  						     enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_end(mm, start, end, event);
> +		__mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
>  }
>  
>  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> @@ -380,12 +390,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> +						struct vm_area_struct *vma,
>  						unsigned long address,
>  						enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +						       struct vm_area_struct *vma,
>  						       unsigned long start,
>  						       unsigned long end,
>  						       enum mmu_event event)
> @@ -393,6 +405,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +						     struct vm_area_struct *vma,
>  						     unsigned long start,
>  						     unsigned long end,
>  						     enum mmu_event event)
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 296f81e..0f552bc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -177,7 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	/* For try_to_free_swap() and munlock_vma_page() below */
>  	lock_page(page);
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	err = -EAGAIN;
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> @@ -212,7 +212,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	err = 0;
>   unlock:
>  	mem_cgroup_cancel_charge(kpage, memcg);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	unlock_page(page);
>  	return err;
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index a2b3f09..f0113df 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -198,7 +198,8 @@ retry:
>  			BUG_ON(pte_dirty(pteval));
>  			pte_unmap_unlock(pte, ptl);
>  			/* must invalidate_page _before_ freeing the page */
> -			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> +			mmu_notifier_invalidate_page(mm, vma, address,
> +						     MMU_MIGRATE);
>  			page_cache_release(page);
>  		}
>  	}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fa30857..cc74b60 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1022,7 +1022,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> @@ -1064,7 +1064,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	page_remove_rmap(page);
>  	spin_unlock(ptl);
>  
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	ret |= VM_FAULT_WRITE;
> @@ -1075,7 +1075,7 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		memcg = (void *)page_private(pages[i]);
> @@ -1162,7 +1162,7 @@ alloc:
>  
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	if (!page)
> @@ -1201,7 +1201,7 @@ alloc:
>  	}
>  	spin_unlock(ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  out:
>  	return ret;
> @@ -1637,7 +1637,7 @@ static int __split_huge_page_splitting(struct page *page,
>  	const unsigned long mmun_start = address;
>  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_STATUS);
>  	pmd = page_check_address_pmd(page, mm, address,
>  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
> @@ -1653,7 +1653,7 @@ static int __split_huge_page_splitting(struct page *page,
>  		ret = 1;
>  		spin_unlock(ptl);
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_STATUS);
>  
>  	return ret;
> @@ -2453,7 +2453,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	mmun_start = address;
>  	mmun_end   = address + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
> @@ -2464,7 +2464,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	_pmd = pmdp_clear_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	spin_lock(pte_ptl);
> @@ -2854,19 +2854,19 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
>  again:
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	if (is_huge_zero_pmd(*pmd)) {
>  		__split_huge_zero_page_pmd(vma, haddr, pmd);
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
> @@ -2874,7 +2874,7 @@ again:
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	get_page(page);
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	split_huge_page(page);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 73e1576..15f0123 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2565,7 +2565,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	mmun_start = vma->vm_start;
>  	mmun_end = vma->vm_end;
>  	if (cow)
> -		mmu_notifier_invalidate_range_start(src, mmun_start,
> +		mmu_notifier_invalidate_range_start(src, vma, mmun_start,
>  						    mmun_end, MMU_MIGRATE);
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> @@ -2616,7 +2616,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	}
>  
>  	if (cow)
> -		mmu_notifier_invalidate_range_end(src, mmun_start,
> +		mmu_notifier_invalidate_range_end(src, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  
>  	return ret;
> @@ -2643,7 +2643,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	BUG_ON(end & ~huge_page_mask(h));
>  
>  	tlb_start_vma(tlb, vma);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  again:
>  	for (address = start; address < end; address += sz) {
> @@ -2715,7 +2715,7 @@ unlock:
>  		if (address < end && !ref_page)
>  			goto again;
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	tlb_end_vma(tlb, vma);
>  }
> @@ -2903,7 +2903,7 @@ retry_avoidcopy:
>  
>  	mmun_start = address & huge_page_mask(h);
>  	mmun_end = mmun_start + huge_page_size(h);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	/*
>  	 * Retake the page table lock to check for racing updates
> @@ -2924,7 +2924,7 @@ retry_avoidcopy:
>  		new_page = old_page;
>  	}
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
> @@ -3363,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end, event);
> +	mmu_notifier_invalidate_range_start(mm, vma, start, end, event);
>  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  	for (; address < end; address += huge_page_size(h)) {
>  		spinlock_t *ptl;
> @@ -3393,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 */
>  	flush_tlb_range(vma, start, end);
>  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -	mmu_notifier_invalidate_range_end(mm, start, end, event);
> +	mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
>  
>  	return pages << h->order;
>  }
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 4b659f1..1f3c4d7 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -873,7 +873,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MPROT_RONLY);
>  
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> @@ -914,7 +914,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  out_unlock:
>  	pte_unmap_unlock(ptep, ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MPROT_RONLY);
>  out:
>  	return err;
> @@ -951,7 +951,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> @@ -977,7 +977,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	pte_unmap_unlock(ptep, ptl);
>  	err = 0;
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  out:
>  	return err;
> diff --git a/mm/memory.c b/mm/memory.c
> index d3908f0..4717579 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	mmun_start = addr;
>  	mmun_end   = end;
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> +		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
>  						    mmun_end, MMU_MIGRATE);
>  
>  	ret = 0;
> @@ -1067,8 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>  
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> -						  MMU_MIGRATE);
> +		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  	return ret;
>  }
>  
> @@ -1372,12 +1372,17 @@ void unmap_vmas(struct mmu_gather *tlb,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  
> -	mmu_notifier_invalidate_range_start(mm, start_addr,
> -					    end_addr, MMU_MUNMAP);
> -	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> +	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> +		mmu_notifier_invalidate_range_start(mm, vma,
> +						    max(start_addr, vma->vm_start),
> +						    min(end_addr, vma->vm_end),
> +						    MMU_MUNMAP);
>  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> -	mmu_notifier_invalidate_range_end(mm, start_addr,
> -					  end_addr, MMU_MUNMAP);
> +		mmu_notifier_invalidate_range_end(mm, vma,
> +						  max(start_addr, vma->vm_start),
> +						  min(end_addr, vma->vm_end),
> +						  MMU_MUNMAP);
> +	}
>  }
>  
>  /**
> @@ -1399,10 +1404,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
> -	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> +	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
> +		mmu_notifier_invalidate_range_start(mm, vma,
> +						    max(start, vma->vm_start),
> +						    min(end, vma->vm_end),
> +						    MMU_MUNMAP);
>  		unmap_single_vma(&tlb, vma, start, end, details);
> -	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
> +		mmu_notifier_invalidate_range_end(mm, vma,
> +						  max(start, vma->vm_start),
> +						  min(end, vma->vm_end),
> +						  MMU_MUNMAP);
> +	}
>  	tlb_finish_mmu(&tlb, start, end);
>  }
>  
> @@ -1425,9 +1437,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, address, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
> +	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
>  	unmap_single_vma(&tlb, vma, address, end, details);
> -	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
> +	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, address, end);
>  }
>  
> @@ -2211,7 +2223,7 @@ gotten:
>  
>  	mmun_start  = address & PAGE_MASK;
>  	mmun_end    = mmun_start + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	/*
> @@ -2283,7 +2295,7 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (mmun_end > mmun_start)
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  	if (old_page) {
>  		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b526c72..0c61aa9 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1820,13 +1820,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	WARN_ON(PageLRU(new_page));
>  
>  	/* Recheck the target PMD */
> -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>  fail_putback:
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  						  mmun_end, MMU_MIGRATE);
>  
>  		/* Reverse changes made by migrate_page_copy() */
> @@ -1880,7 +1880,7 @@ fail_putback:
>  	page_remove_rmap(page);
>  
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	/* Take an "isolate" reference and put new page on the LRU. */
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 9decb88..87e6bc5 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
>  }
>  
>  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> +				    struct vm_area_struct *vma,
>  				    unsigned long address,
>  				    enum mmu_event event)
>  {
> @@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_page)
> -			mn->ops->invalidate_page(mn, mm, address, event);
> +			mn->ops->invalidate_page(mn, mm, vma, address, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +					   struct vm_area_struct *vma,
>  					   unsigned long start,
>  					   unsigned long end,
>  					   enum mmu_event event)
> @@ -165,7 +167,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start,
> +			mn->ops->invalidate_range_start(mn, vma, mm, start,
>  							end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
> @@ -173,6 +175,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
>  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> +					 struct vm_area_struct *vma,
>  					 unsigned long start,
>  					 unsigned long end,
>  					 enum mmu_event event)
> @@ -183,7 +186,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_end)
> -			mn->ops->invalidate_range_end(mn, mm, start,
> +			mn->ops->invalidate_range_end(mn, vma, mm, start,
>  						      end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 6ce6c23..16ce504 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		/* invoke the mmu notifier if the pmd is populated */
>  		if (!mni_start) {
>  			mni_start = addr;
> -			mmu_notifier_invalidate_range_start(mm, mni_start,
> +			mmu_notifier_invalidate_range_start(mm, vma, mni_start,
>  							    end, event);
>  		}
>  
> @@ -187,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  	} while (pmd++, addr = next, addr != end);
>  
>  	if (mni_start)
> -		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
> +		mmu_notifier_invalidate_range_end(mm, vma, mni_start,
> +						  end, event);
>  
>  	if (nr_huge_updates)
>  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 6827d2f..9bee6de 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  
>  	mmun_start = old_addr;
>  	mmun_end   = old_end;
> -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
>  					    mmun_end, MMU_MIGRATE);
>  
>  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> @@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	if (likely(need_flush))
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
> -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
>  					  mmun_end, MMU_MIGRATE);
>  
>  	return len + old_addr - old_end;	/* how much done */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd7e6d7..f1be50d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  	pte_unmap_unlock(pte, ptl);
>  
>  	if (ret) {
> -		mmu_notifier_invalidate_page(mm, address, MMU_WB);
> +		mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
>  		(*cleaned)++;
>  	}
>  out:
> @@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> -		mmu_notifier_invalidate_page(mm, address, event);
> +		mmu_notifier_invalidate_page(mm, vma, address, event);
>  out:
>  	return ret;
>  
> @@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  
>  	mmun_start = address;
>  	mmun_end   = end;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> +					    mmun_end, event);
>  
>  	/*
>  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> @@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  		(*mapcount)--;
>  	}
>  	pte_unmap_unlock(pte - 1, ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
>  	if (locked_vma)
>  		up_read(&vma->vm_mm->mmap_sem);
>  	return ret;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6e1992f..c4b7bf9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
> +					     struct vm_area_struct *vma,
>  					     unsigned long address,
>  					     enum mmu_event event)
>  {
> @@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
> +						    struct vm_area_struct *vma,
>  						    unsigned long start,
>  						    unsigned long end,
>  						    enum mmu_event event)
> @@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  						  struct mm_struct *mm,
> +						  struct vm_area_struct *vma,
>  						  unsigned long start,
>  						  unsigned long end,
>  						  enum mmu_event event)
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Other than the refinements suggested above, I can't seem to find anything 
wrong with this patch, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-28  2:00   ` Jérôme Glisse
@ 2014-06-30  3:49     ` John Hubbard
  -1 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  3:49 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 14760 bytes --]

On Fri, 27 Jun 2014, Jérôme Glisse wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Several subsystem require a callback when a mm struct is being destroy
> so that they can cleanup there respective per mm struct. Instead of
> having each subsystem add its callback to mmput use a notifier chain
> to call each of the subsystem.
> 
> This will allow new subsystem to register callback even if they are
> module. There should be no contention on the rw semaphore protecting
> the call chain and the impact on the code path should be low and
> burried in the noise.
> 
> Note that this patch also move the call to cleanup functions after
> exit_mmap so that new call back can assume that mmu_notifier_release
> have already been call. This does not impact existing cleanup functions
> as they do not rely on anything that exit_mmap is freeing. Also moved
> khugepaged_exit to exit_mmap so that ordering is preserved for that
> function.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  fs/aio.c                | 29 ++++++++++++++++++++++-------
>  include/linux/aio.h     |  2 --
>  include/linux/ksm.h     | 11 -----------
>  include/linux/sched.h   |  5 +++++
>  include/linux/uprobes.h |  1 -
>  kernel/events/uprobes.c | 19 ++++++++++++++++---
>  kernel/fork.c           | 22 ++++++++++++++++++----
>  mm/ksm.c                | 26 +++++++++++++++++++++-----
>  mm/mmap.c               |  3 +++
>  9 files changed, 85 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c1d8c48..1d06e92 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -40,6 +40,7 @@
>  #include <linux/ramfs.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/mount.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/kmap_types.h>
>  #include <asm/uaccess.h>
> @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
>  EXPORT_SYMBOL(wait_on_sync_kiocb);
>  
>  /*
> - * exit_aio: called when the last user of mm goes away.  At this point, there is
> + * aio_exit: called when the last user of mm goes away.  At this point, there is
>   * no way for any new requests to be submited or any of the io_* syscalls to be
>   * called on the context.
>   *
>   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
>   * them.
>   */
> -void exit_aio(struct mm_struct *mm)
> +static int aio_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
>  	int i;
>  
>  	if (!table)
> -		return;
> +		return 0;
>  
>  	for (i = 0; i < table->nr; ++i) {
>  		struct kioctx *ctx = table->table[i];
> @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
>  			continue;
>  		/*
>  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> -		 * is coming and it'll unmap everything. And we simply can't,
> -		 * this is not necessarily our ->mm.
> -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> -		 * that it needs to unmap the area, just set it to 0.
> +		 * have already been call and everything is unmap by now. But
> +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> +		 * area.
>  		 */

Actually, I think the original part of the comment about kill_ioctx
was accurate, but the new reference to aio_free_ring looks like a typo 
(?).  I'd write the entire comment as follows (I've dropped the leading 
whitespace, for email):

    /*
     * We don't need to bother with munmap() here - exit_mmap(mm)
     * has already been called and everything is unmapped by now.
     * But to be safe, set ->mmap_size to 0 since kill_ioctx() uses a
     * non-zero >mmap_size as an indicator that it needs to unmap the
     * area.
     */


>  		ctx->mmap_size = 0;
>  		kill_ioctx(mm, ctx, NULL);
> @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
>  
>  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
>  	kfree(table);
> +	return 0;
>  }
>  
>  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>  	}
>  	return ret;
>  }
> +
> +static struct notifier_block aio_mmput_nb = {
> +	.notifier_call		= aio_exit,
> +	.priority		= 1,
> +};
> +
> +static int __init aio_init(void)
> +{
> +	return mmput_register_notifier(&aio_mmput_nb);
> +}
> +subsys_initcall(aio_init);
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..6308fac 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
>  extern void aio_complete(struct kiocb *iocb, long res, long res2);
>  struct mm_struct;
> -extern void exit_aio(struct mm_struct *mm);
>  extern long do_io_submit(aio_context_t ctx_id, long nr,
>  			 struct iocb __user *__user *iocbpp, bool compat);
>  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
>  struct mm_struct;
> -static inline void exit_aio(struct mm_struct *mm) { }
>  static inline long do_io_submit(aio_context_t ctx_id, long nr,
>  				struct iocb __user * __user *iocbpp,
>  				bool compat) { return 0; }
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..84c184f 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -20,7 +20,6 @@ struct mem_cgroup;
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
> -void __ksm_exit(struct mm_struct *mm);
>  
>  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> -		__ksm_exit(mm);
> -}
> -
>  /*
>   * A KSM page is one of those write-protected "shared pages" or "merged pages"
>   * which KSM maps into multiple mms, wherever identical anonymous page content
> @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -}
> -
>  static inline int PageKsm(struct page *page)
>  {
>  	return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 322d4fc..428b3cf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
>  		__mmdrop(mm);
>  }
>  
> +/* mmput call list of notifier and subsystem/module can register
> + * new one through this call.
> + */
> +extern int mmput_register_notifier(struct notifier_block *nb);
> +extern int mmput_unregister_notifier(struct notifier_block *nb);
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
>  /* Grab a reference to a task's mm, if it is not already going away */
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 4f844c6..44e7267 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
>  extern void uprobe_notify_resume(struct pt_regs *regs);
>  extern bool uprobe_deny_signal(void);
>  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> -extern void uprobe_clear_state(struct mm_struct *mm);
>  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
>  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 46b7c31..32b04dc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -37,6 +37,7 @@
>  #include <linux/percpu-rwsem.h>
>  #include <linux/task_work.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/notifier.h>
>  
>  #include <linux/uprobes.h>
>  
> @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
>  /*
>   * uprobe_clear_state - Free the area allocated for slots.
>   */
> -void uprobe_clear_state(struct mm_struct *mm)
> +static int uprobe_clear_state(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct xol_area *area = mm->uprobes_state.xol_area;
>  
>  	if (!area)
> -		return;
> +		return 0;
>  
>  	put_page(area->page);
>  	kfree(area->bitmap);
>  	kfree(area);
> +	return 0;
>  }
>  
>  void uprobe_start_dup_mmap(void)
> @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
>  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
>  };
>  
> +static struct notifier_block uprobe_mmput_nb = {
> +	.notifier_call		= uprobe_clear_state,
> +	.priority		= 0,
> +};
> +
>  static int __init init_uprobes(void)
>  {
> -	int i;
> +	int i, err;
>  
>  	for (i = 0; i < UPROBES_HASH_SZ; i++)
>  		mutex_init(&uprobes_mmap_mutex[i]);
> @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
>  	if (percpu_init_rwsem(&dup_mmap_sem))
>  		return -ENOMEM;
>  
> +	err = mmput_register_notifier(&uprobe_mmput_nb);
> +	if (err)
> +		return err;
> +
>  	return register_die_notifier(&uprobe_exception_nb);
>  }
>  __initcall(init_uprobes);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dd8864f..b448509 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -87,6 +87,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/task.h>
>  
> +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> +
>  /*
>   * Protected counters by write_lock_irq(&tasklist_lock)
>   */
> @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
>  EXPORT_SYMBOL_GPL(__mmdrop);
>  
>  /*
> + * Register a notifier that will be call by mmput
> + */
> +int mmput_register_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> +
> +int mmput_unregister_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> +
> +/*
>   * Decrement the use count and release all resources for an mm.
>   */
>  void mmput(struct mm_struct *mm)
> @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
>  	might_sleep();
>  
>  	if (atomic_dec_and_test(&mm->mm_users)) {
> -		uprobe_clear_state(mm);
> -		exit_aio(mm);
> -		ksm_exit(mm);
> -		khugepaged_exit(mm); /* must run before exit_mmap */
>  		exit_mmap(mm);
> +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
>  		set_mm_exe_file(mm, NULL);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 346ddc9..cb1e976 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -37,6 +37,7 @@
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		ksm_scan.mm_slot = slot;
>  		spin_unlock(&ksm_mmlist_lock);
>  		/*
> -		 * Although we tested list_empty() above, a racing __ksm_exit
> +		 * Although we tested list_empty() above, a racing ksm_exit
>  		 * of the last mm on the list may have removed it since then.
>  		 */
>  		if (slot == &ksm_mm_head)
> @@ -1658,9 +1659,9 @@ next_mm:
>  		/*
>  		 * We've completed a full scan of all vmas, holding mmap_sem
>  		 * throughout, and found no VM_MERGEABLE: so do the same as
> -		 * __ksm_exit does to remove this mm from all our lists now.
> -		 * This applies either when cleaning up after __ksm_exit
> -		 * (but beware: we can reach here even before __ksm_exit),
> +		 * ksm_exit does to remove this mm from all our lists now.
> +		 * This applies either when cleaning up after ksm_exit
> +		 * (but beware: we can reach here even before ksm_exit),
>  		 * or when all VM_MERGEABLE areas have been unmapped (and
>  		 * mmap_sem then protects against race with MADV_MERGEABLE).
>  		 */
> @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -void __ksm_exit(struct mm_struct *mm)
> +static int ksm_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct mm_slot *mm_slot;
>  	int easy_to_free = 0;
>  
> +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> +		return 0;
> +
>  	/*
>  	 * This process is exiting: if it's straightforward (as is the
>  	 * case when ksmd was never running), free mm_slot immediately.
> @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}
> +	return 0;
>  }
>  
>  struct page *ksm_might_need_to_copy(struct page *page,
> @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
>  };
>  #endif /* CONFIG_SYSFS */
>  
> +static struct notifier_block ksm_mmput_nb = {
> +	.notifier_call		= ksm_exit,
> +	.priority		= 2,
> +};
> +
>  static int __init ksm_init(void)
>  {
>  	struct task_struct *ksm_thread;
>  	int err;
>  
> +	err = mmput_register_notifier(&ksm_mmput_nb);
> +	if (err)
> +		return err;
> +

In order to be perfectly consistent with this routine's existing code, you 
would want to write:

if (err)
	goto out;

...but it does the same thing as your code. It' just a consistency thing.

>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61aec93..b684a21 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
>  
> +	/* Important to call this first. */
> +	khugepaged_exit(mm);
> +
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Above points are extremely minor, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30  3:49     ` John Hubbard
  0 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  3:49 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 14766 bytes --]

On Fri, 27 Jun 2014, JA(C)rA'me Glisse wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> Several subsystem require a callback when a mm struct is being destroy
> so that they can cleanup there respective per mm struct. Instead of
> having each subsystem add its callback to mmput use a notifier chain
> to call each of the subsystem.
> 
> This will allow new subsystem to register callback even if they are
> module. There should be no contention on the rw semaphore protecting
> the call chain and the impact on the code path should be low and
> burried in the noise.
> 
> Note that this patch also move the call to cleanup functions after
> exit_mmap so that new call back can assume that mmu_notifier_release
> have already been call. This does not impact existing cleanup functions
> as they do not rely on anything that exit_mmap is freeing. Also moved
> khugepaged_exit to exit_mmap so that ordering is preserved for that
> function.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  fs/aio.c                | 29 ++++++++++++++++++++++-------
>  include/linux/aio.h     |  2 --
>  include/linux/ksm.h     | 11 -----------
>  include/linux/sched.h   |  5 +++++
>  include/linux/uprobes.h |  1 -
>  kernel/events/uprobes.c | 19 ++++++++++++++++---
>  kernel/fork.c           | 22 ++++++++++++++++++----
>  mm/ksm.c                | 26 +++++++++++++++++++++-----
>  mm/mmap.c               |  3 +++
>  9 files changed, 85 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c1d8c48..1d06e92 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -40,6 +40,7 @@
>  #include <linux/ramfs.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/mount.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/kmap_types.h>
>  #include <asm/uaccess.h>
> @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
>  EXPORT_SYMBOL(wait_on_sync_kiocb);
>  
>  /*
> - * exit_aio: called when the last user of mm goes away.  At this point, there is
> + * aio_exit: called when the last user of mm goes away.  At this point, there is
>   * no way for any new requests to be submited or any of the io_* syscalls to be
>   * called on the context.
>   *
>   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
>   * them.
>   */
> -void exit_aio(struct mm_struct *mm)
> +static int aio_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
>  	int i;
>  
>  	if (!table)
> -		return;
> +		return 0;
>  
>  	for (i = 0; i < table->nr; ++i) {
>  		struct kioctx *ctx = table->table[i];
> @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
>  			continue;
>  		/*
>  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> -		 * is coming and it'll unmap everything. And we simply can't,
> -		 * this is not necessarily our ->mm.
> -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> -		 * that it needs to unmap the area, just set it to 0.
> +		 * have already been call and everything is unmap by now. But
> +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> +		 * area.
>  		 */

Actually, I think the original part of the comment about kill_ioctx
was accurate, but the new reference to aio_free_ring looks like a typo 
(?).  I'd write the entire comment as follows (I've dropped the leading 
whitespace, for email):

    /*
     * We don't need to bother with munmap() here - exit_mmap(mm)
     * has already been called and everything is unmapped by now.
     * But to be safe, set ->mmap_size to 0 since kill_ioctx() uses a
     * non-zero >mmap_size as an indicator that it needs to unmap the
     * area.
     */


>  		ctx->mmap_size = 0;
>  		kill_ioctx(mm, ctx, NULL);
> @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
>  
>  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
>  	kfree(table);
> +	return 0;
>  }
>  
>  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>  	}
>  	return ret;
>  }
> +
> +static struct notifier_block aio_mmput_nb = {
> +	.notifier_call		= aio_exit,
> +	.priority		= 1,
> +};
> +
> +static int __init aio_init(void)
> +{
> +	return mmput_register_notifier(&aio_mmput_nb);
> +}
> +subsys_initcall(aio_init);
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..6308fac 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
>  extern void aio_complete(struct kiocb *iocb, long res, long res2);
>  struct mm_struct;
> -extern void exit_aio(struct mm_struct *mm);
>  extern long do_io_submit(aio_context_t ctx_id, long nr,
>  			 struct iocb __user *__user *iocbpp, bool compat);
>  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
>  struct mm_struct;
> -static inline void exit_aio(struct mm_struct *mm) { }
>  static inline long do_io_submit(aio_context_t ctx_id, long nr,
>  				struct iocb __user * __user *iocbpp,
>  				bool compat) { return 0; }
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..84c184f 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -20,7 +20,6 @@ struct mem_cgroup;
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
> -void __ksm_exit(struct mm_struct *mm);
>  
>  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> -		__ksm_exit(mm);
> -}
> -
>  /*
>   * A KSM page is one of those write-protected "shared pages" or "merged pages"
>   * which KSM maps into multiple mms, wherever identical anonymous page content
> @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -}
> -
>  static inline int PageKsm(struct page *page)
>  {
>  	return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 322d4fc..428b3cf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
>  		__mmdrop(mm);
>  }
>  
> +/* mmput call list of notifier and subsystem/module can register
> + * new one through this call.
> + */
> +extern int mmput_register_notifier(struct notifier_block *nb);
> +extern int mmput_unregister_notifier(struct notifier_block *nb);
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
>  /* Grab a reference to a task's mm, if it is not already going away */
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 4f844c6..44e7267 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
>  extern void uprobe_notify_resume(struct pt_regs *regs);
>  extern bool uprobe_deny_signal(void);
>  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> -extern void uprobe_clear_state(struct mm_struct *mm);
>  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
>  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 46b7c31..32b04dc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -37,6 +37,7 @@
>  #include <linux/percpu-rwsem.h>
>  #include <linux/task_work.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/notifier.h>
>  
>  #include <linux/uprobes.h>
>  
> @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
>  /*
>   * uprobe_clear_state - Free the area allocated for slots.
>   */
> -void uprobe_clear_state(struct mm_struct *mm)
> +static int uprobe_clear_state(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct xol_area *area = mm->uprobes_state.xol_area;
>  
>  	if (!area)
> -		return;
> +		return 0;
>  
>  	put_page(area->page);
>  	kfree(area->bitmap);
>  	kfree(area);
> +	return 0;
>  }
>  
>  void uprobe_start_dup_mmap(void)
> @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
>  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
>  };
>  
> +static struct notifier_block uprobe_mmput_nb = {
> +	.notifier_call		= uprobe_clear_state,
> +	.priority		= 0,
> +};
> +
>  static int __init init_uprobes(void)
>  {
> -	int i;
> +	int i, err;
>  
>  	for (i = 0; i < UPROBES_HASH_SZ; i++)
>  		mutex_init(&uprobes_mmap_mutex[i]);
> @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
>  	if (percpu_init_rwsem(&dup_mmap_sem))
>  		return -ENOMEM;
>  
> +	err = mmput_register_notifier(&uprobe_mmput_nb);
> +	if (err)
> +		return err;
> +
>  	return register_die_notifier(&uprobe_exception_nb);
>  }
>  __initcall(init_uprobes);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dd8864f..b448509 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -87,6 +87,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/task.h>
>  
> +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> +
>  /*
>   * Protected counters by write_lock_irq(&tasklist_lock)
>   */
> @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
>  EXPORT_SYMBOL_GPL(__mmdrop);
>  
>  /*
> + * Register a notifier that will be call by mmput
> + */
> +int mmput_register_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> +
> +int mmput_unregister_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> +
> +/*
>   * Decrement the use count and release all resources for an mm.
>   */
>  void mmput(struct mm_struct *mm)
> @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
>  	might_sleep();
>  
>  	if (atomic_dec_and_test(&mm->mm_users)) {
> -		uprobe_clear_state(mm);
> -		exit_aio(mm);
> -		ksm_exit(mm);
> -		khugepaged_exit(mm); /* must run before exit_mmap */
>  		exit_mmap(mm);
> +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
>  		set_mm_exe_file(mm, NULL);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 346ddc9..cb1e976 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -37,6 +37,7 @@
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		ksm_scan.mm_slot = slot;
>  		spin_unlock(&ksm_mmlist_lock);
>  		/*
> -		 * Although we tested list_empty() above, a racing __ksm_exit
> +		 * Although we tested list_empty() above, a racing ksm_exit
>  		 * of the last mm on the list may have removed it since then.
>  		 */
>  		if (slot == &ksm_mm_head)
> @@ -1658,9 +1659,9 @@ next_mm:
>  		/*
>  		 * We've completed a full scan of all vmas, holding mmap_sem
>  		 * throughout, and found no VM_MERGEABLE: so do the same as
> -		 * __ksm_exit does to remove this mm from all our lists now.
> -		 * This applies either when cleaning up after __ksm_exit
> -		 * (but beware: we can reach here even before __ksm_exit),
> +		 * ksm_exit does to remove this mm from all our lists now.
> +		 * This applies either when cleaning up after ksm_exit
> +		 * (but beware: we can reach here even before ksm_exit),
>  		 * or when all VM_MERGEABLE areas have been unmapped (and
>  		 * mmap_sem then protects against race with MADV_MERGEABLE).
>  		 */
> @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -void __ksm_exit(struct mm_struct *mm)
> +static int ksm_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct mm_slot *mm_slot;
>  	int easy_to_free = 0;
>  
> +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> +		return 0;
> +
>  	/*
>  	 * This process is exiting: if it's straightforward (as is the
>  	 * case when ksmd was never running), free mm_slot immediately.
> @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}
> +	return 0;
>  }
>  
>  struct page *ksm_might_need_to_copy(struct page *page,
> @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
>  };
>  #endif /* CONFIG_SYSFS */
>  
> +static struct notifier_block ksm_mmput_nb = {
> +	.notifier_call		= ksm_exit,
> +	.priority		= 2,
> +};
> +
>  static int __init ksm_init(void)
>  {
>  	struct task_struct *ksm_thread;
>  	int err;
>  
> +	err = mmput_register_notifier(&ksm_mmput_nb);
> +	if (err)
> +		return err;
> +

In order to be perfectly consistent with this routine's existing code, you 
would want to write:

if (err)
	goto out;

...but it does the same thing as your code. It' just a consistency thing.

>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61aec93..b684a21 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
>  
> +	/* Important to call this first. */
> +	khugepaged_exit(mm);
> +
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Above points are extremely minor, so:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.
  2014-06-28  2:00   ` Jérôme Glisse
@ 2014-06-30  3:58     ` John Hubbard
  -1 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  3:58 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 3432 bytes --]

On Fri, 27 Jun 2014, Jérôme Glisse wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> New code will need to be able to differentiate between a regular unmap and
> an unmap trigger by vmscan in which case we want to be as quick as possible.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  include/linux/rmap.h | 15 ++++++++-------
>  mm/memory-failure.c  |  2 +-
>  mm/vmscan.c          |  4 ++--
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be57450..eddbc07 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -72,13 +72,14 @@ struct anon_vma_chain {
>  };
>  
>  enum ttu_flags {
> -	TTU_UNMAP = 1,			/* unmap mode */
> -	TTU_MIGRATION = 2,		/* migration mode */
> -	TTU_MUNLOCK = 4,		/* munlock mode */
> -
> -	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> -	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> -	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> +	TTU_VMSCAN = 1,			/* unmap for vmscan */
> +	TTU_POISON = 2,			/* unmap for poison */
> +	TTU_MIGRATION = 4,		/* migration mode */
> +	TTU_MUNLOCK = 8,		/* munlock mode */
> +
> +	TTU_IGNORE_MLOCK = (1 << 9),	/* ignore mlock */
> +	TTU_IGNORE_ACCESS = (1 << 10),	/* don't age */
> +	TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */

Unless there is a deeper purpose that I am overlooking, I think it would 
be better to leave the _MLOCK, _ACCESS, and _HWPOISON at their original 
values. I just can't quite see why they would need to start at bit 9 
instead of bit 8...

>  };
>  
>  #ifdef CONFIG_MMU
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a7a89eb..ba176c4 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page *p,
>  static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  				  int trapno, int flags, struct page **hpagep)
>  {
> -	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
>  	struct address_space *mapping;
>  	LIST_HEAD(tokill);
>  	int ret;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d24fd6..5a7d286 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>  	}
>  
>  	ret = shrink_page_list(&clean_pages, zone, &sc,
> -			TTU_UNMAP|TTU_IGNORE_ACCESS,
> +			TTU_VMSCAN|TTU_IGNORE_ACCESS,
>  			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>  	list_splice(&clean_pages, page_list);
>  	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> @@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	if (nr_taken == 0)
>  		return 0;
>  
> -	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
>  				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
>  				&nr_writeback, &nr_immediate,
>  				false);
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Other than that, looks good.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.
@ 2014-06-30  3:58     ` John Hubbard
  0 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  3:58 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

On Fri, 27 Jun 2014, JA(C)rA'me Glisse wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> New code will need to be able to differentiate between a regular unmap and
> an unmap trigger by vmscan in which case we want to be as quick as possible.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  include/linux/rmap.h | 15 ++++++++-------
>  mm/memory-failure.c  |  2 +-
>  mm/vmscan.c          |  4 ++--
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be57450..eddbc07 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -72,13 +72,14 @@ struct anon_vma_chain {
>  };
>  
>  enum ttu_flags {
> -	TTU_UNMAP = 1,			/* unmap mode */
> -	TTU_MIGRATION = 2,		/* migration mode */
> -	TTU_MUNLOCK = 4,		/* munlock mode */
> -
> -	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> -	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> -	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> +	TTU_VMSCAN = 1,			/* unmap for vmscan */
> +	TTU_POISON = 2,			/* unmap for poison */
> +	TTU_MIGRATION = 4,		/* migration mode */
> +	TTU_MUNLOCK = 8,		/* munlock mode */
> +
> +	TTU_IGNORE_MLOCK = (1 << 9),	/* ignore mlock */
> +	TTU_IGNORE_ACCESS = (1 << 10),	/* don't age */
> +	TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */

Unless there is a deeper purpose that I am overlooking, I think it would 
be better to leave the _MLOCK, _ACCESS, and _HWPOISON at their original 
values. I just can't quite see why they would need to start at bit 9 
instead of bit 8...

>  };
>  
>  #ifdef CONFIG_MMU
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a7a89eb..ba176c4 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page *p,
>  static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  				  int trapno, int flags, struct page **hpagep)
>  {
> -	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
>  	struct address_space *mapping;
>  	LIST_HEAD(tokill);
>  	int ret;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d24fd6..5a7d286 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>  	}
>  
>  	ret = shrink_page_list(&clean_pages, zone, &sc,
> -			TTU_UNMAP|TTU_IGNORE_ACCESS,
> +			TTU_VMSCAN|TTU_IGNORE_ACCESS,
>  			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
>  	list_splice(&clean_pages, page_list);
>  	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> @@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	if (nr_taken == 0)
>  		return 0;
>  
> -	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> +	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
>  				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
>  				&nr_writeback, &nr_immediate,
>  				false);
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Other than that, looks good.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
  2014-06-28  2:00   ` Jérôme Glisse
@ 2014-06-30  5:22     ` John Hubbard
  -1 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  5:22 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 51734 bytes --]

On Fri, 27 Jun 2014, Jérôme Glisse wrote:

> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The event information will be usefull for new user of mmu_notifier API.
> The event argument differentiate between a vma disappearing, a page
> being write protected or simply a page being unmaped. This allow new
> user to take different path for different event for instance on unmap
> the resource used to track a vma are still valid and should stay around.
> While if the event is saying that a vma is being destroy it means that any
> resources used to track this vma can be free.
> 
> Changed since v1:
>   - renamed action into event (updated commit message too).
>   - simplified the event names and clarified their intented usage
>     also documenting what exceptation the listener can have in
>     respect to each event.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
>  drivers/iommu/amd_iommu_v2.c            |  14 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
>  drivers/xen/gntdev.c                    |   9 ++-
>  fs/proc/task_mmu.c                      |   6 +-
>  include/linux/hugetlb.h                 |   7 +-
>  include/linux/mmu_notifier.h            | 117 ++++++++++++++++++++++++++------
>  kernel/events/uprobes.c                 |  10 ++-
>  mm/filemap_xip.c                        |   2 +-
>  mm/huge_memory.c                        |  51 ++++++++------
>  mm/hugetlb.c                            |  25 ++++---
>  mm/ksm.c                                |  18 +++--
>  mm/memory.c                             |  27 +++++---
>  mm/migrate.c                            |   9 ++-
>  mm/mmu_notifier.c                       |  28 +++++---
>  mm/mprotect.c                           |  33 ++++++---
>  mm/mremap.c                             |   6 +-
>  mm/rmap.c                               |  24 +++++--
>  virt/kvm/kvm_main.c                     |  12 ++--
>  19 files changed, 291 insertions(+), 119 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 21ea928..ed6f35e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -56,7 +56,8 @@ struct i915_mmu_object {
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
>  						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
>  	struct interval_tree_node *it = NULL;
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 499b436..2bb9771 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
>  static void mn_change_pte(struct mmu_notifier *mn,
>  			  struct mm_struct *mm,
>  			  unsigned long address,
> -			  pte_t pte)
> +			  pte_t pte,
> +			  enum mmu_event event)
>  {
>  	__mn_flush_page(mn, address);
>  }
>  
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  			       struct mm_struct *mm,
> -			       unsigned long address)
> +			       unsigned long address,
> +			       enum mmu_event event)
>  {
>  	__mn_flush_page(mn, address);
>  }
>  
>  static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  				      struct mm_struct *mm,
> -				      unsigned long start, unsigned long end)
> +				      unsigned long start,
> +				      unsigned long end,
> +				      enum mmu_event event)
>  {
>  	struct pasid_state *pasid_state;
>  	struct device_state *dev_state;
> @@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_range_end(struct mmu_notifier *mn,
>  				    struct mm_struct *mm,
> -				    unsigned long start, unsigned long end)
> +				    unsigned long start,
> +				    unsigned long end,
> +				    enum mmu_event event)
>  {
>  	struct pasid_state *pasid_state;
>  	struct device_state *dev_state;
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index 2129274..e67fed1 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   */
>  static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  				     struct mm_struct *mm, unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> -				unsigned long address)
> +				unsigned long address,
> +				enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 073b4a1..fe9da94 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
>  
>  static void mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start,
> +				unsigned long end,
> +				enum mmu_event event)
>  {
>  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>  	struct grant_map *map;
> @@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invl_page(struct mmu_notifier *mn,
>  			 struct mm_struct *mm,
> -			 unsigned long address)
> +			 unsigned long address,
> +			 enum mmu_event event)
>  {
> -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
> +	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index cfa63ee..e9e79f7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		};
>  		down_read(&mm->mmap_sem);
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> +			mmu_notifier_invalidate_range_start(mm, 0,
> +							    -1, MMU_STATUS);
>  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  			cp.vma = vma;
>  			if (is_vm_hugetlb_page(vma))
> @@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  					&clear_refs_walk);
>  		}
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> +			mmu_notifier_invalidate_range_end(mm, 0,
> +							  -1, MMU_STATUS);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  		mmput(mm);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6a836ef..d7e512f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -6,6 +6,7 @@
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
>  #include <linux/cgroup.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/list.h>
>  #include <linux/kref.h>
>  
> @@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  int pmd_huge(pmd_t pmd);
>  int pud_huge(pud_t pmd);
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> -		unsigned long address, unsigned long end, pgprot_t newprot);
> +		unsigned long address, unsigned long end, pgprot_t newprot,
> +		enum mmu_event event);
>  
>  #else /* !CONFIG_HUGETLB_PAGE */
>  
> @@ -148,7 +150,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
>  #define is_hugepage_active(x)	false
>  
>  static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> -		unsigned long address, unsigned long end, pgprot_t newprot)
> +		unsigned long address, unsigned long end, pgprot_t newprot,
> +		enum mmu_event event)
>  {
>  	return 0;
>  }
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index deca874..82e9577 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -9,6 +9,52 @@
>  struct mmu_notifier;
>  struct mmu_notifier_ops;
>  
> +/* Event report finer informations to the callback allowing the event listener
> + * to take better action. There are only few kinds of events :
> + *
> + *   - MMU_MIGRATE memory is migrating from one page to another thus all write
> + *     access must stop after invalidate_range_start callback returns. And no
> + *     read access should be allowed either as new page can be remapped with
> + *     write access before the invalidate_range_end callback happen and thus
> + *     any read access to old page might access outdated informations. Several
> + *     source to this event like page moving to swap (for various reasons like
> + *     page reclaim), outcome of mremap syscall, migration for numa reasons,
> + *     balancing memory pool, write fault on read only page trigger a new page
> + *     to be allocated and used, ...
> + *   - MMU_MPROT_NONE memory access protection is change, no page in the range
> + *     can be accessed in either read or write mode but the range of address
> + *     is still valid. All access are still fine until invalidate_range_end
> + *     callback returns.
> + *   - MMU_MPROT_RONLY memory access proctection is changing to read only.
> + *     All access are still fine until invalidate_range_end callback returns.
> + *   - MMU_MPROT_RANDW memory access proctection is changing to read an write.
> + *     All access are still fine until invalidate_range_end callback returns.
> + *   - MMU_MPROT_WONLY memory access proctection is changing to write only.
> + *     All access are still fine until invalidate_range_end callback returns.
> + *   - MMU_MUNMAP the range is being unmaped (outcome of a munmap syscall). It
> + *     is fine to still have read/write access until the invalidate_range_end
> + *     callback returns. This also imply that secondary page table can be trim
> + *     as the address range is no longer valid.
> + *   - MMU_WB memory is being write back to disk, all write access must stop
> + *     after invalidate_range_start callback returns. Read access are still
> + *     allowed.
> + *   - MMU_STATUS memory status change, like soft dirty.
> + *
> + * In doubt when adding a new notifier caller use MMU_MIGRATE it will always
> + * result in expected behavior but will not allow listener a chance to optimize
> + * its events.
> + */

Here is a pass at tightening up that documentation:

/* MMU Events report fine-grained information to the callback routine, allowing
 * the event listener to make a more informed decision as to what action to
 * take. The event types are:
 *
 *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
 *     access must stop after invalidate_range_start callback returns.
 *     Furthermore, no read access should be allowed either, as a new page can
 *     be remapped with write access before the invalidate_range_end callback
 *     happens and thus any read access to old page might read stale data. There
 *     are several sources for this event, including:
 *
 *         - A page moving to swap (for various reasons, including page
 *           reclaim),
 *         - An mremap syscall,
 *         - migration for NUMA reasons,
 *         - balancing the memory pool,
 *         - write fault on a read-only page triggers a new page to be allocated
 *           and used,
 *         - and more that are not listed here.
 *
 *   - MMU_MPROT_NONE: memory access protection is changing to "none": no page
 *     in the range can be accessed in either read or write mode but the range
 *     of addresses is still valid. However, access is still allowed, up until
 *     invalidate_range_end callback returns.
 *
 *   - MMU_MPROT_RONLY: memory access proctection is changing to read only.
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns.
 *
 *   - MMU_MPROT_RANDW: memory access proctection is changing to read and write.
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns.
 *
 *   - MMU_MPROT_WONLY: memory access proctection is changing to write only.
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns.
 *
 *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall).
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns. This also implies that the secondary page table can be trimmed,
 *     because the address range is no longer valid.
 *
 *   - MMU_WB: memory is being written back to disk, all write accesses must
 *     stop after invalidate_range_start callback returns. Read access are still
 *     allowed.
 *
 *   - MMU_STATUS memory status change, like soft dirty, or huge page 
 *     splitting (in place).
 *
 * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
 * because it will always lead to reasonable behavior, but will not allow the
 * listener a chance to optimize its events.
 */

Mostly just cleaning up the wording, except that I did add "huge page 
splitting" to the cases that could cause an MMU_STATUS to fire.

> +enum mmu_event {
> +	MMU_MIGRATE = 0,
> +	MMU_MPROT_NONE,
> +	MMU_MPROT_RONLY,
> +	MMU_MPROT_RANDW,
> +	MMU_MPROT_WONLY,
> +	MMU_MUNMAP,
> +	MMU_STATUS,
> +	MMU_WB,
> +};
> +
>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  /*
> @@ -79,7 +125,8 @@ struct mmu_notifier_ops {
>  	void (*change_pte)(struct mmu_notifier *mn,
>  			   struct mm_struct *mm,
>  			   unsigned long address,
> -			   pte_t pte);
> +			   pte_t pte,
> +			   enum mmu_event event);
>  
>  	/*
>  	 * Before this is invoked any secondary MMU is still ok to
> @@ -90,7 +137,8 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long address);
> +				unsigned long address,
> +				enum mmu_event event);
>  
>  	/*
>  	 * invalidate_range_start() and invalidate_range_end() must be
> @@ -137,10 +185,14 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start,
> +				       unsigned long end,
> +				       enum mmu_event event);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
> -				     unsigned long start, unsigned long end);
> +				     unsigned long start,
> +				     unsigned long end,
> +				     enum mmu_event event);
>  };
>  
>  /*
> @@ -177,13 +229,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> -				      unsigned long address, pte_t pte);
> +				      unsigned long address,
> +				      pte_t pte,
> +				      enum mmu_event event);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address);
> +					  unsigned long address,
> +					  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +						  unsigned long start,
> +						  unsigned long end,
> +						  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +						unsigned long start,
> +						unsigned long end,
> +						enum mmu_event event);
>  
>  static inline void mmu_notifier_release(struct mm_struct *mm)
>  {
> @@ -208,31 +267,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> -					   unsigned long address, pte_t pte)
> +					   unsigned long address,
> +					   pte_t pte,
> +					   enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_change_pte(mm, address, pte);
> +		__mmu_notifier_change_pte(mm, address, pte, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +						unsigned long address,
> +						enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_page(mm, address);
> +		__mmu_notifier_invalidate_page(mm, address, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						       unsigned long start,
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						     unsigned long start,
> +						     unsigned long end,
> +						     enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_end(mm, start, end);
> +		__mmu_notifier_invalidate_range_end(mm, start, end, event);
>  }
>  
>  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> @@ -278,13 +344,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>   * old page would remain mapped readonly in the secondary MMUs after the new
>   * page is already writable by some CPU through the primary MMU.
>   */
> -#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
> +#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
>  ({									\
>  	struct mm_struct *___mm = __mm;					\
>  	unsigned long ___address = __address;				\
>  	pte_t ___pte = __pte;						\
>  									\
> -	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
> +	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
>  	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>  })
>  
> @@ -307,22 +373,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> -					   unsigned long address, pte_t pte)
> +					   unsigned long address,
> +					   pte_t pte,
> +					   enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +						unsigned long address,
> +						enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						       unsigned long start,
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						     unsigned long start,
> +						     unsigned long end,
> +						     enum mmu_event event)
>  {
>  }
>  
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 32b04dc..296f81e 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -177,7 +177,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	/* For try_to_free_swap() and munlock_vma_page() below */
>  	lock_page(page);
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	err = -EAGAIN;
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -195,7 +196,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush(vma, addr, ptep);
> -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +	set_pte_at_notify(mm, addr, ptep,
> +			  mk_pte(kpage, vma->vm_page_prot),
> +			  MMU_MIGRATE);
>  
>  	page_remove_rmap(page);
>  	if (!page_mapped(page))
> @@ -209,7 +212,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	err = 0;
>   unlock:
>  	mem_cgroup_cancel_charge(kpage, memcg);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	unlock_page(page);
>  	return err;
>  }
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index d8d9fe3..a2b3f09 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -198,7 +198,7 @@ retry:
>  			BUG_ON(pte_dirty(pteval));
>  			pte_unmap_unlock(pte, ptl);
>  			/* must invalidate_page _before_ freeing the page */
> -			mmu_notifier_invalidate_page(mm, address);
> +			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
>  			page_cache_release(page);
>  		}
>  	}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5d562a9..fa30857 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1020,6 +1020,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		set_page_private(pages[i], (unsigned long)memcg);
>  	}
>  
> +	mmun_start = haddr;
> +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
> +
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		copy_user_highpage(pages[i], page + i,
>  				   haddr + PAGE_SIZE * i, vma);
> @@ -1027,10 +1032,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		cond_resched();
>  	}
>  
> -	mmun_start = haddr;
> -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> -
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
>  		goto out_free_pages;

So, that looks like you are fixing a pre-existing bug here? The 
invalidate_range call is now happening *before* we copy pages. That seems 
correct, although this is starting to get into code I'm less comfortable 
with (huge pages).  But I think it's worth mentioning in the commit 
message.

> @@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	page_remove_rmap(page);
>  	spin_unlock(ptl);
>  
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	ret |= VM_FAULT_WRITE;
>  	put_page(page);
> @@ -1073,7 +1075,8 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		memcg = (void *)page_private(pages[i]);
>  		set_page_private(pages[i], 0);
> @@ -1157,16 +1160,17 @@ alloc:
>  
>  	count_vm_event(THP_FAULT_ALLOC);
>  
> +	mmun_start = haddr;
> +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
> +
>  	if (!page)
>  		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
>  	else
>  		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
>  	__SetPageUptodate(new_page);
>  
> -	mmun_start = haddr;
> -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> -

Another bug fix, OK.

>  	spin_lock(ptl);
>  	if (page)
>  		put_user_huge_page(page);
> @@ -1197,7 +1201,8 @@ alloc:
>  	}
>  	spin_unlock(ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  out:
>  	return ret;
>  out_unlock:
> @@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
>  	const unsigned long mmun_start = address;
>  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_STATUS);

OK, just to be sure: we are not moving the page contents at this point, 
right? Just changing the page table from a single "huge" entry, into lots 
of little 4K page entries? If so, than MMU_STATUS seems correct, but we 
should add that case to the "Event types" documentation above.

>  	pmd = page_check_address_pmd(page, mm, address,
>  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
>  	if (pmd) {
> @@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
>  		ret = 1;
>  		spin_unlock(ptl);
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_STATUS);
>  
>  	return ret;
>  }
> @@ -2446,7 +2453,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	mmun_start = address;
>  	mmun_end   = address + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
>  	 * After this gup_fast can't run anymore. This also removes
> @@ -2456,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	_pmd = pmdp_clear_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	spin_lock(pte_ptl);
>  	isolated = __collapse_huge_page_isolate(vma, address, pte);
> @@ -2845,24 +2854,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
>  again:
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);

Just checking: this is MMU_MIGRATE, instead of MMU_STATUS, because we are 
actually moving data? (The pages backing the page table?)

>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	if (is_huge_zero_pmd(*pmd)) {
>  		__split_huge_zero_page_pmd(vma, haddr, pmd);
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	get_page(page);
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	split_huge_page(page);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7faab71..73e1576 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2565,7 +2565,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	mmun_start = vma->vm_start;
>  	mmun_end = vma->vm_end;
>  	if (cow)
> -		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_start(src, mmun_start,
> +						    mmun_end, MMU_MIGRATE);
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>  		spinlock_t *src_ptl, *dst_ptl;
> @@ -2615,7 +2616,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	}
>  
>  	if (cow)
> -		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(src, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  
>  	return ret;
>  }
> @@ -2641,7 +2643,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	BUG_ON(end & ~huge_page_mask(h));
>  
>  	tlb_start_vma(tlb, vma);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  again:
>  	for (address = start; address < end; address += sz) {
>  		ptep = huge_pte_offset(mm, address);
> @@ -2712,7 +2715,8 @@ unlock:
>  		if (address < end && !ref_page)
>  			goto again;
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	tlb_end_vma(tlb, vma);
>  }
>  
> @@ -2899,7 +2903,8 @@ retry_avoidcopy:
>  
>  	mmun_start = address & huge_page_mask(h);
>  	mmun_end = mmun_start + huge_page_size(h);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	/*
>  	 * Retake the page table lock to check for racing updates
>  	 * before the page tables are altered
> @@ -2919,7 +2924,8 @@ retry_avoidcopy:
>  		new_page = old_page;
>  	}
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
>  
> @@ -3344,7 +3350,8 @@ same_page:
>  }
>  
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> -		unsigned long address, unsigned long end, pgprot_t newprot)
> +		unsigned long address, unsigned long end, pgprot_t newprot,
> +		enum mmu_event event)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long start = address;
> @@ -3356,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, event);
>  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  	for (; address < end; address += huge_page_size(h)) {
>  		spinlock_t *ptl;
> @@ -3386,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 */
>  	flush_tlb_range(vma, start, end);
>  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, event);
>  
>  	return pages << h->order;
>  }
> diff --git a/mm/ksm.c b/mm/ksm.c
> index cb1e976..4b659f1 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -873,7 +873,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MPROT_RONLY);
>  
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -905,7 +906,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  		if (pte_dirty(entry))
>  			set_page_dirty(page);
>  		entry = pte_mkclean(pte_wrprotect(entry));
> -		set_pte_at_notify(mm, addr, ptep, entry);
> +		set_pte_at_notify(mm, addr, ptep, entry, MMU_MPROT_RONLY);
>  	}
>  	*orig_pte = *ptep;
>  	err = 0;
> @@ -913,7 +914,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  out_unlock:
>  	pte_unmap_unlock(ptep, ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MPROT_RONLY);
>  out:
>  	return err;
>  }
> @@ -949,7 +951,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte_same(*ptep, orig_pte)) {
> @@ -962,7 +965,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush(vma, addr, ptep);
> -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +	set_pte_at_notify(mm, addr, ptep,
> +			  mk_pte(kpage, vma->vm_page_prot),
> +			  MMU_MIGRATE);
>  
>  	page_remove_rmap(page);
>  	if (!page_mapped(page))
> @@ -972,7 +977,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	pte_unmap_unlock(ptep, ptl);
>  	err = 0;
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  out:
>  	return err;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 09e2cd0..d3908f0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	mmun_end   = end;
>  	if (is_cow)
>  		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> -						    mmun_end);
> +						    mmun_end, MMU_MIGRATE);
>  
>  	ret = 0;
>  	dst_pgd = pgd_offset(dst_mm, addr);
> @@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>  
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> +						  MMU_MIGRATE);
>  	return ret;
>  }
>  
> @@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  
> -	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> +	mmu_notifier_invalidate_range_start(mm, start_addr,
> +					    end_addr, MMU_MUNMAP);
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> -	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> +	mmu_notifier_invalidate_range_end(mm, start_addr,
> +					  end_addr, MMU_MUNMAP);
>  }
>  
>  /**
> @@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
>  	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
>  		unmap_single_vma(&tlb, vma, start, end, details);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, start, end);
>  }
>  
> @@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, address, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, address, end);
> +	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
>  	unmap_single_vma(&tlb, vma, address, end, details);
> -	mmu_notifier_invalidate_range_end(mm, address, end);
> +	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, address, end);
>  }
>  
> @@ -2208,7 +2211,8 @@ gotten:
>  
>  	mmun_start  = address & PAGE_MASK;
>  	mmun_end    = mmun_start + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	/*
>  	 * Re-check the pte - we dropped the lock
> @@ -2240,7 +2244,7 @@ gotten:
>  		 * mmu page tables (such as kvm shadow page tables), we want the
>  		 * new page to be mapped directly into the secondary page table.
>  		 */
> -		set_pte_at_notify(mm, address, page_table, entry);
> +		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
>  		update_mmu_cache(vma, address, page_table);
>  		if (old_page) {
>  			/*
> @@ -2279,7 +2283,8 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (mmun_end > mmun_start)
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  	if (old_page) {
>  		/*
>  		 * Don't let another task, with possibly unlocked vma,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ab43fbf..b526c72 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1820,12 +1820,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	WARN_ON(PageLRU(new_page));
>  
>  	/* Recheck the target PMD */
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>  fail_putback:
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  
>  		/* Reverse changes made by migrate_page_copy() */
>  		if (TestClearPageActive(new_page))
> @@ -1878,7 +1880,8 @@ fail_putback:
>  	page_remove_rmap(page);
>  
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	/* Take an "isolate" reference and put new page on the LRU. */
>  	get_page(new_page);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 41cefdf..9decb88 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
>  	return young;
>  }
>  
> -void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> -			       pte_t pte)
> +void __mmu_notifier_change_pte(struct mm_struct *mm,
> +			       unsigned long address,
> +			       pte_t pte,
> +			       enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->change_pte)
> -			mn->ops->change_pte(mn, mm, address, pte);
> +			mn->ops->change_pte(mn, mm, address, pte, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +				    unsigned long address,
> +				    enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_page)
> -			mn->ops->invalidate_page(mn, mm, address);
> +			mn->ops->invalidate_page(mn, mm, address, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +					   unsigned long start,
> +					   unsigned long end,
> +					   enum mmu_event event)
> +
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -159,14 +165,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +			mn->ops->invalidate_range_start(mn, mm, start,
> +							end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
>  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +					 unsigned long start,
> +					 unsigned long end,
> +					 enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -174,7 +183,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_end)
> -			mn->ops->invalidate_range_end(mn, mm, start, end);
> +			mn->ops->invalidate_range_end(mn, mm, start,
> +						      end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c43d557..6ce6c23 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		pud_t *pud, unsigned long addr, unsigned long end,
> -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> +		enum mmu_event event)
>  {
>  	pmd_t *pmd;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -157,7 +158,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		/* invoke the mmu notifier if the pmd is populated */
>  		if (!mni_start) {
>  			mni_start = addr;
> -			mmu_notifier_invalidate_range_start(mm, mni_start, end);
> +			mmu_notifier_invalidate_range_start(mm, mni_start,
> +							    end, event);
>  		}
>  
>  		if (pmd_trans_huge(*pmd)) {
> @@ -185,7 +187,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  	} while (pmd++, addr = next, addr != end);
>  
>  	if (mni_start)
> -		mmu_notifier_invalidate_range_end(mm, mni_start, end);
> +		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
>  
>  	if (nr_huge_updates)
>  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> @@ -194,7 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  
>  static inline unsigned long change_pud_range(struct vm_area_struct *vma,
>  		pgd_t *pgd, unsigned long addr, unsigned long end,
> -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> +		enum mmu_event event)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -206,7 +209,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
>  		pages += change_pmd_range(vma, pud, addr, next, newprot,
> -				 dirty_accountable, prot_numa);
> +				 dirty_accountable, prot_numa, event);
>  	} while (pud++, addr = next, addr != end);
>  
>  	return pages;
> @@ -214,7 +217,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
>  
>  static unsigned long change_protection_range(struct vm_area_struct *vma,
>  		unsigned long addr, unsigned long end, pgprot_t newprot,
> -		int dirty_accountable, int prot_numa)
> +		int dirty_accountable, int prot_numa, enum mmu_event event)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pgd_t *pgd;
> @@ -231,7 +234,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
>  		if (pgd_none_or_clear_bad(pgd))
>  			continue;
>  		pages += change_pud_range(vma, pgd, addr, next, newprot,
> -				 dirty_accountable, prot_numa);
> +				 dirty_accountable, prot_numa, event);
>  	} while (pgd++, addr = next, addr != end);
>  
>  	/* Only flush the TLB if we actually modified any entries: */
> @@ -247,11 +250,23 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
>  		       int dirty_accountable, int prot_numa)
>  {
>  	unsigned long pages;
> +	enum mmu_event event = MMU_MPROT_NONE;
> +
> +	/* At this points vm_flags is updated. */
> +	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
> +		event = MMU_MPROT_RANDW;
> +	else if (vma->vm_flags & VM_WRITE)
> +		event = MMU_MPROT_WONLY;
> +	else if (vma->vm_flags & VM_READ)
> +		event = MMU_MPROT_RONLY;

hmmm, shouldn't we be checking against the newprot argument, instead of 
against vm->vm_flags?  The calling code, mprotect_fixup for example, can 
set flags *other* than VM_READ or RM_WRITE, and that could leave to a 
confusing or even inaccurate event. We could have a case where the event 
type is MMU_MPROT_RONLY, but the page might have been read-only the entire 
time, and some other flag was actually getting set.

I'm also starting to wonder if this event is adding much value here (for 
protection changes), if the newprot argument contains the same 
information. Although it is important to have a unified sort of reporting 
system for HMM, so that's probably good enough reason to do this.

>  
>  	if (is_vm_hugetlb_page(vma))
> -		pages = hugetlb_change_protection(vma, start, end, newprot);
> +		pages = hugetlb_change_protection(vma, start, end,
> +						  newprot, event);
>  	else
> -		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
> +		pages = change_protection_range(vma, start, end, newprot,
> +						dirty_accountable,
> +						prot_numa, event);
>  
>  	return pages;
>  }
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 05f1180..6827d2f 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  
>  	mmun_start = old_addr;
>  	mmun_end   = old_end;
> -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
>  		cond_resched();
> @@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	if (likely(need_flush))
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
> -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	return len + old_addr - old_end;	/* how much done */
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7928ddd..bd7e6d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  	pte_unmap_unlock(pte, ptl);
>  
>  	if (ret) {
> -		mmu_notifier_invalidate_page(mm, address);
> +		mmu_notifier_invalidate_page(mm, address, MMU_WB);
>  		(*cleaned)++;
>  	}
>  out:
> @@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	int ret = SWAP_AGAIN;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
> +	enum mmu_event event = MMU_MIGRATE;
> +
> +	if (flags & TTU_MUNLOCK)
> +		event = MMU_STATUS;
>  
>  	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
> @@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> -		mmu_notifier_invalidate_page(mm, address);
> +		mmu_notifier_invalidate_page(mm, address, event);
>  out:
>  	return ret;
>  
> @@ -1287,7 +1291,9 @@ out_mlock:
>  #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
>  
>  static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> -		struct vm_area_struct *vma, struct page *check_page)
> +				struct vm_area_struct *vma,
> +				struct page *check_page,
> +				enum ttu_flags flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t *pmd;
> @@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  	unsigned long end;
>  	int ret = SWAP_AGAIN;
>  	int locked_vma = 0;
> +	enum mmu_event event = MMU_MIGRATE;
> +
> +	if (flags & TTU_MUNLOCK)
> +		event = MMU_STATUS;
>  
>  	address = (vma->vm_start + cursor) & CLUSTER_MASK;
>  	end = address + CLUSTER_SIZE;
> @@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  
>  	mmun_start = address;
>  	mmun_end   = end;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
>  
>  	/*
>  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> @@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  		(*mapcount)--;
>  	}
>  	pte_unmap_unlock(pte - 1, ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
>  	if (locked_vma)
>  		up_read(&vma->vm_mm->mmap_sem);
>  	return ret;
> @@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
>  			while (cursor < max_nl_cursor &&
>  				cursor < vma->vm_end - vma->vm_start) {
>  				if (try_to_unmap_cluster(cursor, &mapcount,
> -						vma, page) == SWAP_MLOCK)
> +							 vma, page,
> +							 (enum ttu_flags)arg)
> +							 == SWAP_MLOCK)
>  					ret = SWAP_MLOCK;
>  				cursor += CLUSTER_SIZE;
>  				vma->vm_private_data = (void *) cursor;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4b6c01b..6e1992f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
> -					     unsigned long address)
> +					     unsigned long address,
> +					     enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush, idx;
> @@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  					struct mm_struct *mm,
>  					unsigned long address,
> -					pte_t pte)
> +					pte_t pte,
> +					enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int idx;
> @@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush = 0, idx;
> @@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  						  struct mm_struct *mm,
>  						  unsigned long start,
> -						  unsigned long end)
> +						  unsigned long end,
> +						  enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
@ 2014-06-30  5:22     ` John Hubbard
  0 siblings, 0 replies; 76+ messages in thread
From: John Hubbard @ 2014-06-30  5:22 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

[-- Attachment #1: Type: text/plain, Size: 51740 bytes --]

On Fri, 27 Jun 2014, JA(C)rA'me Glisse wrote:

> From: JA(C)rA'me Glisse <jglisse@redhat.com>
> 
> The event information will be usefull for new user of mmu_notifier API.
> The event argument differentiate between a vma disappearing, a page
> being write protected or simply a page being unmaped. This allow new
> user to take different path for different event for instance on unmap
> the resource used to track a vma are still valid and should stay around.
> While if the event is saying that a vma is being destroy it means that any
> resources used to track this vma can be free.
> 
> Changed since v1:
>   - renamed action into event (updated commit message too).
>   - simplified the event names and clarified their intented usage
>     also documenting what exceptation the listener can have in
>     respect to each event.
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
>  drivers/iommu/amd_iommu_v2.c            |  14 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
>  drivers/xen/gntdev.c                    |   9 ++-
>  fs/proc/task_mmu.c                      |   6 +-
>  include/linux/hugetlb.h                 |   7 +-
>  include/linux/mmu_notifier.h            | 117 ++++++++++++++++++++++++++------
>  kernel/events/uprobes.c                 |  10 ++-
>  mm/filemap_xip.c                        |   2 +-
>  mm/huge_memory.c                        |  51 ++++++++------
>  mm/hugetlb.c                            |  25 ++++---
>  mm/ksm.c                                |  18 +++--
>  mm/memory.c                             |  27 +++++---
>  mm/migrate.c                            |   9 ++-
>  mm/mmu_notifier.c                       |  28 +++++---
>  mm/mprotect.c                           |  33 ++++++---
>  mm/mremap.c                             |   6 +-
>  mm/rmap.c                               |  24 +++++--
>  virt/kvm/kvm_main.c                     |  12 ++--
>  19 files changed, 291 insertions(+), 119 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 21ea928..ed6f35e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -56,7 +56,8 @@ struct i915_mmu_object {
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
>  						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
>  	struct interval_tree_node *it = NULL;
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 499b436..2bb9771 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
>  static void mn_change_pte(struct mmu_notifier *mn,
>  			  struct mm_struct *mm,
>  			  unsigned long address,
> -			  pte_t pte)
> +			  pte_t pte,
> +			  enum mmu_event event)
>  {
>  	__mn_flush_page(mn, address);
>  }
>  
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  			       struct mm_struct *mm,
> -			       unsigned long address)
> +			       unsigned long address,
> +			       enum mmu_event event)
>  {
>  	__mn_flush_page(mn, address);
>  }
>  
>  static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  				      struct mm_struct *mm,
> -				      unsigned long start, unsigned long end)
> +				      unsigned long start,
> +				      unsigned long end,
> +				      enum mmu_event event)
>  {
>  	struct pasid_state *pasid_state;
>  	struct device_state *dev_state;
> @@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invalidate_range_end(struct mmu_notifier *mn,
>  				    struct mm_struct *mm,
> -				    unsigned long start, unsigned long end)
> +				    unsigned long start,
> +				    unsigned long end,
> +				    enum mmu_event event)
>  {
>  	struct pasid_state *pasid_state;
>  	struct device_state *dev_state;
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index 2129274..e67fed1 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   */
>  static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  				     struct mm_struct *mm, unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
>  }
>  
>  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> -				unsigned long address)
> +				unsigned long address,
> +				enum mmu_event event)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 073b4a1..fe9da94 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
>  
>  static void mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start,
> +				unsigned long end,
> +				enum mmu_event event)
>  {
>  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>  	struct grant_map *map;
> @@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>  
>  static void mn_invl_page(struct mmu_notifier *mn,
>  			 struct mm_struct *mm,
> -			 unsigned long address)
> +			 unsigned long address,
> +			 enum mmu_event event)
>  {
> -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
> +	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index cfa63ee..e9e79f7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		};
>  		down_read(&mm->mmap_sem);
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> +			mmu_notifier_invalidate_range_start(mm, 0,
> +							    -1, MMU_STATUS);
>  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
>  			cp.vma = vma;
>  			if (is_vm_hugetlb_page(vma))
> @@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  					&clear_refs_walk);
>  		}
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
> -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> +			mmu_notifier_invalidate_range_end(mm, 0,
> +							  -1, MMU_STATUS);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  		mmput(mm);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 6a836ef..d7e512f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -6,6 +6,7 @@
>  #include <linux/fs.h>
>  #include <linux/hugetlb_inline.h>
>  #include <linux/cgroup.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/list.h>
>  #include <linux/kref.h>
>  
> @@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
>  int pmd_huge(pmd_t pmd);
>  int pud_huge(pud_t pmd);
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> -		unsigned long address, unsigned long end, pgprot_t newprot);
> +		unsigned long address, unsigned long end, pgprot_t newprot,
> +		enum mmu_event event);
>  
>  #else /* !CONFIG_HUGETLB_PAGE */
>  
> @@ -148,7 +150,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
>  #define is_hugepage_active(x)	false
>  
>  static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> -		unsigned long address, unsigned long end, pgprot_t newprot)
> +		unsigned long address, unsigned long end, pgprot_t newprot,
> +		enum mmu_event event)
>  {
>  	return 0;
>  }
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index deca874..82e9577 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -9,6 +9,52 @@
>  struct mmu_notifier;
>  struct mmu_notifier_ops;
>  
> +/* Event report finer informations to the callback allowing the event listener
> + * to take better action. There are only few kinds of events :
> + *
> + *   - MMU_MIGRATE memory is migrating from one page to another thus all write
> + *     access must stop after invalidate_range_start callback returns. And no
> + *     read access should be allowed either as new page can be remapped with
> + *     write access before the invalidate_range_end callback happen and thus
> + *     any read access to old page might access outdated informations. Several
> + *     source to this event like page moving to swap (for various reasons like
> + *     page reclaim), outcome of mremap syscall, migration for numa reasons,
> + *     balancing memory pool, write fault on read only page trigger a new page
> + *     to be allocated and used, ...
> + *   - MMU_MPROT_NONE memory access protection is change, no page in the range
> + *     can be accessed in either read or write mode but the range of address
> + *     is still valid. All access are still fine until invalidate_range_end
> + *     callback returns.
> + *   - MMU_MPROT_RONLY memory access proctection is changing to read only.
> + *     All access are still fine until invalidate_range_end callback returns.
> + *   - MMU_MPROT_RANDW memory access proctection is changing to read an write.
> + *     All access are still fine until invalidate_range_end callback returns.
> + *   - MMU_MPROT_WONLY memory access proctection is changing to write only.
> + *     All access are still fine until invalidate_range_end callback returns.
> + *   - MMU_MUNMAP the range is being unmaped (outcome of a munmap syscall). It
> + *     is fine to still have read/write access until the invalidate_range_end
> + *     callback returns. This also imply that secondary page table can be trim
> + *     as the address range is no longer valid.
> + *   - MMU_WB memory is being write back to disk, all write access must stop
> + *     after invalidate_range_start callback returns. Read access are still
> + *     allowed.
> + *   - MMU_STATUS memory status change, like soft dirty.
> + *
> + * In doubt when adding a new notifier caller use MMU_MIGRATE it will always
> + * result in expected behavior but will not allow listener a chance to optimize
> + * its events.
> + */

Here is a pass at tightening up that documentation:

/* MMU Events report fine-grained information to the callback routine, allowing
 * the event listener to make a more informed decision as to what action to
 * take. The event types are:
 *
 *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
 *     access must stop after invalidate_range_start callback returns.
 *     Furthermore, no read access should be allowed either, as a new page can
 *     be remapped with write access before the invalidate_range_end callback
 *     happens and thus any read access to old page might read stale data. There
 *     are several sources for this event, including:
 *
 *         - A page moving to swap (for various reasons, including page
 *           reclaim),
 *         - An mremap syscall,
 *         - migration for NUMA reasons,
 *         - balancing the memory pool,
 *         - write fault on a read-only page triggers a new page to be allocated
 *           and used,
 *         - and more that are not listed here.
 *
 *   - MMU_MPROT_NONE: memory access protection is changing to "none": no page
 *     in the range can be accessed in either read or write mode but the range
 *     of addresses is still valid. However, access is still allowed, up until
 *     invalidate_range_end callback returns.
 *
 *   - MMU_MPROT_RONLY: memory access proctection is changing to read only.
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns.
 *
 *   - MMU_MPROT_RANDW: memory access proctection is changing to read and write.
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns.
 *
 *   - MMU_MPROT_WONLY: memory access proctection is changing to write only.
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns.
 *
 *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall).
 *     However, access is still allowed, up until invalidate_range_end callback
 *     returns. This also implies that the secondary page table can be trimmed,
 *     because the address range is no longer valid.
 *
 *   - MMU_WB: memory is being written back to disk, all write accesses must
 *     stop after invalidate_range_start callback returns. Read access are still
 *     allowed.
 *
 *   - MMU_STATUS memory status change, like soft dirty, or huge page 
 *     splitting (in place).
 *
 * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
 * because it will always lead to reasonable behavior, but will not allow the
 * listener a chance to optimize its events.
 */

Mostly just cleaning up the wording, except that I did add "huge page 
splitting" to the cases that could cause an MMU_STATUS to fire.

> +enum mmu_event {
> +	MMU_MIGRATE = 0,
> +	MMU_MPROT_NONE,
> +	MMU_MPROT_RONLY,
> +	MMU_MPROT_RANDW,
> +	MMU_MPROT_WONLY,
> +	MMU_MUNMAP,
> +	MMU_STATUS,
> +	MMU_WB,
> +};
> +
>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  /*
> @@ -79,7 +125,8 @@ struct mmu_notifier_ops {
>  	void (*change_pte)(struct mmu_notifier *mn,
>  			   struct mm_struct *mm,
>  			   unsigned long address,
> -			   pte_t pte);
> +			   pte_t pte,
> +			   enum mmu_event event);
>  
>  	/*
>  	 * Before this is invoked any secondary MMU is still ok to
> @@ -90,7 +137,8 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_page)(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long address);
> +				unsigned long address,
> +				enum mmu_event event);
>  
>  	/*
>  	 * invalidate_range_start() and invalidate_range_end() must be
> @@ -137,10 +185,14 @@ struct mmu_notifier_ops {
>  	 */
>  	void (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start,
> +				       unsigned long end,
> +				       enum mmu_event event);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
> -				     unsigned long start, unsigned long end);
> +				     unsigned long start,
> +				     unsigned long end,
> +				     enum mmu_event event);
>  };
>  
>  /*
> @@ -177,13 +229,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>  extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> -				      unsigned long address, pte_t pte);
> +				      unsigned long address,
> +				      pte_t pte,
> +				      enum mmu_event event);
>  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address);
> +					  unsigned long address,
> +					  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +						  unsigned long start,
> +						  unsigned long end,
> +						  enum mmu_event event);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +						unsigned long start,
> +						unsigned long end,
> +						enum mmu_event event);
>  
>  static inline void mmu_notifier_release(struct mm_struct *mm)
>  {
> @@ -208,31 +267,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> -					   unsigned long address, pte_t pte)
> +					   unsigned long address,
> +					   pte_t pte,
> +					   enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_change_pte(mm, address, pte);
> +		__mmu_notifier_change_pte(mm, address, pte, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +						unsigned long address,
> +						enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_page(mm, address);
> +		__mmu_notifier_invalidate_page(mm, address, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						       unsigned long start,
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, event);
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						     unsigned long start,
> +						     unsigned long end,
> +						     enum mmu_event event)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_end(mm, start, end);
> +		__mmu_notifier_invalidate_range_end(mm, start, end, event);
>  }
>  
>  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> @@ -278,13 +344,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>   * old page would remain mapped readonly in the secondary MMUs after the new
>   * page is already writable by some CPU through the primary MMU.
>   */
> -#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
> +#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
>  ({									\
>  	struct mm_struct *___mm = __mm;					\
>  	unsigned long ___address = __address;				\
>  	pte_t ___pte = __pte;						\
>  									\
> -	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
> +	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
>  	set_pte_at(___mm, ___address, __ptep, ___pte);			\
>  })
>  
> @@ -307,22 +373,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
>  }
>  
>  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> -					   unsigned long address, pte_t pte)
> +					   unsigned long address,
> +					   pte_t pte,
> +					   enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +						unsigned long address,
> +						enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						       unsigned long start,
> +						       unsigned long end,
> +						       enum mmu_event event)
>  {
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +						     unsigned long start,
> +						     unsigned long end,
> +						     enum mmu_event event)
>  {
>  }
>  
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 32b04dc..296f81e 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -177,7 +177,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	/* For try_to_free_swap() and munlock_vma_page() below */
>  	lock_page(page);
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	err = -EAGAIN;
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -195,7 +196,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush(vma, addr, ptep);
> -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +	set_pte_at_notify(mm, addr, ptep,
> +			  mk_pte(kpage, vma->vm_page_prot),
> +			  MMU_MIGRATE);
>  
>  	page_remove_rmap(page);
>  	if (!page_mapped(page))
> @@ -209,7 +212,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	err = 0;
>   unlock:
>  	mem_cgroup_cancel_charge(kpage, memcg);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	unlock_page(page);
>  	return err;
>  }
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index d8d9fe3..a2b3f09 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -198,7 +198,7 @@ retry:
>  			BUG_ON(pte_dirty(pteval));
>  			pte_unmap_unlock(pte, ptl);
>  			/* must invalidate_page _before_ freeing the page */
> -			mmu_notifier_invalidate_page(mm, address);
> +			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
>  			page_cache_release(page);
>  		}
>  	}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5d562a9..fa30857 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1020,6 +1020,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		set_page_private(pages[i], (unsigned long)memcg);
>  	}
>  
> +	mmun_start = haddr;
> +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
> +
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		copy_user_highpage(pages[i], page + i,
>  				   haddr + PAGE_SIZE * i, vma);
> @@ -1027,10 +1032,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		cond_resched();
>  	}
>  
> -	mmun_start = haddr;
> -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> -
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
>  		goto out_free_pages;

So, that looks like you are fixing a pre-existing bug here? The 
invalidate_range call is now happening *before* we copy pages. That seems 
correct, although this is starting to get into code I'm less comfortable 
with (huge pages).  But I think it's worth mentioning in the commit 
message.

> @@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	page_remove_rmap(page);
>  	spin_unlock(ptl);
>  
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	ret |= VM_FAULT_WRITE;
>  	put_page(page);
> @@ -1073,7 +1075,8 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		memcg = (void *)page_private(pages[i]);
>  		set_page_private(pages[i], 0);
> @@ -1157,16 +1160,17 @@ alloc:
>  
>  	count_vm_event(THP_FAULT_ALLOC);
>  
> +	mmun_start = haddr;
> +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
> +
>  	if (!page)
>  		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
>  	else
>  		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
>  	__SetPageUptodate(new_page);
>  
> -	mmun_start = haddr;
> -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> -

Another bug fix, OK.

>  	spin_lock(ptl);
>  	if (page)
>  		put_user_huge_page(page);
> @@ -1197,7 +1201,8 @@ alloc:
>  	}
>  	spin_unlock(ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  out:
>  	return ret;
>  out_unlock:
> @@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
>  	const unsigned long mmun_start = address;
>  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
>  
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_STATUS);

OK, just to be sure: we are not moving the page contents at this point, 
right? Just changing the page table from a single "huge" entry, into lots 
of little 4K page entries? If so, than MMU_STATUS seems correct, but we 
should add that case to the "Event types" documentation above.

>  	pmd = page_check_address_pmd(page, mm, address,
>  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
>  	if (pmd) {
> @@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
>  		ret = 1;
>  		spin_unlock(ptl);
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_STATUS);
>  
>  	return ret;
>  }
> @@ -2446,7 +2453,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	mmun_start = address;
>  	mmun_end   = address + HPAGE_PMD_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>  	/*
>  	 * After this gup_fast can't run anymore. This also removes
> @@ -2456,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 */
>  	_pmd = pmdp_clear_flush(vma, address, pmd);
>  	spin_unlock(pmd_ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	spin_lock(pte_ptl);
>  	isolated = __collapse_huge_page_isolate(vma, address, pte);
> @@ -2845,24 +2854,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
>  	mmun_start = haddr;
>  	mmun_end   = haddr + HPAGE_PMD_SIZE;
>  again:
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);

Just checking: this is MMU_MIGRATE, instead of MMU_STATUS, because we are 
actually moving data? (The pages backing the page table?)

>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_trans_huge(*pmd))) {
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	if (is_huge_zero_pmd(*pmd)) {
>  		__split_huge_zero_page_pmd(vma, haddr, pmd);
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  		return;
>  	}
>  	page = pmd_page(*pmd);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	get_page(page);
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	split_huge_page(page);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7faab71..73e1576 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2565,7 +2565,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	mmun_start = vma->vm_start;
>  	mmun_end = vma->vm_end;
>  	if (cow)
> -		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_start(src, mmun_start,
> +						    mmun_end, MMU_MIGRATE);
>  
>  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>  		spinlock_t *src_ptl, *dst_ptl;
> @@ -2615,7 +2616,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  	}
>  
>  	if (cow)
> -		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(src, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  
>  	return ret;
>  }
> @@ -2641,7 +2643,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	BUG_ON(end & ~huge_page_mask(h));
>  
>  	tlb_start_vma(tlb, vma);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  again:
>  	for (address = start; address < end; address += sz) {
>  		ptep = huge_pte_offset(mm, address);
> @@ -2712,7 +2715,8 @@ unlock:
>  		if (address < end && !ref_page)
>  			goto again;
>  	}
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	tlb_end_vma(tlb, vma);
>  }
>  
> @@ -2899,7 +2903,8 @@ retry_avoidcopy:
>  
>  	mmun_start = address & huge_page_mask(h);
>  	mmun_end = mmun_start + huge_page_size(h);
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	/*
>  	 * Retake the page table lock to check for racing updates
>  	 * before the page tables are altered
> @@ -2919,7 +2924,8 @@ retry_avoidcopy:
>  		new_page = old_page;
>  	}
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  	page_cache_release(new_page);
>  	page_cache_release(old_page);
>  
> @@ -3344,7 +3350,8 @@ same_page:
>  }
>  
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> -		unsigned long address, unsigned long end, pgprot_t newprot)
> +		unsigned long address, unsigned long end, pgprot_t newprot,
> +		enum mmu_event event)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long start = address;
> @@ -3356,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	BUG_ON(address >= end);
>  	flush_cache_range(vma, address, end);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, event);
>  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>  	for (; address < end; address += huge_page_size(h)) {
>  		spinlock_t *ptl;
> @@ -3386,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 */
>  	flush_tlb_range(vma, start, end);
>  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, event);
>  
>  	return pages << h->order;
>  }
> diff --git a/mm/ksm.c b/mm/ksm.c
> index cb1e976..4b659f1 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -873,7 +873,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MPROT_RONLY);
>  
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -905,7 +906,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  		if (pte_dirty(entry))
>  			set_page_dirty(page);
>  		entry = pte_mkclean(pte_wrprotect(entry));
> -		set_pte_at_notify(mm, addr, ptep, entry);
> +		set_pte_at_notify(mm, addr, ptep, entry, MMU_MPROT_RONLY);
>  	}
>  	*orig_pte = *ptep;
>  	err = 0;
> @@ -913,7 +914,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  out_unlock:
>  	pte_unmap_unlock(ptep, ptl);
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MPROT_RONLY);
>  out:
>  	return err;
>  }
> @@ -949,7 +951,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	mmun_start = addr;
>  	mmun_end   = addr + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte_same(*ptep, orig_pte)) {
> @@ -962,7 +965,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush(vma, addr, ptep);
> -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +	set_pte_at_notify(mm, addr, ptep,
> +			  mk_pte(kpage, vma->vm_page_prot),
> +			  MMU_MIGRATE);
>  
>  	page_remove_rmap(page);
>  	if (!page_mapped(page))
> @@ -972,7 +977,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	pte_unmap_unlock(ptep, ptl);
>  	err = 0;
>  out_mn:
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  out:
>  	return err;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 09e2cd0..d3908f0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	mmun_end   = end;
>  	if (is_cow)
>  		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> -						    mmun_end);
> +						    mmun_end, MMU_MIGRATE);
>  
>  	ret = 0;
>  	dst_pgd = pgd_offset(dst_mm, addr);
> @@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
>  
>  	if (is_cow)
> -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> +						  MMU_MIGRATE);
>  	return ret;
>  }
>  
> @@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  
> -	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> +	mmu_notifier_invalidate_range_start(mm, start_addr,
> +					    end_addr, MMU_MUNMAP);
>  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> -	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> +	mmu_notifier_invalidate_range_end(mm, start_addr,
> +					  end_addr, MMU_MUNMAP);
>  }
>  
>  /**
> @@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
>  	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
>  		unmap_single_vma(&tlb, vma, start, end, details);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, start, end);
>  }
>  
> @@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, address, end);
>  	update_hiwater_rss(mm);
> -	mmu_notifier_invalidate_range_start(mm, address, end);
> +	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
>  	unmap_single_vma(&tlb, vma, address, end, details);
> -	mmu_notifier_invalidate_range_end(mm, address, end);
> +	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
>  	tlb_finish_mmu(&tlb, address, end);
>  }
>  
> @@ -2208,7 +2211,8 @@ gotten:
>  
>  	mmun_start  = address & PAGE_MASK;
>  	mmun_end    = mmun_start + PAGE_SIZE;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	/*
>  	 * Re-check the pte - we dropped the lock
> @@ -2240,7 +2244,7 @@ gotten:
>  		 * mmu page tables (such as kvm shadow page tables), we want the
>  		 * new page to be mapped directly into the secondary page table.
>  		 */
> -		set_pte_at_notify(mm, address, page_table, entry);
> +		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
>  		update_mmu_cache(vma, address, page_table);
>  		if (old_page) {
>  			/*
> @@ -2279,7 +2283,8 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (mmun_end > mmun_start)
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  	if (old_page) {
>  		/*
>  		 * Don't let another task, with possibly unlocked vma,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ab43fbf..b526c72 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1820,12 +1820,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	WARN_ON(PageLRU(new_page));
>  
>  	/* Recheck the target PMD */
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>  fail_putback:
>  		spin_unlock(ptl);
> -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> +						  mmun_end, MMU_MIGRATE);
>  
>  		/* Reverse changes made by migrate_page_copy() */
>  		if (TestClearPageActive(new_page))
> @@ -1878,7 +1880,8 @@ fail_putback:
>  	page_remove_rmap(page);
>  
>  	spin_unlock(ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	/* Take an "isolate" reference and put new page on the LRU. */
>  	get_page(new_page);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 41cefdf..9decb88 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
>  	return young;
>  }
>  
> -void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> -			       pte_t pte)
> +void __mmu_notifier_change_pte(struct mm_struct *mm,
> +			       unsigned long address,
> +			       pte_t pte,
> +			       enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->change_pte)
> -			mn->ops->change_pte(mn, mm, address, pte);
> +			mn->ops->change_pte(mn, mm, address, pte, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> -					  unsigned long address)
> +				    unsigned long address,
> +				    enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_page)
> -			mn->ops->invalidate_page(mn, mm, address);
> +			mn->ops->invalidate_page(mn, mm, address, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  
>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +					   unsigned long start,
> +					   unsigned long end,
> +					   enum mmu_event event)
> +
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -159,14 +165,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +			mn->ops->invalidate_range_start(mn, mm, start,
> +							end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
>  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +					 unsigned long start,
> +					 unsigned long end,
> +					 enum mmu_event event)
>  {
>  	struct mmu_notifier *mn;
>  	int id;
> @@ -174,7 +183,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>  		if (mn->ops->invalidate_range_end)
> -			mn->ops->invalidate_range_end(mn, mm, start, end);
> +			mn->ops->invalidate_range_end(mn, mm, start,
> +						      end, event);
>  	}
>  	srcu_read_unlock(&srcu, id);
>  }
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c43d557..6ce6c23 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		pud_t *pud, unsigned long addr, unsigned long end,
> -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> +		enum mmu_event event)
>  {
>  	pmd_t *pmd;
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -157,7 +158,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		/* invoke the mmu notifier if the pmd is populated */
>  		if (!mni_start) {
>  			mni_start = addr;
> -			mmu_notifier_invalidate_range_start(mm, mni_start, end);
> +			mmu_notifier_invalidate_range_start(mm, mni_start,
> +							    end, event);
>  		}
>  
>  		if (pmd_trans_huge(*pmd)) {
> @@ -185,7 +187,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  	} while (pmd++, addr = next, addr != end);
>  
>  	if (mni_start)
> -		mmu_notifier_invalidate_range_end(mm, mni_start, end);
> +		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
>  
>  	if (nr_huge_updates)
>  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> @@ -194,7 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  
>  static inline unsigned long change_pud_range(struct vm_area_struct *vma,
>  		pgd_t *pgd, unsigned long addr, unsigned long end,
> -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> +		enum mmu_event event)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -206,7 +209,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
>  		pages += change_pmd_range(vma, pud, addr, next, newprot,
> -				 dirty_accountable, prot_numa);
> +				 dirty_accountable, prot_numa, event);
>  	} while (pud++, addr = next, addr != end);
>  
>  	return pages;
> @@ -214,7 +217,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
>  
>  static unsigned long change_protection_range(struct vm_area_struct *vma,
>  		unsigned long addr, unsigned long end, pgprot_t newprot,
> -		int dirty_accountable, int prot_numa)
> +		int dirty_accountable, int prot_numa, enum mmu_event event)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pgd_t *pgd;
> @@ -231,7 +234,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
>  		if (pgd_none_or_clear_bad(pgd))
>  			continue;
>  		pages += change_pud_range(vma, pgd, addr, next, newprot,
> -				 dirty_accountable, prot_numa);
> +				 dirty_accountable, prot_numa, event);
>  	} while (pgd++, addr = next, addr != end);
>  
>  	/* Only flush the TLB if we actually modified any entries: */
> @@ -247,11 +250,23 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
>  		       int dirty_accountable, int prot_numa)
>  {
>  	unsigned long pages;
> +	enum mmu_event event = MMU_MPROT_NONE;
> +
> +	/* At this points vm_flags is updated. */
> +	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
> +		event = MMU_MPROT_RANDW;
> +	else if (vma->vm_flags & VM_WRITE)
> +		event = MMU_MPROT_WONLY;
> +	else if (vma->vm_flags & VM_READ)
> +		event = MMU_MPROT_RONLY;

hmmm, shouldn't we be checking against the newprot argument, instead of 
against vm->vm_flags?  The calling code, mprotect_fixup for example, can 
set flags *other* than VM_READ or RM_WRITE, and that could leave to a 
confusing or even inaccurate event. We could have a case where the event 
type is MMU_MPROT_RONLY, but the page might have been read-only the entire 
time, and some other flag was actually getting set.

I'm also starting to wonder if this event is adding much value here (for 
protection changes), if the newprot argument contains the same 
information. Although it is important to have a unified sort of reporting 
system for HMM, so that's probably good enough reason to do this.

>  
>  	if (is_vm_hugetlb_page(vma))
> -		pages = hugetlb_change_protection(vma, start, end, newprot);
> +		pages = hugetlb_change_protection(vma, start, end,
> +						  newprot, event);
>  	else
> -		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
> +		pages = change_protection_range(vma, start, end, newprot,
> +						dirty_accountable,
> +						prot_numa, event);
>  
>  	return pages;
>  }
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 05f1180..6827d2f 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  
>  	mmun_start = old_addr;
>  	mmun_end   = old_end;
> -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> +					    mmun_end, MMU_MIGRATE);
>  
>  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
>  		cond_resched();
> @@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	if (likely(need_flush))
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
> -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> +					  mmun_end, MMU_MIGRATE);
>  
>  	return len + old_addr - old_end;	/* how much done */
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7928ddd..bd7e6d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  	pte_unmap_unlock(pte, ptl);
>  
>  	if (ret) {
> -		mmu_notifier_invalidate_page(mm, address);
> +		mmu_notifier_invalidate_page(mm, address, MMU_WB);
>  		(*cleaned)++;
>  	}
>  out:
> @@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	int ret = SWAP_AGAIN;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
> +	enum mmu_event event = MMU_MIGRATE;
> +
> +	if (flags & TTU_MUNLOCK)
> +		event = MMU_STATUS;
>  
>  	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
> @@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> -		mmu_notifier_invalidate_page(mm, address);
> +		mmu_notifier_invalidate_page(mm, address, event);
>  out:
>  	return ret;
>  
> @@ -1287,7 +1291,9 @@ out_mlock:
>  #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
>  
>  static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> -		struct vm_area_struct *vma, struct page *check_page)
> +				struct vm_area_struct *vma,
> +				struct page *check_page,
> +				enum ttu_flags flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pmd_t *pmd;
> @@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  	unsigned long end;
>  	int ret = SWAP_AGAIN;
>  	int locked_vma = 0;
> +	enum mmu_event event = MMU_MIGRATE;
> +
> +	if (flags & TTU_MUNLOCK)
> +		event = MMU_STATUS;
>  
>  	address = (vma->vm_start + cursor) & CLUSTER_MASK;
>  	end = address + CLUSTER_SIZE;
> @@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  
>  	mmun_start = address;
>  	mmun_end   = end;
> -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
>  
>  	/*
>  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> @@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  		(*mapcount)--;
>  	}
>  	pte_unmap_unlock(pte - 1, ptl);
> -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
>  	if (locked_vma)
>  		up_read(&vma->vm_mm->mmap_sem);
>  	return ret;
> @@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
>  			while (cursor < max_nl_cursor &&
>  				cursor < vma->vm_end - vma->vm_start) {
>  				if (try_to_unmap_cluster(cursor, &mapcount,
> -						vma, page) == SWAP_MLOCK)
> +							 vma, page,
> +							 (enum ttu_flags)arg)
> +							 == SWAP_MLOCK)
>  					ret = SWAP_MLOCK;
>  				cursor += CLUSTER_SIZE;
>  				vma->vm_private_data = (void *) cursor;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4b6c01b..6e1992f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
> -					     unsigned long address)
> +					     unsigned long address,
> +					     enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush, idx;
> @@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  					struct mm_struct *mm,
>  					unsigned long address,
> -					pte_t pte)
> +					pte_t pte,
> +					enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int idx;
> @@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush = 0, idx;
> @@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  						  struct mm_struct *mm,
>  						  unsigned long start,
> -						  unsigned long end)
> +						  unsigned long end,
> +						  enum mmu_event event)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

thanks,
John H.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-28  2:00   ` Jérôme Glisse
@ 2014-06-30 14:41     ` Gabbay, Oded
  -1 siblings, 0 replies; 76+ messages in thread
From: Gabbay, Oded @ 2014-06-30 14:41 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay, akpm,
	linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange, riel,
	jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	John Hubbard, Sherry Cheung, Duncan Poole, Joerg Roedel, iommu

On Fri, 2014-06-27 at 22:00 -0400, Jérôme Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Several subsystem require a callback when a mm struct is being destroy
> so that they can cleanup there respective per mm struct. Instead of
> having each subsystem add its callback to mmput use a notifier chain
> to call each of the subsystem.
> 
> This will allow new subsystem to register callback even if they are
> module. There should be no contention on the rw semaphore protecting
> the call chain and the impact on the code path should be low and
> burried in the noise.
> 
> Note that this patch also move the call to cleanup functions after
> exit_mmap so that new call back can assume that mmu_notifier_release
> have already been call. This does not impact existing cleanup functions
> as they do not rely on anything that exit_mmap is freeing. Also moved
> khugepaged_exit to exit_mmap so that ordering is preserved for that
> function.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  fs/aio.c                | 29 ++++++++++++++++++++++-------
>  include/linux/aio.h     |  2 --
>  include/linux/ksm.h     | 11 -----------
>  include/linux/sched.h   |  5 +++++
>  include/linux/uprobes.h |  1 -
>  kernel/events/uprobes.c | 19 ++++++++++++++++---
>  kernel/fork.c           | 22 ++++++++++++++++++----
>  mm/ksm.c                | 26 +++++++++++++++++++++-----
>  mm/mmap.c               |  3 +++
>  9 files changed, 85 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c1d8c48..1d06e92 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -40,6 +40,7 @@
>  #include <linux/ramfs.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/mount.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/kmap_types.h>
>  #include <asm/uaccess.h>
> @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
>  EXPORT_SYMBOL(wait_on_sync_kiocb);
>  
>  /*
> - * exit_aio: called when the last user of mm goes away.  At this point, there is
> + * aio_exit: called when the last user of mm goes away.  At this point, there is
>   * no way for any new requests to be submited or any of the io_* syscalls to be
>   * called on the context.
>   *
>   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
>   * them.
>   */
> -void exit_aio(struct mm_struct *mm)
> +static int aio_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
>  	int i;
>  
>  	if (!table)
> -		return;
> +		return 0;
>  
>  	for (i = 0; i < table->nr; ++i) {
>  		struct kioctx *ctx = table->table[i];
> @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
>  			continue;
>  		/*
>  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> -		 * is coming and it'll unmap everything. And we simply can't,
> -		 * this is not necessarily our ->mm.
> -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> -		 * that it needs to unmap the area, just set it to 0.
> +		 * have already been call and everything is unmap by now. But
> +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> +		 * area.
>  		 */
>  		ctx->mmap_size = 0;
>  		kill_ioctx(mm, ctx, NULL);
> @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
>  
>  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
>  	kfree(table);
> +	return 0;
>  }
>  
>  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>  	}
>  	return ret;
>  }
> +
> +static struct notifier_block aio_mmput_nb = {
> +	.notifier_call		= aio_exit,
> +	.priority		= 1,
> +};
> +
> +static int __init aio_init(void)
> +{
> +	return mmput_register_notifier(&aio_mmput_nb);
> +}
> +subsys_initcall(aio_init);
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..6308fac 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
>  extern void aio_complete(struct kiocb *iocb, long res, long res2);
>  struct mm_struct;
> -extern void exit_aio(struct mm_struct *mm);
>  extern long do_io_submit(aio_context_t ctx_id, long nr,
>  			 struct iocb __user *__user *iocbpp, bool compat);
>  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
>  struct mm_struct;
> -static inline void exit_aio(struct mm_struct *mm) { }
>  static inline long do_io_submit(aio_context_t ctx_id, long nr,
>  				struct iocb __user * __user *iocbpp,
>  				bool compat) { return 0; }
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..84c184f 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -20,7 +20,6 @@ struct mem_cgroup;
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
> -void __ksm_exit(struct mm_struct *mm);
>  
>  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> -		__ksm_exit(mm);
> -}
> -
>  /*
>   * A KSM page is one of those write-protected "shared pages" or "merged pages"
>   * which KSM maps into multiple mms, wherever identical anonymous page content
> @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -}
> -
>  static inline int PageKsm(struct page *page)
>  {
>  	return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 322d4fc..428b3cf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
>  		__mmdrop(mm);
>  }
>  
> +/* mmput call list of notifier and subsystem/module can register
> + * new one through this call.
> + */
> +extern int mmput_register_notifier(struct notifier_block *nb);
> +extern int mmput_unregister_notifier(struct notifier_block *nb);
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
>  /* Grab a reference to a task's mm, if it is not already going away */
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 4f844c6..44e7267 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
>  extern void uprobe_notify_resume(struct pt_regs *regs);
>  extern bool uprobe_deny_signal(void);
>  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> -extern void uprobe_clear_state(struct mm_struct *mm);
>  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
>  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 46b7c31..32b04dc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -37,6 +37,7 @@
>  #include <linux/percpu-rwsem.h>
>  #include <linux/task_work.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/notifier.h>
>  
>  #include <linux/uprobes.h>
>  
> @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
>  /*
>   * uprobe_clear_state - Free the area allocated for slots.
>   */
> -void uprobe_clear_state(struct mm_struct *mm)
> +static int uprobe_clear_state(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct xol_area *area = mm->uprobes_state.xol_area;
>  
>  	if (!area)
> -		return;
> +		return 0;
>  
>  	put_page(area->page);
>  	kfree(area->bitmap);
>  	kfree(area);
> +	return 0;
>  }
>  
>  void uprobe_start_dup_mmap(void)
> @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
>  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
>  };
>  
> +static struct notifier_block uprobe_mmput_nb = {
> +	.notifier_call		= uprobe_clear_state,
> +	.priority		= 0,
> +};
> +
>  static int __init init_uprobes(void)
>  {
> -	int i;
> +	int i, err;
>  
>  	for (i = 0; i < UPROBES_HASH_SZ; i++)
>  		mutex_init(&uprobes_mmap_mutex[i]);
> @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
>  	if (percpu_init_rwsem(&dup_mmap_sem))
>  		return -ENOMEM;
>  
> +	err = mmput_register_notifier(&uprobe_mmput_nb);
> +	if (err)
> +		return err;
> +
>  	return register_die_notifier(&uprobe_exception_nb);
>  }
>  __initcall(init_uprobes);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dd8864f..b448509 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -87,6 +87,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/task.h>
>  
> +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> +
>  /*
>   * Protected counters by write_lock_irq(&tasklist_lock)
>   */
> @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
>  EXPORT_SYMBOL_GPL(__mmdrop);
>  
>  /*
> + * Register a notifier that will be call by mmput
> + */
> +int mmput_register_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> +
> +int mmput_unregister_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> +
> +/*
>   * Decrement the use count and release all resources for an mm.
>   */
>  void mmput(struct mm_struct *mm)
> @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
>  	might_sleep();
>  
>  	if (atomic_dec_and_test(&mm->mm_users)) {
> -		uprobe_clear_state(mm);
> -		exit_aio(mm);
> -		ksm_exit(mm);
> -		khugepaged_exit(mm); /* must run before exit_mmap */
>  		exit_mmap(mm);
> +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
>  		set_mm_exe_file(mm, NULL);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 346ddc9..cb1e976 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -37,6 +37,7 @@
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		ksm_scan.mm_slot = slot;
>  		spin_unlock(&ksm_mmlist_lock);
>  		/*
> -		 * Although we tested list_empty() above, a racing __ksm_exit
> +		 * Although we tested list_empty() above, a racing ksm_exit
>  		 * of the last mm on the list may have removed it since then.
>  		 */
>  		if (slot == &ksm_mm_head)
> @@ -1658,9 +1659,9 @@ next_mm:
>  		/*
>  		 * We've completed a full scan of all vmas, holding mmap_sem
>  		 * throughout, and found no VM_MERGEABLE: so do the same as
> -		 * __ksm_exit does to remove this mm from all our lists now.
> -		 * This applies either when cleaning up after __ksm_exit
> -		 * (but beware: we can reach here even before __ksm_exit),
> +		 * ksm_exit does to remove this mm from all our lists now.
> +		 * This applies either when cleaning up after ksm_exit
> +		 * (but beware: we can reach here even before ksm_exit),
>  		 * or when all VM_MERGEABLE areas have been unmapped (and
>  		 * mmap_sem then protects against race with MADV_MERGEABLE).
>  		 */
> @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -void __ksm_exit(struct mm_struct *mm)
> +static int ksm_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct mm_slot *mm_slot;
>  	int easy_to_free = 0;
>  
> +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> +		return 0;
> +
>  	/*
>  	 * This process is exiting: if it's straightforward (as is the
>  	 * case when ksmd was never running), free mm_slot immediately.
> @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}
> +	return 0;
>  }
>  
>  struct page *ksm_might_need_to_copy(struct page *page,
> @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
>  };
>  #endif /* CONFIG_SYSFS */
>  
> +static struct notifier_block ksm_mmput_nb = {
> +	.notifier_call		= ksm_exit,
> +	.priority		= 2,
> +};
> +
>  static int __init ksm_init(void)
>  {
>  	struct task_struct *ksm_thread;
>  	int err;
>  
> +	err = mmput_register_notifier(&ksm_mmput_nb);
> +	if (err)
> +		return err;
> +
>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61aec93..b684a21 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
>  
> +	/* Important to call this first. */
> +	khugepaged_exit(mm);
> +
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  

Hi Jerome, I reviewed the patch, integrated and tested it in AMD's HSA
driver (KFD). It works as expected and I didn't find any problems with
it.

I did face some problems regarding the amd IOMMU v2 driver, which
changed its behavior (see commit "iommu/amd: Implement
mmu_notifier_release call-back") to use mmu_notifier_release and did
some "bad things" inside that
notifier (primarily, but not only, deleting the object which held the
mmu_notifier object itself, which you mustn't do because of the
locking). 

I'm thinking of changing that driver's behavior to use this new
mechanism instead of using mmu_notifier_release. Does that seem
acceptable ? Another solution will be to add a new mmu_notifier call,
but we already ruled that out ;)

So,
Reviewed-by: Oded Gabbay <oded.gabbay@amd.com>

	Oded


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 14:41     ` Gabbay, Oded
  0 siblings, 0 replies; 76+ messages in thread
From: Gabbay, Oded @ 2014-06-30 14:41 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay, akpm,
	linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange, riel,
	jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	John Hubbard, Sherry

On Fri, 2014-06-27 at 22:00 -0400, Jérôme Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Several subsystem require a callback when a mm struct is being destroy
> so that they can cleanup there respective per mm struct. Instead of
> having each subsystem add its callback to mmput use a notifier chain
> to call each of the subsystem.
> 
> This will allow new subsystem to register callback even if they are
> module. There should be no contention on the rw semaphore protecting
> the call chain and the impact on the code path should be low and
> burried in the noise.
> 
> Note that this patch also move the call to cleanup functions after
> exit_mmap so that new call back can assume that mmu_notifier_release
> have already been call. This does not impact existing cleanup functions
> as they do not rely on anything that exit_mmap is freeing. Also moved
> khugepaged_exit to exit_mmap so that ordering is preserved for that
> function.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  fs/aio.c                | 29 ++++++++++++++++++++++-------
>  include/linux/aio.h     |  2 --
>  include/linux/ksm.h     | 11 -----------
>  include/linux/sched.h   |  5 +++++
>  include/linux/uprobes.h |  1 -
>  kernel/events/uprobes.c | 19 ++++++++++++++++---
>  kernel/fork.c           | 22 ++++++++++++++++++----
>  mm/ksm.c                | 26 +++++++++++++++++++++-----
>  mm/mmap.c               |  3 +++
>  9 files changed, 85 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c1d8c48..1d06e92 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -40,6 +40,7 @@
>  #include <linux/ramfs.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/mount.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/kmap_types.h>
>  #include <asm/uaccess.h>
> @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
>  EXPORT_SYMBOL(wait_on_sync_kiocb);
>  
>  /*
> - * exit_aio: called when the last user of mm goes away.  At this point, there is
> + * aio_exit: called when the last user of mm goes away.  At this point, there is
>   * no way for any new requests to be submited or any of the io_* syscalls to be
>   * called on the context.
>   *
>   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
>   * them.
>   */
> -void exit_aio(struct mm_struct *mm)
> +static int aio_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
>  	int i;
>  
>  	if (!table)
> -		return;
> +		return 0;
>  
>  	for (i = 0; i < table->nr; ++i) {
>  		struct kioctx *ctx = table->table[i];
> @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
>  			continue;
>  		/*
>  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> -		 * is coming and it'll unmap everything. And we simply can't,
> -		 * this is not necessarily our ->mm.
> -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> -		 * that it needs to unmap the area, just set it to 0.
> +		 * have already been call and everything is unmap by now. But
> +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> +		 * area.
>  		 */
>  		ctx->mmap_size = 0;
>  		kill_ioctx(mm, ctx, NULL);
> @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
>  
>  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
>  	kfree(table);
> +	return 0;
>  }
>  
>  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>  	}
>  	return ret;
>  }
> +
> +static struct notifier_block aio_mmput_nb = {
> +	.notifier_call		= aio_exit,
> +	.priority		= 1,
> +};
> +
> +static int __init aio_init(void)
> +{
> +	return mmput_register_notifier(&aio_mmput_nb);
> +}
> +subsys_initcall(aio_init);
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..6308fac 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
>  extern void aio_complete(struct kiocb *iocb, long res, long res2);
>  struct mm_struct;
> -extern void exit_aio(struct mm_struct *mm);
>  extern long do_io_submit(aio_context_t ctx_id, long nr,
>  			 struct iocb __user *__user *iocbpp, bool compat);
>  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
>  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
>  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
>  struct mm_struct;
> -static inline void exit_aio(struct mm_struct *mm) { }
>  static inline long do_io_submit(aio_context_t ctx_id, long nr,
>  				struct iocb __user * __user *iocbpp,
>  				bool compat) { return 0; }
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..84c184f 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -20,7 +20,6 @@ struct mem_cgroup;
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
>  		unsigned long end, int advice, unsigned long *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
> -void __ksm_exit(struct mm_struct *mm);
>  
>  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> -		__ksm_exit(mm);
> -}
> -
>  /*
>   * A KSM page is one of those write-protected "shared pages" or "merged pages"
>   * which KSM maps into multiple mms, wherever identical anonymous page content
> @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  	return 0;
>  }
>  
> -static inline void ksm_exit(struct mm_struct *mm)
> -{
> -}
> -
>  static inline int PageKsm(struct page *page)
>  {
>  	return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 322d4fc..428b3cf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
>  		__mmdrop(mm);
>  }
>  
> +/* mmput call list of notifier and subsystem/module can register
> + * new one through this call.
> + */
> +extern int mmput_register_notifier(struct notifier_block *nb);
> +extern int mmput_unregister_notifier(struct notifier_block *nb);
>  /* mmput gets rid of the mappings and all user-space */
>  extern void mmput(struct mm_struct *);
>  /* Grab a reference to a task's mm, if it is not already going away */
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 4f844c6..44e7267 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
>  extern void uprobe_notify_resume(struct pt_regs *regs);
>  extern bool uprobe_deny_signal(void);
>  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> -extern void uprobe_clear_state(struct mm_struct *mm);
>  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
>  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 46b7c31..32b04dc 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -37,6 +37,7 @@
>  #include <linux/percpu-rwsem.h>
>  #include <linux/task_work.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/notifier.h>
>  
>  #include <linux/uprobes.h>
>  
> @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
>  /*
>   * uprobe_clear_state - Free the area allocated for slots.
>   */
> -void uprobe_clear_state(struct mm_struct *mm)
> +static int uprobe_clear_state(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct xol_area *area = mm->uprobes_state.xol_area;
>  
>  	if (!area)
> -		return;
> +		return 0;
>  
>  	put_page(area->page);
>  	kfree(area->bitmap);
>  	kfree(area);
> +	return 0;
>  }
>  
>  void uprobe_start_dup_mmap(void)
> @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
>  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
>  };
>  
> +static struct notifier_block uprobe_mmput_nb = {
> +	.notifier_call		= uprobe_clear_state,
> +	.priority		= 0,
> +};
> +
>  static int __init init_uprobes(void)
>  {
> -	int i;
> +	int i, err;
>  
>  	for (i = 0; i < UPROBES_HASH_SZ; i++)
>  		mutex_init(&uprobes_mmap_mutex[i]);
> @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
>  	if (percpu_init_rwsem(&dup_mmap_sem))
>  		return -ENOMEM;
>  
> +	err = mmput_register_notifier(&uprobe_mmput_nb);
> +	if (err)
> +		return err;
> +
>  	return register_die_notifier(&uprobe_exception_nb);
>  }
>  __initcall(init_uprobes);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index dd8864f..b448509 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -87,6 +87,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/task.h>
>  
> +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> +
>  /*
>   * Protected counters by write_lock_irq(&tasklist_lock)
>   */
> @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
>  EXPORT_SYMBOL_GPL(__mmdrop);
>  
>  /*
> + * Register a notifier that will be call by mmput
> + */
> +int mmput_register_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> +
> +int mmput_unregister_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> +
> +/*
>   * Decrement the use count and release all resources for an mm.
>   */
>  void mmput(struct mm_struct *mm)
> @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
>  	might_sleep();
>  
>  	if (atomic_dec_and_test(&mm->mm_users)) {
> -		uprobe_clear_state(mm);
> -		exit_aio(mm);
> -		ksm_exit(mm);
> -		khugepaged_exit(mm); /* must run before exit_mmap */
>  		exit_mmap(mm);
> +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
>  		set_mm_exe_file(mm, NULL);
>  		if (!list_empty(&mm->mmlist)) {
>  			spin_lock(&mmlist_lock);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 346ddc9..cb1e976 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -37,6 +37,7 @@
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/notifier.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
>  		ksm_scan.mm_slot = slot;
>  		spin_unlock(&ksm_mmlist_lock);
>  		/*
> -		 * Although we tested list_empty() above, a racing __ksm_exit
> +		 * Although we tested list_empty() above, a racing ksm_exit
>  		 * of the last mm on the list may have removed it since then.
>  		 */
>  		if (slot == &ksm_mm_head)
> @@ -1658,9 +1659,9 @@ next_mm:
>  		/*
>  		 * We've completed a full scan of all vmas, holding mmap_sem
>  		 * throughout, and found no VM_MERGEABLE: so do the same as
> -		 * __ksm_exit does to remove this mm from all our lists now.
> -		 * This applies either when cleaning up after __ksm_exit
> -		 * (but beware: we can reach here even before __ksm_exit),
> +		 * ksm_exit does to remove this mm from all our lists now.
> +		 * This applies either when cleaning up after ksm_exit
> +		 * (but beware: we can reach here even before ksm_exit),
>  		 * or when all VM_MERGEABLE areas have been unmapped (and
>  		 * mmap_sem then protects against race with MADV_MERGEABLE).
>  		 */
> @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -void __ksm_exit(struct mm_struct *mm)
> +static int ksm_exit(struct notifier_block *nb,
> +		    unsigned long action, void *data)
>  {
> +	struct mm_struct *mm = data;
>  	struct mm_slot *mm_slot;
>  	int easy_to_free = 0;
>  
> +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> +		return 0;
> +
>  	/*
>  	 * This process is exiting: if it's straightforward (as is the
>  	 * case when ksmd was never running), free mm_slot immediately.
> @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
>  		down_write(&mm->mmap_sem);
>  		up_write(&mm->mmap_sem);
>  	}
> +	return 0;
>  }
>  
>  struct page *ksm_might_need_to_copy(struct page *page,
> @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
>  };
>  #endif /* CONFIG_SYSFS */
>  
> +static struct notifier_block ksm_mmput_nb = {
> +	.notifier_call		= ksm_exit,
> +	.priority		= 2,
> +};
> +
>  static int __init ksm_init(void)
>  {
>  	struct task_struct *ksm_thread;
>  	int err;
>  
> +	err = mmput_register_notifier(&ksm_mmput_nb);
> +	if (err)
> +		return err;
> +
>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61aec93..b684a21 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	unsigned long nr_accounted = 0;
>  
> +	/* Important to call this first. */
> +	khugepaged_exit(mm);
> +
>  	/* mm's last user has gone, and its about to be pulled down */
>  	mmu_notifier_release(mm);
>  

Hi Jerome, I reviewed the patch, integrated and tested it in AMD's HSA
driver (KFD). It works as expected and I didn't find any problems with
it.

I did face some problems regarding the amd IOMMU v2 driver, which
changed its behavior (see commit "iommu/amd: Implement
mmu_notifier_release call-back") to use mmu_notifier_release and did
some "bad things" inside that
notifier (primarily, but not only, deleting the object which held the
mmu_notifier object itself, which you mustn't do because of the
locking). 

I'm thinking of changing that driver's behavior to use this new
mechanism instead of using mmu_notifier_release. Does that seem
acceptable ? Another solution will be to add a new mmu_notifier call,
but we already ruled that out ;)

So,
Reviewed-by: Oded Gabbay <oded.gabbay@amd.com>

	Oded


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]     ` <019CCE693E457142B37B791721487FD91806B836-0nO7ALo/ziwxlywnonMhLEEOCMrvLtNR@public.gmane.org>
@ 2014-06-30 15:06       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:06 UTC (permalink / raw)
  To: Gabbay, Oded
  Cc: Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay, akpm,
	linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange, riel,
	jweiner, torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	John Hubbard, Sherry Cheung, Duncan Poole, Joerg Roedel, iommu

On Mon, Jun 30, 2014 at 02:41:24PM +0000, Gabbay, Oded wrote:
> On Fri, 2014-06-27 at 22:00 -0400, Jerome Glisse wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > Several subsystem require a callback when a mm struct is being destroy
> > so that they can cleanup there respective per mm struct. Instead of
> > having each subsystem add its callback to mmput use a notifier chain
> > to call each of the subsystem.
> > 
> > This will allow new subsystem to register callback even if they are
> > module. There should be no contention on the rw semaphore protecting
> > the call chain and the impact on the code path should be low and
> > burried in the noise.
> > 
> > Note that this patch also move the call to cleanup functions after
> > exit_mmap so that new call back can assume that mmu_notifier_release
> > have already been call. This does not impact existing cleanup functions
> > as they do not rely on anything that exit_mmap is freeing. Also moved
> > khugepaged_exit to exit_mmap so that ordering is preserved for that
> > function.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > ---
> >  fs/aio.c                | 29 ++++++++++++++++++++++-------
> >  include/linux/aio.h     |  2 --
> >  include/linux/ksm.h     | 11 -----------
> >  include/linux/sched.h   |  5 +++++
> >  include/linux/uprobes.h |  1 -
> >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> >  kernel/fork.c           | 22 ++++++++++++++++++----
> >  mm/ksm.c                | 26 +++++++++++++++++++++-----
> >  mm/mmap.c               |  3 +++
> >  9 files changed, 85 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index c1d8c48..1d06e92 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -40,6 +40,7 @@
> >  #include <linux/ramfs.h>
> >  #include <linux/percpu-refcount.h>
> >  #include <linux/mount.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/kmap_types.h>
> >  #include <asm/uaccess.h>
> > @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
> >  EXPORT_SYMBOL(wait_on_sync_kiocb);
> >  
> >  /*
> > - * exit_aio: called when the last user of mm goes away.  At this point, there is
> > + * aio_exit: called when the last user of mm goes away.  At this point, there is
> >   * no way for any new requests to be submited or any of the io_* syscalls to be
> >   * called on the context.
> >   *
> >   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
> >   * them.
> >   */
> > -void exit_aio(struct mm_struct *mm)
> > +static int aio_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
> >  	int i;
> >  
> >  	if (!table)
> > -		return;
> > +		return 0;
> >  
> >  	for (i = 0; i < table->nr; ++i) {
> >  		struct kioctx *ctx = table->table[i];
> > @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
> >  			continue;
> >  		/*
> >  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> > -		 * is coming and it'll unmap everything. And we simply can't,
> > -		 * this is not necessarily our ->mm.
> > -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> > -		 * that it needs to unmap the area, just set it to 0.
> > +		 * have already been call and everything is unmap by now. But
> > +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> > +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> > +		 * area.
> >  		 */
> >  		ctx->mmap_size = 0;
> >  		kill_ioctx(mm, ctx, NULL);
> > @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
> >  
> >  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
> >  	kfree(table);
> > +	return 0;
> >  }
> >  
> >  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> > @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
> >  	}
> >  	return ret;
> >  }
> > +
> > +static struct notifier_block aio_mmput_nb = {
> > +	.notifier_call		= aio_exit,
> > +	.priority		= 1,
> > +};
> > +
> > +static int __init aio_init(void)
> > +{
> > +	return mmput_register_notifier(&aio_mmput_nb);
> > +}
> > +subsys_initcall(aio_init);
> > diff --git a/include/linux/aio.h b/include/linux/aio.h
> > index d9c92da..6308fac 100644
> > --- a/include/linux/aio.h
> > +++ b/include/linux/aio.h
> > @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> >  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
> >  extern void aio_complete(struct kiocb *iocb, long res, long res2);
> >  struct mm_struct;
> > -extern void exit_aio(struct mm_struct *mm);
> >  extern long do_io_submit(aio_context_t ctx_id, long nr,
> >  			 struct iocb __user *__user *iocbpp, bool compat);
> >  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> > @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> >  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
> >  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
> >  struct mm_struct;
> > -static inline void exit_aio(struct mm_struct *mm) { }
> >  static inline long do_io_submit(aio_context_t ctx_id, long nr,
> >  				struct iocb __user * __user *iocbpp,
> >  				bool compat) { return 0; }
> > diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> > index 3be6bb1..84c184f 100644
> > --- a/include/linux/ksm.h
> > +++ b/include/linux/ksm.h
> > @@ -20,7 +20,6 @@ struct mem_cgroup;
> >  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> >  		unsigned long end, int advice, unsigned long *vm_flags);
> >  int __ksm_enter(struct mm_struct *mm);
> > -void __ksm_exit(struct mm_struct *mm);
> >  
> >  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > -		__ksm_exit(mm);
> > -}
> > -
> >  /*
> >   * A KSM page is one of those write-protected "shared pages" or "merged pages"
> >   * which KSM maps into multiple mms, wherever identical anonymous page content
> > @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -}
> > -
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 322d4fc..428b3cf 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
> >  		__mmdrop(mm);
> >  }
> >  
> > +/* mmput call list of notifier and subsystem/module can register
> > + * new one through this call.
> > + */
> > +extern int mmput_register_notifier(struct notifier_block *nb);
> > +extern int mmput_unregister_notifier(struct notifier_block *nb);
> >  /* mmput gets rid of the mappings and all user-space */
> >  extern void mmput(struct mm_struct *);
> >  /* Grab a reference to a task's mm, if it is not already going away */
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 4f844c6..44e7267 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
> >  extern void uprobe_notify_resume(struct pt_regs *regs);
> >  extern bool uprobe_deny_signal(void);
> >  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> > -extern void uprobe_clear_state(struct mm_struct *mm);
> >  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
> >  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> >  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 46b7c31..32b04dc 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/percpu-rwsem.h>
> >  #include <linux/task_work.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <linux/uprobes.h>
> >  
> > @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
> >  /*
> >   * uprobe_clear_state - Free the area allocated for slots.
> >   */
> > -void uprobe_clear_state(struct mm_struct *mm)
> > +static int uprobe_clear_state(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct xol_area *area = mm->uprobes_state.xol_area;
> >  
> >  	if (!area)
> > -		return;
> > +		return 0;
> >  
> >  	put_page(area->page);
> >  	kfree(area->bitmap);
> >  	kfree(area);
> > +	return 0;
> >  }
> >  
> >  void uprobe_start_dup_mmap(void)
> > @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
> >  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
> >  };
> >  
> > +static struct notifier_block uprobe_mmput_nb = {
> > +	.notifier_call		= uprobe_clear_state,
> > +	.priority		= 0,
> > +};
> > +
> >  static int __init init_uprobes(void)
> >  {
> > -	int i;
> > +	int i, err;
> >  
> >  	for (i = 0; i < UPROBES_HASH_SZ; i++)
> >  		mutex_init(&uprobes_mmap_mutex[i]);
> > @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
> >  	if (percpu_init_rwsem(&dup_mmap_sem))
> >  		return -ENOMEM;
> >  
> > +	err = mmput_register_notifier(&uprobe_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> >  	return register_die_notifier(&uprobe_exception_nb);
> >  }
> >  __initcall(init_uprobes);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index dd8864f..b448509 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -87,6 +87,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/task.h>
> >  
> > +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> > +
> >  /*
> >   * Protected counters by write_lock_irq(&tasklist_lock)
> >   */
> > @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
> >  EXPORT_SYMBOL_GPL(__mmdrop);
> >  
> >  /*
> > + * Register a notifier that will be call by mmput
> > + */
> > +int mmput_register_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> > +
> > +int mmput_unregister_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> > +
> > +/*
> >   * Decrement the use count and release all resources for an mm.
> >   */
> >  void mmput(struct mm_struct *mm)
> > @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
> >  	might_sleep();
> >  
> >  	if (atomic_dec_and_test(&mm->mm_users)) {
> > -		uprobe_clear_state(mm);
> > -		exit_aio(mm);
> > -		ksm_exit(mm);
> > -		khugepaged_exit(mm); /* must run before exit_mmap */
> >  		exit_mmap(mm);
> > +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
> >  		set_mm_exe_file(mm, NULL);
> >  		if (!list_empty(&mm->mmlist)) {
> >  			spin_lock(&mmlist_lock);
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 346ddc9..cb1e976 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/freezer.h>
> >  #include <linux/oom.h>
> >  #include <linux/numa.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> > @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  		ksm_scan.mm_slot = slot;
> >  		spin_unlock(&ksm_mmlist_lock);
> >  		/*
> > -		 * Although we tested list_empty() above, a racing __ksm_exit
> > +		 * Although we tested list_empty() above, a racing ksm_exit
> >  		 * of the last mm on the list may have removed it since then.
> >  		 */
> >  		if (slot == &ksm_mm_head)
> > @@ -1658,9 +1659,9 @@ next_mm:
> >  		/*
> >  		 * We've completed a full scan of all vmas, holding mmap_sem
> >  		 * throughout, and found no VM_MERGEABLE: so do the same as
> > -		 * __ksm_exit does to remove this mm from all our lists now.
> > -		 * This applies either when cleaning up after __ksm_exit
> > -		 * (but beware: we can reach here even before __ksm_exit),
> > +		 * ksm_exit does to remove this mm from all our lists now.
> > +		 * This applies either when cleaning up after ksm_exit
> > +		 * (but beware: we can reach here even before ksm_exit),
> >  		 * or when all VM_MERGEABLE areas have been unmapped (and
> >  		 * mmap_sem then protects against race with MADV_MERGEABLE).
> >  		 */
> > @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
> >  	return 0;
> >  }
> >  
> > -void __ksm_exit(struct mm_struct *mm)
> > +static int ksm_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct mm_slot *mm_slot;
> >  	int easy_to_free = 0;
> >  
> > +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > +		return 0;
> > +
> >  	/*
> >  	 * This process is exiting: if it's straightforward (as is the
> >  	 * case when ksmd was never running), free mm_slot immediately.
> > @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
> >  		down_write(&mm->mmap_sem);
> >  		up_write(&mm->mmap_sem);
> >  	}
> > +	return 0;
> >  }
> >  
> >  struct page *ksm_might_need_to_copy(struct page *page,
> > @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
> >  };
> >  #endif /* CONFIG_SYSFS */
> >  
> > +static struct notifier_block ksm_mmput_nb = {
> > +	.notifier_call		= ksm_exit,
> > +	.priority		= 2,
> > +};
> > +
> >  static int __init ksm_init(void)
> >  {
> >  	struct task_struct *ksm_thread;
> >  	int err;
> >  
> > +	err = mmput_register_notifier(&ksm_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> >  	err = ksm_slab_init();
> >  	if (err)
> >  		goto out;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 61aec93..b684a21 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_accounted = 0;
> >  
> > +	/* Important to call this first. */
> > +	khugepaged_exit(mm);
> > +
> >  	/* mm's last user has gone, and its about to be pulled down */
> >  	mmu_notifier_release(mm);
> >  
> 
> Hi Jerome, I reviewed the patch, integrated and tested it in AMD's HSA
> driver (KFD). It works as expected and I didn't find any problems with
> it.
> 
> I did face some problems regarding the amd IOMMU v2 driver, which
> changed its behavior (see commit "iommu/amd: Implement
> mmu_notifier_release call-back") to use mmu_notifier_release and did
> some "bad things" inside that
> notifier (primarily, but not only, deleting the object which held the
> mmu_notifier object itself, which you mustn't do because of the
> locking). 
> 
> I'm thinking of changing that driver's behavior to use this new
> mechanism instead of using mmu_notifier_release. Does that seem
> acceptable ? Another solution will be to add a new mmu_notifier call,
> but we already ruled that out ;)
> 

This sounds acceptable. You can check how i did it in hmm :
http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm

As it had similar issues.

Cheers,
Jerome

> So,
> Reviewed-by: Oded Gabbay <oded.gabbay@amd.com>
> 
> 	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 15:06       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:06 UTC (permalink / raw)
  To: Gabbay, Oded
  Cc: Sherry Cheung, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	Jatin Kumar, Lucien Dunning, mgorman-l3A5Bk7waGM,
	jweiner-H+wXaHxf7aLQT0dZR+AlfA, Subhash Gutti,
	riel-H+wXaHxf7aLQT0dZR+AlfA, John Hubbard, Mark Hairgrove,
	Cameron Buschardt, peterz-hDdKplPs4pWWVfeAwA7xHQ, Duncan Poole,
	Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Arvind Gopalakrishnan, Deucher, Alexander

On Mon, Jun 30, 2014 at 02:41:24PM +0000, Gabbay, Oded wrote:
> On Fri, 2014-06-27 at 22:00 -0400, Jérôme Glisse wrote:
> > From: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > 
> > Several subsystem require a callback when a mm struct is being destroy
> > so that they can cleanup there respective per mm struct. Instead of
> > having each subsystem add its callback to mmput use a notifier chain
> > to call each of the subsystem.
> > 
> > This will allow new subsystem to register callback even if they are
> > module. There should be no contention on the rw semaphore protecting
> > the call chain and the impact on the code path should be low and
> > burried in the noise.
> > 
> > Note that this patch also move the call to cleanup functions after
> > exit_mmap so that new call back can assume that mmu_notifier_release
> > have already been call. This does not impact existing cleanup functions
> > as they do not rely on anything that exit_mmap is freeing. Also moved
> > khugepaged_exit to exit_mmap so that ordering is preserved for that
> > function.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  fs/aio.c                | 29 ++++++++++++++++++++++-------
> >  include/linux/aio.h     |  2 --
> >  include/linux/ksm.h     | 11 -----------
> >  include/linux/sched.h   |  5 +++++
> >  include/linux/uprobes.h |  1 -
> >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> >  kernel/fork.c           | 22 ++++++++++++++++++----
> >  mm/ksm.c                | 26 +++++++++++++++++++++-----
> >  mm/mmap.c               |  3 +++
> >  9 files changed, 85 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index c1d8c48..1d06e92 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -40,6 +40,7 @@
> >  #include <linux/ramfs.h>
> >  #include <linux/percpu-refcount.h>
> >  #include <linux/mount.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/kmap_types.h>
> >  #include <asm/uaccess.h>
> > @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
> >  EXPORT_SYMBOL(wait_on_sync_kiocb);
> >  
> >  /*
> > - * exit_aio: called when the last user of mm goes away.  At this point, there is
> > + * aio_exit: called when the last user of mm goes away.  At this point, there is
> >   * no way for any new requests to be submited or any of the io_* syscalls to be
> >   * called on the context.
> >   *
> >   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
> >   * them.
> >   */
> > -void exit_aio(struct mm_struct *mm)
> > +static int aio_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
> >  	int i;
> >  
> >  	if (!table)
> > -		return;
> > +		return 0;
> >  
> >  	for (i = 0; i < table->nr; ++i) {
> >  		struct kioctx *ctx = table->table[i];
> > @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
> >  			continue;
> >  		/*
> >  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> > -		 * is coming and it'll unmap everything. And we simply can't,
> > -		 * this is not necessarily our ->mm.
> > -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> > -		 * that it needs to unmap the area, just set it to 0.
> > +		 * have already been call and everything is unmap by now. But
> > +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> > +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> > +		 * area.
> >  		 */
> >  		ctx->mmap_size = 0;
> >  		kill_ioctx(mm, ctx, NULL);
> > @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
> >  
> >  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
> >  	kfree(table);
> > +	return 0;
> >  }
> >  
> >  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> > @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
> >  	}
> >  	return ret;
> >  }
> > +
> > +static struct notifier_block aio_mmput_nb = {
> > +	.notifier_call		= aio_exit,
> > +	.priority		= 1,
> > +};
> > +
> > +static int __init aio_init(void)
> > +{
> > +	return mmput_register_notifier(&aio_mmput_nb);
> > +}
> > +subsys_initcall(aio_init);
> > diff --git a/include/linux/aio.h b/include/linux/aio.h
> > index d9c92da..6308fac 100644
> > --- a/include/linux/aio.h
> > +++ b/include/linux/aio.h
> > @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> >  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
> >  extern void aio_complete(struct kiocb *iocb, long res, long res2);
> >  struct mm_struct;
> > -extern void exit_aio(struct mm_struct *mm);
> >  extern long do_io_submit(aio_context_t ctx_id, long nr,
> >  			 struct iocb __user *__user *iocbpp, bool compat);
> >  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> > @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> >  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
> >  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
> >  struct mm_struct;
> > -static inline void exit_aio(struct mm_struct *mm) { }
> >  static inline long do_io_submit(aio_context_t ctx_id, long nr,
> >  				struct iocb __user * __user *iocbpp,
> >  				bool compat) { return 0; }
> > diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> > index 3be6bb1..84c184f 100644
> > --- a/include/linux/ksm.h
> > +++ b/include/linux/ksm.h
> > @@ -20,7 +20,6 @@ struct mem_cgroup;
> >  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> >  		unsigned long end, int advice, unsigned long *vm_flags);
> >  int __ksm_enter(struct mm_struct *mm);
> > -void __ksm_exit(struct mm_struct *mm);
> >  
> >  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > -		__ksm_exit(mm);
> > -}
> > -
> >  /*
> >   * A KSM page is one of those write-protected "shared pages" or "merged pages"
> >   * which KSM maps into multiple mms, wherever identical anonymous page content
> > @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -}
> > -
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 322d4fc..428b3cf 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
> >  		__mmdrop(mm);
> >  }
> >  
> > +/* mmput call list of notifier and subsystem/module can register
> > + * new one through this call.
> > + */
> > +extern int mmput_register_notifier(struct notifier_block *nb);
> > +extern int mmput_unregister_notifier(struct notifier_block *nb);
> >  /* mmput gets rid of the mappings and all user-space */
> >  extern void mmput(struct mm_struct *);
> >  /* Grab a reference to a task's mm, if it is not already going away */
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 4f844c6..44e7267 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
> >  extern void uprobe_notify_resume(struct pt_regs *regs);
> >  extern bool uprobe_deny_signal(void);
> >  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> > -extern void uprobe_clear_state(struct mm_struct *mm);
> >  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
> >  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> >  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 46b7c31..32b04dc 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/percpu-rwsem.h>
> >  #include <linux/task_work.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <linux/uprobes.h>
> >  
> > @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
> >  /*
> >   * uprobe_clear_state - Free the area allocated for slots.
> >   */
> > -void uprobe_clear_state(struct mm_struct *mm)
> > +static int uprobe_clear_state(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct xol_area *area = mm->uprobes_state.xol_area;
> >  
> >  	if (!area)
> > -		return;
> > +		return 0;
> >  
> >  	put_page(area->page);
> >  	kfree(area->bitmap);
> >  	kfree(area);
> > +	return 0;
> >  }
> >  
> >  void uprobe_start_dup_mmap(void)
> > @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
> >  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
> >  };
> >  
> > +static struct notifier_block uprobe_mmput_nb = {
> > +	.notifier_call		= uprobe_clear_state,
> > +	.priority		= 0,
> > +};
> > +
> >  static int __init init_uprobes(void)
> >  {
> > -	int i;
> > +	int i, err;
> >  
> >  	for (i = 0; i < UPROBES_HASH_SZ; i++)
> >  		mutex_init(&uprobes_mmap_mutex[i]);
> > @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
> >  	if (percpu_init_rwsem(&dup_mmap_sem))
> >  		return -ENOMEM;
> >  
> > +	err = mmput_register_notifier(&uprobe_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> >  	return register_die_notifier(&uprobe_exception_nb);
> >  }
> >  __initcall(init_uprobes);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index dd8864f..b448509 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -87,6 +87,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/task.h>
> >  
> > +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> > +
> >  /*
> >   * Protected counters by write_lock_irq(&tasklist_lock)
> >   */
> > @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
> >  EXPORT_SYMBOL_GPL(__mmdrop);
> >  
> >  /*
> > + * Register a notifier that will be call by mmput
> > + */
> > +int mmput_register_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> > +
> > +int mmput_unregister_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> > +
> > +/*
> >   * Decrement the use count and release all resources for an mm.
> >   */
> >  void mmput(struct mm_struct *mm)
> > @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
> >  	might_sleep();
> >  
> >  	if (atomic_dec_and_test(&mm->mm_users)) {
> > -		uprobe_clear_state(mm);
> > -		exit_aio(mm);
> > -		ksm_exit(mm);
> > -		khugepaged_exit(mm); /* must run before exit_mmap */
> >  		exit_mmap(mm);
> > +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
> >  		set_mm_exe_file(mm, NULL);
> >  		if (!list_empty(&mm->mmlist)) {
> >  			spin_lock(&mmlist_lock);
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 346ddc9..cb1e976 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/freezer.h>
> >  #include <linux/oom.h>
> >  #include <linux/numa.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> > @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  		ksm_scan.mm_slot = slot;
> >  		spin_unlock(&ksm_mmlist_lock);
> >  		/*
> > -		 * Although we tested list_empty() above, a racing __ksm_exit
> > +		 * Although we tested list_empty() above, a racing ksm_exit
> >  		 * of the last mm on the list may have removed it since then.
> >  		 */
> >  		if (slot == &ksm_mm_head)
> > @@ -1658,9 +1659,9 @@ next_mm:
> >  		/*
> >  		 * We've completed a full scan of all vmas, holding mmap_sem
> >  		 * throughout, and found no VM_MERGEABLE: so do the same as
> > -		 * __ksm_exit does to remove this mm from all our lists now.
> > -		 * This applies either when cleaning up after __ksm_exit
> > -		 * (but beware: we can reach here even before __ksm_exit),
> > +		 * ksm_exit does to remove this mm from all our lists now.
> > +		 * This applies either when cleaning up after ksm_exit
> > +		 * (but beware: we can reach here even before ksm_exit),
> >  		 * or when all VM_MERGEABLE areas have been unmapped (and
> >  		 * mmap_sem then protects against race with MADV_MERGEABLE).
> >  		 */
> > @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
> >  	return 0;
> >  }
> >  
> > -void __ksm_exit(struct mm_struct *mm)
> > +static int ksm_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct mm_slot *mm_slot;
> >  	int easy_to_free = 0;
> >  
> > +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > +		return 0;
> > +
> >  	/*
> >  	 * This process is exiting: if it's straightforward (as is the
> >  	 * case when ksmd was never running), free mm_slot immediately.
> > @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
> >  		down_write(&mm->mmap_sem);
> >  		up_write(&mm->mmap_sem);
> >  	}
> > +	return 0;
> >  }
> >  
> >  struct page *ksm_might_need_to_copy(struct page *page,
> > @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
> >  };
> >  #endif /* CONFIG_SYSFS */
> >  
> > +static struct notifier_block ksm_mmput_nb = {
> > +	.notifier_call		= ksm_exit,
> > +	.priority		= 2,
> > +};
> > +
> >  static int __init ksm_init(void)
> >  {
> >  	struct task_struct *ksm_thread;
> >  	int err;
> >  
> > +	err = mmput_register_notifier(&ksm_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> >  	err = ksm_slab_init();
> >  	if (err)
> >  		goto out;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 61aec93..b684a21 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_accounted = 0;
> >  
> > +	/* Important to call this first. */
> > +	khugepaged_exit(mm);
> > +
> >  	/* mm's last user has gone, and its about to be pulled down */
> >  	mmu_notifier_release(mm);
> >  
> 
> Hi Jerome, I reviewed the patch, integrated and tested it in AMD's HSA
> driver (KFD). It works as expected and I didn't find any problems with
> it.
> 
> I did face some problems regarding the amd IOMMU v2 driver, which
> changed its behavior (see commit "iommu/amd: Implement
> mmu_notifier_release call-back") to use mmu_notifier_release and did
> some "bad things" inside that
> notifier (primarily, but not only, deleting the object which held the
> mmu_notifier object itself, which you mustn't do because of the
> locking). 
> 
> I'm thinking of changing that driver's behavior to use this new
> mechanism instead of using mmu_notifier_release. Does that seem
> acceptable ? Another solution will be to add a new mmu_notifier call,
> but we already ruled that out ;)
> 

This sounds acceptable. You can check how i did it in hmm :
http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm

As it had similar issues.

Cheers,
Jérôme

> So,
> Reviewed-by: Oded Gabbay <oded.gabbay-5C7GfCeVMHo@public.gmane.org>
> 
> 	Oded

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-30  3:49     ` John Hubbard
@ 2014-06-30 15:07       ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 08:49:16PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jérôme Glisse wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Several subsystem require a callback when a mm struct is being destroy
> > so that they can cleanup there respective per mm struct. Instead of
> > having each subsystem add its callback to mmput use a notifier chain
> > to call each of the subsystem.
> > 
> > This will allow new subsystem to register callback even if they are
> > module. There should be no contention on the rw semaphore protecting
> > the call chain and the impact on the code path should be low and
> > burried in the noise.
> > 
> > Note that this patch also move the call to cleanup functions after
> > exit_mmap so that new call back can assume that mmu_notifier_release
> > have already been call. This does not impact existing cleanup functions
> > as they do not rely on anything that exit_mmap is freeing. Also moved
> > khugepaged_exit to exit_mmap so that ordering is preserved for that
> > function.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > ---
> >  fs/aio.c                | 29 ++++++++++++++++++++++-------
> >  include/linux/aio.h     |  2 --
> >  include/linux/ksm.h     | 11 -----------
> >  include/linux/sched.h   |  5 +++++
> >  include/linux/uprobes.h |  1 -
> >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> >  kernel/fork.c           | 22 ++++++++++++++++++----
> >  mm/ksm.c                | 26 +++++++++++++++++++++-----
> >  mm/mmap.c               |  3 +++
> >  9 files changed, 85 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index c1d8c48..1d06e92 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -40,6 +40,7 @@
> >  #include <linux/ramfs.h>
> >  #include <linux/percpu-refcount.h>
> >  #include <linux/mount.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/kmap_types.h>
> >  #include <asm/uaccess.h>
> > @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
> >  EXPORT_SYMBOL(wait_on_sync_kiocb);
> >  
> >  /*
> > - * exit_aio: called when the last user of mm goes away.  At this point, there is
> > + * aio_exit: called when the last user of mm goes away.  At this point, there is
> >   * no way for any new requests to be submited or any of the io_* syscalls to be
> >   * called on the context.
> >   *
> >   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
> >   * them.
> >   */
> > -void exit_aio(struct mm_struct *mm)
> > +static int aio_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
> >  	int i;
> >  
> >  	if (!table)
> > -		return;
> > +		return 0;
> >  
> >  	for (i = 0; i < table->nr; ++i) {
> >  		struct kioctx *ctx = table->table[i];
> > @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
> >  			continue;
> >  		/*
> >  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> > -		 * is coming and it'll unmap everything. And we simply can't,
> > -		 * this is not necessarily our ->mm.
> > -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> > -		 * that it needs to unmap the area, just set it to 0.
> > +		 * have already been call and everything is unmap by now. But
> > +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> > +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> > +		 * area.
> >  		 */
> 
> Actually, I think the original part of the comment about kill_ioctx
> was accurate, but the new reference to aio_free_ring looks like a typo 
> (?).  I'd write the entire comment as follows (I've dropped the leading 
> whitespace, for email):
> 
>     /*
>      * We don't need to bother with munmap() here - exit_mmap(mm)
>      * has already been called and everything is unmapped by now.
>      * But to be safe, set ->mmap_size to 0 since kill_ioctx() uses a
>      * non-zero >mmap_size as an indicator that it needs to unmap the
>      * area.
>      */
>

This is a rebase issue, the code changed and i updated the code but
not the comment.
 
> 
> >  		ctx->mmap_size = 0;
> >  		kill_ioctx(mm, ctx, NULL);
> > @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
> >  
> >  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
> >  	kfree(table);
> > +	return 0;
> >  }
> >  
> >  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> > @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
> >  	}
> >  	return ret;
> >  }
> > +
> > +static struct notifier_block aio_mmput_nb = {
> > +	.notifier_call		= aio_exit,
> > +	.priority		= 1,
> > +};
> > +
> > +static int __init aio_init(void)
> > +{
> > +	return mmput_register_notifier(&aio_mmput_nb);
> > +}
> > +subsys_initcall(aio_init);
> > diff --git a/include/linux/aio.h b/include/linux/aio.h
> > index d9c92da..6308fac 100644
> > --- a/include/linux/aio.h
> > +++ b/include/linux/aio.h
> > @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> >  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
> >  extern void aio_complete(struct kiocb *iocb, long res, long res2);
> >  struct mm_struct;
> > -extern void exit_aio(struct mm_struct *mm);
> >  extern long do_io_submit(aio_context_t ctx_id, long nr,
> >  			 struct iocb __user *__user *iocbpp, bool compat);
> >  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> > @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> >  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
> >  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
> >  struct mm_struct;
> > -static inline void exit_aio(struct mm_struct *mm) { }
> >  static inline long do_io_submit(aio_context_t ctx_id, long nr,
> >  				struct iocb __user * __user *iocbpp,
> >  				bool compat) { return 0; }
> > diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> > index 3be6bb1..84c184f 100644
> > --- a/include/linux/ksm.h
> > +++ b/include/linux/ksm.h
> > @@ -20,7 +20,6 @@ struct mem_cgroup;
> >  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> >  		unsigned long end, int advice, unsigned long *vm_flags);
> >  int __ksm_enter(struct mm_struct *mm);
> > -void __ksm_exit(struct mm_struct *mm);
> >  
> >  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > -		__ksm_exit(mm);
> > -}
> > -
> >  /*
> >   * A KSM page is one of those write-protected "shared pages" or "merged pages"
> >   * which KSM maps into multiple mms, wherever identical anonymous page content
> > @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -}
> > -
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 322d4fc..428b3cf 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
> >  		__mmdrop(mm);
> >  }
> >  
> > +/* mmput call list of notifier and subsystem/module can register
> > + * new one through this call.
> > + */
> > +extern int mmput_register_notifier(struct notifier_block *nb);
> > +extern int mmput_unregister_notifier(struct notifier_block *nb);
> >  /* mmput gets rid of the mappings and all user-space */
> >  extern void mmput(struct mm_struct *);
> >  /* Grab a reference to a task's mm, if it is not already going away */
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 4f844c6..44e7267 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
> >  extern void uprobe_notify_resume(struct pt_regs *regs);
> >  extern bool uprobe_deny_signal(void);
> >  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> > -extern void uprobe_clear_state(struct mm_struct *mm);
> >  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
> >  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> >  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 46b7c31..32b04dc 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/percpu-rwsem.h>
> >  #include <linux/task_work.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <linux/uprobes.h>
> >  
> > @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
> >  /*
> >   * uprobe_clear_state - Free the area allocated for slots.
> >   */
> > -void uprobe_clear_state(struct mm_struct *mm)
> > +static int uprobe_clear_state(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct xol_area *area = mm->uprobes_state.xol_area;
> >  
> >  	if (!area)
> > -		return;
> > +		return 0;
> >  
> >  	put_page(area->page);
> >  	kfree(area->bitmap);
> >  	kfree(area);
> > +	return 0;
> >  }
> >  
> >  void uprobe_start_dup_mmap(void)
> > @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
> >  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
> >  };
> >  
> > +static struct notifier_block uprobe_mmput_nb = {
> > +	.notifier_call		= uprobe_clear_state,
> > +	.priority		= 0,
> > +};
> > +
> >  static int __init init_uprobes(void)
> >  {
> > -	int i;
> > +	int i, err;
> >  
> >  	for (i = 0; i < UPROBES_HASH_SZ; i++)
> >  		mutex_init(&uprobes_mmap_mutex[i]);
> > @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
> >  	if (percpu_init_rwsem(&dup_mmap_sem))
> >  		return -ENOMEM;
> >  
> > +	err = mmput_register_notifier(&uprobe_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> >  	return register_die_notifier(&uprobe_exception_nb);
> >  }
> >  __initcall(init_uprobes);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index dd8864f..b448509 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -87,6 +87,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/task.h>
> >  
> > +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> > +
> >  /*
> >   * Protected counters by write_lock_irq(&tasklist_lock)
> >   */
> > @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
> >  EXPORT_SYMBOL_GPL(__mmdrop);
> >  
> >  /*
> > + * Register a notifier that will be call by mmput
> > + */
> > +int mmput_register_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> > +
> > +int mmput_unregister_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> > +
> > +/*
> >   * Decrement the use count and release all resources for an mm.
> >   */
> >  void mmput(struct mm_struct *mm)
> > @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
> >  	might_sleep();
> >  
> >  	if (atomic_dec_and_test(&mm->mm_users)) {
> > -		uprobe_clear_state(mm);
> > -		exit_aio(mm);
> > -		ksm_exit(mm);
> > -		khugepaged_exit(mm); /* must run before exit_mmap */
> >  		exit_mmap(mm);
> > +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
> >  		set_mm_exe_file(mm, NULL);
> >  		if (!list_empty(&mm->mmlist)) {
> >  			spin_lock(&mmlist_lock);
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 346ddc9..cb1e976 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/freezer.h>
> >  #include <linux/oom.h>
> >  #include <linux/numa.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> > @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  		ksm_scan.mm_slot = slot;
> >  		spin_unlock(&ksm_mmlist_lock);
> >  		/*
> > -		 * Although we tested list_empty() above, a racing __ksm_exit
> > +		 * Although we tested list_empty() above, a racing ksm_exit
> >  		 * of the last mm on the list may have removed it since then.
> >  		 */
> >  		if (slot == &ksm_mm_head)
> > @@ -1658,9 +1659,9 @@ next_mm:
> >  		/*
> >  		 * We've completed a full scan of all vmas, holding mmap_sem
> >  		 * throughout, and found no VM_MERGEABLE: so do the same as
> > -		 * __ksm_exit does to remove this mm from all our lists now.
> > -		 * This applies either when cleaning up after __ksm_exit
> > -		 * (but beware: we can reach here even before __ksm_exit),
> > +		 * ksm_exit does to remove this mm from all our lists now.
> > +		 * This applies either when cleaning up after ksm_exit
> > +		 * (but beware: we can reach here even before ksm_exit),
> >  		 * or when all VM_MERGEABLE areas have been unmapped (and
> >  		 * mmap_sem then protects against race with MADV_MERGEABLE).
> >  		 */
> > @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
> >  	return 0;
> >  }
> >  
> > -void __ksm_exit(struct mm_struct *mm)
> > +static int ksm_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct mm_slot *mm_slot;
> >  	int easy_to_free = 0;
> >  
> > +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > +		return 0;
> > +
> >  	/*
> >  	 * This process is exiting: if it's straightforward (as is the
> >  	 * case when ksmd was never running), free mm_slot immediately.
> > @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
> >  		down_write(&mm->mmap_sem);
> >  		up_write(&mm->mmap_sem);
> >  	}
> > +	return 0;
> >  }
> >  
> >  struct page *ksm_might_need_to_copy(struct page *page,
> > @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
> >  };
> >  #endif /* CONFIG_SYSFS */
> >  
> > +static struct notifier_block ksm_mmput_nb = {
> > +	.notifier_call		= ksm_exit,
> > +	.priority		= 2,
> > +};
> > +
> >  static int __init ksm_init(void)
> >  {
> >  	struct task_struct *ksm_thread;
> >  	int err;
> >  
> > +	err = mmput_register_notifier(&ksm_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> 
> In order to be perfectly consistent with this routine's existing code, you 
> would want to write:
> 
> if (err)
> 	goto out;
> 
> ...but it does the same thing as your code. It' just a consistency thing.
> 
> >  	err = ksm_slab_init();
> >  	if (err)
> >  		goto out;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 61aec93..b684a21 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_accounted = 0;
> >  
> > +	/* Important to call this first. */
> > +	khugepaged_exit(mm);
> > +
> >  	/* mm's last user has gone, and its about to be pulled down */
> >  	mmu_notifier_release(mm);
> >  
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> Above points are extremely minor, so:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

I will respin none the less with fixed comment.

> 
> thanks,
> John H.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 15:07       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 08:49:16PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jerome Glisse wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > Several subsystem require a callback when a mm struct is being destroy
> > so that they can cleanup there respective per mm struct. Instead of
> > having each subsystem add its callback to mmput use a notifier chain
> > to call each of the subsystem.
> > 
> > This will allow new subsystem to register callback even if they are
> > module. There should be no contention on the rw semaphore protecting
> > the call chain and the impact on the code path should be low and
> > burried in the noise.
> > 
> > Note that this patch also move the call to cleanup functions after
> > exit_mmap so that new call back can assume that mmu_notifier_release
> > have already been call. This does not impact existing cleanup functions
> > as they do not rely on anything that exit_mmap is freeing. Also moved
> > khugepaged_exit to exit_mmap so that ordering is preserved for that
> > function.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > ---
> >  fs/aio.c                | 29 ++++++++++++++++++++++-------
> >  include/linux/aio.h     |  2 --
> >  include/linux/ksm.h     | 11 -----------
> >  include/linux/sched.h   |  5 +++++
> >  include/linux/uprobes.h |  1 -
> >  kernel/events/uprobes.c | 19 ++++++++++++++++---
> >  kernel/fork.c           | 22 ++++++++++++++++++----
> >  mm/ksm.c                | 26 +++++++++++++++++++++-----
> >  mm/mmap.c               |  3 +++
> >  9 files changed, 85 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index c1d8c48..1d06e92 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -40,6 +40,7 @@
> >  #include <linux/ramfs.h>
> >  #include <linux/percpu-refcount.h>
> >  #include <linux/mount.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/kmap_types.h>
> >  #include <asm/uaccess.h>
> > @@ -774,20 +775,22 @@ ssize_t wait_on_sync_kiocb(struct kiocb *req)
> >  EXPORT_SYMBOL(wait_on_sync_kiocb);
> >  
> >  /*
> > - * exit_aio: called when the last user of mm goes away.  At this point, there is
> > + * aio_exit: called when the last user of mm goes away.  At this point, there is
> >   * no way for any new requests to be submited or any of the io_* syscalls to be
> >   * called on the context.
> >   *
> >   * There may be outstanding kiocbs, but free_ioctx() will explicitly wait on
> >   * them.
> >   */
> > -void exit_aio(struct mm_struct *mm)
> > +static int aio_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
> >  	int i;
> >  
> >  	if (!table)
> > -		return;
> > +		return 0;
> >  
> >  	for (i = 0; i < table->nr; ++i) {
> >  		struct kioctx *ctx = table->table[i];
> > @@ -796,10 +799,10 @@ void exit_aio(struct mm_struct *mm)
> >  			continue;
> >  		/*
> >  		 * We don't need to bother with munmap() here - exit_mmap(mm)
> > -		 * is coming and it'll unmap everything. And we simply can't,
> > -		 * this is not necessarily our ->mm.
> > -		 * Since kill_ioctx() uses non-zero ->mmap_size as indicator
> > -		 * that it needs to unmap the area, just set it to 0.
> > +		 * have already been call and everything is unmap by now. But
> > +		 * to be safe set ->mmap_size to 0 since aio_free_ring() uses
> > +		 * non-zero ->mmap_size as indicator that it needs to unmap the
> > +		 * area.
> >  		 */
> 
> Actually, I think the original part of the comment about kill_ioctx
> was accurate, but the new reference to aio_free_ring looks like a typo 
> (?).  I'd write the entire comment as follows (I've dropped the leading 
> whitespace, for email):
> 
>     /*
>      * We don't need to bother with munmap() here - exit_mmap(mm)
>      * has already been called and everything is unmapped by now.
>      * But to be safe, set ->mmap_size to 0 since kill_ioctx() uses a
>      * non-zero >mmap_size as an indicator that it needs to unmap the
>      * area.
>      */
>

This is a rebase issue, the code changed and i updated the code but
not the comment.
 
> 
> >  		ctx->mmap_size = 0;
> >  		kill_ioctx(mm, ctx, NULL);
> > @@ -807,6 +810,7 @@ void exit_aio(struct mm_struct *mm)
> >  
> >  	RCU_INIT_POINTER(mm->ioctx_table, NULL);
> >  	kfree(table);
> > +	return 0;
> >  }
> >  
> >  static void put_reqs_available(struct kioctx *ctx, unsigned nr)
> > @@ -1629,3 +1633,14 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
> >  	}
> >  	return ret;
> >  }
> > +
> > +static struct notifier_block aio_mmput_nb = {
> > +	.notifier_call		= aio_exit,
> > +	.priority		= 1,
> > +};
> > +
> > +static int __init aio_init(void)
> > +{
> > +	return mmput_register_notifier(&aio_mmput_nb);
> > +}
> > +subsys_initcall(aio_init);
> > diff --git a/include/linux/aio.h b/include/linux/aio.h
> > index d9c92da..6308fac 100644
> > --- a/include/linux/aio.h
> > +++ b/include/linux/aio.h
> > @@ -73,7 +73,6 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> >  extern ssize_t wait_on_sync_kiocb(struct kiocb *iocb);
> >  extern void aio_complete(struct kiocb *iocb, long res, long res2);
> >  struct mm_struct;
> > -extern void exit_aio(struct mm_struct *mm);
> >  extern long do_io_submit(aio_context_t ctx_id, long nr,
> >  			 struct iocb __user *__user *iocbpp, bool compat);
> >  void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> > @@ -81,7 +80,6 @@ void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
> >  static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
> >  static inline void aio_complete(struct kiocb *iocb, long res, long res2) { }
> >  struct mm_struct;
> > -static inline void exit_aio(struct mm_struct *mm) { }
> >  static inline long do_io_submit(aio_context_t ctx_id, long nr,
> >  				struct iocb __user * __user *iocbpp,
> >  				bool compat) { return 0; }
> > diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> > index 3be6bb1..84c184f 100644
> > --- a/include/linux/ksm.h
> > +++ b/include/linux/ksm.h
> > @@ -20,7 +20,6 @@ struct mem_cgroup;
> >  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> >  		unsigned long end, int advice, unsigned long *vm_flags);
> >  int __ksm_enter(struct mm_struct *mm);
> > -void __ksm_exit(struct mm_struct *mm);
> >  
> >  static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > @@ -29,12 +28,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -	if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > -		__ksm_exit(mm);
> > -}
> > -
> >  /*
> >   * A KSM page is one of those write-protected "shared pages" or "merged pages"
> >   * which KSM maps into multiple mms, wherever identical anonymous page content
> > @@ -83,10 +76,6 @@ static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  	return 0;
> >  }
> >  
> > -static inline void ksm_exit(struct mm_struct *mm)
> > -{
> > -}
> > -
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 322d4fc..428b3cf 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2384,6 +2384,11 @@ static inline void mmdrop(struct mm_struct * mm)
> >  		__mmdrop(mm);
> >  }
> >  
> > +/* mmput call list of notifier and subsystem/module can register
> > + * new one through this call.
> > + */
> > +extern int mmput_register_notifier(struct notifier_block *nb);
> > +extern int mmput_unregister_notifier(struct notifier_block *nb);
> >  /* mmput gets rid of the mappings and all user-space */
> >  extern void mmput(struct mm_struct *);
> >  /* Grab a reference to a task's mm, if it is not already going away */
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 4f844c6..44e7267 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -120,7 +120,6 @@ extern int uprobe_pre_sstep_notifier(struct pt_regs *regs);
> >  extern void uprobe_notify_resume(struct pt_regs *regs);
> >  extern bool uprobe_deny_signal(void);
> >  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
> > -extern void uprobe_clear_state(struct mm_struct *mm);
> >  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
> >  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> >  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 46b7c31..32b04dc 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/percpu-rwsem.h>
> >  #include <linux/task_work.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <linux/uprobes.h>
> >  
> > @@ -1220,16 +1221,19 @@ static struct xol_area *get_xol_area(void)
> >  /*
> >   * uprobe_clear_state - Free the area allocated for slots.
> >   */
> > -void uprobe_clear_state(struct mm_struct *mm)
> > +static int uprobe_clear_state(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct xol_area *area = mm->uprobes_state.xol_area;
> >  
> >  	if (!area)
> > -		return;
> > +		return 0;
> >  
> >  	put_page(area->page);
> >  	kfree(area->bitmap);
> >  	kfree(area);
> > +	return 0;
> >  }
> >  
> >  void uprobe_start_dup_mmap(void)
> > @@ -1979,9 +1983,14 @@ static struct notifier_block uprobe_exception_nb = {
> >  	.priority		= INT_MAX-1,	/* notified after kprobes, kgdb */
> >  };
> >  
> > +static struct notifier_block uprobe_mmput_nb = {
> > +	.notifier_call		= uprobe_clear_state,
> > +	.priority		= 0,
> > +};
> > +
> >  static int __init init_uprobes(void)
> >  {
> > -	int i;
> > +	int i, err;
> >  
> >  	for (i = 0; i < UPROBES_HASH_SZ; i++)
> >  		mutex_init(&uprobes_mmap_mutex[i]);
> > @@ -1989,6 +1998,10 @@ static int __init init_uprobes(void)
> >  	if (percpu_init_rwsem(&dup_mmap_sem))
> >  		return -ENOMEM;
> >  
> > +	err = mmput_register_notifier(&uprobe_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> >  	return register_die_notifier(&uprobe_exception_nb);
> >  }
> >  __initcall(init_uprobes);
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index dd8864f..b448509 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -87,6 +87,8 @@
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/task.h>
> >  
> > +static BLOCKING_NOTIFIER_HEAD(mmput_notifier);
> > +
> >  /*
> >   * Protected counters by write_lock_irq(&tasklist_lock)
> >   */
> > @@ -623,6 +625,21 @@ void __mmdrop(struct mm_struct *mm)
> >  EXPORT_SYMBOL_GPL(__mmdrop);
> >  
> >  /*
> > + * Register a notifier that will be call by mmput
> > + */
> > +int mmput_register_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_register(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_register_notifier);
> > +
> > +int mmput_unregister_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_unregister(&mmput_notifier, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(mmput_unregister_notifier);
> > +
> > +/*
> >   * Decrement the use count and release all resources for an mm.
> >   */
> >  void mmput(struct mm_struct *mm)
> > @@ -630,11 +647,8 @@ void mmput(struct mm_struct *mm)
> >  	might_sleep();
> >  
> >  	if (atomic_dec_and_test(&mm->mm_users)) {
> > -		uprobe_clear_state(mm);
> > -		exit_aio(mm);
> > -		ksm_exit(mm);
> > -		khugepaged_exit(mm); /* must run before exit_mmap */
> >  		exit_mmap(mm);
> > +		blocking_notifier_call_chain(&mmput_notifier, 0, mm);
> >  		set_mm_exe_file(mm, NULL);
> >  		if (!list_empty(&mm->mmlist)) {
> >  			spin_lock(&mmlist_lock);
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 346ddc9..cb1e976 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/freezer.h>
> >  #include <linux/oom.h>
> >  #include <linux/numa.h>
> > +#include <linux/notifier.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include "internal.h"
> > @@ -1586,7 +1587,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
> >  		ksm_scan.mm_slot = slot;
> >  		spin_unlock(&ksm_mmlist_lock);
> >  		/*
> > -		 * Although we tested list_empty() above, a racing __ksm_exit
> > +		 * Although we tested list_empty() above, a racing ksm_exit
> >  		 * of the last mm on the list may have removed it since then.
> >  		 */
> >  		if (slot == &ksm_mm_head)
> > @@ -1658,9 +1659,9 @@ next_mm:
> >  		/*
> >  		 * We've completed a full scan of all vmas, holding mmap_sem
> >  		 * throughout, and found no VM_MERGEABLE: so do the same as
> > -		 * __ksm_exit does to remove this mm from all our lists now.
> > -		 * This applies either when cleaning up after __ksm_exit
> > -		 * (but beware: we can reach here even before __ksm_exit),
> > +		 * ksm_exit does to remove this mm from all our lists now.
> > +		 * This applies either when cleaning up after ksm_exit
> > +		 * (but beware: we can reach here even before ksm_exit),
> >  		 * or when all VM_MERGEABLE areas have been unmapped (and
> >  		 * mmap_sem then protects against race with MADV_MERGEABLE).
> >  		 */
> > @@ -1821,11 +1822,16 @@ int __ksm_enter(struct mm_struct *mm)
> >  	return 0;
> >  }
> >  
> > -void __ksm_exit(struct mm_struct *mm)
> > +static int ksm_exit(struct notifier_block *nb,
> > +		    unsigned long action, void *data)
> >  {
> > +	struct mm_struct *mm = data;
> >  	struct mm_slot *mm_slot;
> >  	int easy_to_free = 0;
> >  
> > +	if (!test_bit(MMF_VM_MERGEABLE, &mm->flags))
> > +		return 0;
> > +
> >  	/*
> >  	 * This process is exiting: if it's straightforward (as is the
> >  	 * case when ksmd was never running), free mm_slot immediately.
> > @@ -1857,6 +1863,7 @@ void __ksm_exit(struct mm_struct *mm)
> >  		down_write(&mm->mmap_sem);
> >  		up_write(&mm->mmap_sem);
> >  	}
> > +	return 0;
> >  }
> >  
> >  struct page *ksm_might_need_to_copy(struct page *page,
> > @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
> >  };
> >  #endif /* CONFIG_SYSFS */
> >  
> > +static struct notifier_block ksm_mmput_nb = {
> > +	.notifier_call		= ksm_exit,
> > +	.priority		= 2,
> > +};
> > +
> >  static int __init ksm_init(void)
> >  {
> >  	struct task_struct *ksm_thread;
> >  	int err;
> >  
> > +	err = mmput_register_notifier(&ksm_mmput_nb);
> > +	if (err)
> > +		return err;
> > +
> 
> In order to be perfectly consistent with this routine's existing code, you 
> would want to write:
> 
> if (err)
> 	goto out;
> 
> ...but it does the same thing as your code. It' just a consistency thing.
> 
> >  	err = ksm_slab_init();
> >  	if (err)
> >  		goto out;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 61aec93..b684a21 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_accounted = 0;
> >  
> > +	/* Important to call this first. */
> > +	khugepaged_exit(mm);
> > +
> >  	/* mm's last user has gone, and its about to be pulled down */
> >  	mmu_notifier_release(mm);
> >  
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> Above points are extremely minor, so:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

I will respin none the less with fixed comment.

> 
> thanks,
> John H.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-28  2:00   ` Jérôme Glisse
                     ` (2 preceding siblings ...)
  (?)
@ 2014-06-30 15:37   ` Joerg Roedel
  -1 siblings, 0 replies; 76+ messages in thread
From: Joerg Roedel @ 2014-06-30 15:37 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	Oded Gabbay, Alexander Deucher, Andrew Lewycky,
	Jérôme Glisse

On Fri, Jun 27, 2014 at 10:00:19PM -0400, Jérôme Glisse wrote:
> Note that this patch also move the call to cleanup functions after
> exit_mmap so that new call back can assume that mmu_notifier_release
> have already been call. This does not impact existing cleanup functions
> as they do not rely on anything that exit_mmap is freeing. Also moved
> khugepaged_exit to exit_mmap so that ordering is preserved for that
> function.

What this patch does is duplicating the functionality of the
mmu_notifier_release call-back. Why is it needed?


	Joerg



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]     ` <019CCE693E457142B37B791721487FD91806B836-0nO7ALo/ziwxlywnonMhLEEOCMrvLtNR@public.gmane.org>
@ 2014-06-30 15:40       ` Joerg Roedel
  2014-06-30 16:06           ` Jerome Glisse
  0 siblings, 1 reply; 76+ messages in thread
From: Joerg Roedel @ 2014-06-30 15:40 UTC (permalink / raw)
  To: Gabbay, Oded
  Cc: Sherry Cheung, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Jérôme Glisse,
	aarcange-H+wXaHxf7aLQT0dZR+AlfA, Jatin Kumar, Lucien Dunning,
	mgorman-l3A5Bk7waGM, jweiner-H+wXaHxf7aLQT0dZR+AlfA,
	Subhash Gutti, riel-H+wXaHxf7aLQT0dZR+AlfA, John Hubbard,
	Mark Hairgrove, Cameron Buschardt, peterz-hDdKplPs4pWWVfeAwA7xHQ,
	Duncan Poole, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Jun 30, 2014 at 02:41:24PM +0000, Gabbay, Oded wrote:
> I did face some problems regarding the amd IOMMU v2 driver, which
> changed its behavior (see commit "iommu/amd: Implement
> mmu_notifier_release call-back") to use mmu_notifier_release and did
> some "bad things" inside that
> notifier (primarily, but not only, deleting the object which held the
> mmu_notifier object itself, which you mustn't do because of the
> locking). 
> 
> I'm thinking of changing that driver's behavior to use this new
> mechanism instead of using mmu_notifier_release. Does that seem
> acceptable ? Another solution will be to add a new mmu_notifier call,
> but we already ruled that out ;)

The mmu_notifier_release() function is exactly what this new notifier
aims to do. Unless there is a very compelling reason to duplicate this
functionality I stronly NACK this approach.


	Joerg

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
  2014-06-30  5:22     ` John Hubbard
@ 2014-06-30 15:57       ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 10:22:57PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jérôme Glisse wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > The event information will be usefull for new user of mmu_notifier API.
> > The event argument differentiate between a vma disappearing, a page
> > being write protected or simply a page being unmaped. This allow new
> > user to take different path for different event for instance on unmap
> > the resource used to track a vma are still valid and should stay around.
> > While if the event is saying that a vma is being destroy it means that any
> > resources used to track this vma can be free.
> > 
> > Changed since v1:
> >   - renamed action into event (updated commit message too).
> >   - simplified the event names and clarified their intented usage
> >     also documenting what exceptation the listener can have in
> >     respect to each event.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
> >  drivers/iommu/amd_iommu_v2.c            |  14 ++--
> >  drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
> >  drivers/xen/gntdev.c                    |   9 ++-
> >  fs/proc/task_mmu.c                      |   6 +-
> >  include/linux/hugetlb.h                 |   7 +-
> >  include/linux/mmu_notifier.h            | 117 ++++++++++++++++++++++++++------
> >  kernel/events/uprobes.c                 |  10 ++-
> >  mm/filemap_xip.c                        |   2 +-
> >  mm/huge_memory.c                        |  51 ++++++++------
> >  mm/hugetlb.c                            |  25 ++++---
> >  mm/ksm.c                                |  18 +++--
> >  mm/memory.c                             |  27 +++++---
> >  mm/migrate.c                            |   9 ++-
> >  mm/mmu_notifier.c                       |  28 +++++---
> >  mm/mprotect.c                           |  33 ++++++---
> >  mm/mremap.c                             |   6 +-
> >  mm/rmap.c                               |  24 +++++--
> >  virt/kvm/kvm_main.c                     |  12 ++--
> >  19 files changed, 291 insertions(+), 119 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 21ea928..ed6f35e 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -56,7 +56,8 @@ struct i915_mmu_object {
> >  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >  						       struct mm_struct *mm,
> >  						       unsigned long start,
> > -						       unsigned long end)
> > +						       unsigned long end,
> > +						       enum mmu_event event)
> >  {
> >  	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
> >  	struct interval_tree_node *it = NULL;
> > diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> > index 499b436..2bb9771 100644
> > --- a/drivers/iommu/amd_iommu_v2.c
> > +++ b/drivers/iommu/amd_iommu_v2.c
> > @@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
> >  static void mn_change_pte(struct mmu_notifier *mn,
> >  			  struct mm_struct *mm,
> >  			  unsigned long address,
> > -			  pte_t pte)
> > +			  pte_t pte,
> > +			  enum mmu_event event)
> >  {
> >  	__mn_flush_page(mn, address);
> >  }
> >  
> >  static void mn_invalidate_page(struct mmu_notifier *mn,
> >  			       struct mm_struct *mm,
> > -			       unsigned long address)
> > +			       unsigned long address,
> > +			       enum mmu_event event)
> >  {
> >  	__mn_flush_page(mn, address);
> >  }
> >  
> >  static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  				      struct mm_struct *mm,
> > -				      unsigned long start, unsigned long end)
> > +				      unsigned long start,
> > +				      unsigned long end,
> > +				      enum mmu_event event)
> >  {
> >  	struct pasid_state *pasid_state;
> >  	struct device_state *dev_state;
> > @@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_range_end(struct mmu_notifier *mn,
> >  				    struct mm_struct *mm,
> > -				    unsigned long start, unsigned long end)
> > +				    unsigned long start,
> > +				    unsigned long end,
> > +				    enum mmu_event event)
> >  {
> >  	struct pasid_state *pasid_state;
> >  	struct device_state *dev_state;
> > diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> > index 2129274..e67fed1 100644
> > --- a/drivers/misc/sgi-gru/grutlbpurge.c
> > +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> > @@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
> >   */
> >  static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > -				       unsigned long start, unsigned long end)
> > +				       unsigned long start, unsigned long end,
> > +				       enum mmu_event event)
> >  {
> >  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> >  						 ms_notifier);
> > @@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> >  				     struct mm_struct *mm, unsigned long start,
> > -				     unsigned long end)
> > +				     unsigned long end,
> > +				     enum mmu_event event)
> >  {
> >  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> >  						 ms_notifier);
> > @@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
> >  }
> >  
> >  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> > -				unsigned long address)
> > +				unsigned long address,
> > +				enum mmu_event event)
> >  {
> >  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> >  						 ms_notifier);
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index 073b4a1..fe9da94 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
> >  
> >  static void mn_invl_range_start(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > -				unsigned long start, unsigned long end)
> > +				unsigned long start,
> > +				unsigned long end,
> > +				enum mmu_event event)
> >  {
> >  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> >  	struct grant_map *map;
> > @@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invl_page(struct mmu_notifier *mn,
> >  			 struct mm_struct *mm,
> > -			 unsigned long address)
> > +			 unsigned long address,
> > +			 enum mmu_event event)
> >  {
> > -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
> > +	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
> >  }
> >  
> >  static void mn_release(struct mmu_notifier *mn,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index cfa63ee..e9e79f7 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  		};
> >  		down_read(&mm->mmap_sem);
> >  		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> > +			mmu_notifier_invalidate_range_start(mm, 0,
> > +							    -1, MMU_STATUS);
> >  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >  			cp.vma = vma;
> >  			if (is_vm_hugetlb_page(vma))
> > @@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  					&clear_refs_walk);
> >  		}
> >  		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> > +			mmu_notifier_invalidate_range_end(mm, 0,
> > +							  -1, MMU_STATUS);
> >  		flush_tlb_mm(mm);
> >  		up_read(&mm->mmap_sem);
> >  		mmput(mm);
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 6a836ef..d7e512f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -6,6 +6,7 @@
> >  #include <linux/fs.h>
> >  #include <linux/hugetlb_inline.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/mmu_notifier.h>
> >  #include <linux/list.h>
> >  #include <linux/kref.h>
> >  
> > @@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> >  int pmd_huge(pmd_t pmd);
> >  int pud_huge(pud_t pmd);
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > -		unsigned long address, unsigned long end, pgprot_t newprot);
> > +		unsigned long address, unsigned long end, pgprot_t newprot,
> > +		enum mmu_event event);
> >  
> >  #else /* !CONFIG_HUGETLB_PAGE */
> >  
> > @@ -148,7 +150,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
> >  #define is_hugepage_active(x)	false
> >  
> >  static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > -		unsigned long address, unsigned long end, pgprot_t newprot)
> > +		unsigned long address, unsigned long end, pgprot_t newprot,
> > +		enum mmu_event event)
> >  {
> >  	return 0;
> >  }
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index deca874..82e9577 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -9,6 +9,52 @@
> >  struct mmu_notifier;
> >  struct mmu_notifier_ops;
> >  
> > +/* Event report finer informations to the callback allowing the event listener
> > + * to take better action. There are only few kinds of events :
> > + *
> > + *   - MMU_MIGRATE memory is migrating from one page to another thus all write
> > + *     access must stop after invalidate_range_start callback returns. And no
> > + *     read access should be allowed either as new page can be remapped with
> > + *     write access before the invalidate_range_end callback happen and thus
> > + *     any read access to old page might access outdated informations. Several
> > + *     source to this event like page moving to swap (for various reasons like
> > + *     page reclaim), outcome of mremap syscall, migration for numa reasons,
> > + *     balancing memory pool, write fault on read only page trigger a new page
> > + *     to be allocated and used, ...
> > + *   - MMU_MPROT_NONE memory access protection is change, no page in the range
> > + *     can be accessed in either read or write mode but the range of address
> > + *     is still valid. All access are still fine until invalidate_range_end
> > + *     callback returns.
> > + *   - MMU_MPROT_RONLY memory access proctection is changing to read only.
> > + *     All access are still fine until invalidate_range_end callback returns.
> > + *   - MMU_MPROT_RANDW memory access proctection is changing to read an write.
> > + *     All access are still fine until invalidate_range_end callback returns.
> > + *   - MMU_MPROT_WONLY memory access proctection is changing to write only.
> > + *     All access are still fine until invalidate_range_end callback returns.
> > + *   - MMU_MUNMAP the range is being unmaped (outcome of a munmap syscall). It
> > + *     is fine to still have read/write access until the invalidate_range_end
> > + *     callback returns. This also imply that secondary page table can be trim
> > + *     as the address range is no longer valid.
> > + *   - MMU_WB memory is being write back to disk, all write access must stop
> > + *     after invalidate_range_start callback returns. Read access are still
> > + *     allowed.
> > + *   - MMU_STATUS memory status change, like soft dirty.
> > + *
> > + * In doubt when adding a new notifier caller use MMU_MIGRATE it will always
> > + * result in expected behavior but will not allow listener a chance to optimize
> > + * its events.
> > + */
> 
> Here is a pass at tightening up that documentation:
> 
> /* MMU Events report fine-grained information to the callback routine, allowing
>  * the event listener to make a more informed decision as to what action to
>  * take. The event types are:
>  *
>  *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
>  *     access must stop after invalidate_range_start callback returns.
>  *     Furthermore, no read access should be allowed either, as a new page can
>  *     be remapped with write access before the invalidate_range_end callback
>  *     happens and thus any read access to old page might read stale data. There
>  *     are several sources for this event, including:
>  *
>  *         - A page moving to swap (for various reasons, including page
>  *           reclaim),
>  *         - An mremap syscall,
>  *         - migration for NUMA reasons,
>  *         - balancing the memory pool,
>  *         - write fault on a read-only page triggers a new page to be allocated
>  *           and used,
>  *         - and more that are not listed here.
>  *
>  *   - MMU_MPROT_NONE: memory access protection is changing to "none": no page
>  *     in the range can be accessed in either read or write mode but the range
>  *     of addresses is still valid. However, access is still allowed, up until
>  *     invalidate_range_end callback returns.
>  *
>  *   - MMU_MPROT_RONLY: memory access proctection is changing to read only.
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns.
>  *
>  *   - MMU_MPROT_RANDW: memory access proctection is changing to read and write.
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns.
>  *
>  *   - MMU_MPROT_WONLY: memory access proctection is changing to write only.
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns.
>  *
>  *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall).
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns. This also implies that the secondary page table can be trimmed,
>  *     because the address range is no longer valid.
>  *
>  *   - MMU_WB: memory is being written back to disk, all write accesses must
>  *     stop after invalidate_range_start callback returns. Read access are still
>  *     allowed.
>  *
>  *   - MMU_STATUS memory status change, like soft dirty, or huge page 
>  *     splitting (in place).
>  *
>  * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
>  * because it will always lead to reasonable behavior, but will not allow the
>  * listener a chance to optimize its events.
>  */
> 
> Mostly just cleaning up the wording, except that I did add "huge page 
> splitting" to the cases that could cause an MMU_STATUS to fire.
>

Yes your wording is better than mine. The huge page case is kind of a
weird case as it is not always reusing the same page (THP vs hugetlbfs).

> > +enum mmu_event {
> > +	MMU_MIGRATE = 0,
> > +	MMU_MPROT_NONE,
> > +	MMU_MPROT_RONLY,
> > +	MMU_MPROT_RANDW,
> > +	MMU_MPROT_WONLY,
> > +	MMU_MUNMAP,
> > +	MMU_STATUS,
> > +	MMU_WB,
> > +};
> > +
> >  #ifdef CONFIG_MMU_NOTIFIER
> >  
> >  /*
> > @@ -79,7 +125,8 @@ struct mmu_notifier_ops {
> >  	void (*change_pte)(struct mmu_notifier *mn,
> >  			   struct mm_struct *mm,
> >  			   unsigned long address,
> > -			   pte_t pte);
> > +			   pte_t pte,
> > +			   enum mmu_event event);
> >  
> >  	/*
> >  	 * Before this is invoked any secondary MMU is still ok to
> > @@ -90,7 +137,8 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_page)(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > -				unsigned long address);
> > +				unsigned long address,
> > +				enum mmu_event event);
> >  
> >  	/*
> >  	 * invalidate_range_start() and invalidate_range_end() must be
> > @@ -137,10 +185,14 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_range_start)(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > -				       unsigned long start, unsigned long end);
> > +				       unsigned long start,
> > +				       unsigned long end,
> > +				       enum mmu_event event);
> >  	void (*invalidate_range_end)(struct mmu_notifier *mn,
> >  				     struct mm_struct *mm,
> > -				     unsigned long start, unsigned long end);
> > +				     unsigned long start,
> > +				     unsigned long end,
> > +				     enum mmu_event event);
> >  };
> >  
> >  /*
> > @@ -177,13 +229,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> >  extern int __mmu_notifier_test_young(struct mm_struct *mm,
> >  				     unsigned long address);
> >  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> > -				      unsigned long address, pte_t pte);
> > +				      unsigned long address,
> > +				      pte_t pte,
> > +				      enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address);
> > +					  unsigned long address,
> > +					  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end);
> > +						  unsigned long start,
> > +						  unsigned long end,
> > +						  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end);
> > +						unsigned long start,
> > +						unsigned long end,
> > +						enum mmu_event event);
> >  
> >  static inline void mmu_notifier_release(struct mm_struct *mm)
> >  {
> > @@ -208,31 +267,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> > -					   unsigned long address, pte_t pte)
> > +					   unsigned long address,
> > +					   pte_t pte,
> > +					   enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_change_pte(mm, address, pte);
> > +		__mmu_notifier_change_pte(mm, address, pte, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address)
> > +						unsigned long address,
> > +						enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_page(mm, address);
> > +		__mmu_notifier_invalidate_page(mm, address, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						       unsigned long start,
> > +						       unsigned long end,
> > +						       enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_start(mm, start, end);
> > +		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						     unsigned long start,
> > +						     unsigned long end,
> > +						     enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_end(mm, start, end);
> > +		__mmu_notifier_invalidate_range_end(mm, start, end, event);
> >  }
> >  
> >  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> > @@ -278,13 +344,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> >   * old page would remain mapped readonly in the secondary MMUs after the new
> >   * page is already writable by some CPU through the primary MMU.
> >   */
> > -#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
> > +#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
> >  ({									\
> >  	struct mm_struct *___mm = __mm;					\
> >  	unsigned long ___address = __address;				\
> >  	pte_t ___pte = __pte;						\
> >  									\
> > -	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
> > +	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
> >  	set_pte_at(___mm, ___address, __ptep, ___pte);			\
> >  })
> >  
> > @@ -307,22 +373,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> > -					   unsigned long address, pte_t pte)
> > +					   unsigned long address,
> > +					   pte_t pte,
> > +					   enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address)
> > +						unsigned long address,
> > +						enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						       unsigned long start,
> > +						       unsigned long end,
> > +						       enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						     unsigned long start,
> > +						     unsigned long end,
> > +						     enum mmu_event event)
> >  {
> >  }
> >  
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 32b04dc..296f81e 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -177,7 +177,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	/* For try_to_free_swap() and munlock_vma_page() below */
> >  	lock_page(page);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	err = -EAGAIN;
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> >  	if (!ptep)
> > @@ -195,7 +196,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> >  	ptep_clear_flush(vma, addr, ptep);
> > -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> > +	set_pte_at_notify(mm, addr, ptep,
> > +			  mk_pte(kpage, vma->vm_page_prot),
> > +			  MMU_MIGRATE);
> >  
> >  	page_remove_rmap(page);
> >  	if (!page_mapped(page))
> > @@ -209,7 +212,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	err = 0;
> >   unlock:
> >  	mem_cgroup_cancel_charge(kpage, memcg);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	unlock_page(page);
> >  	return err;
> >  }
> > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> > index d8d9fe3..a2b3f09 100644
> > --- a/mm/filemap_xip.c
> > +++ b/mm/filemap_xip.c
> > @@ -198,7 +198,7 @@ retry:
> >  			BUG_ON(pte_dirty(pteval));
> >  			pte_unmap_unlock(pte, ptl);
> >  			/* must invalidate_page _before_ freeing the page */
> > -			mmu_notifier_invalidate_page(mm, address);
> > +			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> >  			page_cache_release(page);
> >  		}
> >  	}
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5d562a9..fa30857 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1020,6 +1020,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  		set_page_private(pages[i], (unsigned long)memcg);
> >  	}
> >  
> > +	mmun_start = haddr;
> > +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> > +
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		copy_user_highpage(pages[i], page + i,
> >  				   haddr + PAGE_SIZE * i, vma);
> > @@ -1027,10 +1032,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  		cond_resched();
> >  	}
> >  
> > -	mmun_start = haddr;
> > -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > -
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> >  		goto out_free_pages;
> 
> So, that looks like you are fixing a pre-existing bug here? The 
> invalidate_range call is now happening *before* we copy pages. That seems 
> correct, although this is starting to get into code I'm less comfortable 
> with (huge pages).  But I think it's worth mentioning in the commit 
> message.

Yes i should actualy split the fix out of this patch i will respin with
fix a preparatory patch to this one.

> 
> > @@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  	page_remove_rmap(page);
> >  	spin_unlock(ptl);
> >  
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	ret |= VM_FAULT_WRITE;
> >  	put_page(page);
> > @@ -1073,7 +1075,8 @@ out:
> >  
> >  out_free_pages:
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		memcg = (void *)page_private(pages[i]);
> >  		set_page_private(pages[i], 0);
> > @@ -1157,16 +1160,17 @@ alloc:
> >  
> >  	count_vm_event(THP_FAULT_ALLOC);
> >  
> > +	mmun_start = haddr;
> > +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> > +
> >  	if (!page)
> >  		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> >  	else
> >  		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> >  	__SetPageUptodate(new_page);
> >  
> > -	mmun_start = haddr;
> > -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > -
> 
> Another bug fix, OK.
> 
> >  	spin_lock(ptl);
> >  	if (page)
> >  		put_user_huge_page(page);
> > @@ -1197,7 +1201,8 @@ alloc:
> >  	}
> >  	spin_unlock(ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return ret;
> >  out_unlock:
> > @@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
> >  	const unsigned long mmun_start = address;
> >  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_STATUS);
> 
> OK, just to be sure: we are not moving the page contents at this point, 
> right? Just changing the page table from a single "huge" entry, into lots 
> of little 4K page entries? If so, than MMU_STATUS seems correct, but we 
> should add that case to the "Event types" documentation above.

Yes correct.

> 
> >  	pmd = page_check_address_pmd(page, mm, address,
> >  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
> >  	if (pmd) {
> > @@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
> >  		ret = 1;
> >  		spin_unlock(ptl);
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_STATUS);
> >  
> >  	return ret;
> >  }
> > @@ -2446,7 +2453,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = address + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >  	/*
> >  	 * After this gup_fast can't run anymore. This also removes
> > @@ -2456,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 */
> >  	_pmd = pmdp_clear_flush(vma, address, pmd);
> >  	spin_unlock(pmd_ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	spin_lock(pte_ptl);
> >  	isolated = __collapse_huge_page_isolate(vma, address, pte);
> > @@ -2845,24 +2854,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> >  again:
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> 
> Just checking: this is MMU_MIGRATE, instead of MMU_STATUS, because we are 
> actually moving data? (The pages backing the page table?)

Well in truth i think this mmu_notifier call should change. As what happens
depends on the branch taken. But i am guessing the call were added there
because of the spinlock. I will just remove then with explanation in a
preparatory patch to this one. Issue is the huge zero case, but i think it
is fine to call mmu_notifier after having updated the cpu page table for
that case.

> 
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_trans_huge(*pmd))) {
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> >  	if (is_huge_zero_pmd(*pmd)) {
> >  		__split_huge_zero_page_pmd(vma, haddr, pmd);
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> >  	page = pmd_page(*pmd);
> >  	VM_BUG_ON_PAGE(!page_count(page), page);
> >  	get_page(page);
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	split_huge_page(page);
> >  
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7faab71..73e1576 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2565,7 +2565,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	mmun_start = vma->vm_start;
> >  	mmun_end = vma->vm_end;
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_start(src, mmun_start,
> > +						    mmun_end, MMU_MIGRATE);
> >  
> >  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> >  		spinlock_t *src_ptl, *dst_ptl;
> > @@ -2615,7 +2616,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	}
> >  
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(src, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  
> >  	return ret;
> >  }
> > @@ -2641,7 +2643,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  	BUG_ON(end & ~huge_page_mask(h));
> >  
> >  	tlb_start_vma(tlb, vma);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  again:
> >  	for (address = start; address < end; address += sz) {
> >  		ptep = huge_pte_offset(mm, address);
> > @@ -2712,7 +2715,8 @@ unlock:
> >  		if (address < end && !ref_page)
> >  			goto again;
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	tlb_end_vma(tlb, vma);
> >  }
> >  
> > @@ -2899,7 +2903,8 @@ retry_avoidcopy:
> >  
> >  	mmun_start = address & huge_page_mask(h);
> >  	mmun_end = mmun_start + huge_page_size(h);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	/*
> >  	 * Retake the page table lock to check for racing updates
> >  	 * before the page tables are altered
> > @@ -2919,7 +2924,8 @@ retry_avoidcopy:
> >  		new_page = old_page;
> >  	}
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	page_cache_release(new_page);
> >  	page_cache_release(old_page);
> >  
> > @@ -3344,7 +3350,8 @@ same_page:
> >  }
> >  
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > -		unsigned long address, unsigned long end, pgprot_t newprot)
> > +		unsigned long address, unsigned long end, pgprot_t newprot,
> > +		enum mmu_event event)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long start = address;
> > @@ -3356,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	BUG_ON(address >= end);
> >  	flush_cache_range(vma, address, end);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start, end);
> > +	mmu_notifier_invalidate_range_start(mm, start, end, event);
> >  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >  	for (; address < end; address += huge_page_size(h)) {
> >  		spinlock_t *ptl;
> > @@ -3386,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 */
> >  	flush_tlb_range(vma, start, end);
> >  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > -	mmu_notifier_invalidate_range_end(mm, start, end);
> > +	mmu_notifier_invalidate_range_end(mm, start, end, event);
> >  
> >  	return pages << h->order;
> >  }
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index cb1e976..4b659f1 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -873,7 +873,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MPROT_RONLY);
> >  
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> >  	if (!ptep)
> > @@ -905,7 +906,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  		if (pte_dirty(entry))
> >  			set_page_dirty(page);
> >  		entry = pte_mkclean(pte_wrprotect(entry));
> > -		set_pte_at_notify(mm, addr, ptep, entry);
> > +		set_pte_at_notify(mm, addr, ptep, entry, MMU_MPROT_RONLY);
> >  	}
> >  	*orig_pte = *ptep;
> >  	err = 0;
> > @@ -913,7 +914,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  out_unlock:
> >  	pte_unmap_unlock(ptep, ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MPROT_RONLY);
> >  out:
> >  	return err;
> >  }
> > @@ -949,7 +951,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  
> >  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> >  	if (!pte_same(*ptep, orig_pte)) {
> > @@ -962,7 +965,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> >  	ptep_clear_flush(vma, addr, ptep);
> > -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> > +	set_pte_at_notify(mm, addr, ptep,
> > +			  mk_pte(kpage, vma->vm_page_prot),
> > +			  MMU_MIGRATE);
> >  
> >  	page_remove_rmap(page);
> >  	if (!page_mapped(page))
> > @@ -972,7 +977,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	pte_unmap_unlock(ptep, ptl);
> >  	err = 0;
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return err;
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 09e2cd0..d3908f0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	mmun_end   = end;
> >  	if (is_cow)
> >  		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> > -						    mmun_end);
> > +						    mmun_end, MMU_MIGRATE);
> >  
> >  	ret = 0;
> >  	dst_pgd = pgd_offset(dst_mm, addr);
> > @@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
> >  
> >  	if (is_cow)
> > -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> > +						  MMU_MIGRATE);
> >  	return ret;
> >  }
> >  
> > @@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> > +	mmu_notifier_invalidate_range_start(mm, start_addr,
> > +					    end_addr, MMU_MUNMAP);
> >  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > -	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> > +	mmu_notifier_invalidate_range_end(mm, start_addr,
> > +					  end_addr, MMU_MUNMAP);
> >  }
> >  
> >  /**
> > @@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, start, end);
> > +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
> >  	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> >  		unmap_single_vma(&tlb, vma, start, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, start, end);
> > +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
> >  	tlb_finish_mmu(&tlb, start, end);
> >  }
> >  
> > @@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, address, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, address, end);
> > +	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
> >  	unmap_single_vma(&tlb, vma, address, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, address, end);
> > +	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
> >  	tlb_finish_mmu(&tlb, address, end);
> >  }
> >  
> > @@ -2208,7 +2211,8 @@ gotten:
> >  
> >  	mmun_start  = address & PAGE_MASK;
> >  	mmun_end    = mmun_start + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  
> >  	/*
> >  	 * Re-check the pte - we dropped the lock
> > @@ -2240,7 +2244,7 @@ gotten:
> >  		 * mmu page tables (such as kvm shadow page tables), we want the
> >  		 * new page to be mapped directly into the secondary page table.
> >  		 */
> > -		set_pte_at_notify(mm, address, page_table, entry);
> > +		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
> >  		update_mmu_cache(vma, address, page_table);
> >  		if (old_page) {
> >  			/*
> > @@ -2279,7 +2283,8 @@ gotten:
> >  unlock:
> >  	pte_unmap_unlock(page_table, ptl);
> >  	if (mmun_end > mmun_start)
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  	if (old_page) {
> >  		/*
> >  		 * Don't let another task, with possibly unlocked vma,
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index ab43fbf..b526c72 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1820,12 +1820,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	WARN_ON(PageLRU(new_page));
> >  
> >  	/* Recheck the target PMD */
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> >  fail_putback:
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  
> >  		/* Reverse changes made by migrate_page_copy() */
> >  		if (TestClearPageActive(new_page))
> > @@ -1878,7 +1880,8 @@ fail_putback:
> >  	page_remove_rmap(page);
> >  
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	/* Take an "isolate" reference and put new page on the LRU. */
> >  	get_page(new_page);
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 41cefdf..9decb88 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
> >  	return young;
> >  }
> >  
> > -void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> > -			       pte_t pte)
> > +void __mmu_notifier_change_pte(struct mm_struct *mm,
> > +			       unsigned long address,
> > +			       pte_t pte,
> > +			       enum mmu_event event)
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->change_pte)
> > -			mn->ops->change_pte(mn, mm, address, pte);
> > +			mn->ops->change_pte(mn, mm, address, pte, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  
> >  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address)
> > +				    unsigned long address,
> > +				    enum mmu_event event)
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_page)
> > -			mn->ops->invalidate_page(mn, mm, address);
> > +			mn->ops->invalidate_page(mn, mm, address, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  
> >  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +					   unsigned long start,
> > +					   unsigned long end,
> > +					   enum mmu_event event)
> > +
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -159,14 +165,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start)
> > -			mn->ops->invalidate_range_start(mn, mm, start, end);
> > +			mn->ops->invalidate_range_start(mn, mm, start,
> > +							end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
> >  
> >  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +					 unsigned long start,
> > +					 unsigned long end,
> > +					 enum mmu_event event)
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -174,7 +183,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_end)
> > -			mn->ops->invalidate_range_end(mn, mm, start, end);
> > +			mn->ops->invalidate_range_end(mn, mm, start,
> > +						      end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index c43d557..6ce6c23 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  
> >  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		pud_t *pud, unsigned long addr, unsigned long end,
> > -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> > +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> > +		enum mmu_event event)
> >  {
> >  	pmd_t *pmd;
> >  	struct mm_struct *mm = vma->vm_mm;
> > @@ -157,7 +158,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		/* invoke the mmu notifier if the pmd is populated */
> >  		if (!mni_start) {
> >  			mni_start = addr;
> > -			mmu_notifier_invalidate_range_start(mm, mni_start, end);
> > +			mmu_notifier_invalidate_range_start(mm, mni_start,
> > +							    end, event);
> >  		}
> >  
> >  		if (pmd_trans_huge(*pmd)) {
> > @@ -185,7 +187,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  	} while (pmd++, addr = next, addr != end);
> >  
> >  	if (mni_start)
> > -		mmu_notifier_invalidate_range_end(mm, mni_start, end);
> > +		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
> >  
> >  	if (nr_huge_updates)
> >  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> > @@ -194,7 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  
> >  static inline unsigned long change_pud_range(struct vm_area_struct *vma,
> >  		pgd_t *pgd, unsigned long addr, unsigned long end,
> > -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> > +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> > +		enum mmu_event event)
> >  {
> >  	pud_t *pud;
> >  	unsigned long next;
> > @@ -206,7 +209,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
> >  		if (pud_none_or_clear_bad(pud))
> >  			continue;
> >  		pages += change_pmd_range(vma, pud, addr, next, newprot,
> > -				 dirty_accountable, prot_numa);
> > +				 dirty_accountable, prot_numa, event);
> >  	} while (pud++, addr = next, addr != end);
> >  
> >  	return pages;
> > @@ -214,7 +217,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
> >  
> >  static unsigned long change_protection_range(struct vm_area_struct *vma,
> >  		unsigned long addr, unsigned long end, pgprot_t newprot,
> > -		int dirty_accountable, int prot_numa)
> > +		int dirty_accountable, int prot_numa, enum mmu_event event)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	pgd_t *pgd;
> > @@ -231,7 +234,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> >  		if (pgd_none_or_clear_bad(pgd))
> >  			continue;
> >  		pages += change_pud_range(vma, pgd, addr, next, newprot,
> > -				 dirty_accountable, prot_numa);
> > +				 dirty_accountable, prot_numa, event);
> >  	} while (pgd++, addr = next, addr != end);
> >  
> >  	/* Only flush the TLB if we actually modified any entries: */
> > @@ -247,11 +250,23 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
> >  		       int dirty_accountable, int prot_numa)
> >  {
> >  	unsigned long pages;
> > +	enum mmu_event event = MMU_MPROT_NONE;
> > +
> > +	/* At this points vm_flags is updated. */
> > +	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
> > +		event = MMU_MPROT_RANDW;
> > +	else if (vma->vm_flags & VM_WRITE)
> > +		event = MMU_MPROT_WONLY;
> > +	else if (vma->vm_flags & VM_READ)
> > +		event = MMU_MPROT_RONLY;
> 
> hmmm, shouldn't we be checking against the newprot argument, instead of 
> against vm->vm_flags?  The calling code, mprotect_fixup for example, can 
> set flags *other* than VM_READ or RM_WRITE, and that could leave to a 
> confusing or even inaccurate event. We could have a case where the event 
> type is MMU_MPROT_RONLY, but the page might have been read-only the entire 
> time, and some other flag was actually getting set.

Afaict the only way to end up here is if VM_WRITE or VM_READ changes (are
set or clear, either of them or both of them). It might be simpler to use
new prot as the whole VM_ vs VM_MAY_ arithmetic always confuse me.

> 
> I'm also starting to wonder if this event is adding much value here (for 
> protection changes), if the newprot argument contains the same 
> information. Although it is important to have a unified sort of reporting 
> system for HMM, so that's probably good enough reason to do this.
> 
> >  
> >  	if (is_vm_hugetlb_page(vma))
> > -		pages = hugetlb_change_protection(vma, start, end, newprot);
> > +		pages = hugetlb_change_protection(vma, start, end,
> > +						  newprot, event);
> >  	else
> > -		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
> > +		pages = change_protection_range(vma, start, end, newprot,
> > +						dirty_accountable,
> > +						prot_numa, event);
> >  
> >  	return pages;
> >  }
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 05f1180..6827d2f 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  
> >  	mmun_start = old_addr;
> >  	mmun_end   = old_end;
> > -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  
> >  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> >  		cond_resched();
> > @@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  	if (likely(need_flush))
> >  		flush_tlb_range(vma, old_end-len, old_addr);
> >  
> > -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	return len + old_addr - old_end;	/* how much done */
> >  }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 7928ddd..bd7e6d7 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  	pte_unmap_unlock(pte, ptl);
> >  
> >  	if (ret) {
> > -		mmu_notifier_invalidate_page(mm, address);
> > +		mmu_notifier_invalidate_page(mm, address, MMU_WB);
> >  		(*cleaned)++;
> >  	}
> >  out:
> > @@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  	spinlock_t *ptl;
> >  	int ret = SWAP_AGAIN;
> >  	enum ttu_flags flags = (enum ttu_flags)arg;
> > +	enum mmu_event event = MMU_MIGRATE;
> > +
> > +	if (flags & TTU_MUNLOCK)
> > +		event = MMU_STATUS;
> >  
> >  	pte = page_check_address(page, mm, address, &ptl, 0);
> >  	if (!pte)
> > @@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  out_unmap:
> >  	pte_unmap_unlock(pte, ptl);
> >  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> > -		mmu_notifier_invalidate_page(mm, address);
> > +		mmu_notifier_invalidate_page(mm, address, event);
> >  out:
> >  	return ret;
> >  
> > @@ -1287,7 +1291,9 @@ out_mlock:
> >  #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
> >  
> >  static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> > -		struct vm_area_struct *vma, struct page *check_page)
> > +				struct vm_area_struct *vma,
> > +				struct page *check_page,
> > +				enum ttu_flags flags)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	pmd_t *pmd;
> > @@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  	unsigned long end;
> >  	int ret = SWAP_AGAIN;
> >  	int locked_vma = 0;
> > +	enum mmu_event event = MMU_MIGRATE;
> > +
> > +	if (flags & TTU_MUNLOCK)
> > +		event = MMU_STATUS;
> >  
> >  	address = (vma->vm_start + cursor) & CLUSTER_MASK;
> >  	end = address + CLUSTER_SIZE;
> > @@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = end;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> >  
> >  	/*
> >  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> > @@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  		(*mapcount)--;
> >  	}
> >  	pte_unmap_unlock(pte - 1, ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> >  	if (locked_vma)
> >  		up_read(&vma->vm_mm->mmap_sem);
> >  	return ret;
> > @@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
> >  			while (cursor < max_nl_cursor &&
> >  				cursor < vma->vm_end - vma->vm_start) {
> >  				if (try_to_unmap_cluster(cursor, &mapcount,
> > -						vma, page) == SWAP_MLOCK)
> > +							 vma, page,
> > +							 (enum ttu_flags)arg)
> > +							 == SWAP_MLOCK)
> >  					ret = SWAP_MLOCK;
> >  				cursor += CLUSTER_SIZE;
> >  				vma->vm_private_data = (void *) cursor;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 4b6c01b..6e1992f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  
> >  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  					     struct mm_struct *mm,
> > -					     unsigned long address)
> > +					     unsigned long address,
> > +					     enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  	int need_tlb_flush, idx;
> > @@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  					struct mm_struct *mm,
> >  					unsigned long address,
> > -					pte_t pte)
> > +					pte_t pte,
> > +					enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  	int idx;
> > @@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  						    struct mm_struct *mm,
> >  						    unsigned long start,
> > -						    unsigned long end)
> > +						    unsigned long end,
> > +						    enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  	int need_tlb_flush = 0, idx;
> > @@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  						  struct mm_struct *mm,
> >  						  unsigned long start,
> > -						  unsigned long end)
> > +						  unsigned long end,
> > +						  enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> thanks,
> John H.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
@ 2014-06-30 15:57       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 10:22:57PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jerome Glisse wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > The event information will be usefull for new user of mmu_notifier API.
> > The event argument differentiate between a vma disappearing, a page
> > being write protected or simply a page being unmaped. This allow new
> > user to take different path for different event for instance on unmap
> > the resource used to track a vma are still valid and should stay around.
> > While if the event is saying that a vma is being destroy it means that any
> > resources used to track this vma can be free.
> > 
> > Changed since v1:
> >   - renamed action into event (updated commit message too).
> >   - simplified the event names and clarified their intented usage
> >     also documenting what exceptation the listener can have in
> >     respect to each event.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
> >  drivers/iommu/amd_iommu_v2.c            |  14 ++--
> >  drivers/misc/sgi-gru/grutlbpurge.c      |   9 ++-
> >  drivers/xen/gntdev.c                    |   9 ++-
> >  fs/proc/task_mmu.c                      |   6 +-
> >  include/linux/hugetlb.h                 |   7 +-
> >  include/linux/mmu_notifier.h            | 117 ++++++++++++++++++++++++++------
> >  kernel/events/uprobes.c                 |  10 ++-
> >  mm/filemap_xip.c                        |   2 +-
> >  mm/huge_memory.c                        |  51 ++++++++------
> >  mm/hugetlb.c                            |  25 ++++---
> >  mm/ksm.c                                |  18 +++--
> >  mm/memory.c                             |  27 +++++---
> >  mm/migrate.c                            |   9 ++-
> >  mm/mmu_notifier.c                       |  28 +++++---
> >  mm/mprotect.c                           |  33 ++++++---
> >  mm/mremap.c                             |   6 +-
> >  mm/rmap.c                               |  24 +++++--
> >  virt/kvm/kvm_main.c                     |  12 ++--
> >  19 files changed, 291 insertions(+), 119 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 21ea928..ed6f35e 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -56,7 +56,8 @@ struct i915_mmu_object {
> >  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >  						       struct mm_struct *mm,
> >  						       unsigned long start,
> > -						       unsigned long end)
> > +						       unsigned long end,
> > +						       enum mmu_event event)
> >  {
> >  	struct i915_mmu_notifier *mn = container_of(_mn, struct i915_mmu_notifier, mn);
> >  	struct interval_tree_node *it = NULL;
> > diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> > index 499b436..2bb9771 100644
> > --- a/drivers/iommu/amd_iommu_v2.c
> > +++ b/drivers/iommu/amd_iommu_v2.c
> > @@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
> >  static void mn_change_pte(struct mmu_notifier *mn,
> >  			  struct mm_struct *mm,
> >  			  unsigned long address,
> > -			  pte_t pte)
> > +			  pte_t pte,
> > +			  enum mmu_event event)
> >  {
> >  	__mn_flush_page(mn, address);
> >  }
> >  
> >  static void mn_invalidate_page(struct mmu_notifier *mn,
> >  			       struct mm_struct *mm,
> > -			       unsigned long address)
> > +			       unsigned long address,
> > +			       enum mmu_event event)
> >  {
> >  	__mn_flush_page(mn, address);
> >  }
> >  
> >  static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  				      struct mm_struct *mm,
> > -				      unsigned long start, unsigned long end)
> > +				      unsigned long start,
> > +				      unsigned long end,
> > +				      enum mmu_event event)
> >  {
> >  	struct pasid_state *pasid_state;
> >  	struct device_state *dev_state;
> > @@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_range_end(struct mmu_notifier *mn,
> >  				    struct mm_struct *mm,
> > -				    unsigned long start, unsigned long end)
> > +				    unsigned long start,
> > +				    unsigned long end,
> > +				    enum mmu_event event)
> >  {
> >  	struct pasid_state *pasid_state;
> >  	struct device_state *dev_state;
> > diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> > index 2129274..e67fed1 100644
> > --- a/drivers/misc/sgi-gru/grutlbpurge.c
> > +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> > @@ -221,7 +221,8 @@ void gru_flush_all_tlb(struct gru_state *gru)
> >   */
> >  static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > -				       unsigned long start, unsigned long end)
> > +				       unsigned long start, unsigned long end,
> > +				       enum mmu_event event)
> >  {
> >  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> >  						 ms_notifier);
> > @@ -235,7 +236,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> >  				     struct mm_struct *mm, unsigned long start,
> > -				     unsigned long end)
> > +				     unsigned long end,
> > +				     enum mmu_event event)
> >  {
> >  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> >  						 ms_notifier);
> > @@ -248,7 +250,8 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
> >  }
> >  
> >  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> > -				unsigned long address)
> > +				unsigned long address,
> > +				enum mmu_event event)
> >  {
> >  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
> >  						 ms_notifier);
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index 073b4a1..fe9da94 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -428,7 +428,9 @@ static void unmap_if_in_range(struct grant_map *map,
> >  
> >  static void mn_invl_range_start(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > -				unsigned long start, unsigned long end)
> > +				unsigned long start,
> > +				unsigned long end,
> > +				enum mmu_event event)
> >  {
> >  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> >  	struct grant_map *map;
> > @@ -445,9 +447,10 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invl_page(struct mmu_notifier *mn,
> >  			 struct mm_struct *mm,
> > -			 unsigned long address)
> > +			 unsigned long address,
> > +			 enum mmu_event event)
> >  {
> > -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
> > +	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
> >  }
> >  
> >  static void mn_release(struct mmu_notifier *mn,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index cfa63ee..e9e79f7 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -830,7 +830,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  		};
> >  		down_read(&mm->mmap_sem);
> >  		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_start(mm, 0, -1);
> > +			mmu_notifier_invalidate_range_start(mm, 0,
> > +							    -1, MMU_STATUS);
> >  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >  			cp.vma = vma;
> >  			if (is_vm_hugetlb_page(vma))
> > @@ -858,7 +859,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  					&clear_refs_walk);
> >  		}
> >  		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_end(mm, 0, -1);
> > +			mmu_notifier_invalidate_range_end(mm, 0,
> > +							  -1, MMU_STATUS);
> >  		flush_tlb_mm(mm);
> >  		up_read(&mm->mmap_sem);
> >  		mmput(mm);
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 6a836ef..d7e512f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -6,6 +6,7 @@
> >  #include <linux/fs.h>
> >  #include <linux/hugetlb_inline.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/mmu_notifier.h>
> >  #include <linux/list.h>
> >  #include <linux/kref.h>
> >  
> > @@ -103,7 +104,8 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> >  int pmd_huge(pmd_t pmd);
> >  int pud_huge(pud_t pmd);
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > -		unsigned long address, unsigned long end, pgprot_t newprot);
> > +		unsigned long address, unsigned long end, pgprot_t newprot,
> > +		enum mmu_event event);
> >  
> >  #else /* !CONFIG_HUGETLB_PAGE */
> >  
> > @@ -148,7 +150,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
> >  #define is_hugepage_active(x)	false
> >  
> >  static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > -		unsigned long address, unsigned long end, pgprot_t newprot)
> > +		unsigned long address, unsigned long end, pgprot_t newprot,
> > +		enum mmu_event event)
> >  {
> >  	return 0;
> >  }
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index deca874..82e9577 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -9,6 +9,52 @@
> >  struct mmu_notifier;
> >  struct mmu_notifier_ops;
> >  
> > +/* Event report finer informations to the callback allowing the event listener
> > + * to take better action. There are only few kinds of events :
> > + *
> > + *   - MMU_MIGRATE memory is migrating from one page to another thus all write
> > + *     access must stop after invalidate_range_start callback returns. And no
> > + *     read access should be allowed either as new page can be remapped with
> > + *     write access before the invalidate_range_end callback happen and thus
> > + *     any read access to old page might access outdated informations. Several
> > + *     source to this event like page moving to swap (for various reasons like
> > + *     page reclaim), outcome of mremap syscall, migration for numa reasons,
> > + *     balancing memory pool, write fault on read only page trigger a new page
> > + *     to be allocated and used, ...
> > + *   - MMU_MPROT_NONE memory access protection is change, no page in the range
> > + *     can be accessed in either read or write mode but the range of address
> > + *     is still valid. All access are still fine until invalidate_range_end
> > + *     callback returns.
> > + *   - MMU_MPROT_RONLY memory access proctection is changing to read only.
> > + *     All access are still fine until invalidate_range_end callback returns.
> > + *   - MMU_MPROT_RANDW memory access proctection is changing to read an write.
> > + *     All access are still fine until invalidate_range_end callback returns.
> > + *   - MMU_MPROT_WONLY memory access proctection is changing to write only.
> > + *     All access are still fine until invalidate_range_end callback returns.
> > + *   - MMU_MUNMAP the range is being unmaped (outcome of a munmap syscall). It
> > + *     is fine to still have read/write access until the invalidate_range_end
> > + *     callback returns. This also imply that secondary page table can be trim
> > + *     as the address range is no longer valid.
> > + *   - MMU_WB memory is being write back to disk, all write access must stop
> > + *     after invalidate_range_start callback returns. Read access are still
> > + *     allowed.
> > + *   - MMU_STATUS memory status change, like soft dirty.
> > + *
> > + * In doubt when adding a new notifier caller use MMU_MIGRATE it will always
> > + * result in expected behavior but will not allow listener a chance to optimize
> > + * its events.
> > + */
> 
> Here is a pass at tightening up that documentation:
> 
> /* MMU Events report fine-grained information to the callback routine, allowing
>  * the event listener to make a more informed decision as to what action to
>  * take. The event types are:
>  *
>  *   - MMU_MIGRATE: memory is migrating from one page to another, thus all write
>  *     access must stop after invalidate_range_start callback returns.
>  *     Furthermore, no read access should be allowed either, as a new page can
>  *     be remapped with write access before the invalidate_range_end callback
>  *     happens and thus any read access to old page might read stale data. There
>  *     are several sources for this event, including:
>  *
>  *         - A page moving to swap (for various reasons, including page
>  *           reclaim),
>  *         - An mremap syscall,
>  *         - migration for NUMA reasons,
>  *         - balancing the memory pool,
>  *         - write fault on a read-only page triggers a new page to be allocated
>  *           and used,
>  *         - and more that are not listed here.
>  *
>  *   - MMU_MPROT_NONE: memory access protection is changing to "none": no page
>  *     in the range can be accessed in either read or write mode but the range
>  *     of addresses is still valid. However, access is still allowed, up until
>  *     invalidate_range_end callback returns.
>  *
>  *   - MMU_MPROT_RONLY: memory access proctection is changing to read only.
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns.
>  *
>  *   - MMU_MPROT_RANDW: memory access proctection is changing to read and write.
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns.
>  *
>  *   - MMU_MPROT_WONLY: memory access proctection is changing to write only.
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns.
>  *
>  *   - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall).
>  *     However, access is still allowed, up until invalidate_range_end callback
>  *     returns. This also implies that the secondary page table can be trimmed,
>  *     because the address range is no longer valid.
>  *
>  *   - MMU_WB: memory is being written back to disk, all write accesses must
>  *     stop after invalidate_range_start callback returns. Read access are still
>  *     allowed.
>  *
>  *   - MMU_STATUS memory status change, like soft dirty, or huge page 
>  *     splitting (in place).
>  *
>  * If in doubt when adding a new notifier caller, please use MMU_MIGRATE,
>  * because it will always lead to reasonable behavior, but will not allow the
>  * listener a chance to optimize its events.
>  */
> 
> Mostly just cleaning up the wording, except that I did add "huge page 
> splitting" to the cases that could cause an MMU_STATUS to fire.
>

Yes your wording is better than mine. The huge page case is kind of a
weird case as it is not always reusing the same page (THP vs hugetlbfs).

> > +enum mmu_event {
> > +	MMU_MIGRATE = 0,
> > +	MMU_MPROT_NONE,
> > +	MMU_MPROT_RONLY,
> > +	MMU_MPROT_RANDW,
> > +	MMU_MPROT_WONLY,
> > +	MMU_MUNMAP,
> > +	MMU_STATUS,
> > +	MMU_WB,
> > +};
> > +
> >  #ifdef CONFIG_MMU_NOTIFIER
> >  
> >  /*
> > @@ -79,7 +125,8 @@ struct mmu_notifier_ops {
> >  	void (*change_pte)(struct mmu_notifier *mn,
> >  			   struct mm_struct *mm,
> >  			   unsigned long address,
> > -			   pte_t pte);
> > +			   pte_t pte,
> > +			   enum mmu_event event);
> >  
> >  	/*
> >  	 * Before this is invoked any secondary MMU is still ok to
> > @@ -90,7 +137,8 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_page)(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > -				unsigned long address);
> > +				unsigned long address,
> > +				enum mmu_event event);
> >  
> >  	/*
> >  	 * invalidate_range_start() and invalidate_range_end() must be
> > @@ -137,10 +185,14 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_range_start)(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > -				       unsigned long start, unsigned long end);
> > +				       unsigned long start,
> > +				       unsigned long end,
> > +				       enum mmu_event event);
> >  	void (*invalidate_range_end)(struct mmu_notifier *mn,
> >  				     struct mm_struct *mm,
> > -				     unsigned long start, unsigned long end);
> > +				     unsigned long start,
> > +				     unsigned long end,
> > +				     enum mmu_event event);
> >  };
> >  
> >  /*
> > @@ -177,13 +229,20 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> >  extern int __mmu_notifier_test_young(struct mm_struct *mm,
> >  				     unsigned long address);
> >  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> > -				      unsigned long address, pte_t pte);
> > +				      unsigned long address,
> > +				      pte_t pte,
> > +				      enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address);
> > +					  unsigned long address,
> > +					  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end);
> > +						  unsigned long start,
> > +						  unsigned long end,
> > +						  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end);
> > +						unsigned long start,
> > +						unsigned long end,
> > +						enum mmu_event event);
> >  
> >  static inline void mmu_notifier_release(struct mm_struct *mm)
> >  {
> > @@ -208,31 +267,38 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> > -					   unsigned long address, pte_t pte)
> > +					   unsigned long address,
> > +					   pte_t pte,
> > +					   enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_change_pte(mm, address, pte);
> > +		__mmu_notifier_change_pte(mm, address, pte, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address)
> > +						unsigned long address,
> > +						enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_page(mm, address);
> > +		__mmu_notifier_invalidate_page(mm, address, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						       unsigned long start,
> > +						       unsigned long end,
> > +						       enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_start(mm, start, end);
> > +		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						     unsigned long start,
> > +						     unsigned long end,
> > +						     enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_end(mm, start, end);
> > +		__mmu_notifier_invalidate_range_end(mm, start, end, event);
> >  }
> >  
> >  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> > @@ -278,13 +344,13 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> >   * old page would remain mapped readonly in the secondary MMUs after the new
> >   * page is already writable by some CPU through the primary MMU.
> >   */
> > -#define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
> > +#define set_pte_at_notify(__mm, __address, __ptep, __pte, __event)	\
> >  ({									\
> >  	struct mm_struct *___mm = __mm;					\
> >  	unsigned long ___address = __address;				\
> >  	pte_t ___pte = __pte;						\
> >  									\
> > -	mmu_notifier_change_pte(___mm, ___address, ___pte);		\
> > +	mmu_notifier_change_pte(___mm, ___address, ___pte, __event);	\
> >  	set_pte_at(___mm, ___address, __ptep, ___pte);			\
> >  })
> >  
> > @@ -307,22 +373,29 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> > -					   unsigned long address, pte_t pte)
> > +					   unsigned long address,
> > +					   pte_t pte,
> > +					   enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address)
> > +						unsigned long address,
> > +						enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						       unsigned long start,
> > +						       unsigned long end,
> > +						       enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +						     unsigned long start,
> > +						     unsigned long end,
> > +						     enum mmu_event event)
> >  {
> >  }
> >  
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 32b04dc..296f81e 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -177,7 +177,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	/* For try_to_free_swap() and munlock_vma_page() below */
> >  	lock_page(page);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	err = -EAGAIN;
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> >  	if (!ptep)
> > @@ -195,7 +196,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> >  	ptep_clear_flush(vma, addr, ptep);
> > -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> > +	set_pte_at_notify(mm, addr, ptep,
> > +			  mk_pte(kpage, vma->vm_page_prot),
> > +			  MMU_MIGRATE);
> >  
> >  	page_remove_rmap(page);
> >  	if (!page_mapped(page))
> > @@ -209,7 +212,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	err = 0;
> >   unlock:
> >  	mem_cgroup_cancel_charge(kpage, memcg);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	unlock_page(page);
> >  	return err;
> >  }
> > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> > index d8d9fe3..a2b3f09 100644
> > --- a/mm/filemap_xip.c
> > +++ b/mm/filemap_xip.c
> > @@ -198,7 +198,7 @@ retry:
> >  			BUG_ON(pte_dirty(pteval));
> >  			pte_unmap_unlock(pte, ptl);
> >  			/* must invalidate_page _before_ freeing the page */
> > -			mmu_notifier_invalidate_page(mm, address);
> > +			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> >  			page_cache_release(page);
> >  		}
> >  	}
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5d562a9..fa30857 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1020,6 +1020,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  		set_page_private(pages[i], (unsigned long)memcg);
> >  	}
> >  
> > +	mmun_start = haddr;
> > +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> > +
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		copy_user_highpage(pages[i], page + i,
> >  				   haddr + PAGE_SIZE * i, vma);
> > @@ -1027,10 +1032,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  		cond_resched();
> >  	}
> >  
> > -	mmun_start = haddr;
> > -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > -
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> >  		goto out_free_pages;
> 
> So, that looks like you are fixing a pre-existing bug here? The 
> invalidate_range call is now happening *before* we copy pages. That seems 
> correct, although this is starting to get into code I'm less comfortable 
> with (huge pages).  But I think it's worth mentioning in the commit 
> message.

Yes i should actualy split the fix out of this patch i will respin with
fix a preparatory patch to this one.

> 
> > @@ -1063,7 +1064,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  	page_remove_rmap(page);
> >  	spin_unlock(ptl);
> >  
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	ret |= VM_FAULT_WRITE;
> >  	put_page(page);
> > @@ -1073,7 +1075,8 @@ out:
> >  
> >  out_free_pages:
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		memcg = (void *)page_private(pages[i]);
> >  		set_page_private(pages[i], 0);
> > @@ -1157,16 +1160,17 @@ alloc:
> >  
> >  	count_vm_event(THP_FAULT_ALLOC);
> >  
> > +	mmun_start = haddr;
> > +	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> > +
> >  	if (!page)
> >  		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
> >  	else
> >  		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> >  	__SetPageUptodate(new_page);
> >  
> > -	mmun_start = haddr;
> > -	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > -
> 
> Another bug fix, OK.
> 
> >  	spin_lock(ptl);
> >  	if (page)
> >  		put_user_huge_page(page);
> > @@ -1197,7 +1201,8 @@ alloc:
> >  	}
> >  	spin_unlock(ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return ret;
> >  out_unlock:
> > @@ -1632,7 +1637,8 @@ static int __split_huge_page_splitting(struct page *page,
> >  	const unsigned long mmun_start = address;
> >  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_STATUS);
> 
> OK, just to be sure: we are not moving the page contents at this point, 
> right? Just changing the page table from a single "huge" entry, into lots 
> of little 4K page entries? If so, than MMU_STATUS seems correct, but we 
> should add that case to the "Event types" documentation above.

Yes correct.

> 
> >  	pmd = page_check_address_pmd(page, mm, address,
> >  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
> >  	if (pmd) {
> > @@ -1647,7 +1653,8 @@ static int __split_huge_page_splitting(struct page *page,
> >  		ret = 1;
> >  		spin_unlock(ptl);
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_STATUS);
> >  
> >  	return ret;
> >  }
> > @@ -2446,7 +2453,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = address + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >  	/*
> >  	 * After this gup_fast can't run anymore. This also removes
> > @@ -2456,7 +2464,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 */
> >  	_pmd = pmdp_clear_flush(vma, address, pmd);
> >  	spin_unlock(pmd_ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	spin_lock(pte_ptl);
> >  	isolated = __collapse_huge_page_isolate(vma, address, pte);
> > @@ -2845,24 +2854,28 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> >  again:
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> 
> Just checking: this is MMU_MIGRATE, instead of MMU_STATUS, because we are 
> actually moving data? (The pages backing the page table?)

Well in truth i think this mmu_notifier call should change. As what happens
depends on the branch taken. But i am guessing the call were added there
because of the spinlock. I will just remove then with explanation in a
preparatory patch to this one. Issue is the huge zero case, but i think it
is fine to call mmu_notifier after having updated the cpu page table for
that case.

> 
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_trans_huge(*pmd))) {
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> >  	if (is_huge_zero_pmd(*pmd)) {
> >  		__split_huge_zero_page_pmd(vma, haddr, pmd);
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> >  	page = pmd_page(*pmd);
> >  	VM_BUG_ON_PAGE(!page_count(page), page);
> >  	get_page(page);
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	split_huge_page(page);
> >  
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7faab71..73e1576 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2565,7 +2565,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	mmun_start = vma->vm_start;
> >  	mmun_end = vma->vm_end;
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_start(src, mmun_start,
> > +						    mmun_end, MMU_MIGRATE);
> >  
> >  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> >  		spinlock_t *src_ptl, *dst_ptl;
> > @@ -2615,7 +2616,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	}
> >  
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(src, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  
> >  	return ret;
> >  }
> > @@ -2641,7 +2643,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  	BUG_ON(end & ~huge_page_mask(h));
> >  
> >  	tlb_start_vma(tlb, vma);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  again:
> >  	for (address = start; address < end; address += sz) {
> >  		ptep = huge_pte_offset(mm, address);
> > @@ -2712,7 +2715,8 @@ unlock:
> >  		if (address < end && !ref_page)
> >  			goto again;
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	tlb_end_vma(tlb, vma);
> >  }
> >  
> > @@ -2899,7 +2903,8 @@ retry_avoidcopy:
> >  
> >  	mmun_start = address & huge_page_mask(h);
> >  	mmun_end = mmun_start + huge_page_size(h);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	/*
> >  	 * Retake the page table lock to check for racing updates
> >  	 * before the page tables are altered
> > @@ -2919,7 +2924,8 @@ retry_avoidcopy:
> >  		new_page = old_page;
> >  	}
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  	page_cache_release(new_page);
> >  	page_cache_release(old_page);
> >  
> > @@ -3344,7 +3350,8 @@ same_page:
> >  }
> >  
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > -		unsigned long address, unsigned long end, pgprot_t newprot)
> > +		unsigned long address, unsigned long end, pgprot_t newprot,
> > +		enum mmu_event event)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long start = address;
> > @@ -3356,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	BUG_ON(address >= end);
> >  	flush_cache_range(vma, address, end);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start, end);
> > +	mmu_notifier_invalidate_range_start(mm, start, end, event);
> >  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >  	for (; address < end; address += huge_page_size(h)) {
> >  		spinlock_t *ptl;
> > @@ -3386,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 */
> >  	flush_tlb_range(vma, start, end);
> >  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > -	mmu_notifier_invalidate_range_end(mm, start, end);
> > +	mmu_notifier_invalidate_range_end(mm, start, end, event);
> >  
> >  	return pages << h->order;
> >  }
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index cb1e976..4b659f1 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -873,7 +873,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MPROT_RONLY);
> >  
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> >  	if (!ptep)
> > @@ -905,7 +906,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  		if (pte_dirty(entry))
> >  			set_page_dirty(page);
> >  		entry = pte_mkclean(pte_wrprotect(entry));
> > -		set_pte_at_notify(mm, addr, ptep, entry);
> > +		set_pte_at_notify(mm, addr, ptep, entry, MMU_MPROT_RONLY);
> >  	}
> >  	*orig_pte = *ptep;
> >  	err = 0;
> > @@ -913,7 +914,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  out_unlock:
> >  	pte_unmap_unlock(ptep, ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MPROT_RONLY);
> >  out:
> >  	return err;
> >  }
> > @@ -949,7 +951,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  
> >  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> >  	if (!pte_same(*ptep, orig_pte)) {
> > @@ -962,7 +965,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*ptep));
> >  	ptep_clear_flush(vma, addr, ptep);
> > -	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> > +	set_pte_at_notify(mm, addr, ptep,
> > +			  mk_pte(kpage, vma->vm_page_prot),
> > +			  MMU_MIGRATE);
> >  
> >  	page_remove_rmap(page);
> >  	if (!page_mapped(page))
> > @@ -972,7 +977,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	pte_unmap_unlock(ptep, ptl);
> >  	err = 0;
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return err;
> >  }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 09e2cd0..d3908f0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1050,7 +1050,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	mmun_end   = end;
> >  	if (is_cow)
> >  		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> > -						    mmun_end);
> > +						    mmun_end, MMU_MIGRATE);
> >  
> >  	ret = 0;
> >  	dst_pgd = pgd_offset(dst_mm, addr);
> > @@ -1067,7 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
> >  
> >  	if (is_cow)
> > -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> > +						  MMU_MIGRATE);
> >  	return ret;
> >  }
> >  
> > @@ -1371,10 +1372,12 @@ void unmap_vmas(struct mmu_gather *tlb,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> > +	mmu_notifier_invalidate_range_start(mm, start_addr,
> > +					    end_addr, MMU_MUNMAP);
> >  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > -	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> > +	mmu_notifier_invalidate_range_end(mm, start_addr,
> > +					  end_addr, MMU_MUNMAP);
> >  }
> >  
> >  /**
> > @@ -1396,10 +1399,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, start, end);
> > +	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
> >  	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> >  		unmap_single_vma(&tlb, vma, start, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, start, end);
> > +	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
> >  	tlb_finish_mmu(&tlb, start, end);
> >  }
> >  
> > @@ -1422,9 +1425,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, address, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, address, end);
> > +	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
> >  	unmap_single_vma(&tlb, vma, address, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, address, end);
> > +	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
> >  	tlb_finish_mmu(&tlb, address, end);
> >  }
> >  
> > @@ -2208,7 +2211,8 @@ gotten:
> >  
> >  	mmun_start  = address & PAGE_MASK;
> >  	mmun_end    = mmun_start + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  
> >  	/*
> >  	 * Re-check the pte - we dropped the lock
> > @@ -2240,7 +2244,7 @@ gotten:
> >  		 * mmu page tables (such as kvm shadow page tables), we want the
> >  		 * new page to be mapped directly into the secondary page table.
> >  		 */
> > -		set_pte_at_notify(mm, address, page_table, entry);
> > +		set_pte_at_notify(mm, address, page_table, entry, MMU_MIGRATE);
> >  		update_mmu_cache(vma, address, page_table);
> >  		if (old_page) {
> >  			/*
> > @@ -2279,7 +2283,8 @@ gotten:
> >  unlock:
> >  	pte_unmap_unlock(page_table, ptl);
> >  	if (mmun_end > mmun_start)
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  	if (old_page) {
> >  		/*
> >  		 * Don't let another task, with possibly unlocked vma,
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index ab43fbf..b526c72 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1820,12 +1820,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	WARN_ON(PageLRU(new_page));
> >  
> >  	/* Recheck the target PMD */
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> >  fail_putback:
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  
> >  		/* Reverse changes made by migrate_page_copy() */
> >  		if (TestClearPageActive(new_page))
> > @@ -1878,7 +1880,8 @@ fail_putback:
> >  	page_remove_rmap(page);
> >  
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	/* Take an "isolate" reference and put new page on the LRU. */
> >  	get_page(new_page);
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 41cefdf..9decb88 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -122,8 +122,10 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
> >  	return young;
> >  }
> >  
> > -void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> > -			       pte_t pte)
> > +void __mmu_notifier_change_pte(struct mm_struct *mm,
> > +			       unsigned long address,
> > +			       pte_t pte,
> > +			       enum mmu_event event)
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -131,13 +133,14 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->change_pte)
> > -			mn->ops->change_pte(mn, mm, address, pte);
> > +			mn->ops->change_pte(mn, mm, address, pte, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  
> >  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > -					  unsigned long address)
> > +				    unsigned long address,
> > +				    enum mmu_event event)
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -145,13 +148,16 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_page)
> > -			mn->ops->invalidate_page(mn, mm, address);
> > +			mn->ops->invalidate_page(mn, mm, address, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  
> >  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +					   unsigned long start,
> > +					   unsigned long end,
> > +					   enum mmu_event event)
> > +
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -159,14 +165,17 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start)
> > -			mn->ops->invalidate_range_start(mn, mm, start, end);
> > +			mn->ops->invalidate_range_start(mn, mm, start,
> > +							end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
> >  
> >  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > -				  unsigned long start, unsigned long end)
> > +					 unsigned long start,
> > +					 unsigned long end,
> > +					 enum mmu_event event)
> >  {
> >  	struct mmu_notifier *mn;
> >  	int id;
> > @@ -174,7 +183,8 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_end)
> > -			mn->ops->invalidate_range_end(mn, mm, start, end);
> > +			mn->ops->invalidate_range_end(mn, mm, start,
> > +						      end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index c43d557..6ce6c23 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -137,7 +137,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  
> >  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		pud_t *pud, unsigned long addr, unsigned long end,
> > -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> > +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> > +		enum mmu_event event)
> >  {
> >  	pmd_t *pmd;
> >  	struct mm_struct *mm = vma->vm_mm;
> > @@ -157,7 +158,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		/* invoke the mmu notifier if the pmd is populated */
> >  		if (!mni_start) {
> >  			mni_start = addr;
> > -			mmu_notifier_invalidate_range_start(mm, mni_start, end);
> > +			mmu_notifier_invalidate_range_start(mm, mni_start,
> > +							    end, event);
> >  		}
> >  
> >  		if (pmd_trans_huge(*pmd)) {
> > @@ -185,7 +187,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  	} while (pmd++, addr = next, addr != end);
> >  
> >  	if (mni_start)
> > -		mmu_notifier_invalidate_range_end(mm, mni_start, end);
> > +		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
> >  
> >  	if (nr_huge_updates)
> >  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> > @@ -194,7 +196,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  
> >  static inline unsigned long change_pud_range(struct vm_area_struct *vma,
> >  		pgd_t *pgd, unsigned long addr, unsigned long end,
> > -		pgprot_t newprot, int dirty_accountable, int prot_numa)
> > +		pgprot_t newprot, int dirty_accountable, int prot_numa,
> > +		enum mmu_event event)
> >  {
> >  	pud_t *pud;
> >  	unsigned long next;
> > @@ -206,7 +209,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
> >  		if (pud_none_or_clear_bad(pud))
> >  			continue;
> >  		pages += change_pmd_range(vma, pud, addr, next, newprot,
> > -				 dirty_accountable, prot_numa);
> > +				 dirty_accountable, prot_numa, event);
> >  	} while (pud++, addr = next, addr != end);
> >  
> >  	return pages;
> > @@ -214,7 +217,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
> >  
> >  static unsigned long change_protection_range(struct vm_area_struct *vma,
> >  		unsigned long addr, unsigned long end, pgprot_t newprot,
> > -		int dirty_accountable, int prot_numa)
> > +		int dirty_accountable, int prot_numa, enum mmu_event event)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	pgd_t *pgd;
> > @@ -231,7 +234,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> >  		if (pgd_none_or_clear_bad(pgd))
> >  			continue;
> >  		pages += change_pud_range(vma, pgd, addr, next, newprot,
> > -				 dirty_accountable, prot_numa);
> > +				 dirty_accountable, prot_numa, event);
> >  	} while (pgd++, addr = next, addr != end);
> >  
> >  	/* Only flush the TLB if we actually modified any entries: */
> > @@ -247,11 +250,23 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
> >  		       int dirty_accountable, int prot_numa)
> >  {
> >  	unsigned long pages;
> > +	enum mmu_event event = MMU_MPROT_NONE;
> > +
> > +	/* At this points vm_flags is updated. */
> > +	if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
> > +		event = MMU_MPROT_RANDW;
> > +	else if (vma->vm_flags & VM_WRITE)
> > +		event = MMU_MPROT_WONLY;
> > +	else if (vma->vm_flags & VM_READ)
> > +		event = MMU_MPROT_RONLY;
> 
> hmmm, shouldn't we be checking against the newprot argument, instead of 
> against vm->vm_flags?  The calling code, mprotect_fixup for example, can 
> set flags *other* than VM_READ or RM_WRITE, and that could leave to a 
> confusing or even inaccurate event. We could have a case where the event 
> type is MMU_MPROT_RONLY, but the page might have been read-only the entire 
> time, and some other flag was actually getting set.

Afaict the only way to end up here is if VM_WRITE or VM_READ changes (are
set or clear, either of them or both of them). It might be simpler to use
new prot as the whole VM_ vs VM_MAY_ arithmetic always confuse me.

> 
> I'm also starting to wonder if this event is adding much value here (for 
> protection changes), if the newprot argument contains the same 
> information. Although it is important to have a unified sort of reporting 
> system for HMM, so that's probably good enough reason to do this.
> 
> >  
> >  	if (is_vm_hugetlb_page(vma))
> > -		pages = hugetlb_change_protection(vma, start, end, newprot);
> > +		pages = hugetlb_change_protection(vma, start, end,
> > +						  newprot, event);
> >  	else
> > -		pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
> > +		pages = change_protection_range(vma, start, end, newprot,
> > +						dirty_accountable,
> > +						prot_numa, event);
> >  
> >  	return pages;
> >  }
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 05f1180..6827d2f 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -177,7 +177,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  
> >  	mmun_start = old_addr;
> >  	mmun_end   = old_end;
> > -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> > +					    mmun_end, MMU_MIGRATE);
> >  
> >  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> >  		cond_resched();
> > @@ -228,7 +229,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  	if (likely(need_flush))
> >  		flush_tlb_range(vma, old_end-len, old_addr);
> >  
> > -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> > +					  mmun_end, MMU_MIGRATE);
> >  
> >  	return len + old_addr - old_end;	/* how much done */
> >  }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 7928ddd..bd7e6d7 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  	pte_unmap_unlock(pte, ptl);
> >  
> >  	if (ret) {
> > -		mmu_notifier_invalidate_page(mm, address);
> > +		mmu_notifier_invalidate_page(mm, address, MMU_WB);
> >  		(*cleaned)++;
> >  	}
> >  out:
> > @@ -1128,6 +1128,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  	spinlock_t *ptl;
> >  	int ret = SWAP_AGAIN;
> >  	enum ttu_flags flags = (enum ttu_flags)arg;
> > +	enum mmu_event event = MMU_MIGRATE;
> > +
> > +	if (flags & TTU_MUNLOCK)
> > +		event = MMU_STATUS;
> >  
> >  	pte = page_check_address(page, mm, address, &ptl, 0);
> >  	if (!pte)
> > @@ -1233,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  out_unmap:
> >  	pte_unmap_unlock(pte, ptl);
> >  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> > -		mmu_notifier_invalidate_page(mm, address);
> > +		mmu_notifier_invalidate_page(mm, address, event);
> >  out:
> >  	return ret;
> >  
> > @@ -1287,7 +1291,9 @@ out_mlock:
> >  #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
> >  
> >  static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> > -		struct vm_area_struct *vma, struct page *check_page)
> > +				struct vm_area_struct *vma,
> > +				struct page *check_page,
> > +				enum ttu_flags flags)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	pmd_t *pmd;
> > @@ -1301,6 +1307,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  	unsigned long end;
> >  	int ret = SWAP_AGAIN;
> >  	int locked_vma = 0;
> > +	enum mmu_event event = MMU_MIGRATE;
> > +
> > +	if (flags & TTU_MUNLOCK)
> > +		event = MMU_STATUS;
> >  
> >  	address = (vma->vm_start + cursor) & CLUSTER_MASK;
> >  	end = address + CLUSTER_SIZE;
> > @@ -1315,7 +1325,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = end;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> >  
> >  	/*
> >  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> > @@ -1380,7 +1390,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  		(*mapcount)--;
> >  	}
> >  	pte_unmap_unlock(pte - 1, ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > +	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> >  	if (locked_vma)
> >  		up_read(&vma->vm_mm->mmap_sem);
> >  	return ret;
> > @@ -1436,7 +1446,9 @@ static int try_to_unmap_nonlinear(struct page *page,
> >  			while (cursor < max_nl_cursor &&
> >  				cursor < vma->vm_end - vma->vm_start) {
> >  				if (try_to_unmap_cluster(cursor, &mapcount,
> > -						vma, page) == SWAP_MLOCK)
> > +							 vma, page,
> > +							 (enum ttu_flags)arg)
> > +							 == SWAP_MLOCK)
> >  					ret = SWAP_MLOCK;
> >  				cursor += CLUSTER_SIZE;
> >  				vma->vm_private_data = (void *) cursor;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 4b6c01b..6e1992f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -262,7 +262,8 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  
> >  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  					     struct mm_struct *mm,
> > -					     unsigned long address)
> > +					     unsigned long address,
> > +					     enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  	int need_tlb_flush, idx;
> > @@ -301,7 +302,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  					struct mm_struct *mm,
> >  					unsigned long address,
> > -					pte_t pte)
> > +					pte_t pte,
> > +					enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  	int idx;
> > @@ -317,7 +319,8 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  						    struct mm_struct *mm,
> >  						    unsigned long start,
> > -						    unsigned long end)
> > +						    unsigned long end,
> > +						    enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  	int need_tlb_flush = 0, idx;
> > @@ -343,7 +346,8 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  						  struct mm_struct *mm,
> >  						  unsigned long start,
> > -						  unsigned long end)
> > +						  unsigned long end,
> > +						  enum mmu_event event)
> >  {
> >  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >  
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> thanks,
> John H.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.
  2014-06-30  3:58     ` John Hubbard
@ 2014-06-30 15:58       ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:58 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 08:58:17PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jérôme Glisse wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > New code will need to be able to differentiate between a regular unmap and
> > an unmap trigger by vmscan in which case we want to be as quick as possible.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > ---
> >  include/linux/rmap.h | 15 ++++++++-------
> >  mm/memory-failure.c  |  2 +-
> >  mm/vmscan.c          |  4 ++--
> >  3 files changed, 11 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index be57450..eddbc07 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -72,13 +72,14 @@ struct anon_vma_chain {
> >  };
> >  
> >  enum ttu_flags {
> > -	TTU_UNMAP = 1,			/* unmap mode */
> > -	TTU_MIGRATION = 2,		/* migration mode */
> > -	TTU_MUNLOCK = 4,		/* munlock mode */
> > -
> > -	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> > -	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> > -	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> > +	TTU_VMSCAN = 1,			/* unmap for vmscan */
> > +	TTU_POISON = 2,			/* unmap for poison */
> > +	TTU_MIGRATION = 4,		/* migration mode */
> > +	TTU_MUNLOCK = 8,		/* munlock mode */
> > +
> > +	TTU_IGNORE_MLOCK = (1 << 9),	/* ignore mlock */
> > +	TTU_IGNORE_ACCESS = (1 << 10),	/* don't age */
> > +	TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */
> 
> Unless there is a deeper purpose that I am overlooking, I think it would 
> be better to leave the _MLOCK, _ACCESS, and _HWPOISON at their original 
> values. I just can't quite see why they would need to start at bit 9 
> instead of bit 8...

This code change to have various TTU_* be bitflag instead of value. I am
not sure what was the win in that, would need to dig up that patch that
did that. But in all the case i preserve that change here hence starting
at 9.
> 
> >  };
> >  
> >  #ifdef CONFIG_MMU
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a7a89eb..ba176c4 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page *p,
> >  static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
> >  				  int trapno, int flags, struct page **hpagep)
> >  {
> > -	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> > +	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> >  	struct address_space *mapping;
> >  	LIST_HEAD(tokill);
> >  	int ret;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6d24fd6..5a7d286 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >  	}
> >  
> >  	ret = shrink_page_list(&clean_pages, zone, &sc,
> > -			TTU_UNMAP|TTU_IGNORE_ACCESS,
> > +			TTU_VMSCAN|TTU_IGNORE_ACCESS,
> >  			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
> >  	list_splice(&clean_pages, page_list);
> >  	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> > @@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  	if (nr_taken == 0)
> >  		return 0;
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> > +	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
> >  				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
> >  				&nr_writeback, &nr_immediate,
> >  				false);
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> Other than that, looks good.
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> John H.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.
@ 2014-06-30 15:58       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 15:58 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 08:58:17PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jerome Glisse wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > New code will need to be able to differentiate between a regular unmap and
> > an unmap trigger by vmscan in which case we want to be as quick as possible.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > ---
> >  include/linux/rmap.h | 15 ++++++++-------
> >  mm/memory-failure.c  |  2 +-
> >  mm/vmscan.c          |  4 ++--
> >  3 files changed, 11 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index be57450..eddbc07 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -72,13 +72,14 @@ struct anon_vma_chain {
> >  };
> >  
> >  enum ttu_flags {
> > -	TTU_UNMAP = 1,			/* unmap mode */
> > -	TTU_MIGRATION = 2,		/* migration mode */
> > -	TTU_MUNLOCK = 4,		/* munlock mode */
> > -
> > -	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> > -	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> > -	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> > +	TTU_VMSCAN = 1,			/* unmap for vmscan */
> > +	TTU_POISON = 2,			/* unmap for poison */
> > +	TTU_MIGRATION = 4,		/* migration mode */
> > +	TTU_MUNLOCK = 8,		/* munlock mode */
> > +
> > +	TTU_IGNORE_MLOCK = (1 << 9),	/* ignore mlock */
> > +	TTU_IGNORE_ACCESS = (1 << 10),	/* don't age */
> > +	TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */
> 
> Unless there is a deeper purpose that I am overlooking, I think it would 
> be better to leave the _MLOCK, _ACCESS, and _HWPOISON at their original 
> values. I just can't quite see why they would need to start at bit 9 
> instead of bit 8...

This code change to have various TTU_* be bitflag instead of value. I am
not sure what was the win in that, would need to dig up that patch that
did that. But in all the case i preserve that change here hence starting
at 9.
> 
> >  };
> >  
> >  #ifdef CONFIG_MMU
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a7a89eb..ba176c4 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page *p,
> >  static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
> >  				  int trapno, int flags, struct page **hpagep)
> >  {
> > -	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> > +	enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> >  	struct address_space *mapping;
> >  	LIST_HEAD(tokill);
> >  	int ret;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6d24fd6..5a7d286 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >  	}
> >  
> >  	ret = shrink_page_list(&clean_pages, zone, &sc,
> > -			TTU_UNMAP|TTU_IGNORE_ACCESS,
> > +			TTU_VMSCAN|TTU_IGNORE_ACCESS,
> >  			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
> >  	list_splice(&clean_pages, page_list);
> >  	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> > @@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  	if (nr_taken == 0)
> >  		return 0;
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
> > +	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_VMSCAN,
> >  				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
> >  				&nr_writeback, &nr_immediate,
> >  				false);
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> Other than that, looks good.
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> John H.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
  2014-06-30  3:29     ` John Hubbard
@ 2014-06-30 16:00       ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 16:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 08:29:01PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jérôme Glisse wrote:
> 
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > New user of the mmu_notifier interface need to lookup vma in order to
> > perform the invalidation operation. Instead of redoing a vma lookup
> > inside the callback just pass through the vma from the call site where
> > it is already available.
> > 
> > This needs small refactoring in memory.c to call invalidate_range on
> > vma boundary the overhead should be low enough.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
> >  drivers/iommu/amd_iommu_v2.c            |  3 +++
> >  drivers/misc/sgi-gru/grutlbpurge.c      |  6 ++++-
> >  drivers/xen/gntdev.c                    |  4 +++-
> >  fs/proc/task_mmu.c                      | 16 ++++++++-----
> >  include/linux/mmu_notifier.h            | 19 ++++++++++++---
> >  kernel/events/uprobes.c                 |  4 ++--
> >  mm/filemap_xip.c                        |  3 ++-
> >  mm/huge_memory.c                        | 26 ++++++++++----------
> >  mm/hugetlb.c                            | 16 ++++++-------
> >  mm/ksm.c                                |  8 +++----
> >  mm/memory.c                             | 42 +++++++++++++++++++++------------
> >  mm/migrate.c                            |  6 ++---
> >  mm/mmu_notifier.c                       |  9 ++++---
> >  mm/mprotect.c                           |  5 ++--
> >  mm/mremap.c                             |  4 ++--
> >  mm/rmap.c                               |  9 +++----
> >  virt/kvm/kvm_main.c                     |  3 +++
> >  18 files changed, 116 insertions(+), 68 deletions(-)
> > 
> 
> Hi Jerome, considering that you have to change every call site already, it 
> seems to me that it would be ideal to just delete the mm argument from all 
> of these invalidate_range* callbacks. In other words, replace the mm 
> argument with the new vma argument.  I don't see much point in passing 
> them both around, and while it would make the patch a *bit* larger, it's 
> mostly just an extra line or two per call site:
> 
>   mm = vma->vm_mm;

Yes probably is.

> 
> Also, passing the vma around really does seem like a good approach, but it 
> does cause a bunch of additional calls to the invalidate_range* routines,
> because we generate a call per vma, instead of just one for the entire mm.
> So that brings up a couple questions:
> 
> 1) Is there any chance that this could cause measurable performance 
> regressions?
> 
> 2) Should you put a little note in the commit message, mentioning this 
> point?
> 

I pointed that out in my introduction mail to the patchset. I think those
code path are mostly use during process destruction and hence is fine but
at same time being able to spin and destroy quickly many process by be
considered a features and an important use case that should not regress.

> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index ed6f35e..191ac71 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -55,6 +55,7 @@ struct i915_mmu_object {
> >  
> >  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >  						       struct mm_struct *mm,
> > +						       struct vm_area_struct *vma,
> >  						       unsigned long start,
> >  						       unsigned long end,
> >  						       enum mmu_event event)
> 
> That routine has a local variable named vma, so it might be polite to 
> rename the local variable, to make it more obvious to the reader that they 
> are different. Of course, since the compiler knows which is which, feel 
> free to ignore this comment.
> 
> > diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> > index 2bb9771..9f9e706 100644
> > --- a/drivers/iommu/amd_iommu_v2.c
> > +++ b/drivers/iommu/amd_iommu_v2.c
> > @@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_page(struct mmu_notifier *mn,
> >  			       struct mm_struct *mm,
> > +			       struct vm_area_struct *vma,
> >  			       unsigned long address,
> >  			       enum mmu_event event)
> >  {
> > @@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  				      struct mm_struct *mm,
> > +				      struct vm_area_struct *vma,
> >  				      unsigned long start,
> >  				      unsigned long end,
> >  				      enum mmu_event event)
> > @@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_range_end(struct mmu_notifier *mn,
> >  				    struct mm_struct *mm,
> > +				    struct vm_area_struct *vma,
> >  				    unsigned long start,
> >  				    unsigned long end,
> >  				    enum mmu_event event)
> > diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> > index e67fed1..d02e4c7 100644
> > --- a/drivers/misc/sgi-gru/grutlbpurge.c
> > +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> > @@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
> >   */
> >  static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > +				       struct vm_area_struct *vma,
> >  				       unsigned long start, unsigned long end,
> >  				       enum mmu_event event)
> >  {
> > @@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  }
> >  
> >  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> > -				     struct mm_struct *mm, unsigned long start,
> > +				     struct mm_struct *mm,
> > +				     struct vm_area_struct *vma,
> > +				     unsigned long start,
> >  				     unsigned long end,
> >  				     enum mmu_event event)
> >  {
> > @@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
> >  }
> >  
> >  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				unsigned long address,
> >  				enum mmu_event event)
> >  {
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index fe9da94..219928b 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
> >  
> >  static void mn_invl_range_start(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				unsigned long start,
> >  				unsigned long end,
> >  				enum mmu_event event)
> > @@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invl_page(struct mmu_notifier *mn,
> >  			 struct mm_struct *mm,
> > +			 struct vm_area_struct *vma,
> >  			 unsigned long address,
> >  			 enum mmu_event event)
> >  {
> > -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
> > +	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, event);
> >  }
> >  
> >  static void mn_release(struct mmu_notifier *mn,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index e9e79f7..8b0f25d 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -829,13 +829,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			.private = &cp,
> >  		};
> >  		down_read(&mm->mmap_sem);
> > -		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_start(mm, 0,
> > -							    -1, MMU_STATUS);
> >  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >  			cp.vma = vma;
> >  			if (is_vm_hugetlb_page(vma))
> >  				continue;
> > +			if (type == CLEAR_REFS_SOFT_DIRTY)
> > +				mmu_notifier_invalidate_range_start(mm, vma,
> > +								    vma->vm_start,
> > +								    vma->vm_end,
> > +								    MMU_STATUS);
> >  			/*
> >  			 * Writing 1 to /proc/pid/clear_refs affects all pages.
> >  			 *
> > @@ -857,10 +859,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			}
> >  			walk_page_range(vma->vm_start, vma->vm_end,
> >  					&clear_refs_walk);
> > +			if (type == CLEAR_REFS_SOFT_DIRTY)
> > +				mmu_notifier_invalidate_range_end(mm, vma,
> > +								  vma->vm_start,
> > +								  vma->vm_end,
> > +								  MMU_STATUS);
> >  		}
> > -		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_end(mm, 0,
> > -							  -1, MMU_STATUS);
> >  		flush_tlb_mm(mm);
> >  		up_read(&mm->mmap_sem);
> >  		mmput(mm);
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 82e9577..8907e5d 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -137,6 +137,7 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_page)(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				unsigned long address,
> >  				enum mmu_event event);
> >  
> > @@ -185,11 +186,13 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_range_start)(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > +				       struct vm_area_struct *vma,
> >  				       unsigned long start,
> >  				       unsigned long end,
> >  				       enum mmu_event event);
> >  	void (*invalidate_range_end)(struct mmu_notifier *mn,
> >  				     struct mm_struct *mm,
> > +				     struct vm_area_struct *vma,
> >  				     unsigned long start,
> >  				     unsigned long end,
> >  				     enum mmu_event event);
> > @@ -233,13 +236,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> >  				      pte_t pte,
> >  				      enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +					   struct vm_area_struct *vma,
> >  					  unsigned long address,
> >  					  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +						  struct vm_area_struct *vma,
> >  						  unsigned long start,
> >  						  unsigned long end,
> >  						  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +						struct vm_area_struct *vma,
> >  						unsigned long start,
> >  						unsigned long end,
> >  						enum mmu_event event);
> > @@ -276,29 +282,33 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +						struct vm_area_struct *vma,
> >  						unsigned long address,
> >  						enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_page(mm, address, event);
> > +		__mmu_notifier_invalidate_page(mm, vma, address, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +						       struct vm_area_struct *vma,
> >  						       unsigned long start,
> >  						       unsigned long end,
> >  						       enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> > +		__mmu_notifier_invalidate_range_start(mm, vma, start,
> > +						      end, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +						     struct vm_area_struct *vma,
> >  						     unsigned long start,
> >  						     unsigned long end,
> >  						     enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_end(mm, start, end, event);
> > +		__mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
> >  }
> >  
> >  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> > @@ -380,12 +390,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +						struct vm_area_struct *vma,
> >  						unsigned long address,
> >  						enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +						       struct vm_area_struct *vma,
> >  						       unsigned long start,
> >  						       unsigned long end,
> >  						       enum mmu_event event)
> > @@ -393,6 +405,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +						     struct vm_area_struct *vma,
> >  						     unsigned long start,
> >  						     unsigned long end,
> >  						     enum mmu_event event)
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 296f81e..0f552bc 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -177,7 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	/* For try_to_free_swap() and munlock_vma_page() below */
> >  	lock_page(page);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	err = -EAGAIN;
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> > @@ -212,7 +212,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	err = 0;
> >   unlock:
> >  	mem_cgroup_cancel_charge(kpage, memcg);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	unlock_page(page);
> >  	return err;
> > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> > index a2b3f09..f0113df 100644
> > --- a/mm/filemap_xip.c
> > +++ b/mm/filemap_xip.c
> > @@ -198,7 +198,8 @@ retry:
> >  			BUG_ON(pte_dirty(pteval));
> >  			pte_unmap_unlock(pte, ptl);
> >  			/* must invalidate_page _before_ freeing the page */
> > -			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> > +			mmu_notifier_invalidate_page(mm, vma, address,
> > +						     MMU_MIGRATE);
> >  			page_cache_release(page);
> >  		}
> >  	}
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index fa30857..cc74b60 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1022,7 +1022,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> > @@ -1064,7 +1064,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  	page_remove_rmap(page);
> >  	spin_unlock(ptl);
> >  
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	ret |= VM_FAULT_WRITE;
> > @@ -1075,7 +1075,7 @@ out:
> >  
> >  out_free_pages:
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		memcg = (void *)page_private(pages[i]);
> > @@ -1162,7 +1162,7 @@ alloc:
> >  
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	if (!page)
> > @@ -1201,7 +1201,7 @@ alloc:
> >  	}
> >  	spin_unlock(ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return ret;
> > @@ -1637,7 +1637,7 @@ static int __split_huge_page_splitting(struct page *page,
> >  	const unsigned long mmun_start = address;
> >  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_STATUS);
> >  	pmd = page_check_address_pmd(page, mm, address,
> >  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
> > @@ -1653,7 +1653,7 @@ static int __split_huge_page_splitting(struct page *page,
> >  		ret = 1;
> >  		spin_unlock(ptl);
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_STATUS);
> >  
> >  	return ret;
> > @@ -2453,7 +2453,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = address + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >  	/*
> > @@ -2464,7 +2464,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 */
> >  	_pmd = pmdp_clear_flush(vma, address, pmd);
> >  	spin_unlock(pmd_ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	spin_lock(pte_ptl);
> > @@ -2854,19 +2854,19 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> >  again:
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_trans_huge(*pmd))) {
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> >  	if (is_huge_zero_pmd(*pmd)) {
> >  		__split_huge_zero_page_pmd(vma, haddr, pmd);
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> > @@ -2874,7 +2874,7 @@ again:
> >  	VM_BUG_ON_PAGE(!page_count(page), page);
> >  	get_page(page);
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	split_huge_page(page);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 73e1576..15f0123 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2565,7 +2565,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	mmun_start = vma->vm_start;
> >  	mmun_end = vma->vm_end;
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_start(src, mmun_start,
> > +		mmu_notifier_invalidate_range_start(src, vma, mmun_start,
> >  						    mmun_end, MMU_MIGRATE);
> >  
> >  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> > @@ -2616,7 +2616,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	}
> >  
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_end(src, mmun_start,
> > +		mmu_notifier_invalidate_range_end(src, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  
> >  	return ret;
> > @@ -2643,7 +2643,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  	BUG_ON(end & ~huge_page_mask(h));
> >  
> >  	tlb_start_vma(tlb, vma);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  again:
> >  	for (address = start; address < end; address += sz) {
> > @@ -2715,7 +2715,7 @@ unlock:
> >  		if (address < end && !ref_page)
> >  			goto again;
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	tlb_end_vma(tlb, vma);
> >  }
> > @@ -2903,7 +2903,7 @@ retry_avoidcopy:
> >  
> >  	mmun_start = address & huge_page_mask(h);
> >  	mmun_end = mmun_start + huge_page_size(h);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	/*
> >  	 * Retake the page table lock to check for racing updates
> > @@ -2924,7 +2924,7 @@ retry_avoidcopy:
> >  		new_page = old_page;
> >  	}
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	page_cache_release(new_page);
> >  	page_cache_release(old_page);
> > @@ -3363,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	BUG_ON(address >= end);
> >  	flush_cache_range(vma, address, end);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start, end, event);
> > +	mmu_notifier_invalidate_range_start(mm, vma, start, end, event);
> >  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >  	for (; address < end; address += huge_page_size(h)) {
> >  		spinlock_t *ptl;
> > @@ -3393,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 */
> >  	flush_tlb_range(vma, start, end);
> >  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > -	mmu_notifier_invalidate_range_end(mm, start, end, event);
> > +	mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
> >  
> >  	return pages << h->order;
> >  }
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 4b659f1..1f3c4d7 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -873,7 +873,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MPROT_RONLY);
> >  
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> > @@ -914,7 +914,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  out_unlock:
> >  	pte_unmap_unlock(ptep, ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MPROT_RONLY);
> >  out:
> >  	return err;
> > @@ -951,7 +951,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > @@ -977,7 +977,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	pte_unmap_unlock(ptep, ptl);
> >  	err = 0;
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return err;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d3908f0..4717579 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	mmun_start = addr;
> >  	mmun_end   = end;
> >  	if (is_cow)
> > -		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> > +		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
> >  						    mmun_end, MMU_MIGRATE);
> >  
> >  	ret = 0;
> > @@ -1067,8 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
> >  
> >  	if (is_cow)
> > -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> > -						  MMU_MIGRATE);
> > +		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  	return ret;
> >  }
> >  
> > @@ -1372,12 +1372,17 @@ void unmap_vmas(struct mmu_gather *tlb,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start_addr,
> > -					    end_addr, MMU_MUNMAP);
> > -	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > +	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> > +		mmu_notifier_invalidate_range_start(mm, vma,
> > +						    max(start_addr, vma->vm_start),
> > +						    min(end_addr, vma->vm_end),
> > +						    MMU_MUNMAP);
> >  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > -	mmu_notifier_invalidate_range_end(mm, start_addr,
> > -					  end_addr, MMU_MUNMAP);
> > +		mmu_notifier_invalidate_range_end(mm, vma,
> > +						  max(start_addr, vma->vm_start),
> > +						  min(end_addr, vma->vm_end),
> > +						  MMU_MUNMAP);
> > +	}
> >  }
> >  
> >  /**
> > @@ -1399,10 +1404,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
> > -	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> > +	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
> > +		mmu_notifier_invalidate_range_start(mm, vma,
> > +						    max(start, vma->vm_start),
> > +						    min(end, vma->vm_end),
> > +						    MMU_MUNMAP);
> >  		unmap_single_vma(&tlb, vma, start, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
> > +		mmu_notifier_invalidate_range_end(mm, vma,
> > +						  max(start, vma->vm_start),
> > +						  min(end, vma->vm_end),
> > +						  MMU_MUNMAP);
> > +	}
> >  	tlb_finish_mmu(&tlb, start, end);
> >  }
> >  
> > @@ -1425,9 +1437,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, address, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
> > +	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
> >  	unmap_single_vma(&tlb, vma, address, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
> > +	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
> >  	tlb_finish_mmu(&tlb, address, end);
> >  }
> >  
> > @@ -2211,7 +2223,7 @@ gotten:
> >  
> >  	mmun_start  = address & PAGE_MASK;
> >  	mmun_end    = mmun_start + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	/*
> > @@ -2283,7 +2295,7 @@ gotten:
> >  unlock:
> >  	pte_unmap_unlock(page_table, ptl);
> >  	if (mmun_end > mmun_start)
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  	if (old_page) {
> >  		/*
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index b526c72..0c61aa9 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1820,13 +1820,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	WARN_ON(PageLRU(new_page));
> >  
> >  	/* Recheck the target PMD */
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> >  fail_putback:
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  
> >  		/* Reverse changes made by migrate_page_copy() */
> > @@ -1880,7 +1880,7 @@ fail_putback:
> >  	page_remove_rmap(page);
> >  
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	/* Take an "isolate" reference and put new page on the LRU. */
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 9decb88..87e6bc5 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
> >  }
> >  
> >  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +				    struct vm_area_struct *vma,
> >  				    unsigned long address,
> >  				    enum mmu_event event)
> >  {
> > @@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_page)
> > -			mn->ops->invalidate_page(mn, mm, address, event);
> > +			mn->ops->invalidate_page(mn, mm, vma, address, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  
> >  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +					   struct vm_area_struct *vma,
> >  					   unsigned long start,
> >  					   unsigned long end,
> >  					   enum mmu_event event)
> > @@ -165,7 +167,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start)
> > -			mn->ops->invalidate_range_start(mn, mm, start,
> > +			mn->ops->invalidate_range_start(mn, vma, mm, start,
> >  							end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> > @@ -173,6 +175,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
> >  
> >  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +					 struct vm_area_struct *vma,
> >  					 unsigned long start,
> >  					 unsigned long end,
> >  					 enum mmu_event event)
> > @@ -183,7 +186,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_end)
> > -			mn->ops->invalidate_range_end(mn, mm, start,
> > +			mn->ops->invalidate_range_end(mn, vma, mm, start,
> >  						      end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 6ce6c23..16ce504 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		/* invoke the mmu notifier if the pmd is populated */
> >  		if (!mni_start) {
> >  			mni_start = addr;
> > -			mmu_notifier_invalidate_range_start(mm, mni_start,
> > +			mmu_notifier_invalidate_range_start(mm, vma, mni_start,
> >  							    end, event);
> >  		}
> >  
> > @@ -187,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  	} while (pmd++, addr = next, addr != end);
> >  
> >  	if (mni_start)
> > -		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
> > +		mmu_notifier_invalidate_range_end(mm, vma, mni_start,
> > +						  end, event);
> >  
> >  	if (nr_huge_updates)
> >  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 6827d2f..9bee6de 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  
> >  	mmun_start = old_addr;
> >  	mmun_end   = old_end;
> > -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> > @@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  	if (likely(need_flush))
> >  		flush_tlb_range(vma, old_end-len, old_addr);
> >  
> > -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	return len + old_addr - old_end;	/* how much done */
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index bd7e6d7..f1be50d 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  	pte_unmap_unlock(pte, ptl);
> >  
> >  	if (ret) {
> > -		mmu_notifier_invalidate_page(mm, address, MMU_WB);
> > +		mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
> >  		(*cleaned)++;
> >  	}
> >  out:
> > @@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  out_unmap:
> >  	pte_unmap_unlock(pte, ptl);
> >  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> > -		mmu_notifier_invalidate_page(mm, address, event);
> > +		mmu_notifier_invalidate_page(mm, vma, address, event);
> >  out:
> >  	return ret;
> >  
> > @@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = end;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> > +					    mmun_end, event);
> >  
> >  	/*
> >  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> > @@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  		(*mapcount)--;
> >  	}
> >  	pte_unmap_unlock(pte - 1, ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
> >  	if (locked_vma)
> >  		up_read(&vma->vm_mm->mmap_sem);
> >  	return ret;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6e1992f..c4b7bf9 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  
> >  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  					     struct mm_struct *mm,
> > +					     struct vm_area_struct *vma,
> >  					     unsigned long address,
> >  					     enum mmu_event event)
> >  {
> > @@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  
> >  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  						    struct mm_struct *mm,
> > +						    struct vm_area_struct *vma,
> >  						    unsigned long start,
> >  						    unsigned long end,
> >  						    enum mmu_event event)
> > @@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  						  struct mm_struct *mm,
> > +						  struct vm_area_struct *vma,
> >  						  unsigned long start,
> >  						  unsigned long end,
> >  						  enum mmu_event event)
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> Other than the refinements suggested above, I can't seem to find anything 
> wrong with this patch, so:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> John H.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
@ 2014-06-30 16:00       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 16:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Sun, Jun 29, 2014 at 08:29:01PM -0700, John Hubbard wrote:
> On Fri, 27 Jun 2014, Jerome Glisse wrote:
> 
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > New user of the mmu_notifier interface need to lookup vma in order to
> > perform the invalidation operation. Instead of redoing a vma lookup
> > inside the callback just pass through the vma from the call site where
> > it is already available.
> > 
> > This needs small refactoring in memory.c to call invalidate_range on
> > vma boundary the overhead should be low enough.
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c |  1 +
> >  drivers/iommu/amd_iommu_v2.c            |  3 +++
> >  drivers/misc/sgi-gru/grutlbpurge.c      |  6 ++++-
> >  drivers/xen/gntdev.c                    |  4 +++-
> >  fs/proc/task_mmu.c                      | 16 ++++++++-----
> >  include/linux/mmu_notifier.h            | 19 ++++++++++++---
> >  kernel/events/uprobes.c                 |  4 ++--
> >  mm/filemap_xip.c                        |  3 ++-
> >  mm/huge_memory.c                        | 26 ++++++++++----------
> >  mm/hugetlb.c                            | 16 ++++++-------
> >  mm/ksm.c                                |  8 +++----
> >  mm/memory.c                             | 42 +++++++++++++++++++++------------
> >  mm/migrate.c                            |  6 ++---
> >  mm/mmu_notifier.c                       |  9 ++++---
> >  mm/mprotect.c                           |  5 ++--
> >  mm/mremap.c                             |  4 ++--
> >  mm/rmap.c                               |  9 +++----
> >  virt/kvm/kvm_main.c                     |  3 +++
> >  18 files changed, 116 insertions(+), 68 deletions(-)
> > 
> 
> Hi Jerome, considering that you have to change every call site already, it 
> seems to me that it would be ideal to just delete the mm argument from all 
> of these invalidate_range* callbacks. In other words, replace the mm 
> argument with the new vma argument.  I don't see much point in passing 
> them both around, and while it would make the patch a *bit* larger, it's 
> mostly just an extra line or two per call site:
> 
>   mm = vma->vm_mm;

Yes probably is.

> 
> Also, passing the vma around really does seem like a good approach, but it 
> does cause a bunch of additional calls to the invalidate_range* routines,
> because we generate a call per vma, instead of just one for the entire mm.
> So that brings up a couple questions:
> 
> 1) Is there any chance that this could cause measurable performance 
> regressions?
> 
> 2) Should you put a little note in the commit message, mentioning this 
> point?
> 

I pointed that out in my introduction mail to the patchset. I think those
code path are mostly use during process destruction and hence is fine but
at same time being able to spin and destroy quickly many process by be
considered a features and an important use case that should not regress.

> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index ed6f35e..191ac71 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -55,6 +55,7 @@ struct i915_mmu_object {
> >  
> >  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >  						       struct mm_struct *mm,
> > +						       struct vm_area_struct *vma,
> >  						       unsigned long start,
> >  						       unsigned long end,
> >  						       enum mmu_event event)
> 
> That routine has a local variable named vma, so it might be polite to 
> rename the local variable, to make it more obvious to the reader that they 
> are different. Of course, since the compiler knows which is which, feel 
> free to ignore this comment.
> 
> > diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> > index 2bb9771..9f9e706 100644
> > --- a/drivers/iommu/amd_iommu_v2.c
> > +++ b/drivers/iommu/amd_iommu_v2.c
> > @@ -422,6 +422,7 @@ static void mn_change_pte(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_page(struct mmu_notifier *mn,
> >  			       struct mm_struct *mm,
> > +			       struct vm_area_struct *vma,
> >  			       unsigned long address,
> >  			       enum mmu_event event)
> >  {
> > @@ -430,6 +431,7 @@ static void mn_invalidate_page(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  				      struct mm_struct *mm,
> > +				      struct vm_area_struct *vma,
> >  				      unsigned long start,
> >  				      unsigned long end,
> >  				      enum mmu_event event)
> > @@ -453,6 +455,7 @@ static void mn_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invalidate_range_end(struct mmu_notifier *mn,
> >  				    struct mm_struct *mm,
> > +				    struct vm_area_struct *vma,
> >  				    unsigned long start,
> >  				    unsigned long end,
> >  				    enum mmu_event event)
> > diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> > index e67fed1..d02e4c7 100644
> > --- a/drivers/misc/sgi-gru/grutlbpurge.c
> > +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> > @@ -221,6 +221,7 @@ void gru_flush_all_tlb(struct gru_state *gru)
> >   */
> >  static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > +				       struct vm_area_struct *vma,
> >  				       unsigned long start, unsigned long end,
> >  				       enum mmu_event event)
> >  {
> > @@ -235,7 +236,9 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
> >  }
> >  
> >  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> > -				     struct mm_struct *mm, unsigned long start,
> > +				     struct mm_struct *mm,
> > +				     struct vm_area_struct *vma,
> > +				     unsigned long start,
> >  				     unsigned long end,
> >  				     enum mmu_event event)
> >  {
> > @@ -250,6 +253,7 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
> >  }
> >  
> >  static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				unsigned long address,
> >  				enum mmu_event event)
> >  {
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index fe9da94..219928b 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -428,6 +428,7 @@ static void unmap_if_in_range(struct grant_map *map,
> >  
> >  static void mn_invl_range_start(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				unsigned long start,
> >  				unsigned long end,
> >  				enum mmu_event event)
> > @@ -447,10 +448,11 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
> >  
> >  static void mn_invl_page(struct mmu_notifier *mn,
> >  			 struct mm_struct *mm,
> > +			 struct vm_area_struct *vma,
> >  			 unsigned long address,
> >  			 enum mmu_event event)
> >  {
> > -	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE, event);
> > +	mn_invl_range_start(mn, mm, vma, address, address + PAGE_SIZE, event);
> >  }
> >  
> >  static void mn_release(struct mmu_notifier *mn,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index e9e79f7..8b0f25d 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -829,13 +829,15 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			.private = &cp,
> >  		};
> >  		down_read(&mm->mmap_sem);
> > -		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_start(mm, 0,
> > -							    -1, MMU_STATUS);
> >  		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >  			cp.vma = vma;
> >  			if (is_vm_hugetlb_page(vma))
> >  				continue;
> > +			if (type == CLEAR_REFS_SOFT_DIRTY)
> > +				mmu_notifier_invalidate_range_start(mm, vma,
> > +								    vma->vm_start,
> > +								    vma->vm_end,
> > +								    MMU_STATUS);
> >  			/*
> >  			 * Writing 1 to /proc/pid/clear_refs affects all pages.
> >  			 *
> > @@ -857,10 +859,12 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			}
> >  			walk_page_range(vma->vm_start, vma->vm_end,
> >  					&clear_refs_walk);
> > +			if (type == CLEAR_REFS_SOFT_DIRTY)
> > +				mmu_notifier_invalidate_range_end(mm, vma,
> > +								  vma->vm_start,
> > +								  vma->vm_end,
> > +								  MMU_STATUS);
> >  		}
> > -		if (type == CLEAR_REFS_SOFT_DIRTY)
> > -			mmu_notifier_invalidate_range_end(mm, 0,
> > -							  -1, MMU_STATUS);
> >  		flush_tlb_mm(mm);
> >  		up_read(&mm->mmap_sem);
> >  		mmput(mm);
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 82e9577..8907e5d 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -137,6 +137,7 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_page)(struct mmu_notifier *mn,
> >  				struct mm_struct *mm,
> > +				struct vm_area_struct *vma,
> >  				unsigned long address,
> >  				enum mmu_event event);
> >  
> > @@ -185,11 +186,13 @@ struct mmu_notifier_ops {
> >  	 */
> >  	void (*invalidate_range_start)(struct mmu_notifier *mn,
> >  				       struct mm_struct *mm,
> > +				       struct vm_area_struct *vma,
> >  				       unsigned long start,
> >  				       unsigned long end,
> >  				       enum mmu_event event);
> >  	void (*invalidate_range_end)(struct mmu_notifier *mn,
> >  				     struct mm_struct *mm,
> > +				     struct vm_area_struct *vma,
> >  				     unsigned long start,
> >  				     unsigned long end,
> >  				     enum mmu_event event);
> > @@ -233,13 +236,16 @@ extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> >  				      pte_t pte,
> >  				      enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +					   struct vm_area_struct *vma,
> >  					  unsigned long address,
> >  					  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +						  struct vm_area_struct *vma,
> >  						  unsigned long start,
> >  						  unsigned long end,
> >  						  enum mmu_event event);
> >  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +						struct vm_area_struct *vma,
> >  						unsigned long start,
> >  						unsigned long end,
> >  						enum mmu_event event);
> > @@ -276,29 +282,33 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +						struct vm_area_struct *vma,
> >  						unsigned long address,
> >  						enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_page(mm, address, event);
> > +		__mmu_notifier_invalidate_page(mm, vma, address, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +						       struct vm_area_struct *vma,
> >  						       unsigned long start,
> >  						       unsigned long end,
> >  						       enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_start(mm, start, end, event);
> > +		__mmu_notifier_invalidate_range_start(mm, vma, start,
> > +						      end, event);
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +						     struct vm_area_struct *vma,
> >  						     unsigned long start,
> >  						     unsigned long end,
> >  						     enum mmu_event event)
> >  {
> >  	if (mm_has_notifiers(mm))
> > -		__mmu_notifier_invalidate_range_end(mm, start, end, event);
> > +		__mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
> >  }
> >  
> >  static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> > @@ -380,12 +390,14 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +						struct vm_area_struct *vma,
> >  						unsigned long address,
> >  						enum mmu_event event)
> >  {
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +						       struct vm_area_struct *vma,
> >  						       unsigned long start,
> >  						       unsigned long end,
> >  						       enum mmu_event event)
> > @@ -393,6 +405,7 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  }
> >  
> >  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +						     struct vm_area_struct *vma,
> >  						     unsigned long start,
> >  						     unsigned long end,
> >  						     enum mmu_event event)
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 296f81e..0f552bc 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -177,7 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	/* For try_to_free_swap() and munlock_vma_page() below */
> >  	lock_page(page);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	err = -EAGAIN;
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> > @@ -212,7 +212,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  	err = 0;
> >   unlock:
> >  	mem_cgroup_cancel_charge(kpage, memcg);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	unlock_page(page);
> >  	return err;
> > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> > index a2b3f09..f0113df 100644
> > --- a/mm/filemap_xip.c
> > +++ b/mm/filemap_xip.c
> > @@ -198,7 +198,8 @@ retry:
> >  			BUG_ON(pte_dirty(pteval));
> >  			pte_unmap_unlock(pte, ptl);
> >  			/* must invalidate_page _before_ freeing the page */
> > -			mmu_notifier_invalidate_page(mm, address, MMU_MIGRATE);
> > +			mmu_notifier_invalidate_page(mm, vma, address,
> > +						     MMU_MIGRATE);
> >  			page_cache_release(page);
> >  		}
> >  	}
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index fa30857..cc74b60 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1022,7 +1022,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> > @@ -1064,7 +1064,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> >  	page_remove_rmap(page);
> >  	spin_unlock(ptl);
> >  
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	ret |= VM_FAULT_WRITE;
> > @@ -1075,7 +1075,7 @@ out:
> >  
> >  out_free_pages:
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		memcg = (void *)page_private(pages[i]);
> > @@ -1162,7 +1162,7 @@ alloc:
> >  
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	if (!page)
> > @@ -1201,7 +1201,7 @@ alloc:
> >  	}
> >  	spin_unlock(ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return ret;
> > @@ -1637,7 +1637,7 @@ static int __split_huge_page_splitting(struct page *page,
> >  	const unsigned long mmun_start = address;
> >  	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_STATUS);
> >  	pmd = page_check_address_pmd(page, mm, address,
> >  			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
> > @@ -1653,7 +1653,7 @@ static int __split_huge_page_splitting(struct page *page,
> >  		ret = 1;
> >  		spin_unlock(ptl);
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_STATUS);
> >  
> >  	return ret;
> > @@ -2453,7 +2453,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = address + HPAGE_PMD_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >  	/*
> > @@ -2464,7 +2464,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 */
> >  	_pmd = pmdp_clear_flush(vma, address, pmd);
> >  	spin_unlock(pmd_ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	spin_lock(pte_ptl);
> > @@ -2854,19 +2854,19 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> >  	mmun_start = haddr;
> >  	mmun_end   = haddr + HPAGE_PMD_SIZE;
> >  again:
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_trans_huge(*pmd))) {
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> >  	if (is_huge_zero_pmd(*pmd)) {
> >  		__split_huge_zero_page_pmd(vma, haddr, pmd);
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  		return;
> >  	}
> > @@ -2874,7 +2874,7 @@ again:
> >  	VM_BUG_ON_PAGE(!page_count(page), page);
> >  	get_page(page);
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	split_huge_page(page);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 73e1576..15f0123 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2565,7 +2565,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	mmun_start = vma->vm_start;
> >  	mmun_end = vma->vm_end;
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_start(src, mmun_start,
> > +		mmu_notifier_invalidate_range_start(src, vma, mmun_start,
> >  						    mmun_end, MMU_MIGRATE);
> >  
> >  	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> > @@ -2616,7 +2616,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> >  	}
> >  
> >  	if (cow)
> > -		mmu_notifier_invalidate_range_end(src, mmun_start,
> > +		mmu_notifier_invalidate_range_end(src, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  
> >  	return ret;
> > @@ -2643,7 +2643,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  	BUG_ON(end & ~huge_page_mask(h));
> >  
> >  	tlb_start_vma(tlb, vma);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  again:
> >  	for (address = start; address < end; address += sz) {
> > @@ -2715,7 +2715,7 @@ unlock:
> >  		if (address < end && !ref_page)
> >  			goto again;
> >  	}
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	tlb_end_vma(tlb, vma);
> >  }
> > @@ -2903,7 +2903,7 @@ retry_avoidcopy:
> >  
> >  	mmun_start = address & huge_page_mask(h);
> >  	mmun_end = mmun_start + huge_page_size(h);
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	/*
> >  	 * Retake the page table lock to check for racing updates
> > @@ -2924,7 +2924,7 @@ retry_avoidcopy:
> >  		new_page = old_page;
> >  	}
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  	page_cache_release(new_page);
> >  	page_cache_release(old_page);
> > @@ -3363,7 +3363,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	BUG_ON(address >= end);
> >  	flush_cache_range(vma, address, end);
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start, end, event);
> > +	mmu_notifier_invalidate_range_start(mm, vma, start, end, event);
> >  	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >  	for (; address < end; address += huge_page_size(h)) {
> >  		spinlock_t *ptl;
> > @@ -3393,7 +3393,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> >  	 */
> >  	flush_tlb_range(vma, start, end);
> >  	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > -	mmu_notifier_invalidate_range_end(mm, start, end, event);
> > +	mmu_notifier_invalidate_range_end(mm, vma, start, end, event);
> >  
> >  	return pages << h->order;
> >  }
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 4b659f1..1f3c4d7 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -873,7 +873,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MPROT_RONLY);
> >  
> >  	ptep = page_check_address(page, mm, addr, &ptl, 0);
> > @@ -914,7 +914,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> >  out_unlock:
> >  	pte_unmap_unlock(ptep, ptl);
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MPROT_RONLY);
> >  out:
> >  	return err;
> > @@ -951,7 +951,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  
> >  	mmun_start = addr;
> >  	mmun_end   = addr + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > @@ -977,7 +977,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> >  	pte_unmap_unlock(ptep, ptl);
> >  	err = 0;
> >  out_mn:
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  out:
> >  	return err;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d3908f0..4717579 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1049,7 +1049,7 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	mmun_start = addr;
> >  	mmun_end   = end;
> >  	if (is_cow)
> > -		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
> > +		mmu_notifier_invalidate_range_start(src_mm, vma, mmun_start,
> >  						    mmun_end, MMU_MIGRATE);
> >  
> >  	ret = 0;
> > @@ -1067,8 +1067,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >  	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
> >  
> >  	if (is_cow)
> > -		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end,
> > -						  MMU_MIGRATE);
> > +		mmu_notifier_invalidate_range_end(src_mm, vma, mmun_start,
> > +						  mmun_end, MMU_MIGRATE);
> >  	return ret;
> >  }
> >  
> > @@ -1372,12 +1372,17 @@ void unmap_vmas(struct mmu_gather *tlb,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  
> > -	mmu_notifier_invalidate_range_start(mm, start_addr,
> > -					    end_addr, MMU_MUNMAP);
> > -	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > +	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> > +		mmu_notifier_invalidate_range_start(mm, vma,
> > +						    max(start_addr, vma->vm_start),
> > +						    min(end_addr, vma->vm_end),
> > +						    MMU_MUNMAP);
> >  		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > -	mmu_notifier_invalidate_range_end(mm, start_addr,
> > -					  end_addr, MMU_MUNMAP);
> > +		mmu_notifier_invalidate_range_end(mm, vma,
> > +						  max(start_addr, vma->vm_start),
> > +						  min(end_addr, vma->vm_end),
> > +						  MMU_MUNMAP);
> > +	}
> >  }
> >  
> >  /**
> > @@ -1399,10 +1404,17 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, start, end, MMU_MUNMAP);
> > -	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
> > +	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
> > +		mmu_notifier_invalidate_range_start(mm, vma,
> > +						    max(start, vma->vm_start),
> > +						    min(end, vma->vm_end),
> > +						    MMU_MUNMAP);
> >  		unmap_single_vma(&tlb, vma, start, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, start, end, MMU_MUNMAP);
> > +		mmu_notifier_invalidate_range_end(mm, vma,
> > +						  max(start, vma->vm_start),
> > +						  min(end, vma->vm_end),
> > +						  MMU_MUNMAP);
> > +	}
> >  	tlb_finish_mmu(&tlb, start, end);
> >  }
> >  
> > @@ -1425,9 +1437,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, address, end);
> >  	update_hiwater_rss(mm);
> > -	mmu_notifier_invalidate_range_start(mm, address, end, MMU_MUNMAP);
> > +	mmu_notifier_invalidate_range_start(mm, vma, address, end, MMU_MUNMAP);
> >  	unmap_single_vma(&tlb, vma, address, end, details);
> > -	mmu_notifier_invalidate_range_end(mm, address, end, MMU_MUNMAP);
> > +	mmu_notifier_invalidate_range_end(mm, vma, address, end, MMU_MUNMAP);
> >  	tlb_finish_mmu(&tlb, address, end);
> >  }
> >  
> > @@ -2211,7 +2223,7 @@ gotten:
> >  
> >  	mmun_start  = address & PAGE_MASK;
> >  	mmun_end    = mmun_start + PAGE_SIZE;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	/*
> > @@ -2283,7 +2295,7 @@ gotten:
> >  unlock:
> >  	pte_unmap_unlock(page_table, ptl);
> >  	if (mmun_end > mmun_start)
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  	if (old_page) {
> >  		/*
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index b526c72..0c61aa9 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1820,13 +1820,13 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	WARN_ON(PageLRU(new_page));
> >  
> >  	/* Recheck the target PMD */
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> >  fail_putback:
> >  		spin_unlock(ptl);
> > -		mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +		mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  						  mmun_end, MMU_MIGRATE);
> >  
> >  		/* Reverse changes made by migrate_page_copy() */
> > @@ -1880,7 +1880,7 @@ fail_putback:
> >  	page_remove_rmap(page);
> >  
> >  	spin_unlock(ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	/* Take an "isolate" reference and put new page on the LRU. */
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 9decb88..87e6bc5 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -139,6 +139,7 @@ void __mmu_notifier_change_pte(struct mm_struct *mm,
> >  }
> >  
> >  void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> > +				    struct vm_area_struct *vma,
> >  				    unsigned long address,
> >  				    enum mmu_event event)
> >  {
> > @@ -148,12 +149,13 @@ void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_page)
> > -			mn->ops->invalidate_page(mn, mm, address, event);
> > +			mn->ops->invalidate_page(mn, mm, vma, address, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >  
> >  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> > +					   struct vm_area_struct *vma,
> >  					   unsigned long start,
> >  					   unsigned long end,
> >  					   enum mmu_event event)
> > @@ -165,7 +167,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_start)
> > -			mn->ops->invalidate_range_start(mn, mm, start,
> > +			mn->ops->invalidate_range_start(mn, vma, mm, start,
> >  							end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> > @@ -173,6 +175,7 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> >  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
> >  
> >  void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> > +					 struct vm_area_struct *vma,
> >  					 unsigned long start,
> >  					 unsigned long end,
> >  					 enum mmu_event event)
> > @@ -183,7 +186,7 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> >  	id = srcu_read_lock(&srcu);
> >  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> >  		if (mn->ops->invalidate_range_end)
> > -			mn->ops->invalidate_range_end(mn, mm, start,
> > +			mn->ops->invalidate_range_end(mn, vma, mm, start,
> >  						      end, event);
> >  	}
> >  	srcu_read_unlock(&srcu, id);
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 6ce6c23..16ce504 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  		/* invoke the mmu notifier if the pmd is populated */
> >  		if (!mni_start) {
> >  			mni_start = addr;
> > -			mmu_notifier_invalidate_range_start(mm, mni_start,
> > +			mmu_notifier_invalidate_range_start(mm, vma, mni_start,
> >  							    end, event);
> >  		}
> >  
> > @@ -187,7 +187,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >  	} while (pmd++, addr = next, addr != end);
> >  
> >  	if (mni_start)
> > -		mmu_notifier_invalidate_range_end(mm, mni_start, end, event);
> > +		mmu_notifier_invalidate_range_end(mm, vma, mni_start,
> > +						  end, event);
> >  
> >  	if (nr_huge_updates)
> >  		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 6827d2f..9bee6de 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -177,7 +177,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  
> >  	mmun_start = old_addr;
> >  	mmun_end   = old_end;
> > -	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start,
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
> >  					    mmun_end, MMU_MIGRATE);
> >  
> >  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> > @@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  	if (likely(need_flush))
> >  		flush_tlb_range(vma, old_end-len, old_addr);
> >  
> > -	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> > +	mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
> >  					  mmun_end, MMU_MIGRATE);
> >  
> >  	return len + old_addr - old_end;	/* how much done */
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index bd7e6d7..f1be50d 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> >  	pte_unmap_unlock(pte, ptl);
> >  
> >  	if (ret) {
> > -		mmu_notifier_invalidate_page(mm, address, MMU_WB);
> > +		mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
> >  		(*cleaned)++;
> >  	}
> >  out:
> > @@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  out_unmap:
> >  	pte_unmap_unlock(pte, ptl);
> >  	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> > -		mmu_notifier_invalidate_page(mm, address, event);
> > +		mmu_notifier_invalidate_page(mm, vma, address, event);
> >  out:
> >  	return ret;
> >  
> > @@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  
> >  	mmun_start = address;
> >  	mmun_end   = end;
> > -	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> > +	mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> > +					    mmun_end, event);
> >  
> >  	/*
> >  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> > @@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >  		(*mapcount)--;
> >  	}
> >  	pte_unmap_unlock(pte - 1, ptl);
> > -	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> > +	mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
> >  	if (locked_vma)
> >  		up_read(&vma->vm_mm->mmap_sem);
> >  	return ret;
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 6e1992f..c4b7bf9 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  
> >  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
> >  					     struct mm_struct *mm,
> > +					     struct vm_area_struct *vma,
> >  					     unsigned long address,
> >  					     enum mmu_event event)
> >  {
> > @@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
> >  
> >  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  						    struct mm_struct *mm,
> > +						    struct vm_area_struct *vma,
> >  						    unsigned long start,
> >  						    unsigned long end,
> >  						    enum mmu_event event)
> > @@ -345,6 +347,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  
> >  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> >  						  struct mm_struct *mm,
> > +						  struct vm_area_struct *vma,
> >  						  unsigned long start,
> >  						  unsigned long end,
> >  						  enum mmu_event event)
> > -- 
> > 1.9.0
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 
> Other than the refinements suggested above, I can't seem to find anything 
> wrong with this patch, so:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> John H.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-30 15:40       ` Joerg Roedel
@ 2014-06-30 16:06           ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 16:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Mon, Jun 30, 2014 at 05:40:42PM +0200, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 02:41:24PM +0000, Gabbay, Oded wrote:
> > I did face some problems regarding the amd IOMMU v2 driver, which
> > changed its behavior (see commit "iommu/amd: Implement
> > mmu_notifier_release call-back") to use mmu_notifier_release and did
> > some "bad things" inside that
> > notifier (primarily, but not only, deleting the object which held the
> > mmu_notifier object itself, which you mustn't do because of the
> > locking). 
> > 
> > I'm thinking of changing that driver's behavior to use this new
> > mechanism instead of using mmu_notifier_release. Does that seem
> > acceptable ? Another solution will be to add a new mmu_notifier call,
> > but we already ruled that out ;)
> 
> The mmu_notifier_release() function is exactly what this new notifier
> aims to do. Unless there is a very compelling reason to duplicate this
> functionality I stronly NACK this approach.
> 
>

No this patch does not duplicate it. Current user of mmu_notifier
rely on file close code path to call mmu_notifier_unregister. New
code like AMD IOMMUv2 or HMM can not rely on that. Thus they need
a way to call the mmu_notifer_unregister (which can not be done
from inside the the release call back).

If you look at current code the release callback is use to kill
secondary translation but not to free associated resources. All
the associated resources are free later on after the release
callback (well it depends if the file is close before the process
is kill).

So this patch aims to provide a callback to code outside of the
mmu_notifier realms, a place where it is safe to cleanup the
mmu_notifier and associated resources.

Cheers,
Jerome Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 16:06           ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 16:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Mon, Jun 30, 2014 at 05:40:42PM +0200, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 02:41:24PM +0000, Gabbay, Oded wrote:
> > I did face some problems regarding the amd IOMMU v2 driver, which
> > changed its behavior (see commit "iommu/amd: Implement
> > mmu_notifier_release call-back") to use mmu_notifier_release and did
> > some "bad things" inside that
> > notifier (primarily, but not only, deleting the object which held the
> > mmu_notifier object itself, which you mustn't do because of the
> > locking). 
> > 
> > I'm thinking of changing that driver's behavior to use this new
> > mechanism instead of using mmu_notifier_release. Does that seem
> > acceptable ? Another solution will be to add a new mmu_notifier call,
> > but we already ruled that out ;)
> 
> The mmu_notifier_release() function is exactly what this new notifier
> aims to do. Unless there is a very compelling reason to duplicate this
> functionality I stronly NACK this approach.
> 
>

No this patch does not duplicate it. Current user of mmu_notifier
rely on file close code path to call mmu_notifier_unregister. New
code like AMD IOMMUv2 or HMM can not rely on that. Thus they need
a way to call the mmu_notifer_unregister (which can not be done
from inside the the release call back).

If you look at current code the release callback is use to kill
secondary translation but not to free associated resources. All
the associated resources are free later on after the release
callback (well it depends if the file is close before the process
is kill).

So this patch aims to provide a callback to code outside of the
mmu_notifier realms, a place where it is safe to cleanup the
mmu_notifier and associated resources.

Cheers,
Jérôme Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-30 16:06           ` Jerome Glisse
@ 2014-06-30 18:16             ` Joerg Roedel
  -1 siblings, 0 replies; 76+ messages in thread
From: Joerg Roedel @ 2014-06-30 18:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Mon, Jun 30, 2014 at 12:06:05PM -0400, Jerome Glisse wrote:
> No this patch does not duplicate it. Current user of mmu_notifier
> rely on file close code path to call mmu_notifier_unregister. New
> code like AMD IOMMUv2 or HMM can not rely on that. Thus they need
> a way to call the mmu_notifer_unregister (which can not be done
> from inside the the release call back).

No, when the mm is destroyed the .release function is called from
exit_mmap() which calls mmu_notifier_release() right at the beginning.
In this case you don't need to call mmu_notifer_unregister yourself (you
can still call it, but it will be a nop).

> If you look at current code the release callback is use to kill
> secondary translation but not to free associated resources. All
> the associated resources are free later on after the release
> callback (well it depends if the file is close before the process
> is kill).

In exit_mmap the .release function is called when all mappings are still
present. Thats the perfect point in time to unbind all those resources
from your device so that it can not use it anymore when the mappings get
destroyed.

> So this patch aims to provide a callback to code outside of the
> mmu_notifier realms, a place where it is safe to cleanup the
> mmu_notifier and associated resources.

Still, this is a duplication of mmu_notifier release call-back, so still
NACK.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 18:16             ` Joerg Roedel
  0 siblings, 0 replies; 76+ messages in thread
From: Joerg Roedel @ 2014-06-30 18:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Mon, Jun 30, 2014 at 12:06:05PM -0400, Jerome Glisse wrote:
> No this patch does not duplicate it. Current user of mmu_notifier
> rely on file close code path to call mmu_notifier_unregister. New
> code like AMD IOMMUv2 or HMM can not rely on that. Thus they need
> a way to call the mmu_notifer_unregister (which can not be done
> from inside the the release call back).

No, when the mm is destroyed the .release function is called from
exit_mmap() which calls mmu_notifier_release() right at the beginning.
In this case you don't need to call mmu_notifer_unregister yourself (you
can still call it, but it will be a nop).

> If you look at current code the release callback is use to kill
> secondary translation but not to free associated resources. All
> the associated resources are free later on after the release
> callback (well it depends if the file is close before the process
> is kill).

In exit_mmap the .release function is called when all mappings are still
present. Thats the perfect point in time to unbind all those resources
from your device so that it can not use it anymore when the mappings get
destroyed.

> So this patch aims to provide a callback to code outside of the
> mmu_notifier realms, a place where it is safe to cleanup the
> mmu_notifier and associated resources.

Still, this is a duplication of mmu_notifier release call-back, so still
NACK.


	Joerg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-30 18:16             ` Joerg Roedel
@ 2014-06-30 18:35               ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 18:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Mon, Jun 30, 2014 at 08:16:23PM +0200, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 12:06:05PM -0400, Jerome Glisse wrote:
> > No this patch does not duplicate it. Current user of mmu_notifier
> > rely on file close code path to call mmu_notifier_unregister. New
> > code like AMD IOMMUv2 or HMM can not rely on that. Thus they need
> > a way to call the mmu_notifer_unregister (which can not be done
> > from inside the the release call back).
> 
> No, when the mm is destroyed the .release function is called from
> exit_mmap() which calls mmu_notifier_release() right at the beginning.
> In this case you don't need to call mmu_notifer_unregister yourself (you
> can still call it, but it will be a nop).
> 

We do intend to tear down all secondary mapping inside the relase
callback but still we can not cleanup all the resources associated
with it.

> > If you look at current code the release callback is use to kill
> > secondary translation but not to free associated resources. All
> > the associated resources are free later on after the release
> > callback (well it depends if the file is close before the process
> > is kill).
> 
> In exit_mmap the .release function is called when all mappings are still
> present. Thats the perfect point in time to unbind all those resources
> from your device so that it can not use it anymore when the mappings get
> destroyed.
> 
> > So this patch aims to provide a callback to code outside of the
> > mmu_notifier realms, a place where it is safe to cleanup the
> > mmu_notifier and associated resources.
> 
> Still, this is a duplication of mmu_notifier release call-back, so still
> NACK.
> 

It is not, mmu_notifier_register take increase mm_count and only
mmu_notifier_unregister decrease the mm_count which is different
from the mm_users count (the latter being the one that trigger an
mmu notifier release).

As said from the release call back you can not call mmu_notifier_unregister
and thus you can not fully cleanup things. Only way to achieve so is
to do it ouside mmu_notifier callback. As pointed out current user do
not have this issue because they rely on file close callback to perform
the cleanup operation. New user will not necessarily have such things
to rely on. Hence factorizing various mm_struct destruction callback
with an callback chain.

If you know any other way to call mmu_notifier_unregister before the
end of mmput function than i am all ear. I am not adding this call
back just for the fun of it i spend serious time trying to find a
way to do thing without it. I might have miss a way so if i did please
show it to me.

Cheers,
Jerome Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 18:35               ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-06-30 18:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Mon, Jun 30, 2014 at 08:16:23PM +0200, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 12:06:05PM -0400, Jerome Glisse wrote:
> > No this patch does not duplicate it. Current user of mmu_notifier
> > rely on file close code path to call mmu_notifier_unregister. New
> > code like AMD IOMMUv2 or HMM can not rely on that. Thus they need
> > a way to call the mmu_notifer_unregister (which can not be done
> > from inside the the release call back).
> 
> No, when the mm is destroyed the .release function is called from
> exit_mmap() which calls mmu_notifier_release() right at the beginning.
> In this case you don't need to call mmu_notifer_unregister yourself (you
> can still call it, but it will be a nop).
> 

We do intend to tear down all secondary mapping inside the relase
callback but still we can not cleanup all the resources associated
with it.

> > If you look at current code the release callback is use to kill
> > secondary translation but not to free associated resources. All
> > the associated resources are free later on after the release
> > callback (well it depends if the file is close before the process
> > is kill).
> 
> In exit_mmap the .release function is called when all mappings are still
> present. Thats the perfect point in time to unbind all those resources
> from your device so that it can not use it anymore when the mappings get
> destroyed.
> 
> > So this patch aims to provide a callback to code outside of the
> > mmu_notifier realms, a place where it is safe to cleanup the
> > mmu_notifier and associated resources.
> 
> Still, this is a duplication of mmu_notifier release call-back, so still
> NACK.
> 

It is not, mmu_notifier_register take increase mm_count and only
mmu_notifier_unregister decrease the mm_count which is different
from the mm_users count (the latter being the one that trigger an
mmu notifier release).

As said from the release call back you can not call mmu_notifier_unregister
and thus you can not fully cleanup things. Only way to achieve so is
to do it ouside mmu_notifier callback. As pointed out current user do
not have this issue because they rely on file close callback to perform
the cleanup operation. New user will not necessarily have such things
to rely on. Hence factorizing various mm_struct destruction callback
with an callback chain.

If you know any other way to call mmu_notifier_unregister before the
end of mmput function than i am all ear. I am not adding this call
back just for the fun of it i spend serious time trying to find a
way to do thing without it. I might have miss a way so if i did please
show it to me.

Cheers,
Jérôme Glisse

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-30 18:35               ` Jerome Glisse
@ 2014-06-30 18:57                 ` Lewycky, Andrew
  -1 siblings, 0 replies; 76+ messages in thread
From: Lewycky, Andrew @ 2014-06-30 18:57 UTC (permalink / raw)
  To: Jerome Glisse, Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Cornwall, Jay, akpm, linux-mm,
	linux-kernel, mgorman, hpa, peterz, aarcange, riel, jweiner,
	torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	John Hubbard, Sherry Cheung, Duncan Poole, iommu

> On Mon, Jun 30, 2014 at 08:16:23PM +0200, Joerg Roedel wrote:
> > On Mon, Jun 30, 2014 at 12:06:05PM -0400, Jerome Glisse wrote:
> > > No this patch does not duplicate it. Current user of mmu_notifier
> > > rely on file close code path to call mmu_notifier_unregister. New
> > > code like AMD IOMMUv2 or HMM can not rely on that. Thus they need a
> > > way to call the mmu_notifer_unregister (which can not be done from
> > > inside the the release call back).
> >
> > No, when the mm is destroyed the .release function is called from
> > exit_mmap() which calls mmu_notifier_release() right at the beginning.
> > In this case you don't need to call mmu_notifer_unregister yourself
> > (you can still call it, but it will be a nop).
> >
> 
> We do intend to tear down all secondary mapping inside the relase callback but
> still we can not cleanup all the resources associated with it.
> 
> > > If you look at current code the release callback is use to kill
> > > secondary translation but not to free associated resources. All the
> > > associated resources are free later on after the release callback
> > > (well it depends if the file is close before the process is kill).
> >
> > In exit_mmap the .release function is called when all mappings are
> > still present. Thats the perfect point in time to unbind all those
> > resources from your device so that it can not use it anymore when the
> > mappings get destroyed.
> >
> > > So this patch aims to provide a callback to code outside of the
> > > mmu_notifier realms, a place where it is safe to cleanup the
> > > mmu_notifier and associated resources.
> >
> > Still, this is a duplication of mmu_notifier release call-back, so
> > still NACK.
> >
> 
> It is not, mmu_notifier_register take increase mm_count and only
> mmu_notifier_unregister decrease the mm_count which is different from the
> mm_users count (the latter being the one that trigger an mmu notifier
> release).
> 
> As said from the release call back you can not call mmu_notifier_unregister
> and thus you can not fully cleanup things. Only way to achieve so is to do it
> ouside mmu_notifier callback. As pointed out current user do not have this
> issue because they rely on file close callback to perform the cleanup operation.
> New user will not necessarily have such things to rely on. Hence factorizing
> various mm_struct destruction callback with an callback chain.
> 
> If you know any other way to call mmu_notifier_unregister before the end of
> mmput function than i am all ear. I am not adding this call back just for the fun
> of it i spend serious time trying to find a way to do thing without it. I might have
> miss a way so if i did please show it to me.
> 

Joerg, please consider that the amd_iommu_v2 driver already breaks the rules for what can be done from the release callback. In particular, it frees the pasid_state structure containing the struct mmu_notifier. (mn_release, unbind_pasid, put_pasid_state_wait, free_pasid_state). Since this contains the next pointer for the mmu_notifier list, __mmu_notifier_release will crash. Modifying the MMU notifier list isn't allowed because the notifier code is holding an RCU read lock. In general the problem is that RCU read locks are very constraining and things that you'd like to do from release can't be done. It could be done from the mmput callback, or perhaps mmu_notifier_release could call release from call_srcu instead.

As an aside we found another small issue: amd_iommu_bind_pasid calls get_task_mm. This bumps the mm_struct use count and it will never be released. This would prevent the buggy code path described above from ever running in the first place.

Thanks.
Andrew

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-06-30 18:57                 ` Lewycky, Andrew
  0 siblings, 0 replies; 76+ messages in thread
From: Lewycky, Andrew @ 2014-06-30 18:57 UTC (permalink / raw)
  To: Jerome Glisse, Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Cornwall, Jay, akpm, linux-mm,
	linux-kernel, mgorman, hpa, peterz, aarcange, riel, jweiner,
	torvalds, Mark Hairgrove, Jatin Kumar, Subhash Gutti,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	John Hubbard, Sherry

> On Mon, Jun 30, 2014 at 08:16:23PM +0200, Joerg Roedel wrote:
> > On Mon, Jun 30, 2014 at 12:06:05PM -0400, Jerome Glisse wrote:
> > > No this patch does not duplicate it. Current user of mmu_notifier
> > > rely on file close code path to call mmu_notifier_unregister. New
> > > code like AMD IOMMUv2 or HMM can not rely on that. Thus they need a
> > > way to call the mmu_notifer_unregister (which can not be done from
> > > inside the the release call back).
> >
> > No, when the mm is destroyed the .release function is called from
> > exit_mmap() which calls mmu_notifier_release() right at the beginning.
> > In this case you don't need to call mmu_notifer_unregister yourself
> > (you can still call it, but it will be a nop).
> >
> 
> We do intend to tear down all secondary mapping inside the relase callback but
> still we can not cleanup all the resources associated with it.
> 
> > > If you look at current code the release callback is use to kill
> > > secondary translation but not to free associated resources. All the
> > > associated resources are free later on after the release callback
> > > (well it depends if the file is close before the process is kill).
> >
> > In exit_mmap the .release function is called when all mappings are
> > still present. Thats the perfect point in time to unbind all those
> > resources from your device so that it can not use it anymore when the
> > mappings get destroyed.
> >
> > > So this patch aims to provide a callback to code outside of the
> > > mmu_notifier realms, a place where it is safe to cleanup the
> > > mmu_notifier and associated resources.
> >
> > Still, this is a duplication of mmu_notifier release call-back, so
> > still NACK.
> >
> 
> It is not, mmu_notifier_register take increase mm_count and only
> mmu_notifier_unregister decrease the mm_count which is different from the
> mm_users count (the latter being the one that trigger an mmu notifier
> release).
> 
> As said from the release call back you can not call mmu_notifier_unregister
> and thus you can not fully cleanup things. Only way to achieve so is to do it
> ouside mmu_notifier callback. As pointed out current user do not have this
> issue because they rely on file close callback to perform the cleanup operation.
> New user will not necessarily have such things to rely on. Hence factorizing
> various mm_struct destruction callback with an callback chain.
> 
> If you know any other way to call mmu_notifier_unregister before the end of
> mmput function than i am all ear. I am not adding this call back just for the fun
> of it i spend serious time trying to find a way to do thing without it. I might have
> miss a way so if i did please show it to me.
> 

Joerg, please consider that the amd_iommu_v2 driver already breaks the rules for what can be done from the release callback. In particular, it frees the pasid_state structure containing the struct mmu_notifier. (mn_release, unbind_pasid, put_pasid_state_wait, free_pasid_state). Since this contains the next pointer for the mmu_notifier list, __mmu_notifier_release will crash. Modifying the MMU notifier list isn't allowed because the notifier code is holding an RCU read lock. In general the problem is that RCU read locks are very constraining and things that you'd like to do from release can't be done. It could be done from the mmput callback, or perhaps mmu_notifier_release could call release from call_srcu instead.

As an aside we found another small issue: amd_iommu_bind_pasid calls get_task_mm. This bumps the mm_struct use count and it will never be released. This would prevent the buggy code path described above from ever running in the first place.

Thanks.
Andrew

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/6] mmu_notifier: add event information to address invalidation v2
  2014-06-28  2:00   ` Jérôme Glisse
  (?)
  (?)
@ 2014-07-01  1:57   ` Linus Torvalds
  -1 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2014-07-01  1:57 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, Mel Gorman,
	Peter Anvin, peterz, Andrea Arcangeli, Rik van Riel,
	Johannes Weiner, Mark Hairgrove, Jatin Kumar, Subhash Gutti,
	Lucien Dunning, Cameron Buschardt, Arvind Gopalakrishnan,
	John Hubbard, Sherry Cheung, Duncan Poole, Oded Gabbay,
	Alexander Deucher, Andrew Lewycky, Jérôme Glisse

On Fri, Jun 27, 2014 at 7:00 PM, Jérôme Glisse <j.glisse@gmail.com> wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
>
> The event information will be useful [...]

That needs to be cleaned up, though.

Why the heck are you making up ew and stupid event types? Now you make
the generic VM code do stupid things like this:

+       if ((vma->vm_flags & VM_READ) && (vma->vm_flags & VM_WRITE))
+               event = MMU_MPROT_RANDW;
+       else if (vma->vm_flags & VM_WRITE)
+               event = MMU_MPROT_WONLY;
+       else if (vma->vm_flags & VM_READ)
+               event = MMU_MPROT_RONLY;

which makes no sense at all. The names are some horrible abortion too
("RANDW"? That sounds like "random write" to me, not "read-and-write",
which is commonly shortened RW or perhaps RDWR. Same foes for
RONLY/WONLY - what kind of crazy names are those?

But more importantly, afaik none of that is needed. Instead, tell us
why you need particular flags, and don't make up crazy names like
this. As far as I can tell, you're already passing in the new
protection information (thanks to passing in the vma), so all those
badly named states you've made up seem to be totally pointless. They
add no actual information, but they *do* add crazy code like the above
to generic code that doesn't even WANT any of this crap. The only
thing this should need is a MMU_MPROT event, and just use that. Then
anybody who wants to look at whether the protections are being changed
to read-only, they can just look at the vma->vm_flags themselves.

So things like this need to be tightened up and made sane before any
chance of merging it.

So NAK NAK NAK in the meantime.

            Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page
  2014-06-28  2:00   ` Jérôme Glisse
  (?)
  (?)
@ 2014-07-01  2:04   ` Linus Torvalds
  -1 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2014-07-01  2:04 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, Mel Gorman,
	Peter Anvin, Andrea Arcangeli, Rik van Riel, Johannes Weiner,
	Mark Hairgrove, Jatin Kumar, Subhash Gutti, Lucien Dunning,
	Cameron Buschardt, Arvind Gopalakrishnan, John Hubbard,
	Sherry Cheung, Duncan Poole, Oded Gabbay, Alexander Deucher,
	Andrew Lewycky, Jérôme Glisse, Peter Zijlstra

On Fri, Jun 27, 2014 at 7:00 PM, Jérôme Glisse <j.glisse@gmail.com> wrote:
>
> This needs small refactoring in memory.c to call invalidate_range on
> vma boundary the overhead should be low enough.

.. and looking at it, doesn't that mean that the whole invalidate call
should be moved inside unmap_single_vma() then, instead of being
duplicated in all the callers?

I really get the feeling that somebody needs to go over this
patch-series with a fine comb to fix these kinds of ugly things.

                     Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]               ` <20140630183556.GB3280-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-07-01  9:15                 ` Joerg Roedel
  2014-07-01  9:29                     ` Gabbay, Oded
  0 siblings, 1 reply; 76+ messages in thread
From: Joerg Roedel @ 2014-07-01  9:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Sherry Cheung, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Gabbay, Oded,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	Jatin Kumar, Lucien Dunning, mgorman-l3A5Bk7waGM,
	jweiner-H+wXaHxf7aLQT0dZR+AlfA, Subhash Gutti,
	riel-H+wXaHxf7aLQT0dZR+AlfA, John Hubbard, Mark Hairgrove,
	Cameron Buschardt, peterz-hDdKplPs4pWWVfeAwA7xHQ, Duncan Poole,
	Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Arvind Gopalakrishnan

On Mon, Jun 30, 2014 at 02:35:57PM -0400, Jerome Glisse wrote:
> We do intend to tear down all secondary mapping inside the relase
> callback but still we can not cleanup all the resources associated
> with it.
>

And why can't you cleanup the other resources in the file close path?
Tearing down the mappings is all you need to do in the release function
anyway.

> As said from the release call back you can not call
> mmu_notifier_unregister and thus you can not fully cleanup things.

You don't need to call mmu_notifier_unregister when the release function
is already running from exit_mmap because this is equivalent to calling
mmu_notifier_unregister.

> Only way to achieve so is to do it ouside mmu_notifier callback.

The resources that can't be handled there can be cleaned up in the
file-close path. No need for a new notifier in mm code.

In the end all you need to do in the release function is to tear down
the secondary mapping and make sure the device can no longer access the
address space when the release function returns. Everything else, like
freeing any resources can be done later when the file descriptors are
teared down.

> If you know any other way to call mmu_notifier_unregister before the
> end of mmput function than i am all ear. I am not adding this call
> back just for the fun of it i spend serious time trying to find a
> way to do thing without it. I might have miss a way so if i did please
> show it to me.

Why do you need to call mmu_notifier_unregister manually when it is done
implicitly in exit_mmap already? 


	Joerg

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-01  9:15                 ` Joerg Roedel
@ 2014-07-01  9:29                     ` Gabbay, Oded
  0 siblings, 0 replies; 76+ messages in thread
From: Gabbay, Oded @ 2014-07-01  9:29 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay, Bridgman,
	John, Jerome Glisse, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Tue, 2014-07-01 at 11:15 +0200, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 02:35:57PM -0400, Jerome Glisse wrote:
> > We do intend to tear down all secondary mapping inside the relase
> > callback but still we can not cleanup all the resources associated
> > with it.
> >
> 
> And why can't you cleanup the other resources in the file close path?
> Tearing down the mappings is all you need to do in the release function
> anyway.
> 
> > As said from the release call back you can not call
> > mmu_notifier_unregister and thus you can not fully cleanup things.
> 
> You don't need to call mmu_notifier_unregister when the release function
> is already running from exit_mmap because this is equivalent to calling
> mmu_notifier_unregister.
> 
> > Only way to achieve so is to do it ouside mmu_notifier callback.
> 
> The resources that can't be handled there can be cleaned up in the
> file-close path. No need for a new notifier in mm code.
> 
> In the end all you need to do in the release function is to tear down
> the secondary mapping and make sure the device can no longer access the
> address space when the release function returns. Everything else, like
> freeing any resources can be done later when the file descriptors are
> teared down.

I will answer from the KFD perpective, as I'm AMD's maintainer of this
driver.

Little background: AMD's HSA Linux kernel driver (called radeon_kfd or
KFD in short), has been developed for the past year by AMD, to support
running Linux compute applications on AMD's HSA-enabled APUs, i.e Kaveri
(A10-7850K/7700K). The driver will be up for kernel community review in
about 2-3 weeks so we could push it during the 3.17 merge window. Prior
discussions were made with gpu/drm subsystem maintainers about this
driver.

In the KFD, we need to maintain a notion of each compute process.
Therefore, we have an object called "kfd_process" that is created for
each process that uses the KFD. Naturally, we need to be able to track
the process's shutdown in order to perform cleanup of the resources it
uses (compute queues, virtual address space, gpu local memory
allocations, etc.).

To enable this tracking mechanism, we decided to associate the
kfd_process with mm_struct to ensure that a kfd_process object has
exactly the same lifespan as the process it represents. We preferred to
use the mm_struct and not a file description because using a file
descriptor to track “process” shutdown is wrong in two ways:

* Technical: file descriptors can be passed to unrelated processes using
AF_UNIX sockets. This means that a process can exit while the file stays
open. Even if we implement this “correctly” i.e. holding the address
space & page tables alive until the file is finally released, it’s
really dodgy.

* Philosophical: our ioctls are actually system calls in disguise. They
operate on the process, not on a device.

Moreover, because the GPU interacts with the process only through
virtual memory (and not e.g. file descriptors), and because virtual
address space is fundamental to an intuitive notion of what a process
is, the decision to associate the kfd_process with mm_struct seems like
a natural choice.

Then arrived the issue of how the KFD is notified about an mm_struct
destruction. Because the mmu_notifier release callback is called from an
RCU read lock, it can't destory the mmu_notifier object, which is the
kfd_process object itself. Therefore, I talked to Jerome and Andrew
Morton on a way to implement this and after the discussion (which was in
private emails), Jerome was kind enough to write a patch, which is the
patch we are now discussing.

You are more than welcomed to take a look at the entire driver, at
http://cgit.freedesktop.org/~gabbayo/linux/?h=kfd-0.6.x      although
the driver will undergo some changes before sending the pull request to
Dave Airle.

I believe that converting amd_iommu_v2 driver to use this patch as well,
will benefit all parties. AFAIK, KFD is the _only_ client of the
amd_iommu_v2 driver, so it is imperative that we will work together on
this.

	Oded
> > If you know any other way to call mmu_notifier_unregister before the
> > end of mmput function than i am all ear. I am not adding this call
> > back just for the fun of it i spend serious time trying to find a
> > way to do thing without it. I might have miss a way so if i did please
> > show it to me.
> 
> Why do you need to call mmu_notifier_unregister manually when it is done
> implicitly in exit_mmap already? 
> 
> 
> 	Joerg
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-01  9:29                     ` Gabbay, Oded
  0 siblings, 0 replies; 76+ messages in thread
From: Gabbay, Oded @ 2014-07-01  9:29 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay, Bridgman,
	John, Jerome Glisse, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind

On Tue, 2014-07-01 at 11:15 +0200, Joerg Roedel wrote:
> On Mon, Jun 30, 2014 at 02:35:57PM -0400, Jerome Glisse wrote:
> > We do intend to tear down all secondary mapping inside the relase
> > callback but still we can not cleanup all the resources associated
> > with it.
> >
> 
> And why can't you cleanup the other resources in the file close path?
> Tearing down the mappings is all you need to do in the release function
> anyway.
> 
> > As said from the release call back you can not call
> > mmu_notifier_unregister and thus you can not fully cleanup things.
> 
> You don't need to call mmu_notifier_unregister when the release function
> is already running from exit_mmap because this is equivalent to calling
> mmu_notifier_unregister.
> 
> > Only way to achieve so is to do it ouside mmu_notifier callback.
> 
> The resources that can't be handled there can be cleaned up in the
> file-close path. No need for a new notifier in mm code.
> 
> In the end all you need to do in the release function is to tear down
> the secondary mapping and make sure the device can no longer access the
> address space when the release function returns. Everything else, like
> freeing any resources can be done later when the file descriptors are
> teared down.

I will answer from the KFD perpective, as I'm AMD's maintainer of this
driver.

Little background: AMD's HSA Linux kernel driver (called radeon_kfd or
KFD in short), has been developed for the past year by AMD, to support
running Linux compute applications on AMD's HSA-enabled APUs, i.e Kaveri
(A10-7850K/7700K). The driver will be up for kernel community review in
about 2-3 weeks so we could push it during the 3.17 merge window. Prior
discussions were made with gpu/drm subsystem maintainers about this
driver.

In the KFD, we need to maintain a notion of each compute process.
Therefore, we have an object called "kfd_process" that is created for
each process that uses the KFD. Naturally, we need to be able to track
the process's shutdown in order to perform cleanup of the resources it
uses (compute queues, virtual address space, gpu local memory
allocations, etc.).

To enable this tracking mechanism, we decided to associate the
kfd_process with mm_struct to ensure that a kfd_process object has
exactly the same lifespan as the process it represents. We preferred to
use the mm_struct and not a file description because using a file
descriptor to track “process” shutdown is wrong in two ways:

* Technical: file descriptors can be passed to unrelated processes using
AF_UNIX sockets. This means that a process can exit while the file stays
open. Even if we implement this “correctly” i.e. holding the address
space & page tables alive until the file is finally released, it’s
really dodgy.

* Philosophical: our ioctls are actually system calls in disguise. They
operate on the process, not on a device.

Moreover, because the GPU interacts with the process only through
virtual memory (and not e.g. file descriptors), and because virtual
address space is fundamental to an intuitive notion of what a process
is, the decision to associate the kfd_process with mm_struct seems like
a natural choice.

Then arrived the issue of how the KFD is notified about an mm_struct
destruction. Because the mmu_notifier release callback is called from an
RCU read lock, it can't destory the mmu_notifier object, which is the
kfd_process object itself. Therefore, I talked to Jerome and Andrew
Morton on a way to implement this and after the discussion (which was in
private emails), Jerome was kind enough to write a patch, which is the
patch we are now discussing.

You are more than welcomed to take a look at the entire driver, at
http://cgit.freedesktop.org/~gabbayo/linux/?h=kfd-0.6.x      although
the driver will undergo some changes before sending the pull request to
Dave Airle.

I believe that converting amd_iommu_v2 driver to use this patch as well,
will benefit all parties. AFAIK, KFD is the _only_ client of the
amd_iommu_v2 driver, so it is imperative that we will work together on
this.

	Oded
> > If you know any other way to call mmu_notifier_unregister before the
> > end of mmput function than i am all ear. I am not adding this call
> > back just for the fun of it i spend serious time trying to find a
> > way to do thing without it. I might have miss a way so if i did please
> > show it to me.
> 
> Why do you need to call mmu_notifier_unregister manually when it is done
> implicitly in exit_mmap already? 
> 
> 
> 	Joerg
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-06-30 18:57                 ` Lewycky, Andrew
@ 2014-07-01  9:41                   ` Joerg Roedel
  -1 siblings, 0 replies; 76+ messages in thread
From: Joerg Roedel @ 2014-07-01  9:41 UTC (permalink / raw)
  To: Lewycky, Andrew
  Cc: Jerome Glisse, Gabbay, Oded, Deucher, Alexander, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

Hi Andrew,

On Mon, Jun 30, 2014 at 06:57:48PM +0000, Lewycky, Andrew wrote:
> As an aside we found another small issue: amd_iommu_bind_pasid calls
> get_task_mm. This bumps the mm_struct use count and it will never be
> released. This would prevent the buggy code path described above from
> ever running in the first place.

You are right, the current code is a bit problematic, but to fix this no
new notifier chain in mm-code is needed.

In fact, using get_task_mm() is a good way to keep a reference to the mm
as a user (an external device is in fact another user) and defer the
destruction of the mappings to the file-close path (where you can call
mmput to destroy it). So this is another way to solve the problem
without any new notifier.


	Joerg


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-01  9:41                   ` Joerg Roedel
  0 siblings, 0 replies; 76+ messages in thread
From: Joerg Roedel @ 2014-07-01  9:41 UTC (permalink / raw)
  To: Lewycky, Andrew
  Cc: Jerome Glisse, Gabbay, Oded, Deucher, Alexander, Cornwall, Jay,
	akpm, linux-mm, linux-kernel, mgorman, hpa, peterz, aarcange,
	riel, jweiner, torvalds, Mark Hairgrove, Jatin Kumar,
	Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John

Hi Andrew,

On Mon, Jun 30, 2014 at 06:57:48PM +0000, Lewycky, Andrew wrote:
> As an aside we found another small issue: amd_iommu_bind_pasid calls
> get_task_mm. This bumps the mm_struct use count and it will never be
> released. This would prevent the buggy code path described above from
> ever running in the first place.

You are right, the current code is a bit problematic, but to fix this no
new notifier chain in mm-code is needed.

In fact, using get_task_mm() is a good way to keep a reference to the mm
as a user (an external device is in fact another user) and defer the
destruction of the mappings to the file-close path (where you can call
mmput to destroy it). So this is another way to solve the problem
without any new notifier.


	Joerg


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]                     ` <019CCE693E457142B37B791721487FD91806DD8B-0nO7ALo/ziwxlywnonMhLEEOCMrvLtNR@public.gmane.org>
@ 2014-07-01 11:00                       ` Joerg Roedel
  2014-07-01 19:33                           ` Jerome Glisse
  0 siblings, 1 reply; 76+ messages in thread
From: Joerg Roedel @ 2014-07-01 11:00 UTC (permalink / raw)
  To: Gabbay, Oded
  Cc: Sherry Cheung, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Jerome Glisse,
	aarcange-H+wXaHxf7aLQT0dZR+AlfA, Jatin Kumar, Lucien Dunning,
	mgorman-l3A5Bk7waGM, jweiner-H+wXaHxf7aLQT0dZR+AlfA,
	Subhash Gutti, riel-H+wXaHxf7aLQT0dZR+AlfA, Bridgman, John,
	John Hubbard, Mark Hairgrove, Cameron Buschardt,
	peterz-hDdKplPs4pWWVfeAwA7xHQ, Duncan Poole, Cornwall, Jay,
	Lewycky, Andrew, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Jul 01, 2014 at 09:29:49AM +0000, Gabbay, Oded wrote:
> In the KFD, we need to maintain a notion of each compute process.
> Therefore, we have an object called "kfd_process" that is created for
> each process that uses the KFD. Naturally, we need to be able to track
> the process's shutdown in order to perform cleanup of the resources it
> uses (compute queues, virtual address space, gpu local memory
> allocations, etc.).

If it is only that, you can also use the task_exit notifier already in
the kernel.

> To enable this tracking mechanism, we decided to associate the
> kfd_process with mm_struct to ensure that a kfd_process object has
> exactly the same lifespan as the process it represents. We preferred to
> use the mm_struct and not a file description because using a file
> descriptor to track “process” shutdown is wrong in two ways:
> 
> * Technical: file descriptors can be passed to unrelated processes using
> AF_UNIX sockets. This means that a process can exit while the file stays
> open. Even if we implement this “correctly” i.e. holding the address
> space & page tables alive until the file is finally released, it’s
> really dodgy.

No, its not in this case. The file descriptor is used to connect a
process address space with a device context. Thus without the mappings
the file-descriptor is useless and the mappings should stay in-tact
until the fd is closed.

It would be a very bad semantic for userspace if a fd that is passed on
fails on the other side because the sending process died.


	Joerg


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-01 11:00                       ` Joerg Roedel
@ 2014-07-01 19:33                           ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-01 19:33 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> On Tue, Jul 01, 2014 at 09:29:49AM +0000, Gabbay, Oded wrote:
> > In the KFD, we need to maintain a notion of each compute process.
> > Therefore, we have an object called "kfd_process" that is created for
> > each process that uses the KFD. Naturally, we need to be able to track
> > the process's shutdown in order to perform cleanup of the resources it
> > uses (compute queues, virtual address space, gpu local memory
> > allocations, etc.).
> 
> If it is only that, you can also use the task_exit notifier already in
> the kernel.

No task_exit will happen per thread not once per mm.

> 
> > To enable this tracking mechanism, we decided to associate the
> > kfd_process with mm_struct to ensure that a kfd_process object has
> > exactly the same lifespan as the process it represents. We preferred to
> > use the mm_struct and not a file description because using a file
> > descriptor to track a??processa?? shutdown is wrong in two ways:
> > 
> > * Technical: file descriptors can be passed to unrelated processes using
> > AF_UNIX sockets. This means that a process can exit while the file stays
> > open. Even if we implement this a??correctlya?? i.e. holding the address
> > space & page tables alive until the file is finally released, ita??s
> > really dodgy.
> 
> No, its not in this case. The file descriptor is used to connect a
> process address space with a device context. Thus without the mappings
> the file-descriptor is useless and the mappings should stay in-tact
> until the fd is closed.
> 
> It would be a very bad semantic for userspace if a fd that is passed on
> fails on the other side because the sending process died.
> 

Consider use case where there is no file associated with the mmu_notifier
ie there is no device file descriptor that could hold and take care of
mmu_notifier destruction and cleanup. We need this call chain for this
case.

Anyother idea than task_exit ?

Cheers,
JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-01 19:33                           ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-01 19:33 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Ar

On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> On Tue, Jul 01, 2014 at 09:29:49AM +0000, Gabbay, Oded wrote:
> > In the KFD, we need to maintain a notion of each compute process.
> > Therefore, we have an object called "kfd_process" that is created for
> > each process that uses the KFD. Naturally, we need to be able to track
> > the process's shutdown in order to perform cleanup of the resources it
> > uses (compute queues, virtual address space, gpu local memory
> > allocations, etc.).
> 
> If it is only that, you can also use the task_exit notifier already in
> the kernel.

No task_exit will happen per thread not once per mm.

> 
> > To enable this tracking mechanism, we decided to associate the
> > kfd_process with mm_struct to ensure that a kfd_process object has
> > exactly the same lifespan as the process it represents. We preferred to
> > use the mm_struct and not a file description because using a file
> > descriptor to track “process” shutdown is wrong in two ways:
> > 
> > * Technical: file descriptors can be passed to unrelated processes using
> > AF_UNIX sockets. This means that a process can exit while the file stays
> > open. Even if we implement this “correctly” i.e. holding the address
> > space & page tables alive until the file is finally released, it’s
> > really dodgy.
> 
> No, its not in this case. The file descriptor is used to connect a
> process address space with a device context. Thus without the mappings
> the file-descriptor is useless and the mappings should stay in-tact
> until the fd is closed.
> 
> It would be a very bad semantic for userspace if a fd that is passed on
> fails on the other side because the sending process died.
> 

Consider use case where there is no file associated with the mmu_notifier
ie there is no device file descriptor that could hold and take care of
mmu_notifier destruction and cleanup. We need this call chain for this
case.

Anyother idea than task_exit ?

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]                           ` <20140701193343.GB3322-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-07-01 21:06                             ` Joerg Roedel
  2014-07-01 21:32                                 ` Jerome Glisse
  0 siblings, 1 reply; 76+ messages in thread
From: Joerg Roedel @ 2014-07-01 21:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Sherry Cheung, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Gabbay, Oded,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	Jatin Kumar, Lucien Dunning, mgorman-l3A5Bk7waGM,
	jweiner-H+wXaHxf7aLQT0dZR+AlfA, Subhash Gutti,
	riel-H+wXaHxf7aLQT0dZR+AlfA, Bridgman, John, John Hubbard,
	Mark Hairgrove, Cameron Buschardt, peterz-hDdKplPs4pWWVfeAwA7xHQ,
	Duncan Poole, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Jul 01, 2014 at 03:33:44PM -0400, Jerome Glisse wrote:
> On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> > No, its not in this case. The file descriptor is used to connect a
> > process address space with a device context. Thus without the mappings
> > the file-descriptor is useless and the mappings should stay in-tact
> > until the fd is closed.
> > 
> > It would be a very bad semantic for userspace if a fd that is passed on
> > fails on the other side because the sending process died.
> 
> Consider use case where there is no file associated with the mmu_notifier
> ie there is no device file descriptor that could hold and take care of
> mmu_notifier destruction and cleanup. We need this call chain for this
> case.

Example of such a use-case where no fd will be associated?

Anyway, even without an fd, there will always be something that sets the
mm->device binding up (calling mmu_notifier_register) and tears it down
in the end (calling mmu_notifier_unregister). And this will be the
places where any resources left from the .release call-back can be
cleaned up.


	Joerg

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-01 21:06                             ` Joerg Roedel
@ 2014-07-01 21:32                                 ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-01 21:32 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Tue, Jul 01, 2014 at 11:06:20PM +0200, Joerg Roedel wrote:
> On Tue, Jul 01, 2014 at 03:33:44PM -0400, Jerome Glisse wrote:
> > On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> > > No, its not in this case. The file descriptor is used to connect a
> > > process address space with a device context. Thus without the mappings
> > > the file-descriptor is useless and the mappings should stay in-tact
> > > until the fd is closed.
> > > 
> > > It would be a very bad semantic for userspace if a fd that is passed on
> > > fails on the other side because the sending process died.
> > 
> > Consider use case where there is no file associated with the mmu_notifier
> > ie there is no device file descriptor that could hold and take care of
> > mmu_notifier destruction and cleanup. We need this call chain for this
> > case.
> 
> Example of such a use-case where no fd will be associated?
> 
> Anyway, even without an fd, there will always be something that sets the
> mm->device binding up (calling mmu_notifier_register) and tears it down
> in the end (calling mmu_notifier_unregister). And this will be the
> places where any resources left from the .release call-back can be
> cleaned up.
> 

That's the whole point we can not do what we want without the callback ie
the place where we do the cleanup is the mm callback we need. If you do not
like the call chain than we will just add ourself as another caller in the
exact same spot where the notifier chain is which Andrew disliked because
there are already enough submodule that are interested in being inform of
mm destruction.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-01 21:32                                 ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-01 21:32 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Ar

On Tue, Jul 01, 2014 at 11:06:20PM +0200, Joerg Roedel wrote:
> On Tue, Jul 01, 2014 at 03:33:44PM -0400, Jerome Glisse wrote:
> > On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> > > No, its not in this case. The file descriptor is used to connect a
> > > process address space with a device context. Thus without the mappings
> > > the file-descriptor is useless and the mappings should stay in-tact
> > > until the fd is closed.
> > > 
> > > It would be a very bad semantic for userspace if a fd that is passed on
> > > fails on the other side because the sending process died.
> > 
> > Consider use case where there is no file associated with the mmu_notifier
> > ie there is no device file descriptor that could hold and take care of
> > mmu_notifier destruction and cleanup. We need this call chain for this
> > case.
> 
> Example of such a use-case where no fd will be associated?
> 
> Anyway, even without an fd, there will always be something that sets the
> mm->device binding up (calling mmu_notifier_register) and tears it down
> in the end (calling mmu_notifier_unregister). And this will be the
> places where any resources left from the .release call-back can be
> cleaned up.
> 

That's the whole point we can not do what we want without the callback ie
the place where we do the cleanup is the mm callback we need. If you do not
like the call chain than we will just add ourself as another caller in the
exact same spot where the notifier chain is which Andrew disliked because
there are already enough submodule that are interested in being inform of
mm destruction.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-01 21:32                                 ` Jerome Glisse
@ 2014-07-03 18:30                                   ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-03 18:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Tue, Jul 01, 2014 at 05:32:09PM -0400, Jerome Glisse wrote:
> On Tue, Jul 01, 2014 at 11:06:20PM +0200, Joerg Roedel wrote:
> > On Tue, Jul 01, 2014 at 03:33:44PM -0400, Jerome Glisse wrote:
> > > On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> > > > No, its not in this case. The file descriptor is used to connect a
> > > > process address space with a device context. Thus without the mappings
> > > > the file-descriptor is useless and the mappings should stay in-tact
> > > > until the fd is closed.
> > > > 
> > > > It would be a very bad semantic for userspace if a fd that is passed on
> > > > fails on the other side because the sending process died.
> > > 
> > > Consider use case where there is no file associated with the mmu_notifier
> > > ie there is no device file descriptor that could hold and take care of
> > > mmu_notifier destruction and cleanup. We need this call chain for this
> > > case.
> > 
> > Example of such a use-case where no fd will be associated?
> > 
> > Anyway, even without an fd, there will always be something that sets the
> > mm->device binding up (calling mmu_notifier_register) and tears it down
> > in the end (calling mmu_notifier_unregister). And this will be the
> > places where any resources left from the .release call-back can be
> > cleaned up.
> > 
> 
> That's the whole point we can not do what we want without the callback ie
> the place where we do the cleanup is the mm callback we need. If you do not
> like the call chain than we will just add ourself as another caller in the
> exact same spot where the notifier chain is which Andrew disliked because
> there are already enough submodule that are interested in being inform of
> mm destruction.
> 
> Cheers,
> Jerome

Joerg do you still object to this patch ? Knowing that we need to bind the
lifetime of our object with the mm_struct. While the release callback of
mmu_notifier allow us to stop any further processing in timely manner with
mm destruction, we can not however free some of the associated resources
namely the structure containing the mmu_notifier struct. We could schedule
a delayed job to do it sometimes after we get the release callback but that
would be hackish.

Again the natural place to call this is from mmput and the fact that many
other subsystem already call in from there to cleanup there own per mm data
structure is a testimony that this is a valid use case and valid design.

This patch realy just try to allow new user to easily interface themself
at proper place in mm lifetime. It is just as the task exit notifier chain
but i deals with the mm_struct.

You pointed out that the cleanup should be done from the device driver file
close call. But as i stressed some of the new user will not necessarily have
a device file open hence no way for them to free the associated structure
except with hackish delayed job.

So i do not see any reasons to block this patch.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-03 18:30                                   ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-03 18:30 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Tue, Jul 01, 2014 at 05:32:09PM -0400, Jerome Glisse wrote:
> On Tue, Jul 01, 2014 at 11:06:20PM +0200, Joerg Roedel wrote:
> > On Tue, Jul 01, 2014 at 03:33:44PM -0400, Jerome Glisse wrote:
> > > On Tue, Jul 01, 2014 at 01:00:18PM +0200, Joerg Roedel wrote:
> > > > No, its not in this case. The file descriptor is used to connect a
> > > > process address space with a device context. Thus without the mappings
> > > > the file-descriptor is useless and the mappings should stay in-tact
> > > > until the fd is closed.
> > > > 
> > > > It would be a very bad semantic for userspace if a fd that is passed on
> > > > fails on the other side because the sending process died.
> > > 
> > > Consider use case where there is no file associated with the mmu_notifier
> > > ie there is no device file descriptor that could hold and take care of
> > > mmu_notifier destruction and cleanup. We need this call chain for this
> > > case.
> > 
> > Example of such a use-case where no fd will be associated?
> > 
> > Anyway, even without an fd, there will always be something that sets the
> > mm->device binding up (calling mmu_notifier_register) and tears it down
> > in the end (calling mmu_notifier_unregister). And this will be the
> > places where any resources left from the .release call-back can be
> > cleaned up.
> > 
> 
> That's the whole point we can not do what we want without the callback ie
> the place where we do the cleanup is the mm callback we need. If you do not
> like the call chain than we will just add ourself as another caller in the
> exact same spot where the notifier chain is which Andrew disliked because
> there are already enough submodule that are interested in being inform of
> mm destruction.
> 
> Cheers,
> Jérôme

Joerg do you still object to this patch ? Knowing that we need to bind the
lifetime of our object with the mm_struct. While the release callback of
mmu_notifier allow us to stop any further processing in timely manner with
mm destruction, we can not however free some of the associated resources
namely the structure containing the mmu_notifier struct. We could schedule
a delayed job to do it sometimes after we get the release callback but that
would be hackish.

Again the natural place to call this is from mmput and the fact that many
other subsystem already call in from there to cleanup there own per mm data
structure is a testimony that this is a valid use case and valid design.

This patch realy just try to allow new user to easily interface themself
at proper place in mm lifetime. It is just as the task exit notifier chain
but i deals with the mm_struct.

You pointed out that the cleanup should be done from the device driver file
close call. But as i stressed some of the new user will not necessarily have
a device file open hence no way for them to free the associated structure
except with hackish delayed job.

So i do not see any reasons to block this patch.

Cheers,
Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]                                   ` <20140703183024.GA3306-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-07-03 23:15                                     ` Joerg Roedel
  2014-07-04  0:03                                         ` Jerome Glisse
  2014-07-06 19:25                                         ` Gabbay, Oded
  0 siblings, 2 replies; 76+ messages in thread
From: Joerg Roedel @ 2014-07-03 23:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, Sherry Cheung,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Gabbay, Oded,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	Jatin Kumar, Lucien Dunning, mgorman-l3A5Bk7waGM,
	jweiner-H+wXaHxf7aLQT0dZR+AlfA, Subhash Gutti,
	riel-H+wXaHxf7aLQT0dZR+AlfA, Bridgman, John, John Hubbard,
	Mark Hairgrove, Cameron Buschardt, Duncan Poole, Cornwall, Jay,
	Lewycky, Andrew, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Jerome,

On Thu, Jul 03, 2014 at 02:30:26PM -0400, Jerome Glisse wrote:
> Joerg do you still object to this patch ?

Yes.

> Again the natural place to call this is from mmput and the fact that many
> other subsystem already call in from there to cleanup there own per mm data
> structure is a testimony that this is a valid use case and valid design.

Device drivers are something different than subsystems. I think the
point that the mmu_notifier struct can not be freed in the .release
call-back is a weak reason for introducing a new notifier. In the end
every user of mmu_notifiers has to call mmu_notifier_register somewhere
(file-open/ioctl path or somewhere else where the mm<->device binding is
 set up) and can call mmu_notifier_unregister in a similar path which
destroys the binding.

> You pointed out that the cleanup should be done from the device driver file
> close call. But as i stressed some of the new user will not necessarily have
> a device file open hence no way for them to free the associated structure
> except with hackish delayed job.

Please tell me more about these 'new users', how does mm<->device binding
is set up there if no fd is used?


	Joerg

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-03 23:15                                     ` Joerg Roedel
@ 2014-07-04  0:03                                         ` Jerome Glisse
  2014-07-06 19:25                                         ` Gabbay, Oded
  1 sibling, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-04  0:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan, John Hubbard, Sherry Cheung, Duncan Poole,
	iommu

On Fri, Jul 04, 2014 at 01:15:41AM +0200, Joerg Roedel wrote:
> Hi Jerome,
> 
> On Thu, Jul 03, 2014 at 02:30:26PM -0400, Jerome Glisse wrote:
> > Joerg do you still object to this patch ?
> 
> Yes.
> 
> > Again the natural place to call this is from mmput and the fact that many
> > other subsystem already call in from there to cleanup there own per mm data
> > structure is a testimony that this is a valid use case and valid design.
> 
> Device drivers are something different than subsystems. I think the
> point that the mmu_notifier struct can not be freed in the .release
> call-back is a weak reason for introducing a new notifier. In the end
> every user of mmu_notifiers has to call mmu_notifier_register somewhere
> (file-open/ioctl path or somewhere else where the mm<->device binding is
>  set up) and can call mmu_notifier_unregister in a similar path which
> destroys the binding.
> 
> > You pointed out that the cleanup should be done from the device driver file
> > close call. But as i stressed some of the new user will not necessarily have
> > a device file open hence no way for them to free the associated structure
> > except with hackish delayed job.
> 
> Please tell me more about these 'new users', how does mm<->device binding
> is set up there if no fd is used?

It could be setup on behalf of another process through others means like
local socket. Thought main use case i am thinking about is you open device
fd you setup your gpu queue and then you close the fd but you keep using
the gpu and the gpu keep accessing the address space.

Further done the lane we might even see gpu code as directly executable
thought that seems yet unreleasistic at this time.

Anyway whole point is that no matter how you turn the matter anything that
mirror a process address is linked to the lifetime of the mm_struct so in
order to have some logic there it is far best to have destruction match
the destruction of mm. This also make things like fork lot cleaner, as on
work the device file descriptor is duplicated inside the child process
but nothing setup child process to keep using the gpu. Thus you might
end up with way delayed file closure compare to parent process mm
destruction.

Cheers,
Jerome


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-04  0:03                                         ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-04  0:03 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Gabbay, Oded, Deucher, Alexander, Lewycky, Andrew, Cornwall, Jay,
	Bridgman, John, akpm, linux-mm, linux-kernel, mgorman, hpa,
	peterz, aarcange, riel, jweiner, torvalds, Mark Hairgrove,
	Jatin Kumar, Subhash Gutti, Lucien Dunning, Cameron Buschardt,
	Arvind Gopalakrishnan

On Fri, Jul 04, 2014 at 01:15:41AM +0200, Joerg Roedel wrote:
> Hi Jerome,
> 
> On Thu, Jul 03, 2014 at 02:30:26PM -0400, Jerome Glisse wrote:
> > Joerg do you still object to this patch ?
> 
> Yes.
> 
> > Again the natural place to call this is from mmput and the fact that many
> > other subsystem already call in from there to cleanup there own per mm data
> > structure is a testimony that this is a valid use case and valid design.
> 
> Device drivers are something different than subsystems. I think the
> point that the mmu_notifier struct can not be freed in the .release
> call-back is a weak reason for introducing a new notifier. In the end
> every user of mmu_notifiers has to call mmu_notifier_register somewhere
> (file-open/ioctl path or somewhere else where the mm<->device binding is
>  set up) and can call mmu_notifier_unregister in a similar path which
> destroys the binding.
> 
> > You pointed out that the cleanup should be done from the device driver file
> > close call. But as i stressed some of the new user will not necessarily have
> > a device file open hence no way for them to free the associated structure
> > except with hackish delayed job.
> 
> Please tell me more about these 'new users', how does mm<->device binding
> is set up there if no fd is used?

It could be setup on behalf of another process through others means like
local socket. Thought main use case i am thinking about is you open device
fd you setup your gpu queue and then you close the fd but you keep using
the gpu and the gpu keep accessing the address space.

Further done the lane we might even see gpu code as directly executable
thought that seems yet unreleasistic at this time.

Anyway whole point is that no matter how you turn the matter anything that
mirror a process address is linked to the lifetime of the mm_struct so in
order to have some logic there it is far best to have destruction match
the destruction of mm. This also make things like fork lot cleaner, as on
work the device file descriptor is duplicated inside the child process
but nothing setup child process to keep using the gpu. Thus you might
end up with way delayed file closure compare to parent process mm
destruction.

Cheers,
Jérôme


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-06 19:25                                         ` Gabbay, Oded
  0 siblings, 0 replies; 76+ messages in thread
From: Gabbay, Oded @ 2014-07-06 19:25 UTC (permalink / raw)
  To: joro
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung, jakumar, jhubbard, Cornwall, Jay, mgorman, cabuschardt,
	ldunning

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2067 bytes --]

On Fri, 2014-07-04 at 01:15 +0200, Joerg Roedel wrote:
> Hi Jerome,
> 
> On Thu, Jul 03, 2014 at 02:30:26PM -0400, Jerome Glisse wrote:
> > Joerg do you still object to this patch ?
> 
> Yes.
> 
> > Again the natural place to call this is from mmput and the fact that many
> > other subsystem already call in from there to cleanup there own per mm data
> > structure is a testimony that this is a valid use case and valid design.
> 
> Device drivers are something different than subsystems. 
I think that hsa (kfd) and hmm _are_ subsystems, if not in definition
than in practice. Our model is not a classic device-driver model in the
sense that our ioctl's are more like syscalls than traditional
device-driver ioctls. e.g our kfd_open() doesn't open a kfd device or
even a gpu device, it *binds* a *process* to a device. So basically, our
ioctls are not related to a specific H/W instance (specific GPU in case
of kfd) but more related to a specific process.

Once we can agree on that, than I think we can agree that kfd and hmm
can and should be bounded to mm struct and not file descriptors.

	Oded

> I think the
> point that the mmu_notifier struct can not be freed in the .release
> call-back is a weak reason for introducing a new notifier. In the end
> every user of mmu_notifiers has to call mmu_notifier_register somewhere
> (file-open/ioctl path or somewhere else where the mm<->device binding is
>  set up) and can call mmu_notifier_unregister in a similar path which
> destroys the binding.
> 
> > You pointed out that the cleanup should be done from the device driver file
> > close call. But as i stressed some of the new user will not necessarily have
> > a device file open hence no way for them to free the associated structure
> > except with hackish delayed job.
> 
> Please tell me more about these 'new users', how does mm<->device binding
> is set up there if no fd is used?
> 
> 
> 	Joerg
> 
> 

N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-06 19:25                                         ` Gabbay, Oded
  0 siblings, 0 replies; 76+ messages in thread
From: Gabbay, Oded @ 2014-07-06 19:25 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, SCheung-DDmLM1+adcrQT0dZR+AlfA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	j.glisse-Re5JQEeQqe8AvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	jakumar-DDmLM1+adcrQT0dZR+AlfA, ldunning-DDmLM1+adcrQT0dZR+AlfA,
	mgorman-l3A5Bk7waGM, jweiner-H+wXaHxf7aLQT0dZR+AlfA,
	sgutti-DDmLM1+adcrQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA,
	Bridgman, John, jhubbard-DDmLM1+adcrQT0dZR+AlfA,
	mhairgrove-DDmLM1+adcrQT0dZR+AlfA,
	cabuschardt-DDmLM1+adcrQT0dZR+AlfA,
	dpoole-DDmLM1+adcrQT0dZR+AlfA, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, iom

On Fri, 2014-07-04 at 01:15 +0200, Joerg Roedel wrote:
> Hi Jerome,
> 
> On Thu, Jul 03, 2014 at 02:30:26PM -0400, Jerome Glisse wrote:
> > Joerg do you still object to this patch ?
> 
> Yes.
> 
> > Again the natural place to call this is from mmput and the fact that many
> > other subsystem already call in from there to cleanup there own per mm data
> > structure is a testimony that this is a valid use case and valid design.
> 
> Device drivers are something different than subsystems. 
I think that hsa (kfd) and hmm _are_ subsystems, if not in definition
than in practice. Our model is not a classic device-driver model in the
sense that our ioctl's are more like syscalls than traditional
device-driver ioctls. e.g our kfd_open() doesn't open a kfd device or
even a gpu device, it *binds* a *process* to a device. So basically, our
ioctls are not related to a specific H/W instance (specific GPU in case
of kfd) but more related to a specific process.

Once we can agree on that, than I think we can agree that kfd and hmm
can and should be bounded to mm struct and not file descriptors.

	Oded

> I think the
> point that the mmu_notifier struct can not be freed in the .release
> call-back is a weak reason for introducing a new notifier. In the end
> every user of mmu_notifiers has to call mmu_notifier_register somewhere
> (file-open/ioctl path or somewhere else where the mm<->device binding is
>  set up) and can call mmu_notifier_unregister in a similar path which
> destroys the binding.
> 
> > You pointed out that the cleanup should be done from the device driver file
> > close call. But as i stressed some of the new user will not necessarily have
> > a device file open hence no way for them to free the associated structure
> > except with hackish delayed job.
> 
> Please tell me more about these 'new users', how does mm<->device binding
> is set up there if no fd is used?
> 
> 
> 	Joerg
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-06 19:25                                         ` Gabbay, Oded
@ 2014-07-07 10:11                                           ` joro
  -1 siblings, 0 replies; 76+ messages in thread
From: joro @ 2014-07-07 10:11 UTC (permalink / raw)
  To: Gabbay, Oded
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung, jakumar, jhubbard, Cornwall, Jay, mgorman, cabuschardt,
	ldunning

On Sun, Jul 06, 2014 at 07:25:18PM +0000, Gabbay, Oded wrote:
> Once we can agree on that, than I think we can agree that kfd and hmm
> can and should be bounded to mm struct and not file descriptors.

The file descriptor concept is the way it works in the rest of the
kernel. It works for numerous drivers and subsystems (KVM, VFIO, UIO,
...), when you close a file descriptor handed out from any of those
drivers (already in the kernel) all related resources will be freed. I
don't see a reason why HSA drivers should break these expectations and
be different.


	Joerg


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-07 10:11                                           ` joro
  0 siblings, 0 replies; 76+ messages in thread
From: joro @ 2014-07-07 10:11 UTC (permalink / raw)
  To: Gabbay, Oded
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung

On Sun, Jul 06, 2014 at 07:25:18PM +0000, Gabbay, Oded wrote:
> Once we can agree on that, than I think we can agree that kfd and hmm
> can and should be bounded to mm struct and not file descriptors.

The file descriptor concept is the way it works in the rest of the
kernel. It works for numerous drivers and subsystems (KVM, VFIO, UIO,
...), when you close a file descriptor handed out from any of those
drivers (already in the kernel) all related resources will be freed. I
don't see a reason why HSA drivers should break these expectations and
be different.


	Joerg


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-07 10:11                                           ` joro
@ 2014-07-07 10:36                                             ` Oded Gabbay
  -1 siblings, 0 replies; 76+ messages in thread
From: Oded Gabbay @ 2014-07-07 10:36 UTC (permalink / raw)
  To: joro
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung, jakumar, jhubbard, Cornwall, Jay, mgorman, cabuschardt,
	ldunning

On Mon, 2014-07-07 at 12:11 +0200, joro@8bytes.org wrote:
> On Sun, Jul 06, 2014 at 07:25:18PM +0000, Gabbay, Oded wrote:
> > Once we can agree on that, than I think we can agree that kfd and hmm
> > can and should be bounded to mm struct and not file descriptors.
> 
> The file descriptor concept is the way it works in the rest of the
> kernel. It works for numerous drivers and subsystems (KVM, VFIO, UIO,
> ...), when you close a file descriptor handed out from any of those
> drivers (already in the kernel) all related resources will be freed. I
> don't see a reason why HSA drivers should break these expectations and
> be different.
> 
> 
> 	Joerg
> 
> 
As Jerome pointed out, there are a couple of subsystems/drivers who
don't rely on file descriptors but on the tear-down of mm struct, e.g.
aio, ksm, uprobes, khugepaged

So, based on this fact, I don't think that the argument of "The file
descriptor concept is the way it works in the rest of the kernel" and
only HSA/HMM now wants to change the rules, is a valid argument.

Jerome and I are saying that HMM and HSA, respectively, are additional
use cases of binding to mm struct. If you don't agree with that, than I
would like to hear why, but you can't say that no one else in the kernel
needs notification of mm struct tear-down.

As for the reasons why HSA drivers should follow aio,ksm,etc. and not
other drivers, I will repeat that our ioctls operate on a process
context and not on a device context. Moreover, the calling process
actually is sometimes not aware on which device it runs! 

A prime example of why HSA is not a regular device-driver, and operates
in context of a process and not a specific device is the fact that in
the near future (3-4 months), kfd_open() will actually bind a process
address space to a *set* of devices, each of which could have its *own*
device driver (eg radeon for the CI device, other amd drivers for future
devices). I Assume HMM can be considered in the same way. 

	Oded



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-07 10:36                                             ` Oded Gabbay
  0 siblings, 0 replies; 76+ messages in thread
From: Oded Gabbay @ 2014-07-07 10:36 UTC (permalink / raw)
  To: joro
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung@nvidia.com

On Mon, 2014-07-07 at 12:11 +0200, joro@8bytes.org wrote:
> On Sun, Jul 06, 2014 at 07:25:18PM +0000, Gabbay, Oded wrote:
> > Once we can agree on that, than I think we can agree that kfd and hmm
> > can and should be bounded to mm struct and not file descriptors.
> 
> The file descriptor concept is the way it works in the rest of the
> kernel. It works for numerous drivers and subsystems (KVM, VFIO, UIO,
> ...), when you close a file descriptor handed out from any of those
> drivers (already in the kernel) all related resources will be freed. I
> don't see a reason why HSA drivers should break these expectations and
> be different.
> 
> 
> 	Joerg
> 
> 
As Jerome pointed out, there are a couple of subsystems/drivers who
don't rely on file descriptors but on the tear-down of mm struct, e.g.
aio, ksm, uprobes, khugepaged

So, based on this fact, I don't think that the argument of "The file
descriptor concept is the way it works in the rest of the kernel" and
only HSA/HMM now wants to change the rules, is a valid argument.

Jerome and I are saying that HMM and HSA, respectively, are additional
use cases of binding to mm struct. If you don't agree with that, than I
would like to hear why, but you can't say that no one else in the kernel
needs notification of mm struct tear-down.

As for the reasons why HSA drivers should follow aio,ksm,etc. and not
other drivers, I will repeat that our ioctls operate on a process
context and not on a device context. Moreover, the calling process
actually is sometimes not aware on which device it runs! 

A prime example of why HSA is not a regular device-driver, and operates
in context of a process and not a specific device is the fact that in
the near future (3-4 months), kfd_open() will actually bind a process
address space to a *set* of devices, each of which could have its *own*
device driver (eg radeon for the CI device, other amd drivers for future
devices). I Assume HMM can be considered in the same way. 

	Oded



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-07 10:11                                           ` joro
@ 2014-07-07 10:43                                             ` Oded Gabbay
  -1 siblings, 0 replies; 76+ messages in thread
From: Oded Gabbay @ 2014-07-07 10:43 UTC (permalink / raw)
  To: joro
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung, jakumar, jhubbard, Cornwall, Jay, mgorman, cabuschardt,
	ldunning


On Mon, 2014-07-07 at 12:11 +0200, joro@8bytes.org wrote:
> On Sun, Jul 06, 2014 at 07:25:18PM +0000, Gabbay, Oded wrote:
> > Once we can agree on that, than I think we can agree that kfd and hmm
> > can and should be bounded to mm struct and not file descriptors.
> 
> The file descriptor concept is the way it works in the rest of the
> kernel. It works for numerous drivers and subsystems (KVM, VFIO, UIO,
> ...), when you close a file descriptor handed out from any of those
> drivers (already in the kernel) all related resources will be freed. I
> don't see a reason why HSA drivers should break these expectations and
> be different.
> 
> 
> 	Joerg
> 
> 
As Jerome pointed out, there are a couple of subsystems/drivers who
don't rely on file descriptors but on the tear-down of mm struct, e.g.
aio, ksm, uprobes, khugepaged

So, based on this fact, I don't think that the argument of "The file
descriptor concept is the way it works in the rest of the kernel" and
only HSA/HMM now wants to change the rules, is a valid argument.

Jerome and I are saying that HMM and HSA, respectively, are additional
use cases of binding to mm struct. If you don't agree with that, than I
would like to hear why, but you can't say that no one else in the kernel
needs notification of mm struct tear-down.

As for the reasons why HSA drivers should follow aio,ksm,etc. and not
other drivers, I will repeat that our ioctls operate on a process
context and not on a device context. Moreover, the calling process
actually is sometimes not aware on which device it runs! 

A prime example of why HSA is not a regular device-driver, and operates
in context of a process and not a specific device is the fact that in
the near future (3-4 months), kfd_open() will actually bind a process
address space to a *set* of devices, each of which could have its *own*
device driver (eg radeon for the CI device, other amd drivers for future
devices). I Assume HMM can be considered in the same way. 

	Oded




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-07 10:43                                             ` Oded Gabbay
  0 siblings, 0 replies; 76+ messages in thread
From: Oded Gabbay @ 2014-07-07 10:43 UTC (permalink / raw)
  To: joro
  Cc: dpoole, linux-kernel, peterz, jweiner, mhairgrove, torvalds,
	linux-mm, j.glisse, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung@nvidia.com


On Mon, 2014-07-07 at 12:11 +0200, joro@8bytes.org wrote:
> On Sun, Jul 06, 2014 at 07:25:18PM +0000, Gabbay, Oded wrote:
> > Once we can agree on that, than I think we can agree that kfd and hmm
> > can and should be bounded to mm struct and not file descriptors.
> 
> The file descriptor concept is the way it works in the rest of the
> kernel. It works for numerous drivers and subsystems (KVM, VFIO, UIO,
> ...), when you close a file descriptor handed out from any of those
> drivers (already in the kernel) all related resources will be freed. I
> don't see a reason why HSA drivers should break these expectations and
> be different.
> 
> 
> 	Joerg
> 
> 
As Jerome pointed out, there are a couple of subsystems/drivers who
don't rely on file descriptors but on the tear-down of mm struct, e.g.
aio, ksm, uprobes, khugepaged

So, based on this fact, I don't think that the argument of "The file
descriptor concept is the way it works in the rest of the kernel" and
only HSA/HMM now wants to change the rules, is a valid argument.

Jerome and I are saying that HMM and HSA, respectively, are additional
use cases of binding to mm struct. If you don't agree with that, than I
would like to hear why, but you can't say that no one else in the kernel
needs notification of mm struct tear-down.

As for the reasons why HSA drivers should follow aio,ksm,etc. and not
other drivers, I will repeat that our ioctls operate on a process
context and not on a device context. Moreover, the calling process
actually is sometimes not aware on which device it runs! 

A prime example of why HSA is not a regular device-driver, and operates
in context of a process and not a specific device is the fact that in
the near future (3-4 months), kfd_open() will actually bind a process
address space to a *set* of devices, each of which could have its *own*
device driver (eg radeon for the CI device, other amd drivers for future
devices). I Assume HMM can be considered in the same way. 

	Oded




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
       [not found]                                             ` <1404729783.31606.1.camel-OrheeFI7RUaGvNAqNQFwiPZ4XP/Yx64J@public.gmane.org>
@ 2014-07-08  8:00                                               ` joro-zLv9SwRftAIdnm+yROfE0A
  2014-07-08 17:03                                                   ` Jerome Glisse
  0 siblings, 1 reply; 76+ messages in thread
From: joro-zLv9SwRftAIdnm+yROfE0A @ 2014-07-08  8:00 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, SCheung-DDmLM1+adcrQT0dZR+AlfA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	j.glisse-Re5JQEeQqe8AvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	jakumar-DDmLM1+adcrQT0dZR+AlfA, ldunning-DDmLM1+adcrQT0dZR+AlfA,
	mgorman-l3A5Bk7waGM, jweiner-H+wXaHxf7aLQT0dZR+AlfA,
	sgutti-DDmLM1+adcrQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA,
	Bridgman, John, jhubbard-DDmLM1+adcrQT0dZR+AlfA,
	mhairgrove-DDmLM1+adcrQT0dZR+AlfA,
	cabuschardt-DDmLM1+adcrQT0dZR+AlfA,
	dpoole-DDmLM1+adcrQT0dZR+AlfA, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, iom

On Mon, Jul 07, 2014 at 01:43:03PM +0300, Oded Gabbay wrote:
> As Jerome pointed out, there are a couple of subsystems/drivers who
> don't rely on file descriptors but on the tear-down of mm struct, e.g.
> aio, ksm, uprobes, khugepaged

What you name here is completly different from what HSA offers. AIO,
KSM, uProbes and THP are not drivers or subsystems of their own but
extend existing subsystems. KSM and THP also work in the background and
do not need a fd to setup things (in some cases only new flags to
existing system calls).

What HSA does is offering a new service to userspace applications.  This
either requires new system calls or, as currently implemented, a device
file which can be opened to use the services.  In this regard it is much
more similar to VFIO or KVM, which also offers a new service and which
use file descriptors as their interface to userspace and tear everything
down when the fd is closed.

> Jerome and I are saying that HMM and HSA, respectively, are additional
> use cases of binding to mm struct. If you don't agree with that, than I
> would like to hear why, but you can't say that no one else in the kernel
> needs notification of mm struct tear-down.

In the first place HSA is a service that allows applications to send
compute jobs to peripheral devices (usually GPUs) and read back the
results. That the peripheral device can access the process address space
is a feature of that service that is handled in the driver.

> As for the reasons why HSA drivers should follow aio,ksm,etc. and not
> other drivers, I will repeat that our ioctls operate on a process
> context and not on a device context. Moreover, the calling process
> actually is sometimes not aware on which device it runs!

KFD can very well hide the fact that there are multiple devices as the
IOMMU drivers usually also hide the details about how many IOMMUs are in
the system.


	Joerg

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-08 17:03                                                   ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-08 17:03 UTC (permalink / raw)
  To: joro
  Cc: Oded Gabbay, dpoole, linux-kernel, peterz, jweiner, mhairgrove,
	torvalds, linux-mm, Bridgman, John, Deucher, Alexander, Lewycky,
	Andrew, sgutti, akpm, hpa, aarcange, iommu, riel, arvindg,
	SCheung, jakumar, jhubbard, Cornwall, Jay, mgorman, cabuschardt,
	ldunning

On Tue, Jul 08, 2014 at 10:00:59AM +0200, joro@8bytes.org wrote:
> On Mon, Jul 07, 2014 at 01:43:03PM +0300, Oded Gabbay wrote:
> > As Jerome pointed out, there are a couple of subsystems/drivers who
> > don't rely on file descriptors but on the tear-down of mm struct, e.g.
> > aio, ksm, uprobes, khugepaged
> 
> What you name here is completly different from what HSA offers. AIO,
> KSM, uProbes and THP are not drivers or subsystems of their own but
> extend existing subsystems. KSM and THP also work in the background and
> do not need a fd to setup things (in some cases only new flags to
> existing system calls).
> 
> What HSA does is offering a new service to userspace applications.  This
> either requires new system calls or, as currently implemented, a device
> file which can be opened to use the services.  In this regard it is much
> more similar to VFIO or KVM, which also offers a new service and which
> use file descriptors as their interface to userspace and tear everything
> down when the fd is closed.

Thing is we are closer to AIO than to KVM. Unlike kvm, hmm stores a pointer
to its structure inside mm_struct and those we already add ourself to the
mm_init function ie we do have the same lifespan as the mm_struct not the
same lifespan as a file.

Now regarding the device side, if we were to cleanup inside the file release
callback than we would be broken in front of fork. Imagine the following :
  - process A open device file and mirror its address space (hmm or kfd)
    through a device file.
  - process A forks itself (child is B) while having the device file open.
  - process A quits

Now the file release will not be call until child B exit which might infinite.
Thus we would be leaking memory. As we already pointed out we can not free the
resources from the mmu_notifier >release callback.

One hacky way to do it would be to schedule some delayed job from >release
callback but then again we would have no way to properly synchronize ourself
with other mm destruction code ie the delayed job could run concurently with
other mm destruction code and interfer badly.

So as i am desperatly trying to show you, there is no other clean way to free
resources associated with hmm and same apply to kfd. Only way is by adding a
callback inside mmput.


Another thing you must understand, the kfd or hmm can be share among different
devices each of them having their own device file. So one and one hmm per mm
struct but several device using that hmm structure. Obviously the lifetime of
this hmm structure has first tie to mm struct, all ties to device file are
secondary and i can foresee situation where their would be absolutely no device
file open but still an hmm for mm struct (think another process is using the
process address through a device driver because it provide some api for that).


I genuinely fails to see how to do it properly using file device as i said
the file lifespan is not tie to an mm struct while the struct we want to
cleanup are tie to the mm struct.

Again hmm or kfd is like aio. Not like kvm.

Cheers,
Jerome

> 
> > Jerome and I are saying that HMM and HSA, respectively, are additional
> > use cases of binding to mm struct. If you don't agree with that, than I
> > would like to hear why, but you can't say that no one else in the kernel
> > needs notification of mm struct tear-down.
> 
> In the first place HSA is a service that allows applications to send
> compute jobs to peripheral devices (usually GPUs) and read back the
> results. That the peripheral device can access the process address space
> is a feature of that service that is handled in the driver.
> 
> > As for the reasons why HSA drivers should follow aio,ksm,etc. and not
> > other drivers, I will repeat that our ioctls operate on a process
> > context and not on a device context. Moreover, the calling process
> > actually is sometimes not aware on which device it runs!
> 
> KFD can very well hide the fact that there are multiple devices as the
> IOMMU drivers usually also hide the details about how many IOMMUs are in
> the system.
> 
> 
> 	Joerg
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2014-07-08 17:03                                                   ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2014-07-08 17:03 UTC (permalink / raw)
  To: joro-zLv9SwRftAIdnm+yROfE0A
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ, SCheung-DDmLM1+adcrQT0dZR+AlfA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, ldunning-DDmLM1+adcrQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, aarcange-H+wXaHxf7aLQT0dZR+AlfA,
	jakumar-DDmLM1+adcrQT0dZR+AlfA, mgorman-l3A5Bk7waGM,
	jweiner-H+wXaHxf7aLQT0dZR+AlfA, sgutti-DDmLM1+adcrQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, Bridgman, John,
	jhubbard-DDmLM1+adcrQT0dZR+AlfA,
	mhairgrove-DDmLM1+adcrQT0dZR+AlfA,
	cabuschardt-DDmLM1+adcrQT0dZR+AlfA,
	dpoole-DDmLM1+adcrQT0dZR+AlfA, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org

On Tue, Jul 08, 2014 at 10:00:59AM +0200, joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org wrote:
> On Mon, Jul 07, 2014 at 01:43:03PM +0300, Oded Gabbay wrote:
> > As Jerome pointed out, there are a couple of subsystems/drivers who
> > don't rely on file descriptors but on the tear-down of mm struct, e.g.
> > aio, ksm, uprobes, khugepaged
> 
> What you name here is completly different from what HSA offers. AIO,
> KSM, uProbes and THP are not drivers or subsystems of their own but
> extend existing subsystems. KSM and THP also work in the background and
> do not need a fd to setup things (in some cases only new flags to
> existing system calls).
> 
> What HSA does is offering a new service to userspace applications.  This
> either requires new system calls or, as currently implemented, a device
> file which can be opened to use the services.  In this regard it is much
> more similar to VFIO or KVM, which also offers a new service and which
> use file descriptors as their interface to userspace and tear everything
> down when the fd is closed.

Thing is we are closer to AIO than to KVM. Unlike kvm, hmm stores a pointer
to its structure inside mm_struct and those we already add ourself to the
mm_init function ie we do have the same lifespan as the mm_struct not the
same lifespan as a file.

Now regarding the device side, if we were to cleanup inside the file release
callback than we would be broken in front of fork. Imagine the following :
  - process A open device file and mirror its address space (hmm or kfd)
    through a device file.
  - process A forks itself (child is B) while having the device file open.
  - process A quits

Now the file release will not be call until child B exit which might infinite.
Thus we would be leaking memory. As we already pointed out we can not free the
resources from the mmu_notifier >release callback.

One hacky way to do it would be to schedule some delayed job from >release
callback but then again we would have no way to properly synchronize ourself
with other mm destruction code ie the delayed job could run concurently with
other mm destruction code and interfer badly.

So as i am desperatly trying to show you, there is no other clean way to free
resources associated with hmm and same apply to kfd. Only way is by adding a
callback inside mmput.


Another thing you must understand, the kfd or hmm can be share among different
devices each of them having their own device file. So one and one hmm per mm
struct but several device using that hmm structure. Obviously the lifetime of
this hmm structure has first tie to mm struct, all ties to device file are
secondary and i can foresee situation where their would be absolutely no device
file open but still an hmm for mm struct (think another process is using the
process address through a device driver because it provide some api for that).


I genuinely fails to see how to do it properly using file device as i said
the file lifespan is not tie to an mm struct while the struct we want to
cleanup are tie to the mm struct.

Again hmm or kfd is like aio. Not like kvm.

Cheers,
Jérôme

> 
> > Jerome and I are saying that HMM and HSA, respectively, are additional
> > use cases of binding to mm struct. If you don't agree with that, than I
> > would like to hear why, but you can't say that no one else in the kernel
> > needs notification of mm struct tear-down.
> 
> In the first place HSA is a service that allows applications to send
> compute jobs to peripheral devices (usually GPUs) and read back the
> results. That the peripheral device can access the process address space
> is a feature of that service that is handled in the driver.
> 
> > As for the reasons why HSA drivers should follow aio,ksm,etc. and not
> > other drivers, I will repeat that our ioctls operate on a process
> > context and not on a device context. Moreover, the calling process
> > actually is sometimes not aware on which device it runs!
> 
> KFD can very well hide the fact that there are multiple devices as the
> IOMMU drivers usually also hide the details about how many IOMMUs are in
> the system.
> 
> 
> 	Joerg
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2014-07-08 17:03                                                   ` Jerome Glisse
@ 2015-10-11 19:03                                                     ` David Woodhouse
  -1 siblings, 0 replies; 76+ messages in thread
From: David Woodhouse @ 2015-10-11 19:03 UTC (permalink / raw)
  To: Jerome Glisse, joro
  Cc: peterz, SCheung, linux-mm, ldunning, hpa, aarcange, jakumar,
	mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard,
	mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel, iommu, arvindg, Deucher, Alexander, akpm, torvalds

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]

On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote:
> 
> Now regarding the device side, if we were to cleanup inside the file release
> callback than we would be broken in front of fork. Imagine the following :
>   - process A open device file and mirror its address space (hmm or kfd)
>     through a device file.
>   - process A forks itself (child is B) while having the device file open.
>   - process A quits
> 
> Now the file release will not be call until child B exit which might infinite.
> Thus we would be leaking memory. As we already pointed out we can not free the
> resources from the mmu_notifier >release callback.

So if your library just registers a pthread_atfork() handler to close
the file descriptor in the child, that problem goes away? Like any
other "if we continue to hold file descriptors open when we should
close them, resources don't get freed" problem?

Perhaps we could even handled that automatically in the kernel, with
something like an O_CLOFORK flag on the file descriptor. Although
that's a little horrid.

You've argued that the amdkfd code is special and not just a device
driver. I'm not going to contradict you there, but now we *are* going
to see device drivers doing this kind of thing. And we definitely
*don't* want to be calling back into device driver code from the
mmu_notifier's .release() function.

I think amdkfd absolutely is *not* a useful precedent for normal device
drivers, and we *don't* want to follow this model in the general case.

As we try to put together a generic API for device access to processes'
address space, I definitely think we want to stick with the model that
we take a reference on the mm, and we *keep* it until the device driver
unbinds from the mm (because its file descriptor is closed, or
whatever).

Perhaps you can keep a back door into the AMD IOMMU code to continue to
do what you're doing, or perhaps the explicit management of off-cpu
tasks that is being posited will give you a sane cleanup path that
*doesn't* involve the IOMMU's mmu_notifier calling back to you. But
either way, I *really* don't want this to be the way it works for
device drivers.


> One hacky way to do it would be to schedule some delayed job from 
> >release callback but then again we would have no way to properly 
> synchronize ourself with other mm destruction code ie the delayed job 
> could run concurently with other mm destruction code and interfer
> badly.

With the RCU-based free of the struct containing the notifier, your
'schedule some delayed job' is basically what we have now, isn't it?

I note that you also have your *own* notifier which does other things,
and has to cope with being called before or after the IOMMU's notifier.

Seriously, we don't want device drivers having to do this. We really
need to keep it simple.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2015-10-11 19:03                                                     ` David Woodhouse
  0 siblings, 0 replies; 76+ messages in thread
From: David Woodhouse @ 2015-10-11 19:03 UTC (permalink / raw)
  To: Jerome Glisse, joro
  Cc: peterz, SCheung, linux-mm, ldunning, hpa, aarcange, jakumar,
	mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard,
	mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew,
	linux-kernel, iommu@lists.linux-foundation.org

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]

On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote:
> 
> Now regarding the device side, if we were to cleanup inside the file release
> callback than we would be broken in front of fork. Imagine the following :
>   - process A open device file and mirror its address space (hmm or kfd)
>     through a device file.
>   - process A forks itself (child is B) while having the device file open.
>   - process A quits
> 
> Now the file release will not be call until child B exit which might infinite.
> Thus we would be leaking memory. As we already pointed out we can not free the
> resources from the mmu_notifier >release callback.

So if your library just registers a pthread_atfork() handler to close
the file descriptor in the child, that problem goes away? Like any
other "if we continue to hold file descriptors open when we should
close them, resources don't get freed" problem?

Perhaps we could even handled that automatically in the kernel, with
something like an O_CLOFORK flag on the file descriptor. Although
that's a little horrid.

You've argued that the amdkfd code is special and not just a device
driver. I'm not going to contradict you there, but now we *are* going
to see device drivers doing this kind of thing. And we definitely
*don't* want to be calling back into device driver code from the
mmu_notifier's .release() function.

I think amdkfd absolutely is *not* a useful precedent for normal device
drivers, and we *don't* want to follow this model in the general case.

As we try to put together a generic API for device access to processes'
address space, I definitely think we want to stick with the model that
we take a reference on the mm, and we *keep* it until the device driver
unbinds from the mm (because its file descriptor is closed, or
whatever).

Perhaps you can keep a back door into the AMD IOMMU code to continue to
do what you're doing, or perhaps the explicit management of off-cpu
tasks that is being posited will give you a sane cleanup path that
*doesn't* involve the IOMMU's mmu_notifier calling back to you. But
either way, I *really* don't want this to be the way it works for
device drivers.


> One hacky way to do it would be to schedule some delayed job from 
> >release callback but then again we would have no way to properly 
> synchronize ourself with other mm destruction code ie the delayed job 
> could run concurently with other mm destruction code and interfer
> badly.

With the RCU-based free of the struct containing the notifier, your
'schedule some delayed job' is basically what we have now, isn't it?

I note that you also have your *own* notifier which does other things,
and has to cope with being called before or after the IOMMU's notifier.

Seriously, we don't want device drivers having to do this. We really
need to keep it simple.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2015-10-11 19:03                                                     ` David Woodhouse
  (?)
@ 2015-10-12 17:41                                                       ` Jerome Glisse
  -1 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2015-10-12 17:41 UTC (permalink / raw)
  To: David Woodhouse
  Cc: joro, peterz, SCheung, linux-mm, ldunning, hpa, aarcange,
	jakumar, mgorman, jweiner, sgutti, riel, Bridgman, John,
	jhubbard, mhairgrove, cabuschardt, dpoole, Cornwall, Jay,
	Lewycky, Andrew, linux-kernel, iommu, arvindg, Deucher,
	Alexander, akpm, torvalds


Note that i am no longer actively pushing this patch serie but i believe the
solution it provides to be needed in one form or another. So I still think
discussion on this to be useful so see below for my answer.

On Sun, Oct 11, 2015 at 08:03:29PM +0100, David Woodhouse wrote:
> On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote:
> > 
> > Now regarding the device side, if we were to cleanup inside the file release
> > callback than we would be broken in front of fork. Imagine the following :
> >   - process A open device file and mirror its address space (hmm or kfd)
> >     through a device file.
> >   - process A forks itself (child is B) while having the device file open.
> >   - process A quits
> > 
> > Now the file release will not be call until child B exit which might infinite.
> > Thus we would be leaking memory. As we already pointed out we can not free the
> > resources from the mmu_notifier >release callback.
> 
> So if your library just registers a pthread_atfork() handler to close
> the file descriptor in the child, that problem goes away? Like any
> other "if we continue to hold file descriptors open when we should
> close them, resources don't get freed" problem?


I was just pointing out existing device driver usage pattern where user
space open device file and do ioctl on it without necessarily caring
about the mm struct.

New usecase where device actually run thread against a specific process
mm is different and require proper synchronization as file lifetime is
different from process lifetime in many case when fork is involve.

> 
> Perhaps we could even handled that automatically in the kernel, with
> something like an O_CLOFORK flag on the file descriptor. Although
> that's a little horrid.
> 
> You've argued that the amdkfd code is special and not just a device
> driver. I'm not going to contradict you there, but now we *are* going
> to see device drivers doing this kind of thing. And we definitely
> *don't* want to be calling back into device driver code from the
> mmu_notifier's .release() function.

Well that's the current solution, call back into device driver from the
mmu_notifer release() call back. Since changes to mmu_notifier this is
a workable solution (thanks to mmu_notifier_unregister_no_release()).

> 
> I think amdkfd absolutely is *not* a useful precedent for normal device
> drivers, and we *don't* want to follow this model in the general case.
> 
> As we try to put together a generic API for device access to processes'
> address space, I definitely think we want to stick with the model that
> we take a reference on the mm, and we *keep* it until the device driver
> unbinds from the mm (because its file descriptor is closed, or
> whatever).

Well i think when a process is kill (for whatever reasons) we do want to
also kill all device threads at the same time. For instance we do not want
to have zombie GPU threads that can keep running indefinitly. This is why
use mmu_notifier.release() callback is kind of right place as it will be
call once all threads using a given mm are killed.

The exit_mm() or do_exit() are also places from where we could a callback
to let device know that it must kill all thread related to a given mm.

>
> Perhaps you can keep a back door into the AMD IOMMU code to continue to
> do what you're doing, or perhaps the explicit management of off-cpu
> tasks that is being posited will give you a sane cleanup path that
> *doesn't* involve the IOMMU's mmu_notifier calling back to you. But
> either way, I *really* don't want this to be the way it works for
> device drivers.

So at kernel summit we are supposedly gonna have a discussion about device
thread and scheduling and i think device thread lifetime belongs to that
discussion too. My opinion is that device threads must be kill when a
process quits.


> > One hacky way to do it would be to schedule some delayed job from 
> > >release callback but then again we would have no way to properly 
> > synchronize ourself with other mm destruction code ie the delayed job 
> > could run concurently with other mm destruction code and interfer
> > badly.
> 
> With the RCU-based free of the struct containing the notifier, your
> 'schedule some delayed job' is basically what we have now, isn't it?
> 
> I note that you also have your *own* notifier which does other things,
> and has to cope with being called before or after the IOMMU's notifier.
> 
> Seriously, we don't want device drivers having to do this. We really
> need to keep it simple.

So right now in HMM what happens is that device driver get a callback as
a result of mmu_notifier.release() and can call the unregister functions
and device driver must then schedule through whatever means a call to the
unregister function (can be a workqueue or other a kernel thread).

Basicly i am hidding the issue inside the device driver until we can agree
on a common proper way to do this.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2015-10-12 17:41                                                       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2015-10-12 17:41 UTC (permalink / raw)
  To: David Woodhouse
  Cc: joro, peterz, SCheung, linux-mm, ldunning, hpa, aarcange,
	jakumar, mgorman, jweiner, sgutti, riel, Bridgman, John,
	jhubbard, mhairgrove, cabuschardt, dpoole, Cornwall, Jay,
	Lewycky, Andrew, linux-kernel, iommu, arvindg, Deucher,
	Alexander, akpm, torvalds


Note that i am no longer actively pushing this patch serie but i believe the
solution it provides to be needed in one form or another. So I still think
discussion on this to be useful so see below for my answer.

On Sun, Oct 11, 2015 at 08:03:29PM +0100, David Woodhouse wrote:
> On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote:
> > 
> > Now regarding the device side, if we were to cleanup inside the file release
> > callback than we would be broken in front of fork. Imagine the following :
> >   - process A open device file and mirror its address space (hmm or kfd)
> >     through a device file.
> >   - process A forks itself (child is B) while having the device file open.
> >   - process A quits
> > 
> > Now the file release will not be call until child B exit which might infinite.
> > Thus we would be leaking memory. As we already pointed out we can not free the
> > resources from the mmu_notifier >release callback.
> 
> So if your library just registers a pthread_atfork() handler to close
> the file descriptor in the child, that problem goes away? Like any
> other "if we continue to hold file descriptors open when we should
> close them, resources don't get freed" problem?


I was just pointing out existing device driver usage pattern where user
space open device file and do ioctl on it without necessarily caring
about the mm struct.

New usecase where device actually run thread against a specific process
mm is different and require proper synchronization as file lifetime is
different from process lifetime in many case when fork is involve.

> 
> Perhaps we could even handled that automatically in the kernel, with
> something like an O_CLOFORK flag on the file descriptor. Although
> that's a little horrid.
> 
> You've argued that the amdkfd code is special and not just a device
> driver. I'm not going to contradict you there, but now we *are* going
> to see device drivers doing this kind of thing. And we definitely
> *don't* want to be calling back into device driver code from the
> mmu_notifier's .release() function.

Well that's the current solution, call back into device driver from the
mmu_notifer release() call back. Since changes to mmu_notifier this is
a workable solution (thanks to mmu_notifier_unregister_no_release()).

> 
> I think amdkfd absolutely is *not* a useful precedent for normal device
> drivers, and we *don't* want to follow this model in the general case.
> 
> As we try to put together a generic API for device access to processes'
> address space, I definitely think we want to stick with the model that
> we take a reference on the mm, and we *keep* it until the device driver
> unbinds from the mm (because its file descriptor is closed, or
> whatever).

Well i think when a process is kill (for whatever reasons) we do want to
also kill all device threads at the same time. For instance we do not want
to have zombie GPU threads that can keep running indefinitly. This is why
use mmu_notifier.release() callback is kind of right place as it will be
call once all threads using a given mm are killed.

The exit_mm() or do_exit() are also places from where we could a callback
to let device know that it must kill all thread related to a given mm.

>
> Perhaps you can keep a back door into the AMD IOMMU code to continue to
> do what you're doing, or perhaps the explicit management of off-cpu
> tasks that is being posited will give you a sane cleanup path that
> *doesn't* involve the IOMMU's mmu_notifier calling back to you. But
> either way, I *really* don't want this to be the way it works for
> device drivers.

So at kernel summit we are supposedly gonna have a discussion about device
thread and scheduling and i think device thread lifetime belongs to that
discussion too. My opinion is that device threads must be kill when a
process quits.


> > One hacky way to do it would be to schedule some delayed job from 
> > >release callback but then again we would have no way to properly 
> > synchronize ourself with other mm destruction code ie the delayed job 
> > could run concurently with other mm destruction code and interfer
> > badly.
> 
> With the RCU-based free of the struct containing the notifier, your
> 'schedule some delayed job' is basically what we have now, isn't it?
> 
> I note that you also have your *own* notifier which does other things,
> and has to cope with being called before or after the IOMMU's notifier.
> 
> Seriously, we don't want device drivers having to do this. We really
> need to keep it simple.

So right now in HMM what happens is that device driver get a callback as
a result of mmu_notifier.release() and can call the unregister functions
and device driver must then schedule through whatever means a call to the
unregister function (can be a workqueue or other a kernel thread).

Basicly i am hidding the issue inside the device driver until we can agree
on a common proper way to do this.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2015-10-12 17:41                                                       ` Jerome Glisse
  0 siblings, 0 replies; 76+ messages in thread
From: Jerome Glisse @ 2015-10-12 17:41 UTC (permalink / raw)
  To: David Woodhouse
  Cc: peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	aarcange-H+wXaHxf7aLQT0dZR+AlfA, jakumar-DDmLM1+adcrQT0dZR+AlfA,
	ldunning-DDmLM1+adcrQT0dZR+AlfA, mgorman-l3A5Bk7waGM,
	jweiner-H+wXaHxf7aLQT0dZR+AlfA, sgutti-DDmLM1+adcrQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, Bridgman, John,
	jhubbard-DDmLM1+adcrQT0dZR+AlfA,
	mhairgrove-DDmLM1+adcrQT0dZR+AlfA,
	cabuschardt-DDmLM1+adcrQT0dZR+AlfA,
	dpoole-DDmLM1+adcrQT0dZR+AlfA, Cornwall, Jay, Lewycky, Andrew,
	SCheung-DDmLM1+adcrQT0dZR+AlfA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org


Note that i am no longer actively pushing this patch serie but i believe the
solution it provides to be needed in one form or another. So I still think
discussion on this to be useful so see below for my answer.

On Sun, Oct 11, 2015 at 08:03:29PM +0100, David Woodhouse wrote:
> On Tue, 2014-07-08 at 13:03 -0400, Jerome Glisse wrote:
> > 
> > Now regarding the device side, if we were to cleanup inside the file release
> > callback than we would be broken in front of fork. Imagine the following :
> >   - process A open device file and mirror its address space (hmm or kfd)
> >     through a device file.
> >   - process A forks itself (child is B) while having the device file open.
> >   - process A quits
> > 
> > Now the file release will not be call until child B exit which might infinite.
> > Thus we would be leaking memory. As we already pointed out we can not free the
> > resources from the mmu_notifier >release callback.
> 
> So if your library just registers a pthread_atfork() handler to close
> the file descriptor in the child, that problem goes away? Like any
> other "if we continue to hold file descriptors open when we should
> close them, resources don't get freed" problem?


I was just pointing out existing device driver usage pattern where user
space open device file and do ioctl on it without necessarily caring
about the mm struct.

New usecase where device actually run thread against a specific process
mm is different and require proper synchronization as file lifetime is
different from process lifetime in many case when fork is involve.

> 
> Perhaps we could even handled that automatically in the kernel, with
> something like an O_CLOFORK flag on the file descriptor. Although
> that's a little horrid.
> 
> You've argued that the amdkfd code is special and not just a device
> driver. I'm not going to contradict you there, but now we *are* going
> to see device drivers doing this kind of thing. And we definitely
> *don't* want to be calling back into device driver code from the
> mmu_notifier's .release() function.

Well that's the current solution, call back into device driver from the
mmu_notifer release() call back. Since changes to mmu_notifier this is
a workable solution (thanks to mmu_notifier_unregister_no_release()).

> 
> I think amdkfd absolutely is *not* a useful precedent for normal device
> drivers, and we *don't* want to follow this model in the general case.
> 
> As we try to put together a generic API for device access to processes'
> address space, I definitely think we want to stick with the model that
> we take a reference on the mm, and we *keep* it until the device driver
> unbinds from the mm (because its file descriptor is closed, or
> whatever).

Well i think when a process is kill (for whatever reasons) we do want to
also kill all device threads at the same time. For instance we do not want
to have zombie GPU threads that can keep running indefinitly. This is why
use mmu_notifier.release() callback is kind of right place as it will be
call once all threads using a given mm are killed.

The exit_mm() or do_exit() are also places from where we could a callback
to let device know that it must kill all thread related to a given mm.

>
> Perhaps you can keep a back door into the AMD IOMMU code to continue to
> do what you're doing, or perhaps the explicit management of off-cpu
> tasks that is being posited will give you a sane cleanup path that
> *doesn't* involve the IOMMU's mmu_notifier calling back to you. But
> either way, I *really* don't want this to be the way it works for
> device drivers.

So at kernel summit we are supposedly gonna have a discussion about device
thread and scheduling and i think device thread lifetime belongs to that
discussion too. My opinion is that device threads must be kill when a
process quits.


> > One hacky way to do it would be to schedule some delayed job from 
> > >release callback but then again we would have no way to properly 
> > synchronize ourself with other mm destruction code ie the delayed job 
> > could run concurently with other mm destruction code and interfer
> > badly.
> 
> With the RCU-based free of the struct containing the notifier, your
> 'schedule some delayed job' is basically what we have now, isn't it?
> 
> I note that you also have your *own* notifier which does other things,
> and has to cope with being called before or after the IOMMU's notifier.
> 
> Seriously, we don't want device drivers having to do this. We really
> need to keep it simple.

So right now in HMM what happens is that device driver get a callback as
a result of mmu_notifier.release() and can call the unregister functions
and device driver must then schedule through whatever means a call to the
unregister function (can be a workqueue or other a kernel thread).

Basicly i am hidding the issue inside the device driver until we can agree
on a common proper way to do this.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
  2015-10-11 19:03                                                     ` David Woodhouse
@ 2015-11-20 15:45                                                       ` David Woodhouse
  -1 siblings, 0 replies; 76+ messages in thread
From: David Woodhouse @ 2015-11-20 15:45 UTC (permalink / raw)
  To: Jerome Glisse, joro, Tang, CQ
  Cc: peterz, linux-kernel, linux-mm, hpa, aarcange, jakumar, ldunning,
	mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard,
	mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew,
	SCheung, iommu, arvindg, Deucher, Alexander, akpm, torvalds

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

On Sun, 2015-10-11 at 20:03 +0100, David Woodhouse wrote:
> As we try to put together a generic API for device access to processes'
> address space, I definitely think we want to stick with the model that
> we take a reference on the mm, and we *keep* it until the device driver
> unbinds from the mm (because its file descriptor is closed, or
> whatever).

I've found another problem with this.

In use some cases, we mmap() the device file descriptor which is
responsible for the PASID binding. And in that case we end up with a
recursive refcount.

When the process exits, its file descriptors are closed... but the
underlying struct file remains open because it's still referenced from
the mmap'd VMA.

That VMA remains alive because it's still part of the MM.

And the MM remains alive because the PASID binding still holds a
refcount for it because device's struct file didn't get closed yet...
because it's still mmap'd... because the MM is still alive...

So I suspect that even for the relatively simple case where the
lifetime of the PASID can be bound to a file descriptor (unlike with
amdkfd), we probably still want to explicitly manage its lifetime as an
'off-cpu task' and explicitly kill it when the process dies.

I'm still not keen on doing that implicitly through the mm_release. I
think that way lies a lot of subtle bugs.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.
@ 2015-11-20 15:45                                                       ` David Woodhouse
  0 siblings, 0 replies; 76+ messages in thread
From: David Woodhouse @ 2015-11-20 15:45 UTC (permalink / raw)
  To: Jerome Glisse, joro, Tang, CQ
  Cc: peterz, linux-kernel, linux-mm, hpa, aarcange, jakumar, ldunning,
	mgorman, jweiner, sgutti, riel, Bridgman, John, jhubbard,
	mhairgrove, cabuschardt, dpoole, Cornwall, Jay, Lewycky, Andrew,
	SCheung, iommu@lists.linux-foundation.org

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

On Sun, 2015-10-11 at 20:03 +0100, David Woodhouse wrote:
> As we try to put together a generic API for device access to processes'
> address space, I definitely think we want to stick with the model that
> we take a reference on the mm, and we *keep* it until the device driver
> unbinds from the mm (because its file descriptor is closed, or
> whatever).

I've found another problem with this.

In use some cases, we mmap() the device file descriptor which is
responsible for the PASID binding. And in that case we end up with a
recursive refcount.

When the process exits, its file descriptors are closed... but the
underlying struct file remains open because it's still referenced from
the mmap'd VMA.

That VMA remains alive because it's still part of the MM.

And the MM remains alive because the PASID binding still holds a
refcount for it because device's struct file didn't get closed yet...
because it's still mmap'd... because the MM is still alive...

So I suspect that even for the relatively simple case where the
lifetime of the PASID can be bound to a file descriptor (unlike with
amdkfd), we probably still want to explicitly manage its lifetime as an
'off-cpu task' and explicitly kill it when the process dies.

I'm still not keen on doing that implicitly through the mm_release. I
think that way lies a lot of subtle bugs.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2015-11-20 15:45 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-28  2:00 mm preparatory patches for HMM and IOMMUv2 Jérôme Glisse
2014-06-28  2:00 ` Jérôme Glisse
2014-06-28  2:00 ` [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler Jérôme Glisse
2014-06-28  2:00   ` Jérôme Glisse
2014-06-30  3:49   ` John Hubbard
2014-06-30  3:49     ` John Hubbard
2014-06-30 15:07     ` Jerome Glisse
2014-06-30 15:07       ` Jerome Glisse
2014-06-30 14:41   ` Gabbay, Oded
2014-06-30 14:41     ` Gabbay, Oded
2014-06-30 15:06     ` Jerome Glisse
2014-06-30 15:06       ` Jerome Glisse
     [not found]     ` <019CCE693E457142B37B791721487FD91806B836-0nO7ALo/ziwxlywnonMhLEEOCMrvLtNR@public.gmane.org>
2014-06-30 15:40       ` Joerg Roedel
2014-06-30 16:06         ` Jerome Glisse
2014-06-30 16:06           ` Jerome Glisse
2014-06-30 18:16           ` Joerg Roedel
2014-06-30 18:16             ` Joerg Roedel
2014-06-30 18:35             ` Jerome Glisse
2014-06-30 18:35               ` Jerome Glisse
2014-06-30 18:57               ` Lewycky, Andrew
2014-06-30 18:57                 ` Lewycky, Andrew
2014-07-01  9:41                 ` Joerg Roedel
2014-07-01  9:41                   ` Joerg Roedel
     [not found]               ` <20140630183556.GB3280-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-07-01  9:15                 ` Joerg Roedel
2014-07-01  9:29                   ` Gabbay, Oded
2014-07-01  9:29                     ` Gabbay, Oded
     [not found]                     ` <019CCE693E457142B37B791721487FD91806DD8B-0nO7ALo/ziwxlywnonMhLEEOCMrvLtNR@public.gmane.org>
2014-07-01 11:00                       ` Joerg Roedel
2014-07-01 19:33                         ` Jerome Glisse
2014-07-01 19:33                           ` Jerome Glisse
     [not found]                           ` <20140701193343.GB3322-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-07-01 21:06                             ` Joerg Roedel
2014-07-01 21:32                               ` Jerome Glisse
2014-07-01 21:32                                 ` Jerome Glisse
2014-07-03 18:30                                 ` Jerome Glisse
2014-07-03 18:30                                   ` Jerome Glisse
     [not found]                                   ` <20140703183024.GA3306-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-07-03 23:15                                     ` Joerg Roedel
2014-07-04  0:03                                       ` Jerome Glisse
2014-07-04  0:03                                         ` Jerome Glisse
2014-07-06 19:25                                       ` Gabbay, Oded
2014-07-06 19:25                                         ` Gabbay, Oded
2014-07-07 10:11                                         ` joro
2014-07-07 10:11                                           ` joro
2014-07-07 10:36                                           ` Oded Gabbay
2014-07-07 10:36                                             ` Oded Gabbay
2014-07-07 10:43                                           ` Oded Gabbay
2014-07-07 10:43                                             ` Oded Gabbay
     [not found]                                             ` <1404729783.31606.1.camel-OrheeFI7RUaGvNAqNQFwiPZ4XP/Yx64J@public.gmane.org>
2014-07-08  8:00                                               ` joro-zLv9SwRftAIdnm+yROfE0A
2014-07-08 17:03                                                 ` Jerome Glisse
2014-07-08 17:03                                                   ` Jerome Glisse
2015-10-11 19:03                                                   ` David Woodhouse
2015-10-11 19:03                                                     ` David Woodhouse
2015-10-12 17:41                                                     ` Jerome Glisse
2015-10-12 17:41                                                       ` Jerome Glisse
2015-10-12 17:41                                                       ` Jerome Glisse
2015-11-20 15:45                                                     ` David Woodhouse
2015-11-20 15:45                                                       ` David Woodhouse
2014-06-30 15:37   ` Joerg Roedel
2014-06-28  2:00 ` [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap Jérôme Glisse
2014-06-28  2:00   ` Jérôme Glisse
2014-06-30  3:58   ` John Hubbard
2014-06-30  3:58     ` John Hubbard
2014-06-30 15:58     ` Jerome Glisse
2014-06-30 15:58       ` Jerome Glisse
2014-06-28  2:00 ` [PATCH 3/6] mmu_notifier: add event information to address invalidation v2 Jérôme Glisse
2014-06-28  2:00   ` Jérôme Glisse
2014-06-30  5:22   ` John Hubbard
2014-06-30  5:22     ` John Hubbard
2014-06-30 15:57     ` Jerome Glisse
2014-06-30 15:57       ` Jerome Glisse
2014-07-01  1:57   ` Linus Torvalds
2014-06-28  2:00 ` [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page Jérôme Glisse
2014-06-28  2:00   ` Jérôme Glisse
2014-06-30  3:29   ` John Hubbard
2014-06-30  3:29     ` John Hubbard
2014-06-30 16:00     ` Jerome Glisse
2014-06-30 16:00       ` Jerome Glisse
2014-07-01  2:04   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.