linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] [RFC] MMU Notifiers V1
@ 2008-01-25  5:56 Christoph Lameter
  2008-01-25  5:56 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel.

To do:

- Make locking requirements for the callbacks consistent and
  document them accurately.
- Insure that the callbacks are complete
- Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that it is easier to review their use.
- Avoid adding #include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 1/4] mmu_notifier: Core code
  2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
@ 2008-01-25  5:56 ` Christoph Lameter
  2008-01-25 18:39   ` Robin Holt
  2008-01-25  5:56 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 12412 bytes --]

Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/mm_types.h     |    8 ++
 include/linux/mmu_notifier.h |  152 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |    9 ++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 +
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |   91 +++++++++++++++++++++++++
 8 files changed, 269 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-24 20:59:19.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,10 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-24 20:59:19.000000000 -0800
@@ -0,0 +1,152 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If mappings are
+ * 	removed from an address space then callbacks are performed.
+ * 	Spinlocks must be held in order to the walk reverse maps and the
+ * 	notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms. It is up to the subsystem to mark pags
+ *	as PageExternalRmap as needed to trigger the callbacks. Pages
+ *	must be marked dirty if dirty bits are set in the external
+ *	pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+	void (*invalidate_range)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn, struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)					\
+	do {									\
+		struct mmu_rmap_notifier *__mrn;				\
+		struct hlist_node *__n;						\
+										\
+		rcu_read_lock();						\
+		hlist_for_each_entry_rcu(__mrn, __n, &mmu_rmap_notifier_list, 	\
+						hlist)				\
+			if (__mrn->ops->function)				\
+				__mrn->ops->function(__mrn, args);		\
+		rcu_read_unlock();						\
+	} while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+#define mmu_notifier(function, mm, args...) do { } while (0)
+#define mmu_rmap_notifier(function, args...) do { } while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline void mmu_notifier_age(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) {}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-24 20:59:19.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,14 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page)	set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page)	clear_bit(PG_external_rmap, &(page)->flags)
+#else
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-24 20:59:19.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-24 20:59:19.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-24 20:59:19.000000000 -0800
@@ -0,0 +1,91 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+			hlist_del(&mn->hlist);
+		}
+		rcu_read_unlock();
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del(&mn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-24 20:59:19.000000000 -0800
@@ -51,6 +51,7 @@
 #include <linux/random.h>
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-24 20:59:19.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
  2008-01-25  5:56 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
@ 2008-01-25  5:56 ` Christoph Lameter
  2008-01-25  5:56 ` [patch 3/4] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 4588 bytes --]

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active.

Exceptions:

We hold i_mmap_lock in __unmap_hugepage_range and
sometimes in zap_page_range. Should we pass a parameter to indicate
the different lock situation?

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this?

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/fremap.c  |    2 ++
 mm/hugetlb.c |    2 ++
 mm/memory.c  |    9 +++++++--
 mm/mmap.c    |    1 +
 4 files changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-24 21:01:17.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range, mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-01-24 21:01:17.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-24 21:01:17.000000000 -0800
@@ -50,6 +50,7 @@
 #include <linux/delayacct.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -891,6 +892,7 @@ unsigned long zap_page_range(struct vm_a
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range, mm, address, end);
 	return end;
 }
 
@@ -1319,7 +1321,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1360,6 +1362,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1446,7 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
@@ -1454,6 +1457,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1634,6 +1638,7 @@ gotten:
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	mmu_notifier(invalidate_range, mm, address, address + PAGE_SIZE - 1);
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-24 20:59:19.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-24 21:01:17.000000000 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier(invalidate_range, mm, start, end);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 3/4] mmu_notifier: invalidate_page callbacks for subsystems with rmap
  2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
  2008-01-25  5:56 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
  2008-01-25  5:56 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-01-25  5:56 ` Christoph Lameter
  2008-01-25  5:56 ` [patch 4/4] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
  2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
  4 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_page_rmap_callbacks --]
[-- Type: text/plain, Size: 2220 bytes --]

Callbacks to remove individual pages if the subsystem has an
rmap capability. The pagelock is held but no spinlocks are held.
The refcount of the page is elevated so that dropping the refcount
in the subsystem will not directly free the page.

The callbacks occur after the Linux rmaps have been walked.

Robin: We do not hold the page lock in __xip_unmap().
I guess we do not need to increase the refcount there since the
page is static and cannot go away?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-01-24 19:47:28.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c	2008-01-24 20:30:31.000000000 -0800
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 #include <asm/tlbflush.h>
 
@@ -183,6 +184,9 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
+	if (PageExternalRmap(page))
+		mmu_rmap_notifier(invalidate_page, page);
+
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-24 19:47:28.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-24 20:30:31.000000000 -0800
@@ -49,6 +49,7 @@
 #include <linux/rcupdate.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -473,6 +474,8 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+			if (unlikely(PageExternalRmap(page)))
+				mmu_rmap_notifier(invalidate_page, page);
 			if (page_test_dirty(page)) {
 				page_clear_dirty(page);
 				ret = 1;
@@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	if (unlikely(PageExternalRmap(page)))
+		mmu_rmap_notifier(invalidate_page, page);
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 4/4] MMU notifier: invalidate_page callbacks using Linux rmaps
  2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-01-25  5:56 ` [patch 3/4] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
@ 2008-01-25  5:56 ` Christoph Lameter
  2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
  4 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_page_callbacks --]
[-- Type: text/plain, Size: 4839 bytes --]

These notifiers here use the Linux rmaps to perform the callbacks.
In order to walk the rmaps locks must be held. Callbacks can therefore
only operate in an atomic context.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap_xip.c |    1 +
 mm/fremap.c      |    1 +
 mm/memory.c      |    1 +
 mm/mremap.c      |    2 ++
 mm/rmap.c        |   12 +++++++++++-
 5 files changed, 16 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-01-24 21:35:25.000000000 -0800
+++ linux-2.6/mm/mremap.c	2008-01-24 21:39:34.000000000 -0800
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -105,6 +106,7 @@ static void move_ptes(struct vm_area_str
 		if (pte_none(*old_pte))
 			continue;
 		pte = ptep_clear_flush(vma, old_addr, old_pte);
+		mmu_notifier(invalidate_page, mm, old_addr);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-24 21:39:33.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-24 21:39:34.000000000 -0800
@@ -288,6 +288,9 @@ static int page_referenced_one(struct pa
 	if (ptep_clear_flush_young(vma, address, pte))
 		referenced++;
 
+	if (mmu_notifier_age_page(mm, address))
+		referenced++;
+
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
 	if (mm != current->mm && has_swap_token(mm) &&
@@ -435,6 +438,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -680,7 +684,8 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young(vma, address, pte) ||
+				mmu_notifier_age_page(mm, address)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
@@ -688,6 +693,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mmu_notifier(invalidate_page, mm, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -815,9 +821,13 @@ static void try_to_unmap_cluster(unsigne
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
+		if (mmu_notifier_age_page(mm, address))
+			continue;
+
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-01-24 21:39:33.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c	2008-01-24 21:39:34.000000000 -0800
@@ -198,6 +198,7 @@ __xip_unmap (struct address_space * mapp
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
+			mmu_notifier(invalidate_page, mm, address);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-24 21:39:31.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-24 21:39:34.000000000 -0800
@@ -31,6 +31,7 @@ static void zap_pte(struct mm_struct *mm
 
 		flush_cache_page(vma, addr, pte_pfn(pte));
 		pte = ptep_clear_flush(vma, addr, ptep);
+		mmu_notifier(invalidate_page, mm, addr);
 		page = vm_normal_page(vma, addr, pte);
 		if (page) {
 			if (pte_dirty(pte))
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-24 21:39:31.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-24 21:39:34.000000000 -0800
@@ -1659,6 +1659,7 @@ gotten:
 		 * thread doing COW.
 		 */
 		ptep_clear_flush(vma, address, page_table);
+		mmu_notifier(invalidate_page, mm, address);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-01-25  5:56 ` [patch 4/4] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
@ 2008-01-25 11:42 ` Andrea Arcangeli
  2008-01-25 12:43   ` Robin Holt
                     ` (3 more replies)
  4 siblings, 4 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2008-01-25 11:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Thu, Jan 24, 2008 at 09:56:06PM -0800, Christoph Lameter wrote:
> Andrea's mmu_notifier #4 -> RFC V1
> 
> - Merge subsystem rmap based with Linux rmap based approach
> - Move Linux rmap based notifiers out of macro
> - Try to account for what locks are held while the notifiers are
>   called.
> - Develop a patch sequence that separates out the different types of
>   hooks so that it is easier to review their use.
> - Avoid adding #include to linux/mm_types.h
> - Integrate RCU logic suggested by Peter.

I'm glad you're converging on something a bit saner and much much
closer to my code, plus perfectly usable by KVM optimal rmap design
too. It would have preferred if you would have sent me patches like
Peter did for review and merging etc... that would have made review
especially easier. Anyway I'm used to that on lkml so it's ok, I just
need this patch to be included in mainline, everything else is
irrelevant to me.

On a technical merit this still partially makes me sick and I think
it's the last issue to debate.

@@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
        else
                ret = try_to_unmap_file(page, migration);

+       if (unlikely(PageExternalRmap(page)))
+               mmu_rmap_notifier(invalidate_page, page);
+
        if (!page_mapped(page))
                ret = SWAP_SUCCESS;
        return ret;

I find the above hard to accept, because the moment you work with
physical pages and not "mm+address" I think you couldn't possibly care
if page_mapped is true or false, and I think the above notifier should
be called _outside_ try_to_unmap. Infact I'd call
mmu_rmap_notifier(invalidate_page, page); only if page_unmapped is
false and the linux pte is gone already (practically just before the
page_count == 2 check and after try_to_unmap).

I also think it's still worth to debate the rmap based on virtual or
physical index. By supporting both secondary-rmap designs at the same
time you seem to agree current KVM lightweight rmap implementation is
a superior design at least for KVM. But by insisting on your rmap
based on physical for your usage, you're implicitly telling us that is
a superior design for you. But we know very little of why you can't
exactly build rmap on virtual like KVM does! (especially now that you
implicitly admitted KVM rmap design is superior at least for KVM it'd
be interesting to know why you can't do the same exactly) You said
something on that, but I certainly don't have a clear picture of why
it can't work or why it would be less efficient.

Like you said by PM I'd also like comments from Hugh, Nick and others
about this issue.

Nevertheless I'm very glad we already fully converged on the
set_page_dirty, invalidate-page after ptep_clear_flush/young,
etc... and furthermore that you only made very minor modification to
my code to add a pair of hooks for the page-based rmap notifiers on
top of my patch. So from a functionality POV this is 100% workable
already from KVM side!

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
@ 2008-01-25 12:43   ` Robin Holt
  2008-01-25 18:31   ` Christoph Lameter
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Robin Holt @ 2008-01-25 12:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Fri, Jan 25, 2008 at 12:42:29PM +0100, Andrea Arcangeli wrote:
> On a technical merit this still partially makes me sick and I think
> it's the last issue to debate.
> 
> @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
>         else
>                 ret = try_to_unmap_file(page, migration);
> 
> +       if (unlikely(PageExternalRmap(page)))
> +               mmu_rmap_notifier(invalidate_page, page);
> +
>         if (!page_mapped(page))
>                 ret = SWAP_SUCCESS;
>         return ret;
> 
> I find the above hard to accept, because the moment you work with
> physical pages and not "mm+address" I think you couldn't possibly care
> if page_mapped is true or false, and I think the above notifier should
> be called _outside_ try_to_unmap. Infact I'd call
> mmu_rmap_notifier(invalidate_page, page); only if page_unmapped is
> false and the linux pte is gone already (practically just before the
> page_count == 2 check and after try_to_unmap).

How does the called process sleep or how does it coordinate async work
with try_to_unmap?  We need to sleep.

On a seperate note, I think the page flag needs to be set by the process
when it is acquiring the page for export.  But since the same page could
be acquired by multiple export mechanisms, we should not clear it in the
exporting driver, but rather here after all exportors have been called
to invalidate_page.

That lead me to believe we should add a flag to get_user_pages() that
indicates this is an export with external rmap.  We could then set the
page flag in get_user_pages.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
  2008-01-25 12:43   ` Robin Holt
@ 2008-01-25 18:31   ` Christoph Lameter
  2008-01-25 21:18   ` Benjamin Herrenschmidt
  2008-01-28 16:10   ` Izik Eidus
  3 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25 18:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Fri, 25 Jan 2008, Andrea Arcangeli wrote:

> On a technical merit this still partially makes me sick and I think
> it's the last issue to debate.
> 
> @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
>         else
>                 ret = try_to_unmap_file(page, migration);
> 
> +       if (unlikely(PageExternalRmap(page)))
> +               mmu_rmap_notifier(invalidate_page, page);
> +
>         if (!page_mapped(page))
>                 ret = SWAP_SUCCESS;
>         return ret;
> 
> I find the above hard to accept, because the moment you work with
> physical pages and not "mm+address" I think you couldn't possibly care
> if page_mapped is true or false, and I think the above notifier should
> be called _outside_ try_to_unmap. Infact I'd call
> mmu_rmap_notifier(invalidate_page, page); only if page_unmapped is
> false and the linux pte is gone already (practically just before the
> page_count == 2 check and after try_to_unmap).

try_to_unmap is called from multiple places. The placement here
also covers f.e. page migration.

We also need to do this in the page_mkclean case because the permissions
on an external pte are restricted there. So we need a refault to update
the pte.

> I also think it's still worth to debate the rmap based on virtual or
> physical index. By supporting both secondary-rmap designs at the same
> time you seem to agree current KVM lightweight rmap implementation is
> a superior design at least for KVM. But by insisting on your rmap
> based on physical for your usage, you're implicitly telling us that is
> a superior design for you. But we know very little of why you can't

We actually need both version. We have hardware that has a driver without 
rmap that does not sleep. On the other hand XPmem has rmap capability and 
needs to sleep for its notifications.

> Nevertheless I'm very glad we already fully converged on the
> set_page_dirty, invalidate-page after ptep_clear_flush/young,
> etc... and furthermore that you only made very minor modification to
> my code to add a pair of hooks for the page-based rmap notifiers on
> top of my patch. So from a functionality POV this is 100% workable
> already from KVM side!

Well we still have to review this stuff more and I have a vague feeling 
that not all the multiple hooks that came about because I took the 
mmu_notifier(invalidate_page, ...) out of the macro need to be kept 
because some of them are already covered by the range operations.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25  5:56 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
@ 2008-01-25 18:39   ` Robin Holt
  2008-01-25 18:47     ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-01-25 18:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

> +#define mmu_notifier(function, mm, args...)				\
...
> +				if (__mn->ops->function)		\
> +					__mn->ops->function(__mn,	\
> +							    mm,		\
> +							    args);	\

					__mn->ops->function(__mn, mm, args);	\
I realize it is a minor nit, but since we put the continuation in column
81 in the next define, can we do the same here and make this more
readable?

> +			rcu_read_unlock();				\
...
> +#define mmu_rmap_notifier(function, args...)					\
> +	do {									\
> +		struct mmu_rmap_notifier *__mrn;				\
> +		struct hlist_node *__n;						\
> +										\



> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_rcu(mn, n,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);
> +			hlist_del(&mn->hlist);

I think the hlist_del needs to be before the function callout so we can free
the structure without a use-after-free issue.

		hlist_for_each_entry_rcu(mn, n,
					  &mm->mmu_notifier.head, hlist) {
			hlist_del_rcu(&mn->hlist);
			if (mn->ops->release)
				mn->ops->release(mn, mm);



> +static DEFINE_SPINLOCK(mmu_notifier_list_lock);

Remove

> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	spin_lock(&mmu_notifier_list_lock);

Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
	BUG_ON(!rwsem_is_write_locked(&mm->mmap_sem));

> +	hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);

> +	spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_del(&mn->hlist);

hlist_del_rcu?  Ditto on the lock.

> +	spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> +
> +HLIST_HEAD(mmu_rmap_notifier_list);

static DEFINE_SPINLOCK(mmu_rmap_notifier_list_lock);

> +
> +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
> +	spin_unlock(&mmu_notifier_list_lock);

	spin_lock(&mmu_rmap_notifier_list_lock);
	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
	spin_unlock(&mmu_rmap_notifier_list_lock);

> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_register);
> +
> +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_del_rcu(&mrn->hlist);
> +	spin_unlock(&mmu_notifier_list_lock);

	spin_lock(&mmu_rmap_notifier_list_lock);
	hlist_del_rcu(&mrn->hlist);
	spin_unlock(&mmu_rmap_notifier_list_lock);


> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
>  	tlb_finish_mmu(tlb, 0, end);
> +	mmu_notifier_release(mm);

Can we consider moving this notifier or introducing an additional notifier
in the release or a flag to this one indicating early/late.

The GRU that Jack is concerned with would benefit from the early in
that it could just invalidate the GRU context and immediately all GRU
TLB entries are invalid.  I believe Jack would like to also be able to
remove his entry from the mmu_notifier list in an effort to avoid the
page and range callouts.

XPMEM, would also benefit from a call early.  We could make all the
segments as being torn down and start the recalls.  We already have
this code in and working (have since it was first written 6 years ago).
In this case, all segments are torn down with a single message to each
of the importing partitions.  In contrast, the teardown code which would
happen now would be one set of messages for each vma.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 18:39   ` Robin Holt
@ 2008-01-25 18:47     ` Christoph Lameter
  2008-01-25 18:56       ` Robin Holt
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25 18:47 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Fri, 25 Jan 2008, Robin Holt wrote:

> I realize it is a minor nit, but since we put the continuation in column
> 81 in the next define, can we do the same here and make this more
> readable?

We need to fix the next define to not use column 81.
Found a couple of more 80 column infractions. Will be fixed in next 
release.

> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_rcu(mn, n,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> > +			hlist_del(&mn->hlist);
> 
> I think the hlist_del needs to be before the function callout so we can free
> the structure without a use-after-free issue.

The list head is in the mm_struct. This will be freed later.

> > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +	spin_lock(&mmu_notifier_list_lock);
> 
> Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:

Ok. We could switch this to mmap_sem protection for the mm_struct but the 
rmap notifier is not associated with an mm_struct. So we would need to 
keep it there. Since we already have a spinlock: Just use it for both to 
avoid further complications.

> > +	spin_lock(&mmu_notifier_list_lock);
> > +	hlist_del(&mn->hlist);
> 
> hlist_del_rcu?  Ditto on the lock.

Peter already mentioned that and I have posted patches that address this 
issue.

> > @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
> >  	vm_unacct_memory(nr_accounted);
> >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
> >  	tlb_finish_mmu(tlb, 0, end);
> > +	mmu_notifier_release(mm);
> 
> Can we consider moving this notifier or introducing an additional notifier
> in the release or a flag to this one indicating early/late.

There is only one call right now?

> The GRU that Jack is concerned with would benefit from the early in
> that it could just invalidate the GRU context and immediately all GRU
> TLB entries are invalid.  I believe Jack would like to also be able to
> remove his entry from the mmu_notifier list in an effort to avoid the
> page and range callouts.

The TLB entries are removed by earlier invalidate_range calls. I would 
think that no TLBs are left at this point. Its simply a matter of 
releasing any still allocated resources through this callback.
 
> XPMEM, would also benefit from a call early.  We could make all the
> segments as being torn down and start the recalls.  We already have
> this code in and working (have since it was first written 6 years ago).
> In this case, all segments are torn down with a single message to each
> of the importing partitions.  In contrast, the teardown code which would
> happen now would be one set of messages for each vma.

So we need an additional global teardown call? Then we'd need to switch 
off the vma based invalidate_range()?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 18:47     ` Christoph Lameter
@ 2008-01-25 18:56       ` Robin Holt
  2008-01-25 19:03         ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-01-25 18:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Fri, Jan 25, 2008 at 10:47:04AM -0800, Christoph Lameter wrote:
> On Fri, 25 Jan 2008, Robin Holt wrote:
> 
> > I realize it is a minor nit, but since we put the continuation in column
> > 81 in the next define, can we do the same here and make this more
> > readable?
> 
> We need to fix the next define to not use column 81.
> Found a couple of more 80 column infractions. Will be fixed in next 
> release.
> 
> > > +void mmu_notifier_release(struct mm_struct *mm)
> > > +{
> > > +	struct mmu_notifier *mn;
> > > +	struct hlist_node *n;
> > > +
> > > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > > +		rcu_read_lock();
> > > +		hlist_for_each_entry_rcu(mn, n,
> > > +					  &mm->mmu_notifier.head, hlist) {
> > > +			if (mn->ops->release)
> > > +				mn->ops->release(mn, mm);
> > > +			hlist_del(&mn->hlist);
> > 
> > I think the hlist_del needs to be before the function callout so we can free
> > the structure without a use-after-free issue.
> 
> The list head is in the mm_struct. This will be freed later.
> 

I meant the structure pointed to by &mn.  I assume it is intended that
structure be kmalloc'd as part of a larger structure.  The driver is the
entity which created that structure and should be the one to free it.

> > > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> > > +{
> > > +	spin_lock(&mmu_notifier_list_lock);
> > 
> > Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
> 
> Ok. We could switch this to mmap_sem protection for the mm_struct but the 
> rmap notifier is not associated with an mm_struct. So we would need to 
> keep it there. Since we already have a spinlock: Just use it for both to 
> avoid further complications.

But now you are putting a global lock in where it is inappropriate.

> 
> > > +	spin_lock(&mmu_notifier_list_lock);
> > > +	hlist_del(&mn->hlist);
> > 
> > hlist_del_rcu?  Ditto on the lock.
> 
> Peter already mentioned that and I have posted patches that address this 
> issue.
> 
> > > @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
> > >  	vm_unacct_memory(nr_accounted);
> > >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
> > >  	tlb_finish_mmu(tlb, 0, end);
> > > +	mmu_notifier_release(mm);
> > 
> > Can we consider moving this notifier or introducing an additional notifier
> > in the release or a flag to this one indicating early/late.
> 
> There is only one call right now?
> 
> > The GRU that Jack is concerned with would benefit from the early in
> > that it could just invalidate the GRU context and immediately all GRU
> > TLB entries are invalid.  I believe Jack would like to also be able to
> > remove his entry from the mmu_notifier list in an effort to avoid the
> > page and range callouts.
> 
> The TLB entries are removed by earlier invalidate_range calls. I would 
> think that no TLBs are left at this point. Its simply a matter of 
> releasing any still allocated resources through this callback.

What I was asking for is a way to avoid those numerous callouts for
drivers that can do early cleanup.

>  
> > XPMEM, would also benefit from a call early.  We could make all the
> > segments as being torn down and start the recalls.  We already have
> > this code in and working (have since it was first written 6 years ago).
> > In this case, all segments are torn down with a single message to each
> > of the importing partitions.  In contrast, the teardown code which would
> > happen now would be one set of messages for each vma.
> 
> So we need an additional global teardown call? Then we'd need to switch 
> off the vma based invalidate_range()?

No, EXACTLY what I originally was asking for, either move this call site
up, introduce an additional mmu_notifier op, or place this one in two
locations with a flag indicating which call is being made.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 18:56       ` Robin Holt
@ 2008-01-25 19:03         ` Christoph Lameter
  2008-01-25 19:35           ` Robin Holt
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25 19:03 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Fri, 25 Jan 2008, Robin Holt wrote:

> > > > +void mmu_notifier_release(struct mm_struct *mm)
> > > > +{
> > > > +	struct mmu_notifier *mn;
> > > > +	struct hlist_node *n;
> > > > +
> > > > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > > > +		rcu_read_lock();
> > > > +		hlist_for_each_entry_rcu(mn, n,
> > > > +					  &mm->mmu_notifier.head, hlist) {
> > > > +			if (mn->ops->release)
> > > > +				mn->ops->release(mn, mm);
> > > > +			hlist_del(&mn->hlist);
> > > 
> > > I think the hlist_del needs to be before the function callout so we can free
> > > the structure without a use-after-free issue.
> > 
> > The list head is in the mm_struct. This will be freed later.
> > 
> 
> I meant the structure pointed to by &mn.  I assume it is intended that
> structure be kmalloc'd as part of a larger structure.  The driver is the
> entity which created that structure and should be the one to free it.

mn will be pointing to the listhead in the mm_struct one after the other. 
You mean the ops structure?

> > > > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> > > > +{
> > > > +	spin_lock(&mmu_notifier_list_lock);
> > > 
> > > Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
> > 
> > Ok. We could switch this to mmap_sem protection for the mm_struct but the 
> > rmap notifier is not associated with an mm_struct. So we would need to 
> > keep it there. Since we already have a spinlock: Just use it for both to 
> > avoid further complications.
> 
> But now you are putting a global lock in where it is inappropriate.

The lock is only used during register and unregister. Very low level 
usage.

> > > XPMEM, would also benefit from a call early.  We could make all the
> > > segments as being torn down and start the recalls.  We already have
> > > this code in and working (have since it was first written 6 years ago).
> > > In this case, all segments are torn down with a single message to each
> > > of the importing partitions.  In contrast, the teardown code which would
> > > happen now would be one set of messages for each vma.
> > 
> > So we need an additional global teardown call? Then we'd need to switch 
> > off the vma based invalidate_range()?
> 
> No, EXACTLY what I originally was asking for, either move this call site
> up, introduce an additional mmu_notifier op, or place this one in two
> locations with a flag indicating which call is being made.

Add a new invalidate_all() call? Then on exit we do

1. invalidate_all()

2. invalidate_range() for each vma

3. release()

We cannot simply move the call up because there will be future range 
callbacks on vma invalidation.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 19:03         ` Christoph Lameter
@ 2008-01-25 19:35           ` Robin Holt
  2008-01-25 20:10             ` Christoph Lameter
  2008-01-25 21:18             ` Christoph Lameter
  0 siblings, 2 replies; 30+ messages in thread
From: Robin Holt @ 2008-01-25 19:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Fri, Jan 25, 2008 at 11:03:07AM -0800, Christoph Lameter wrote:
> > > > Shouldn't this really be protected by the down_write(mmap_sem)?  Maybe:
> > > 
> > > Ok. We could switch this to mmap_sem protection for the mm_struct but the 
> > > rmap notifier is not associated with an mm_struct. So we would need to 
> > > keep it there. Since we already have a spinlock: Just use it for both to 
> > > avoid further complications.
> > 
> > But now you are putting a global lock in where it is inappropriate.
> 
> The lock is only used during register and unregister. Very low level 
> usage.

Seems to me that is the same argument used for lock_kernel.  I am saying
we have a perfectly reasonable way to seperate the protections down to
their smallest.  For the things hanging off the mm, mmap_sem, for the
other list, a list specific lock.

Keep in mind that on a 2048p SSI MPI job starting up, we have 2048 ranks
doing this at the same time 6 times withing their address range.  That
seems like a lock which could get hot fairly quickly.  It may be for a
short period during startup and shutdown, but it is there.


> 
> > > > XPMEM, would also benefit from a call early.  We could make all the
> > > > segments as being torn down and start the recalls.  We already have
> > > > this code in and working (have since it was first written 6 years ago).
> > > > In this case, all segments are torn down with a single message to each
> > > > of the importing partitions.  In contrast, the teardown code which would
> > > > happen now would be one set of messages for each vma.
> > > 
> > > So we need an additional global teardown call? Then we'd need to switch 
> > > off the vma based invalidate_range()?
> > 
> > No, EXACTLY what I originally was asking for, either move this call site
> > up, introduce an additional mmu_notifier op, or place this one in two
> > locations with a flag indicating which call is being made.
> 
> Add a new invalidate_all() call? Then on exit we do
> 
> 1. invalidate_all()

That will be fine as long as we can unregister the ops notifier and free
the structure.  Otherwise, we end up being called needlessly.

> 
> 2. invalidate_range() for each vma
> 
> 3. release()
> 
> We cannot simply move the call up because there will be future range 
> callbacks on vma invalidation.

I am not sure what this means.  Right now, if you were to notify XPMEM
the process is exiting, we would take care of all the recalling of pages
exported by this process, clearing those pages cache lines from cache,
and raising memory protections.  I would assume that moving the callout
earlier would expect the same of every driver.


Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 19:35           ` Robin Holt
@ 2008-01-25 20:10             ` Christoph Lameter
  2008-01-26 11:56               ` Robin Holt
  2008-01-25 21:18             ` Christoph Lameter
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25 20:10 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Fri, 25 Jan 2008, Robin Holt wrote:

> Keep in mind that on a 2048p SSI MPI job starting up, we have 2048 ranks
> doing this at the same time 6 times withing their address range.  That
> seems like a lock which could get hot fairly quickly.  It may be for a
> short period during startup and shutdown, but it is there.

Ok. I guess we need to have a __register_mmu_notifier that expects the 
mmap_sem to be held then?

> > 1. invalidate_all()
> 
> That will be fine as long as we can unregister the ops notifier and free
> the structure.  Otherwise, we end up being called needlessly.

No you cannot do that because there are still callbacks that come later. 
The invalidate_all may lead to invalidate_range() doing nothing for this 
mm. The ops notifier and the freeing of the structure has to wait until 
release().

> > 2. invalidate_range() for each vma
> > 
> > 3. release()
> > 
> > We cannot simply move the call up because there will be future range 
> > callbacks on vma invalidation.
> 
> I am not sure what this means.  Right now, if you were to notify XPMEM
> the process is exiting, we would take care of all the recalling of pages
> exported by this process, clearing those pages cache lines from cache,
> and raising memory protections.  I would assume that moving the callout
> earlier would expect the same of every driver.

That does not sync with the current scheme of the invalidate_range() 
hooks. We would have to do a global invalidate early and then place the 
other invalidate_range hooks in such a way that none is called in later in 
process exit handling.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 19:35           ` Robin Holt
  2008-01-25 20:10             ` Christoph Lameter
@ 2008-01-25 21:18             ` Christoph Lameter
  2008-01-26 12:01               ` Robin Holt
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25 21:18 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Diff so far against V1

- Improve RCU support. (There is now a sychronize_rcu in mmu_release which 
is bad.)

- Clean compile for !MMU_NOTIFIER

- Use mmap_sem for serializing additions the mmu_notifier list in the 
mm_struct (but still global spinlock for mmu_rmap_notifier. The 
registration function is called only a couple of times))

- 

---
 include/linux/list.h         |   14 ++++++++++++++
 include/linux/mm_types.h     |    2 --
 include/linux/mmu_notifier.h |   39 ++++++++++++++++++++++++++++++++++++---
 mm/mmu_notifier.c            |   28 +++++++++++++++++++---------
 4 files changed, 69 insertions(+), 14 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-25 12:14:49.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-25 12:14:49.000000000 -0800
@@ -15,17 +15,18 @@
 void mmu_notifier_release(struct mm_struct *mm)
 {
 	struct mmu_notifier *mn;
-	struct hlist_node *n;
+	struct hlist_node *n, *t;
 
 	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
 		rcu_read_lock();
-		hlist_for_each_entry_rcu(mn, n,
+		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
 			hlist_del(&mn->hlist);
 		}
 		rcu_read_unlock();
+		synchronize_rcu();
 	}
 }
 
@@ -53,24 +54,33 @@ int mmu_notifier_age_page(struct mm_stru
 	return young;
 }
 
-static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
 void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	spin_lock(&mmu_notifier_list_lock);
-	hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
-	spin_unlock(&mmu_notifier_list_lock);
+	down_write(&mm->mmap_sem);
+	__mmu_notifier_register(mn, mm);
+	up_write(&mm->mmap_sem);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
 
 void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	spin_lock(&mmu_notifier_list_lock);
-	hlist_del(&mn->hlist);
-	spin_unlock(&mmu_notifier_list_lock);
+	down_write(&mm->mmap_sem);
+	hlist_del_rcu(&mn->hlist);
+	up_write(&mm->mmap_sem);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
 HLIST_HEAD(mmu_rmap_notifier_list);
 
 void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h	2008-01-25 12:14:47.000000000 -0800
+++ linux-2.6/include/linux/list.h	2008-01-25 12:14:49.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
+/**
+ * hlist_for_each_entry_safe_rcu	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		temporary pointer
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
+	for (pos = (head)->first;					 \
+	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-25 12:14:49.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-25 12:14:49.000000000 -0800
@@ -224,9 +224,7 @@ struct mm_struct {
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
 
-#ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
-#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-25 12:14:49.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-25 13:07:54.000000000 -0800
@@ -46,6 +46,10 @@ struct mmu_notifier {
 };
 
 struct mmu_notifier_ops {
+	/*
+	 * Note the mmu_notifier structure must be released with
+	 * call_rcu
+	 */
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
 	int (*age_page)(struct mmu_notifier *mn,
@@ -78,10 +82,16 @@ struct mmu_rmap_notifier_ops {
 
 #ifdef CONFIG_MMU_NOTIFIER
 
+/* Must hold the mmap_sem */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/* Will acquire mmap_sem */
 extern void mmu_notifier_register(struct mmu_notifier *mn,
 				  struct mm_struct *mm);
+/* Will acquire mmap_sem */
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
+
 extern void mmu_notifier_release(struct mm_struct *mm);
 extern int mmu_notifier_age_page(struct mm_struct *mm,
 				 unsigned long address);
@@ -130,15 +140,38 @@ extern struct hlist_head mmu_rmap_notifi
 
 #else /* CONFIG_MMU_NOTIFIER */
 
-#define mmu_notifier(function, mm, args...) do { } while (0)
-#define mmu_rmap_notifier(function, args...) do { } while (0)
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
 
 static inline void mmu_notifier_register(struct mmu_notifier *mn,
 						struct mm_struct *mm) {}
 static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
 						struct mm_struct *mm) {}
 static inline void mmu_notifier_release(struct mm_struct *mm) {}
-static inline void mmu_notifier_age(struct mm_struct *mm,
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
 				unsigned long address)
 {
 	return 0;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
  2008-01-25 12:43   ` Robin Holt
  2008-01-25 18:31   ` Christoph Lameter
@ 2008-01-25 21:18   ` Benjamin Herrenschmidt
  2008-01-25 21:25     ` Christoph Lameter
  2008-01-28 16:10   ` Izik Eidus
  3 siblings, 1 reply; 30+ messages in thread
From: Benjamin Herrenschmidt @ 2008-01-25 21:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins


On Fri, 2008-01-25 at 12:42 +0100, Andrea Arcangeli wrote:
> On Thu, Jan 24, 2008 at 09:56:06PM -0800, Christoph Lameter wrote:
> > Andrea's mmu_notifier #4 -> RFC V1
> > 
> > - Merge subsystem rmap based with Linux rmap based approach
> > - Move Linux rmap based notifiers out of macro
> > - Try to account for what locks are held while the notifiers are
> >   called.
> > - Develop a patch sequence that separates out the different types of
> >   hooks so that it is easier to review their use.
> > - Avoid adding #include to linux/mm_types.h
> > - Integrate RCU logic suggested by Peter.
> 
> I'm glad you're converging on something a bit saner and much much
> closer to my code, plus perfectly usable by KVM optimal rmap design
> too. It would have preferred if you would have sent me patches like
> Peter did for review and merging etc... that would have made review
> especially easier. Anyway I'm used to that on lkml so it's ok, I just
> need this patch to be included in mainline, everything else is
> irrelevant to me.

Also, wouldn't there be a problem with something trying to use that
interface to keep in sync a secondary device MMU such as the DRM or
other accelerators, which might need virtual address based
invalidation ?

Ben.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-25 21:18   ` Benjamin Herrenschmidt
@ 2008-01-25 21:25     ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-25 21:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Sat, 26 Jan 2008, Benjamin Herrenschmidt wrote:

> Also, wouldn't there be a problem with something trying to use that
> interface to keep in sync a secondary device MMU such as the DRM or
> other accelerators, which might need virtual address based
> invalidation ?

Yes just doing the rmap based solution would have required DRM etc to 
maintain their own rmaps. So it looks that we need to go with both 
variants. Note that secondary device MMUs that need to run code outside of 
atomic context may still need to create their own rmaps.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 20:10             ` Christoph Lameter
@ 2008-01-26 11:56               ` Robin Holt
  2008-01-28 18:51                 ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-01-26 11:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

> > > 1. invalidate_all()
> > 
> > That will be fine as long as we can unregister the ops notifier and free
> > the structure.  Otherwise, we end up being called needlessly.
> 
> No you cannot do that because there are still callbacks that come later. 
> The invalidate_all may lead to invalidate_range() doing nothing for this 
> mm. The ops notifier and the freeing of the structure has to wait until 
> release().

Could you be a little more clear here?  If you are saying that the other
callbacks will need to do work?  I can assure you we will clean up those
pages and raise memory protections.  It will also be done in a much more
efficient fashion than the individual callouts.

If, on the other hand, you are saying we can not because of the way
we traverse the list, can we return a result indicating to the caller
we would like to be unregistered and then the mmu_notifier code do the
remove followed by a call to the release notifier?

> 
> > > 2. invalidate_range() for each vma
> > > 
> > > 3. release()
> > > 
> > > We cannot simply move the call up because there will be future range 
> > > callbacks on vma invalidation.
> > 
> > I am not sure what this means.  Right now, if you were to notify XPMEM
> > the process is exiting, we would take care of all the recalling of pages
> > exported by this process, clearing those pages cache lines from cache,
> > and raising memory protections.  I would assume that moving the callout
> > earlier would expect the same of every driver.
> 
> That does not sync with the current scheme of the invalidate_range() 
> hooks. We would have to do a global invalidate early and then place the 
> other invalidate_range hooks in such a way that none is called in later in 
> process exit handling.

But if the notifier is removed from the list following the invalidate_all
callout, there would be no additional callouts.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-25 21:18             ` Christoph Lameter
@ 2008-01-26 12:01               ` Robin Holt
  2008-01-28 18:44                 ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Holt @ 2008-01-26 12:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

>  void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> -	spin_lock(&mmu_notifier_list_lock);
> -	hlist_add_head(&mn->hlist, &mm->mmu_notifier.head);
> -	spin_unlock(&mmu_notifier_list_lock);
> +	down_write(&mm->mmap_sem);
> +	__mmu_notifier_register(mn, mm);
> +	up_write(&mm->mmap_sem);
>  }
>  EXPORT_SYMBOL_GPL(mmu_notifier_register);

But what if the caller is already holding the mmap_sem?  Why force the
acquire into this function?  Since we are dealing with a semaphore/mutex,
it is reasonable that other structures are protected by this, more work
will be done, and therefore put the weight of acquiring the sema in the
control of the caller where they can decide if more needs to be completed.

That was why I originally suggested creating a new rwsem_is_write_locked()
function and basing a BUG_ON upon that.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
                     ` (2 preceding siblings ...)
  2008-01-25 21:18   ` Benjamin Herrenschmidt
@ 2008-01-28 16:10   ` Izik Eidus
  2008-01-28 17:25     ` Andrea Arcangeli
  3 siblings, 1 reply; 30+ messages in thread
From: Izik Eidus @ 2008-01-28 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Nick Piggin,
	kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

Andrea Arcangeli wrote:
> On Thu, Jan 24, 2008 at 09:56:06PM -0800, Christoph Lameter wrote:
>   
>> Andrea's mmu_notifier #4 -> RFC V1
>>
>> - Merge subsystem rmap based with Linux rmap based approach
>> - Move Linux rmap based notifiers out of macro
>> - Try to account for what locks are held while the notifiers are
>>   called.
>> - Develop a patch sequence that separates out the different types of
>>   hooks so that it is easier to review their use.
>> - Avoid adding #include to linux/mm_types.h
>> - Integrate RCU logic suggested by Peter.
>>     
>
> I'm glad you're converging on something a bit saner and much much
> closer to my code, plus perfectly usable by KVM optimal rmap design
> too. It would have preferred if you would have sent me patches like
> Peter did for review and merging etc... that would have made review
> especially easier. Anyway I'm used to that on lkml so it's ok, I just
> need this patch to be included in mainline, everything else is
> irrelevant to me.
>
> On a technical merit this still partially makes me sick and I think
> it's the last issue to debate.
>
> @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
>         else
>                 ret = try_to_unmap_file(page, migration);
>
> +       if (unlikely(PageExternalRmap(page)))
> +               mmu_rmap_notifier(invalidate_page, page);
> +
>         if (!page_mapped(page))
>                 ret = SWAP_SUCCESS;
>         return ret;
>
> I find the above hard to accept, because the moment you work with
> physical pages and not "mm+address" I think you couldn't possibly care
> if page_mapped is true or false, and I think the above notifier should
> be called _outside_ try_to_unmap. Infact I'd call
> mmu_rmap_notifier(invalidate_page, page); only if page_unmapped is
> false and the linux pte is gone already (practically just before the
> page_count == 2 check and after try_to_unmap).
>   

i dont understand how is that better than notification on tlb flush?
i mean cpus have very smiler problem as we do,
so why not deal with the notification at the same place as they do (tlb 
flush) ?

moreover notification on tlb flush allow unmodified applications to work 
with us
for example the memory merging driver that i wrote was able to work with kvm
without any change to its source code.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-28 16:10   ` Izik Eidus
@ 2008-01-28 17:25     ` Andrea Arcangeli
  2008-01-28 19:04       ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Arcangeli @ 2008-01-28 17:25 UTC (permalink / raw)
  To: Izik Eidus
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Nick Piggin,
	kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 06:10:39PM +0200, Izik Eidus wrote:
> i dont understand how is that better than notification on tlb flush?

I certainly agree. The quoted call wasn't actually the one that could
be moved in a single place in the .h file though. But the 4/4 patch
could be reduced to a few liner patch and it would not require any
change to ksm and mm/*.c internals that way etc...

If Christoph prefers to expand the two bits I added in the .h file,
all over the .c files that's ok with us. There is no good excuse for
that IMHO because a __ptep_clear_flush could be added to optimize away
the notifiers if it's followed by the invalidate_range (not the common
case), and if we want to just mark a spte readonly instead of
invalidating it we could implement new ptep_ function that would
invoke a new mmu notifier method. So there's full room for
optimizations without requiring the expansion.

In the end it's only a coding style issue but one that will become
visible in the VM (i.e. ksm has to be modified to add a mmu_notifier
call after ptep_clear_flush, it won't work automatically without
changes anymore).

So I'd like to know what can we do to help to merge the 4 patches from
Christoph in mainline, I'd appreciate comments on them so we can help
to address any outstanding issue!

It's very important to have this code in 2.6.25 final. KVM requires
mmu notifiers for reliable swapping, madvise/ballooning, ksm etc... so
it's high priority to get something merged to provide this
functionality regardless of its coding style ;).

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-26 12:01               ` Robin Holt
@ 2008-01-28 18:44                 ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-28 18:44 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Sat, 26 Jan 2008, Robin Holt wrote:

> But what if the caller is already holding the mmap_sem?  Why force the
> acquire into this function?  Since we are dealing with a semaphore/mutex,

Then you need to call __register_mmu_notifier.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-01-26 11:56               ` Robin Holt
@ 2008-01-28 18:51                 ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-28 18:51 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Sat, 26 Jan 2008, Robin Holt wrote:

> > No you cannot do that because there are still callbacks that come later. 
> > The invalidate_all may lead to invalidate_range() doing nothing for this 
> > mm. The ops notifier and the freeing of the structure has to wait until 
> > release().
> 
> Could you be a little more clear here?  If you are saying that the other
> callbacks will need to do work?  I can assure you we will clean up those
> pages and raise memory protections.  It will also be done in a much more
> efficient fashion than the individual callouts.

No the other callbacks need to work in the sense that they can be called. 
You could have them do nothing after an invalidate_all().
But you cannot release the allocated structs needed for list traversal 
etc.

> If, on the other hand, you are saying we can not because of the way
> we traverse the list, can we return a result indicating to the caller
> we would like to be unregistered and then the mmu_notifier code do the
> remove followed by a call to the release notifier?

You would need to release the resources when the release notifier is 
called.

> > That does not sync with the current scheme of the invalidate_range() 
> > hooks. We would have to do a global invalidate early and then place the 
> > other invalidate_range hooks in such a way that none is called in later in 
> > process exit handling.
> 
> But if the notifier is removed from the list following the invalidate_all
> callout, there would be no additional callouts.

Hmmm.... Okay did not think about that. Then you would need to do a 
synchronize_rcu() in invalidate_all()?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-28 17:25     ` Andrea Arcangeli
@ 2008-01-28 19:04       ` Christoph Lameter
  2008-01-28 19:40         ` Andrea Arcangeli
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-01-28 19:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Robin Holt, Avi Kivity, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, 28 Jan 2008, Andrea Arcangeli wrote:

> So I'd like to know what can we do to help to merge the 4 patches from
> Christoph in mainline, I'd appreciate comments on them so we can help
> to address any outstanding issue!

There are still some pending issues (RCU troubles). I will post V2 today.
 
> It's very important to have this code in 2.6.25 final. KVM requires
> mmu notifiers for reliable swapping, madvise/ballooning, ksm etc... so
> it's high priority to get something merged to provide this
> functionality regardless of its coding style ;).

We also need this urgently.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-28 19:04       ` Christoph Lameter
@ 2008-01-28 19:40         ` Andrea Arcangeli
  2008-01-28 20:16           ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Arcangeli @ 2008-01-28 19:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Izik Eidus, Robin Holt, Avi Kivity, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 11:04:43AM -0800, Christoph Lameter wrote:
> On Mon, 28 Jan 2008, Andrea Arcangeli wrote:
> 
> > So I'd like to know what can we do to help to merge the 4 patches from
> > Christoph in mainline, I'd appreciate comments on them so we can help
> > to address any outstanding issue!
> 
> There are still some pending issues (RCU troubles). I will post V2 today.

With regard to the synchronize_rcu troubles they also be left to the
notifier-user to solve. Certainly having the synchronize_rcu like in
your last incremental patches in _release, will require less
complications (kvm pins the mm so I suppose we could batch the
call_rcu externally too). But _release is not a fast-path for KVM
usage so your V2 is sure ok (and simpler to deal with) too.

For registration synchronize_rcu is the only way, if the notifiers
have to fire synchronously before mmu_notifier_register returns but
that also can be left up to the caller if required (for example KVM
doesn't need that). Otherwise there could be two mmu_notifier_register
and mmu_notifier_register_rcu where the latter calls synchronize_rcu
before returning.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 0/4] [RFC] MMU Notifiers V1
  2008-01-28 19:40         ` Andrea Arcangeli
@ 2008-01-28 20:16           ` Christoph Lameter
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Izik Eidus, Robin Holt, Avi Kivity, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, 28 Jan 2008, Andrea Arcangeli wrote:

> With regard to the synchronize_rcu troubles they also be left to the
> notifier-user to solve. Certainly having the synchronize_rcu like in

Ahh. Ok.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-02-01 10:55   ` Robin Holt
  2008-02-01 11:04     ` Robin Holt
@ 2008-02-01 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Lameter @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, linux-kernel, linux-mm,
	daniel.blueman

On Fri, 1 Feb 2008, Robin Holt wrote:

> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Talked to Jack last night and he said its okay.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-02-01 10:55   ` Robin Holt
@ 2008-02-01 11:04     ` Robin Holt
  2008-02-01 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Robin Holt @ 2008-02-01 11:04 UTC (permalink / raw)
  To: Christoph Lameter, Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 04:55:16AM -0600, Robin Holt wrote:
> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Ignore this, I was in the wrong work area.  I am sorry for adding to the
confusion.  This version has no locking requirement outside the driver
itself.

Sorry,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
@ 2008-02-01 10:55   ` Robin Holt
  2008-02-01 11:04     ` Robin Holt
  2008-02-01 19:14     ` Christoph Lameter
  0 siblings, 2 replies; 30+ messages in thread
From: Robin Holt @ 2008-02-01 10:55 UTC (permalink / raw)
  To: Christoph Lameter, Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

OK.  Now that release has been moved, I think I agree with you that the
down_write(mmap_sem) can be used as our lock again and still work for
Jack.  I would like a ruling from Jack as well.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [patch 1/4] mmu_notifier: Core code
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
@ 2008-02-01  5:04 ` Christoph Lameter
  2008-02-01 10:55   ` Robin Holt
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2008-02-01  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 14594 bytes --]

Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

This first portion is fitting for external mmu's that do not have their
own rmap or need the ability to sleep before removing individual pages.

Two categories of external mmus are possible:

1. KVM style external mmus that have their own page table.
   These are capable of tracking pages in their page tables and
   can therefore increase the refcount on pages. An increased
   refcount guarantees page existence regardless of the vms unmapping
   actions until the logic in the notifier call decides to drop a page.

2. GRU style external mmus that rely on the Linux page table for TLB lookups.
   These cannot track pages that are externally references.
   TLB entries can only be evicted as necessary.


Callbacks are registered with an mm_struct from a device drivers using
mmu_notifier_register. When the VM removes pages (or restricts
permissions on pages) then callbacks are triggered

The VM holds spinlocks in order to walk reverse maps in rmap.c. The single
page callback invalidate_page() is therefore always run with
spinlocks held (which limits what can be done in the callbacks).

The invalidate_range_start/end callbacks can be run in atomic as well as
sleepable contexts. A flag is passed to indicate an atomic context.
The notifier may decide to defer actions if the context is atomic.

Pages must be marked dirty if dirty bits are found to be set in
the external ptes.

Requirements on synchronization within the driver:

     Multiple invalidate_range_begin/ends may be nested or called
     concurrently. That is legit. However, no new external references
     may be established as long as any invalidate_xxx is running or as long
     as any invalidate_range_begin() and has not been completed through a
     corresponding call to invalidate_range_end().

     Locking within the notifier callbacks needs to serialize events
     correspondingly. One simple implementation would be the use of a spinlock
     that needs to be acquired for access to the page table or tlb managed by
     the driver. A rw lock could be used to allow multiplel concurrent invalidates
     to run but then the driver needs to have additional internal synchronization
     for access to hardware resources.

     If all invalidate_xx notifier calls take the driver lock then it is possible
     to run follow_page() under the same lock. The lock can then guarantee
     that no page is removed and provides an additional existence guarantee
     of the page independent of the page count.

     invalidate_range_begin() must clear all references in the range
     and stop the establishment of new references.

     invalidate_range_end() reenables the establishment of references.
     The atomic paramater passed to invalidatge_range_xx indicates that the function
     is called in an atomic context. We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/mm_types.h     |    8 +
 include/linux/mmu_notifier.h |  179 +++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |   76 ++++++++++++++++++
 7 files changed, 272 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-31 19:59:51.000000000 -0800
@@ -153,6 +153,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +225,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-31 20:56:03.000000000 -0800
@@ -0,0 +1,179 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If pages are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page() callbacks are performed with spinlocks are held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *	Pages must be marked dirty if dirty bits are found to be set in
+ *	the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/* invalidate_page is called from contexts where the pte_lock is held */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must paired.
+	 *
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * If all invalidate_xx notifier calls take a driver lock then it is possible
+	 * to run follow_page() under the same lock. The lock can then guarantee
+	 * that no page is removed and provides an additional existence guarantee
+	 * of the page.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 unsigned long stat, unsigned long end,
+				 struct mm_struct *mm, int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-31 19:59:51.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-31 19:59:51.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-31 20:56:03.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-31 19:59:51.000000000 -0800
@@ -52,6 +52,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-31 20:56:03.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();

-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-02-01 19:16 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
2008-01-25  5:56 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
2008-01-25 18:39   ` Robin Holt
2008-01-25 18:47     ` Christoph Lameter
2008-01-25 18:56       ` Robin Holt
2008-01-25 19:03         ` Christoph Lameter
2008-01-25 19:35           ` Robin Holt
2008-01-25 20:10             ` Christoph Lameter
2008-01-26 11:56               ` Robin Holt
2008-01-28 18:51                 ` Christoph Lameter
2008-01-25 21:18             ` Christoph Lameter
2008-01-26 12:01               ` Robin Holt
2008-01-28 18:44                 ` Christoph Lameter
2008-01-25  5:56 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-01-25  5:56 ` [patch 3/4] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
2008-01-25  5:56 ` [patch 4/4] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
2008-01-25 11:42 ` [patch 0/4] [RFC] MMU Notifiers V1 Andrea Arcangeli
2008-01-25 12:43   ` Robin Holt
2008-01-25 18:31   ` Christoph Lameter
2008-01-25 21:18   ` Benjamin Herrenschmidt
2008-01-25 21:25     ` Christoph Lameter
2008-01-28 16:10   ` Izik Eidus
2008-01-28 17:25     ` Andrea Arcangeli
2008-01-28 19:04       ` Christoph Lameter
2008-01-28 19:40         ` Andrea Arcangeli
2008-01-28 20:16           ` Christoph Lameter
2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
2008-02-01 10:55   ` Robin Holt
2008-02-01 11:04     ` Robin Holt
2008-02-01 19:14     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).