linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] [RFC] MMU Notifiers V4
@ 2008-01-31  4:57 Christoph Lameter
  2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
                   ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31  4:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

I hope this is finally a release that covers all the requirements. Locking
description is at the top of the core patch.

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel. The known immediate users are

KVM (establishes a refcount to the page. External references called spte)
	(Refcount seems to be not necessary)

GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb)

XPmem (uses its own reverse mappings. Remote ptes, Needs
	to sleep when sending messages)
	XPmem could defer freeing pages if a callback with atomic=1 occurs.

Pending:

- Feedback from users of the callbacks for KVM, RDMA, XPmem and GRU
  (Early tests with the GRU were successful).

Known issues:

- RCU quiescent periods are required on registering
  notifiers to guarantee visibility to other processors.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
  we are the only executing thread. This is also true for invalidate_all() so
  we could drop off the mmu_notifier there early. Use hlist_del_init instead
  of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
  holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch

-- 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 1/3] mmu_notifier: Core code
  2008-01-31  4:57 [patch 0/3] [RFC] MMU Notifiers V4 Christoph Lameter
@ 2008-01-31  4:57 ` Christoph Lameter
  2008-02-01  1:56   ` Jack Steiner
                     ` (2 more replies)
  2008-01-31  4:57 ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31  4:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 18722 bytes --]

Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

These fall into two classes:

1. mmu_notifier

     These are callbacks registered with an mm_struct. If pages are
     removed from an address space then callbacks are performed.

     Spinlocks must be held in order to walk reverse maps. The
     invalidate_page() callbacks are performed with spinlocks are held.

     The invalidate_range_start/end callbacks can be performed in contexts
     where sleeping is allowed or in atomic contexts. A flag is passed
     to indicate an atomic context.


2. mmu_rmap_notifier

     Callbacks for subsystems that provide their own rmaps. These
     need to walk their own rmaps for a page. The invalidate_page
     callback is outside of locks so that we are not in a strictly
     atomic context (but we may be in a PF_MEMALLOC context if the
     notifier is called from reclaim code) and are able to sleep.

     Rmap notifiers need an extra page bit and are only available
     on 64 bit platforms.

     Pages must be marked dirty if dirty bits are found to be set in
     the external ptes.

Requirements on synchronization within the driver:

     Multiple invalidate_range_begin/ends may be nested or called
     concurrently. That is legit. However, no new external references
     may be established as long as any invalidate_xxx is running or
     any invalidate_range_begin() and has not been completed through a
     corresponding call to invalidate_range_end().

     Locking within the notifier needs to serialize events correspondingly.

     If all invalidate_xx notifier calls take a driver lock then it is possible
     to run follow_page() under the same lock. The lock can then guarantee
     that no page is removed and provides an additional existence guarantee
     of the page.

     invalidate_range_begin() must clear all references in the range
     and stop the establishment of new references.

     invalidate_range_end() reenables the establishment of references.
     atomic indicates that the function is called in an atomic context.
     We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/mm_types.h     |    6 +
 include/linux/mmu_notifier.h |  248 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   11 +
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    3 
 mm/mmu_notifier.c            |  118 ++++++++++++++++++++
 8 files changed, 393 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-30 19:49:34.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-30 20:25:43.000000000 -0800
@@ -0,0 +1,248 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If mappings are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page notifications are performed with spinlocks are held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms.
+ *
+ *      Pages must be marked dirty if dirty bits found to be set in
+ *      the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/* Dummy needed because the mmu_notifier() macro requires it */
+	void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
+				int dummy);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/* invalidate_page is called from contexts where the pte_lock is held */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must paired.
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * If all invalidate_xx notifier calls take a driver lock then it is possible
+	 * to run follow_page() under the same lock. The lock can then guarantee
+	 * that no page is removed. That way we can avoid increasing the refcount
+	 * of the pages.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 struct mm_struct *mm, int atomic);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+/* Must hold PageLock */
+extern void mmu_rmap_export_page(struct page *page);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, hlist)		\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-30 20:30:50.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,16 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define ClearPageExternalRmap(page) do {} while (0)
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-30 19:49:34.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-30 19:49:34.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-30 20:31:12.000000000 -0800
@@ -0,0 +1,118 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	__mmu_notifier_register(mn, mm);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	hlist_del_rcu(&mn->hlist);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+#ifdef CONFIG_64BIT
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
+
+#endif
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-30 19:49:34.000000000 -0800
@@ -52,6 +52,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-30 19:49:32.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-30 20:29:31.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier(invalidate_all, mm, 0);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();
@@ -2044,6 +2046,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

-- 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  4:57 [patch 0/3] [RFC] MMU Notifiers V4 Christoph Lameter
  2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
@ 2008-01-31  4:57 ` Christoph Lameter
  2008-01-31 12:31   ` Andrea Arcangeli
  2008-02-01  4:24   ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Robin Holt
  2008-01-31  4:57 ` [patch 3/3] mmu_notifier: invalidate_page callbacks Christoph Lameter
  2008-01-31 17:18 ` [PATCH] mmu notifiers #v5 Andrea Arcangeli
  3 siblings, 2 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31  4:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 7532 bytes --]

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

invalidate_range_begin/end() is frequently called with only mmap_sem
held. If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap_xip.c |    4 ++++
 mm/fremap.c      |    3 +++
 mm/hugetlb.c     |    3 +++
 mm/memory.c      |   15 +++++++++++++--
 mm/mmap.c        |    2 ++
 5 files changed, 25 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-30 20:03:05.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-30 20:05:39.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier(invalidate_range_end, mm, 0);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-30 20:03:05.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-30 20:07:27.000000000 -0800
@@ -50,6 +50,7 @@
 #include <linux/delayacct.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -883,13 +884,16 @@ unsigned long zap_page_range(struct vm_a
 	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
+	int atomic = details ? (details->i_mmap_lock != 0) : 0;
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range_end, mm, atomic);
 	return end;
 }
 
@@ -1318,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1352,6 +1356,7 @@ int remap_pfn_range(struct vm_area_struc
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
@@ -1359,6 +1364,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range_end, mm, 0);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1442,10 +1448,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1453,6 +1460,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range_end, mm, 0);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1630,6 +1638,8 @@ gotten:
 		goto oom;
 	cow_user_page(new_page, old_page, address, vma);
 
+	mmu_notifier(invalidate_range_begin, mm, address,
+				address + PAGE_SIZE - 1, 0);
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1668,6 +1678,7 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	mmu_notifier(invalidate_range_end, mm, 0);
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-30 20:03:05.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-30 20:05:39.000000000 -0800
@@ -1744,11 +1744,13 @@ static void unmap_region(struct mm_struc
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier(invalidate_range_end, mm, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-01-30 20:03:05.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-01-30 20:05:39.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -743,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier(invalidate_range_begin, mm, start, end, 1);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -763,6 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier(invalidate_range_end, mm, 1);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-01-30 20:03:05.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c	2008-01-30 20:05:39.000000000 -0800
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 #include <asm/tlbflush.h>
 
@@ -189,6 +190,8 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+		mmu_notifier(invalidate_range_begin, mm, address,
+					address + PAGE_SIZE - 1, 1);
 		pte = page_check_address(page, mm, address, &ptl);
 		if (pte) {
 			/* Nuke the page table entry. */
@@ -200,6 +203,7 @@ __xip_unmap (struct address_space * mapp
 			pte_unmap_unlock(pte, ptl);
 			page_cache_release(page);
 		}
+		mmu_notifier(invalidate_range_end, mm, 1);
 	}
 	spin_unlock(&mapping->i_mmap_lock);
 }

-- 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [patch 3/3] mmu_notifier: invalidate_page callbacks
  2008-01-31  4:57 [patch 0/3] [RFC] MMU Notifiers V4 Christoph Lameter
  2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
  2008-01-31  4:57 ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-01-31  4:57 ` Christoph Lameter
  2008-01-31 17:18 ` [PATCH] mmu notifiers #v5 Andrea Arcangeli
  3 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31  4:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_invalidate_page_rmap_callbacks --]
[-- Type: text/plain, Size: 3850 bytes --]

Callbacks to remove individual pages as done in rmap code

3 types of callbacks are used:

1. invalidate_page mmu_notifier
	Called from the inner loop of rmap walks to invalidate
	pages.

2. invalidate_page mmu_rmap_notifier
	Called after the Linux rmap loop under PageLock to allow
	a device to scan its own rmaps and remove mappings.

3. mmu_notifier_age_page
	Called for the determination of the page referenced
	status.

The callbacks occur after the Linux rmaps have been walked. A device
driver does not have to support type 1 and 2 callbacks. One is sufficient.
If we do not care about page referenced status then callback #3 can also
be omitted.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Robin Holt <holt@sgi.com>

---
 mm/rmap.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-30 20:03:03.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-30 20:17:22.000000000 -0800
@@ -49,6 +49,7 @@
 #include <linux/rcupdate.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -284,7 +285,8 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
-	if (ptep_clear_flush_young(vma, address, pte))
+	if (ptep_clear_flush_young(vma, address, pte) |
+	    mmu_notifier_age_page(mm, address))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -434,6 +436,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -473,6 +476,10 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+			if (unlikely(PageExternalRmap(page))) {
+				mmu_rmap_notifier(invalidate_page, page);
+				ClearPageExternalRmap(page);
+			}
 			if (page_test_dirty(page)) {
 				page_clear_dirty(page);
 				ret = 1;
@@ -677,7 +684,8 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young(vma, address, pte) |
+				mmu_notifier_age_page(mm, address)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
@@ -685,6 +693,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mmu_notifier(invalidate_page, mm, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -809,12 +818,14 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte) |
+		    mmu_notifier_age_page(mm, address))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -971,6 +982,11 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	if (unlikely(PageExternalRmap(page))) {
+		mmu_rmap_notifier(invalidate_page, page);
+		ClearPageExternalRmap(page);
+	}
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;

-- 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  4:57 ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-01-31 12:31   ` Andrea Arcangeli
  2008-01-31 20:07     ` Christoph Lameter
  2008-01-31 22:01     ` mmu_notifier: close hole in fork Christoph Lameter
  2008-02-01  4:24   ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Robin Holt
  1 sibling, 2 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 12:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Wed, Jan 30, 2008 at 08:57:52PM -0800, Christoph Lameter wrote:
> @@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
>  		spin_unlock(&mapping->i_mmap_lock);
>  	}
>  
> +	mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
>  	err = populate_range(mm, vma, start, size, pgoff);
> +	mmu_notifier(invalidate_range_end, mm, 0);
>  	if (!err && !(flags & MAP_NONBLOCK)) {
>  		if (unlikely(has_write_lock)) {
>  			downgrade_write(&mm->mmap_sem);

This can't be enough for GRU, infact it can't work for KVM either. You
got 1) to have some invalidate_page for GRU before freeing the page,
and 2) to pass start, end to range_end (if you want kvm to use it
instead of invalidate_page).

mremap still missing as a whole.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] mmu notifiers #v5
  2008-01-31  4:57 [patch 0/3] [RFC] MMU Notifiers V4 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-01-31  4:57 ` [patch 3/3] mmu_notifier: invalidate_page callbacks Christoph Lameter
@ 2008-01-31 17:18 ` Andrea Arcangeli
  2008-01-31 20:18   ` Christoph Lameter
  3 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 17:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

GRU should implement at least invalidate_page and invalidate_pages,
and it should be mostly covered with that.

My suggestion is to add the invalidate_range_start/end incrementally
with this, and to keep all the xpmem mmu notifiers in a separate
incremental patch (those are going to require many more changes to
perfect). They've very different things. GRU is simpler, will require
less changes and it should be taken care of sooner than XPMEM. KVM
requirements are a subset of GRU thanks to the page pin so I can
ignore KVM requirements as a whole and I only focus on GRU for the
time being.

There's room for optimization by moving some invalidate_page to
invalidate_pages but that would complicate all tlb flushing code a
lot, it'd require tons of changes and we should concentrate on getting
the functionality included in the simplest, obviously safe, least
intrusive form, in a way that can be optimized later if that turns out
to be an issue. And I think for KVM this is already quite optimal. And
there's zero slowdown when the notifiers are armed, for all tasks but
the one tracked by the notifiers.

Invalidates inside PT lock will avoid the page faults to happen in
parallel of my invalidates, no dependency on the page pin, mremap
covered, mprotect covered etc... I verified with the below patch KVM
swapping is rock solid on top of 2.6.24 as with my #v4 on a older host
kernel. I think it'd be much less efficient to add yet another
non-per-pte-scalar lock to serialize the GRU interrupt or KVM
pagefault against the main linux page fault, given we already have all
needed serialization out of the PT lock. XPMEM is forced to do that
because it has to sleep in sending the invalidate. So it has to take
such subsystem mutex, send the invalidate, ptep_clear_flush, and
finally unlock the mutex (hence the need of a _end callback, to unlock
the page faults). KVM could also add a kvm internal lock to do
something like that but there's no point given PT lock is more
scalar. GRU is totally forbidden to use such model because GRU lacks
the page pin (KVM could use the XPMEM invalidate_range too, instead of
the GRU invalidate_page[s] only thanks to the page pin taken in
get_user_pages)

Invalidate_pages and invalidate_range_start/end provide entirely
different semantics even if they're sharing the same mmu_notifier.head
registration list. And _end will better get start,end range too.

XPMEM -> invalidate_range_start/end/invalidate_external_rmap
GRU/KVM -> invalidate_pages[s], in the future mprotect_pages optimization etc...

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
 	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
 	if (__young)							\
 		flush_tlb_page(__vma, __address);			\
+	__young |= mmu_notifier_age_page((__vma)->vm_mm, __address);	\
 	__young;							\
 })
 #endif
@@ -86,6 +87,7 @@ do {									\
 	pte_t __pte;							\
 	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
 	flush_tlb_page(__vma, __address);				\
+	mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);	\
 	__pte;								\
 })
 #endif
diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
--- a/include/asm-s390/pgtable.h
+++ b/include/asm-s390/pgtable.h
@@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
 {
 	pte_t pte = *ptep;
 	ptep_invalidate(address, ptep);
+	mmu_notifier(invalidate_page, vma->vm_mm, address);
 	return pte;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/mmu_notifier.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -219,6 +220,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,132 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+	/*
+	 * Called when nobody can register any more notifier in the mm
+	 * and after the "mn" notifier has been disarmed already.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * invalidate_page[s] is called in atomic context
+	 * after any pte has been updated and before
+	 * dropping the PT lock required to update any Linux pte.
+	 * Once the PT lock will be released the pte will have its
+	 * final value to export through the secondary MMU.
+	 * Before this is invoked any secondary MMU is still ok
+	 * to read/write to the page previously pointed by the
+	 * Linux pte because the old page hasn't been freed yet.
+	 * If required set_page_dirty has to be called internally
+	 * to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+	void (*invalidate_pages)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
+
+	/*
+	 * Age page is called in atomic context inside the PT lock
+	 * right after the VM is test-and-clearing the young/accessed
+	 * bitflag in the pte. This way the VM will provide proper aging
+	 * to the accesses to the page through the secondary MMUs
+	 * and not only to the ones through the Linux pte.
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+};
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_head {
+	struct hlist_head head;
+	spinlock_t lock;
+};
+
+#include <linux/mm_types.h>
+
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads.
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the "struct mmu_notifier" can be freed. Alternatively it
+ * can be synchronously freed inside ->release when the list can't
+ * change anymore and nobody could possibly walk it.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+	spin_lock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+						 &(mm)->mmu_notifier.head, \
+						 hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+struct mmu_notifier_head {};
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_age_page(mm, address) ({ 0; })
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)			       \
+	do {							       \
+		if (0) {					       \
+			struct mmu_notifier *__mn;		       \
+								       \
+			__mn = (struct mmu_notifier *)(0x00ff);	       \
+			__mn->ops->function(__mn, mm, args);	       \
+		};						       \
+	} while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -360,6 +360,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -756,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar
 		if (pte_none(pte))
 			continue;
 
+		mmu_notifier(invalidate_page, mm, address);
 		page = pte_page(pte);
 		if (pte_dirty(pte))
 			set_page_dirty(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -661,6 +661,7 @@ static unsigned long zap_pte_range(struc
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			mmu_notifier(invalidate_page, mm, addr);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1249,6 +1250,7 @@ static int remap_pte_range(struct mm_str
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
@@ -1260,6 +1262,7 @@ static int remap_pte_range(struct mm_str
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	mmu_notifier(invalidate_pages, mm, start, addr);
 	pte_unmap_unlock(pte - 1, ptl);
 	return 0;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2043,6 +2043,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,73 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/rcupdate.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *tmp;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, tmp,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					 &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mm->mmu_notifier.lock);
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+	spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mm->mmu_notifier.lock);
+	hlist_del_rcu(&mn->hlist);
+	spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@ static void change_pte_range(struct mm_s
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	mmu_notifier(invalidate_pages, mm, start, addr);
 	pte_unmap_unlock(pte - 1, ptl);
 }
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31 12:31   ` Andrea Arcangeli
@ 2008-01-31 20:07     ` Christoph Lameter
  2008-01-31 22:01     ` mmu_notifier: close hole in fork Christoph Lameter
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31 20:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 08:57:52PM -0800, Christoph Lameter wrote:
> > @@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
> >  		spin_unlock(&mapping->i_mmap_lock);
> >  	}
> >  
> > +	mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
> >  	err = populate_range(mm, vma, start, size, pgoff);
> > +	mmu_notifier(invalidate_range_end, mm, 0);
> >  	if (!err && !(flags & MAP_NONBLOCK)) {
> >  		if (unlikely(has_write_lock)) {
> >  			downgrade_write(&mm->mmap_sem);
> 
> This can't be enough for GRU, infact it can't work for KVM either. You
> got 1) to have some invalidate_page for GRU before freeing the page,
> and 2) to pass start, end to range_end (if you want kvm to use it
> instead of invalidate_page).

The external references are dropped when calling invalidate_range_begin. 
This would work both for the KVM and the GRU. Why would KVM not be able to 
invalidate the range before? Locking conventions is that no additional 
external reference can be added between invalidate_range_begin and 
invalidate_range_end. So KVM is fine too.

> mremap still missing as a whole.

mremap uses do_munmap which calls into unmap_region() that already has 
callbacks. So what is wrong there?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-01-31 17:18 ` [PATCH] mmu notifiers #v5 Andrea Arcangeli
@ 2008-01-31 20:18   ` Christoph Lameter
  2008-01-31 23:09     ` Christoph Lameter
  2008-01-31 23:28     ` Andrea Arcangeli
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31 20:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> My suggestion is to add the invalidate_range_start/end incrementally
> with this, and to keep all the xpmem mmu notifiers in a separate
> incremental patch (those are going to require many more changes to
> perfect). They've very different things. GRU is simpler, will require
> less changes and it should be taken care of sooner than XPMEM. KVM
> requirements are a subset of GRU thanks to the page pin so I can
> ignore KVM requirements as a whole and I only focus on GRU for the
> time being.

KVM requires get_user_pages. This makes them currently different.

> Invalidates inside PT lock will avoid the page faults to happen in
> parallel of my invalidates, no dependency on the page pin, mremap

You are aware that the pt lock is split for systems with >4 CPUS? You can 
use the pte_lock only to serialize access to individual ptes.

> pagefault against the main linux page fault, given we already have all
> needed serialization out of the PT lock. XPMEM is forced to do that

pt lock cannot serialize with invalidate_range since it is split. A range 
requires locking for a series of ptes not only individual ones.

> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -46,6 +46,7 @@
>  	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
>  	if (__young)							\
>  		flush_tlb_page(__vma, __address);			\
> +	__young |= mmu_notifier_age_page((__vma)->vm_mm, __address);	\
>  	__young;							\
>  })
>  #endif

That may be okay. Have you checked all the arches that can provide their 
own implementation of this macro? This is only going to work on arches 
that use the generic implementation.

 > @@ -86,6 +87,7 @@ do {									\
>  	pte_t __pte;							\
>  	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
>  	flush_tlb_page(__vma, __address);				\
> +	mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);	\
>  	__pte;								\
>  })
>  #endif

This will require a callback on every(!) removal of a pte. A range 
invalidate does not do any good since the callbacks are performed anyways. 
Probably needlessly.

In addition you have the same issues with arches providing their own macro 
here.

> diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
> --- a/include/asm-s390/pgtable.h
> +++ b/include/asm-s390/pgtable.h
> @@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
>  {
>  	pte_t pte = *ptep;
>  	ptep_invalidate(address, ptep);
> +	mmu_notifier(invalidate_page, vma->vm_mm, address);
>  	return pte;
>  }
>  

Ahh you found an additional arch. How about x86 code? There is one 
override of these functions there as well.

> +	/*
> +	 * invalidate_page[s] is called in atomic context
> +	 * after any pte has been updated and before
> +	 * dropping the PT lock required to update any Linux pte.
> +	 * Once the PT lock will be released the pte will have its
> +	 * final value to export through the secondary MMU.
> +	 * Before this is invoked any secondary MMU is still ok
> +	 * to read/write to the page previously pointed by the
> +	 * Linux pte because the old page hasn't been freed yet.
> +	 * If required set_page_dirty has to be called internally
> +	 * to this method.
> +	 */
> +	void (*invalidate_page)(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long address);


> +	void (*invalidate_pages)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end);

What is the point of invalidate_pages? It cannot be serialized properly 
and you do the invalidate_page() calles regardless. Is is some sort of 
optimization?

> +struct mmu_notifier_head {};
> +
> +#define mmu_notifier_register(mn, mm) do {} while(0)
> +#define mmu_notifier_unregister(mn, mm) do {} while (0)
> +#define mmu_notifier_release(mm) do {} while (0)
> +#define mmu_notifier_age_page(mm, address) ({ 0; })
> +#define mmu_notifier_head_init(mmh) do {} while (0)

Macros. We want functions there to be able to validate the parameters even 
if !CONFIG_MMU_NOTIFIER.

> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...)			       \
> +	do {							       \
> +		if (0) {					       \
> +			struct mmu_notifier *__mn;		       \
> +								       \
> +			__mn = (struct mmu_notifier *)(0x00ff);	       \
> +			__mn->ops->function(__mn, mm, args);	       \
> +		};						       \
> +	} while (0)
> +
> +#endif /* CONFIG_MMU_NOTIFIER */

Ok here you took the variant that checks parameters.

> @@ -1249,6 +1250,7 @@ static int remap_pte_range(struct mm_str
>  {
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	unsigned long start = addr;
>  
>  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte)
> @@ -1260,6 +1262,7 @@ static int remap_pte_range(struct mm_str
>  		pfn++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
> +	mmu_notifier(invalidate_pages, mm, start, addr);
>  	pte_unmap_unlock(pte - 1, ptl);
>  	return 0;

You are under the wrong impression that you can use the pte lock to 
serialize general access to ptes! Nope. ptelock only serialize access to 
individual ptes. This is broken.

> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		hlist_for_each_entry_safe(mn, n, tmp,
> +					  &mm->mmu_notifier.head, hlist) {
> +			hlist_del(&mn->hlist);

hlist_del_init?

> @@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
>  
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
> +	mmu_notifier(invalidate_pages, mm, start, addr);
>  	pte_unmap_unlock(pte - 1, ptl);
>  }

Again broken serialization.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* mmu_notifier: close hole in fork
  2008-01-31 12:31   ` Andrea Arcangeli
  2008-01-31 20:07     ` Christoph Lameter
@ 2008-01-31 22:01     ` Christoph Lameter
  2008-01-31 22:16       ` mmu_notifier: reduce size of mm_struct if !CONFIG_MMU_NOTIFIER Christoph Lameter
                         ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31 22:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

Talking to Robin and Jack we found taht we still have a hole during fork. 
Fork may set a pte writeprotect. At that point the remote pte are 
not marked readonly(!). Remote writes may occur to pages that are marked 
readonly locally without this patch.

mmu_notifier: Provide invalidate_range on fork

On fork we change ptes in cow mappings to readonly. This means we must
invalidate the ptes so that they are reestablished later with proper
permission.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/memory.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-31 13:42:35.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-31 13:47:31.000000000 -0800
@@ -602,6 +602,9 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
@@ -612,6 +615,9 @@ int copy_page_range(struct mm_struct *ds
 						vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_range_end, src_mm, 0);
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: reduce size of mm_struct if !CONFIG_MMU_NOTIFIER
  2008-01-31 22:01     ` mmu_notifier: close hole in fork Christoph Lameter
@ 2008-01-31 22:16       ` Christoph Lameter
  2008-01-31 22:21       ` mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback Christoph Lameter
  2008-02-01  0:01       ` mmu_notifier: close hole in fork Andrea Arcangeli
  2 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31 22:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

mmu_notifier: Reduce size of mm_struct if !CONFIG_MMU_NOTIFIER

Andrea and Peter had a concern about this.

Use an #ifdef to make the mmu_notifer_head structure empty if we have
no notifier. That allows the use of the structure in inline functions
(which allows parameter verification even if !CONFIG_MMU_NOTIFIER)

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm_types.h |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-31 14:03:23.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-31 14:03:38.000000000 -0800
@@ -154,7 +154,9 @@ struct vm_area_struct {
 };
 
 struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
 	struct hlist_head head;
+#endif
 };
 
 struct mm_struct {


^ permalink raw reply	[flat|nested] 63+ messages in thread

* mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback
  2008-01-31 22:01     ` mmu_notifier: close hole in fork Christoph Lameter
  2008-01-31 22:16       ` mmu_notifier: reduce size of mm_struct if !CONFIG_MMU_NOTIFIER Christoph Lameter
@ 2008-01-31 22:21       ` Christoph Lameter
  2008-02-01  0:13         ` Andrea Arcangeli
  2008-02-01  0:01       ` mmu_notifier: close hole in fork Andrea Arcangeli
  2 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31 22:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

mmu_notifier: Move mmu_notifier_release up to get rid of invalidate_all()

It seems that it is safe to call mmu_notifier_release() before we tear down
the pages and the vmas since we are the only executing thread. mmu_notifier_release
can then also tear down all its external ptes and thus we can get rid
of the invalidate_all() callback.

During the final teardown no mmu_notifier calls are registered anymore which
will speed up exit processing.

Is this okay for KVM too?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mmu_notifier.h |    4 ----
 mm/mmap.c                    |    3 +--
 2 files changed, 1 insertion(+), 6 deletions(-)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-31 14:17:17.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-31 14:17:28.000000000 -0800
@@ -59,10 +59,6 @@ struct mmu_notifier_ops {
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
 
-	/* Dummy needed because the mmu_notifier() macro requires it */
-	void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
-				int dummy);
-
 	/*
 	 * age_page is called from contexts where the pte_lock is held
 	 */
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-31 14:16:51.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-31 14:17:10.000000000 -0800
@@ -2036,7 +2036,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
-	mmu_notifier(invalidate_all, mm, 0);
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();
@@ -2048,7 +2048,6 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
-	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-01-31 20:18   ` Christoph Lameter
@ 2008-01-31 23:09     ` Christoph Lameter
  2008-01-31 23:41       ` Andrea Arcangeli
  2008-01-31 23:28     ` Andrea Arcangeli
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-01-31 23:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Christoph Lameter wrote:

> > pagefault against the main linux page fault, given we already have all
> > needed serialization out of the PT lock. XPMEM is forced to do that
> 
> pt lock cannot serialize with invalidate_range since it is split. A range 
> requires locking for a series of ptes not only individual ones.

Hmmm.. May be okay after all. I see that you are only doing it on the pte 
level. This means the range callbacks are taking down a max of 512 
entries. So you have a callback for each pmd. A callback for 2M of memory?





^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-01-31 20:18   ` Christoph Lameter
  2008-01-31 23:09     ` Christoph Lameter
@ 2008-01-31 23:28     ` Andrea Arcangeli
  2008-02-01  1:37       ` Christoph Lameter
  1 sibling, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 23:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 12:18:54PM -0800, Christoph Lameter wrote:
> pt lock cannot serialize with invalidate_range since it is split. A range 
> requires locking for a series of ptes not only individual ones.

The lock I take already protects up to 512 ptes yes. I call
invalidate_pages only across a _single_ 4k pagetable filled at max
with 512 pte_t mapping virutal addresses in a 2M naturally aligned
virtual range.

There's no smp race even for >4 CPUS. Check it again! I never call
invalidate_pages _outside_ the PT lock specific for that 4k pagetable
(a single PT lock protects 512 pte_t, not only a single one!).

Perhaps if you called it PMD lock there would be no misunderstanding:

    spinlock_t *__ptl = pte_lockptr(mm, pmd);	 \

(the pt lock is a function of the pmd, pte_t not!)

> That may be okay. Have you checked all the arches that can provide their 
> own implementation of this macro? This is only going to work on arches 
> that use the generic implementation.

Obviously I checked them all yes, and this was much faster than hand
editing the .c files like you did indeed.

> This will require a callback on every(!) removal of a pte. A range 
> invalidate does not do any good since the callbacks are performed anyways. 
> Probably needlessly.
> 
> In addition you have the same issues with arches providing their own macro 
> here.

Yes, s390 is the only one.

> 
> > diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
> > --- a/include/asm-s390/pgtable.h
> > +++ b/include/asm-s390/pgtable.h
> > @@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
> >  {
> >  	pte_t pte = *ptep;
> >  	ptep_invalidate(address, ptep);
> > +	mmu_notifier(invalidate_page, vma->vm_mm, address);
> >  	return pte;
> >  }
> >  
> 
> Ahh you found an additional arch. How about x86 code? There is one 
> override of these functions there as well.

There's no ptep_clear_flush override on x86 or I would have patched it
like I did to s390.

I had to change 2 lines instead of a single one, not such a big deal.

> > +	/*
> > +	 * invalidate_page[s] is called in atomic context
> > +	 * after any pte has been updated and before
> > +	 * dropping the PT lock required to update any Linux pte.
> > +	 * Once the PT lock will be released the pte will have its
> > +	 * final value to export through the secondary MMU.
> > +	 * Before this is invoked any secondary MMU is still ok
> > +	 * to read/write to the page previously pointed by the
> > +	 * Linux pte because the old page hasn't been freed yet.
> > +	 * If required set_page_dirty has to be called internally
> > +	 * to this method.
> > +	 */
> > +	void (*invalidate_page)(struct mmu_notifier *mn,
> > +				struct mm_struct *mm,
> > +				unsigned long address);
> 
> 
> > +	void (*invalidate_pages)(struct mmu_notifier *mn,
> > +				 struct mm_struct *mm,
> > +				 unsigned long start, unsigned long end);
> 
> What is the point of invalidate_pages? It cannot be serialized properly 
> and you do the invalidate_page() calles regardless. Is is some sort of 
> optimization?

It is already serialized 100% properly sorry.

> > +struct mmu_notifier_head {};
> > +
> > +#define mmu_notifier_register(mn, mm) do {} while(0)
> > +#define mmu_notifier_unregister(mn, mm) do {} while (0)
> > +#define mmu_notifier_release(mm) do {} while (0)
> > +#define mmu_notifier_age_page(mm, address) ({ 0; })
> > +#define mmu_notifier_head_init(mmh) do {} while (0)
> 
> Macros. We want functions there to be able to validate the parameters even 
> if !CONFIG_MMU_NOTIFIER.

If you want I can turn this into a static inline, it would already
work fine. Certainly this isn't a blocker for merging given most
people will have MMU_NOTIFIER=y and this speedup compilation a tiny
bit for the embedded.

> > +
> > +/*
> > + * Notifiers that use the parameters that they were passed so that the
> > + * compiler does not complain about unused variables but does proper
> > + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> > + * Macros generate no code.
> > + */
> > +#define mmu_notifier(function, mm, args...)			       \
> > +	do {							       \
> > +		if (0) {					       \
> > +			struct mmu_notifier *__mn;		       \
> > +								       \
> > +			__mn = (struct mmu_notifier *)(0x00ff);	       \
> > +			__mn->ops->function(__mn, mm, args);	       \
> > +		};						       \
> > +	} while (0)
> > +
> > +#endif /* CONFIG_MMU_NOTIFIER */
> 
> Ok here you took the variant that checks parameters.

This is primarly to turn off the compiler warning, not to check the
parameters, but yes it should also checks the parameter types as a bonus.

> > @@ -1249,6 +1250,7 @@ static int remap_pte_range(struct mm_str
> >  {
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> > +	unsigned long start = addr;
> >  
> >  	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> >  	if (!pte)
> > @@ -1260,6 +1262,7 @@ static int remap_pte_range(struct mm_str
> >  		pfn++;
> >  	} while (pte++, addr += PAGE_SIZE, addr != end);
> >  	arch_leave_lazy_mmu_mode();
> > +	mmu_notifier(invalidate_pages, mm, start, addr);
> >  	pte_unmap_unlock(pte - 1, ptl);
> >  	return 0;
> 
> You are under the wrong impression that you can use the pte lock to 
> serialize general access to ptes! Nope. ptelock only serialize access to 
> individual ptes. This is broken.

You are under the wrong impression that invalidate_page could be
called on a range that spawns over a region that isn't 2M naturally
aligned and <2M in size.

> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		hlist_for_each_entry_safe(mn, n, tmp,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del(&mn->hlist);
> 
> hlist_del_init?

This is a mess I've to say and I'm almost willing to return to a
_safe_rcu + and removing the autodisarming feature. KVM itself isn't
using mm_users but mm_count. then if somebody need to release the mn
inside ->relase they should mmu_notifier_unregister _inside_
->release and then call_rcu to release the mn (synchronize_rcu isn't
allowed inside ->release because of the rcu_spin_lock() that can't
schedule or it'll reach a quiescent point itself allowing the other
structures to be released.

> > @@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
> >  
> >  	} while (pte++, addr += PAGE_SIZE, addr != end);
> >  	arch_leave_lazy_mmu_mode();
> > +	mmu_notifier(invalidate_pages, mm, start, addr);
> >  	pte_unmap_unlock(pte - 1, ptl);
> >  }
> 
> Again broken serialization.

The above is perfectly fine too. If you have doubts and your
misunderstanding of my 100% SMP safe locking isn't immediately clear,
think about this, how could it be safe to modify 512 ptes with a
single lock, regardless of the mmu notifiers?

I appreciate the review! I hope my entirely bug free and
strightforward #v5 will strongly increase the probability of getting
this in sooner than later. If something else it shows the approach I
prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
requirements only to the mmu notifier users that can't deal with the
scalar and finegrined and already-taken/trashed PT lock.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-01-31 23:09     ` Christoph Lameter
@ 2008-01-31 23:41       ` Andrea Arcangeli
  2008-02-01  1:44         ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 23:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 03:09:55PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Christoph Lameter wrote:
> 
> > > pagefault against the main linux page fault, given we already have all
> > > needed serialization out of the PT lock. XPMEM is forced to do that
> > 
> > pt lock cannot serialize with invalidate_range since it is split. A range 
> > requires locking for a series of ptes not only individual ones.
> 
> Hmmm.. May be okay after all. I see that you are only doing it on the pte 
> level. This means the range callbacks are taking down a max of 512 
> entries. So you have a callback for each pmd. A callback for 2M of memory?

Exactly. The point of _pages is to reduce of an order of magnitude
(512, or 1024 times) the number of needed invalidate_page calls in a
few places where it's a strightforward optimization for both KVM and
GRU. Thanks to the PT lock this remains a totally obviously safe
design and it requires zero additional locking anywhere (nor linux VM,
nor in the mmu notifier methods, nor in the KVM/GRU page fault).

Sure you can do invalidate_range_start/end for more than 2M(/4M on
32bit) max virtual ranges. But my approach that averages the fixed
mmu_lock cost already over 512(/1024) ptes will make any larger
"range" improvement not strongly measurable anymore given to do that
you have to add locking as well and _surely_ decrease the GRU
scalability with tons of threads and tons of cpus potentially making
GRU a lot slower _especially_ on your numa systems.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: close hole in fork
  2008-01-31 22:01     ` mmu_notifier: close hole in fork Christoph Lameter
  2008-01-31 22:16       ` mmu_notifier: reduce size of mm_struct if !CONFIG_MMU_NOTIFIER Christoph Lameter
  2008-01-31 22:21       ` mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback Christoph Lameter
@ 2008-02-01  0:01       ` Andrea Arcangeli
  2008-02-01  1:48         ` Christoph Lameter
  2 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-01  0:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 02:01:43PM -0800, Christoph Lameter wrote:
> Talking to Robin and Jack we found taht we still have a hole during fork. 
> Fork may set a pte writeprotect. At that point the remote pte are 
> not marked readonly(!). Remote writes may occur to pages that are marked 
> readonly locally without this patch.

Good catch! This was missing also in my #v5 (KVM doesn't need that
because the only possible cows on sptes can be generated by ksm, but
it would have been a problem for GRU). The more I think about it, the
more I think GRU will run so nicely with follow_page taking zero
additional locking and with the nicely scalable PT lock that doesn't
require 1024 threads to trash on the same lock on 1024 different cpus
if they access addresses that aren't in the same naturally aligned 2M
chunk of virtual address space. I understood follow_page was
performance critical for GRU, if yes, then I hope my _dual_ approach
is by far the best for at least GRU (and KVM of course for the very
same reason), and of course it'll fit XPMEM too the moment you add
invalidate_range_start/end too.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -494,6 +494,7 @@ static int copy_pte_range(struct mm_stru
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	unsigned long start;
 
 again:
 	rss[1] = rss[0] = 0;
@@ -505,6 +506,7 @@ again:
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	start = addr;
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -525,6 +527,8 @@ again:
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_pages, vma->vm_mm, start, addr);
 	spin_unlock(src_ptl);
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback
  2008-01-31 22:21       ` mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback Christoph Lameter
@ 2008-02-01  0:13         ` Andrea Arcangeli
  2008-02-01  1:52           ` Christoph Lameter
  2008-02-01  1:57           ` mmu_notifier: invalidate_range for move_page_tables Christoph Lameter
  0 siblings, 2 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-01  0:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 02:21:58PM -0800, Christoph Lameter wrote:
> Is this okay for KVM too?

->release isn't implemented at all in KVM, only the list_del generates
complications.

I think current code could be already safe through the mm_count pin,
becasue KVM relies on the fact anybody pinning through mm_count like
KVM does, is forbidden to call unregister and it's forced to wait the
auto-disarming when mm_users hits zero, but I feel like something's
still wrong if I think that I'm not using call_rcu to free the
notifier (OTOH we agreed the list had to be frozen and w/o readers
(modulo _release) before _release is called, so if this initial
assumption is ok it seems I may be safe w/o call_rcu?).

But it's really tricky path. Anyway this is the last of my worries
right now, it works perfectly fine with a single user obviously, and
the moment KVM threads runs remotely through GRU/XPMEM isn't happening
too soon ;) so let's concentrate on the rest first. I can say
hlist_del_init doesn't seem to provide any benefit given nobody could
possibly decide to call register or unregister after _release run.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-01-31 23:28     ` Andrea Arcangeli
@ 2008-02-01  1:37       ` Christoph Lameter
  2008-02-01  2:23         ` Robin Holt
  2008-02-01 12:00         ` Andrea Arcangeli
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  1:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> I appreciate the review! I hope my entirely bug free and
> strightforward #v5 will strongly increase the probability of getting
> this in sooner than later. If something else it shows the approach I
> prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> requirements only to the mmu notifier users that can't deal with the
> scalar and finegrined and already-taken/trashed PT lock.

Mutex locking? Could you be more specific?

I hope you will continue to do reviews of the other mmu notifier patchset?



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-01-31 23:41       ` Andrea Arcangeli
@ 2008-02-01  1:44         ` Christoph Lameter
  2008-02-01 12:09           ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  1:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> GRU. Thanks to the PT lock this remains a totally obviously safe
> design and it requires zero additional locking anywhere (nor linux VM,
> nor in the mmu notifier methods, nor in the KVM/GRU page fault).

Na. I would not be so sure about having caught all the issues yet...

> Sure you can do invalidate_range_start/end for more than 2M(/4M on
> 32bit) max virtual ranges. But my approach that averages the fixed
> mmu_lock cost already over 512(/1024) ptes will make any larger
> "range" improvement not strongly measurable anymore given to do that
> you have to add locking as well and _surely_ decrease the GRU
> scalability with tons of threads and tons of cpus potentially making
> GRU a lot slower _especially_ on your numa systems.

The trouble is that the invalidates are much more expensive if you have to 
send theses to remote partitions (XPmem). And its really great if you can 
simple tear down everything. Certainly this is a significant improvement 
over the earlier approach but you still have the invalidate_page calls in 
ptep_clear_flush. So they fire needlessly?

Serializing access in the device driver makes sense and comes with 
additional possiblity of not having to increment page counts all the time. 
So you trade one cacheline dirtying for many that are necessary if you 
always increment the page count.

How does KVM insure the consistency of the shadow page tables? Atomic ops?

The GRU has no page table on its own. It populates TLB entries on demand 
using the linux page table. There is no way it can figure out when to 
drop page counts again. The invalidate calls are turned directly into tlb 
flushes.

 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: close hole in fork
  2008-02-01  0:01       ` mmu_notifier: close hole in fork Andrea Arcangeli
@ 2008-02-01  1:48         ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  1:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> Good catch! This was missing also in my #v5 (KVM doesn't need that
> because the only possible cows on sptes can be generated by ksm, but
> it would have been a problem for GRU). The more I think about it, the

How do you think the GRU should know when to drop the refcount? There is 
no page table and thus no way of tracking that a refcount was taken. 
Without the refcount you cannot defer the freeing of the page. So 
shootdown on invalidate_range_begin and lock out until 
invalidate_range_end seems to be the only workable solution.

BTW what do you think about adding a flag parameter to the invalidate 
calls that allows shooting down writable ptes only? That could be useful 
for COW and page_mkclean.

So

#define MMU_ATOMIC 1
#define MMU_WRITABLE 2

insted of the atomic parameter?




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback
  2008-02-01  0:13         ` Andrea Arcangeli
@ 2008-02-01  1:52           ` Christoph Lameter
  2008-02-01  1:57           ` mmu_notifier: invalidate_range for move_page_tables Christoph Lameter
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  1:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> On Thu, Jan 31, 2008 at 02:21:58PM -0800, Christoph Lameter wrote:
> > Is this okay for KVM too?
> 
> ->release isn't implemented at all in KVM, only the list_del generates
> complications.

Why would the list_del generate problems?

> I think current code could be already safe through the mm_count pin,
> becasue KVM relies on the fact anybody pinning through mm_count like
> KVM does, is forbidden to call unregister and it's forced to wait the
> auto-disarming when mm_users hits zero, but I feel like something's
> still wrong if I think that I'm not using call_rcu to free the
> notifier (OTOH we agreed the list had to be frozen and w/o readers
> (modulo _release) before _release is called, so if this initial
> assumption is ok it seems I may be safe w/o call_rcu?).

You could pin via mm_users? Then it would be entirely safe and no need for 
rcu tricks?

OTOH if there are mm_count users like in KVM: Could we guarantee that 
they do not perform any operations with the mmu notifier list? Then we 
would be safe as well.

> too soon ;) so let's concentrate on the rest first. I can say
> hlist_del_init doesn't seem to provide any benefit given nobody could
> possibly decide to call register or unregister after _release run.

It is useful if a device driver has a list of data segments that contain 
struct mmu_notifiers. The device driver can inspect the mmu_notifier and 
reliably conclude that the beast is inactive.
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
@ 2008-02-01  1:56   ` Jack Steiner
  2008-02-01  2:24     ` Robin Holt
  2008-02-01  2:31   ` Robin Holt
  2008-02-01  3:52   ` Robin Holt
  2 siblings, 1 reply; 63+ messages in thread
From: Jack Steiner @ 2008-02-01  1:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

> @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
>  	unsigned long end;
>  
>  	/* mm's last user has gone, and its about to be pulled down */
> +	mmu_notifier(invalidate_all, mm, 0);
>  	arch_exit_mmap(mm);
>  

The name of the "invalidate_all" callout is not very descriptive.
Why not use "exit_mmap". That matches the function name, the arch callout
and better describes what is happening.


--- jack



^ permalink raw reply	[flat|nested] 63+ messages in thread

* mmu_notifier: invalidate_range for move_page_tables
  2008-02-01  0:13         ` Andrea Arcangeli
  2008-02-01  1:52           ` Christoph Lameter
@ 2008-02-01  1:57           ` Christoph Lameter
  2008-02-01  2:38             ` Robin Holt
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  1:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

mmu_notifier: Provide invalidate_range for move_page_tables

Move page tables also needs to invalidate the external references
and hold new references off while moving page table entries.

This is already guaranteed by holding a writelock
on mmap_sem for get_user_pages() but follow_page() is not subject to
the mmap_sem locking.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mremap.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-01-25 14:25:31.000000000 -0800
+++ linux-2.6/mm/mremap.c	2008-01-31 17:54:19.000000000 -0800
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -130,6 +131,8 @@ unsigned long move_page_tables(struct vm
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	mmu_notifier(invalidate_range_begin, vma->vm_mm,
+					old_addr, old_addr + len, 0);
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -150,6 +153,7 @@ unsigned long move_page_tables(struct vm
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 				new_vma, new_pmd, new_addr);
 	}
+	mmu_notifier(invalidate_range_end, vma->vm_mm, 0);
 
 	return len + old_addr - old_end;	/* how much done */
 }

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-01  1:37       ` Christoph Lameter
@ 2008-02-01  2:23         ` Robin Holt
  2008-02-01  2:26           ` Christoph Lameter
  2008-02-01 12:00         ` Andrea Arcangeli
  1 sibling, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  2:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 05:37:21PM -0800, Christoph Lameter wrote:
> On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
> 
> > I appreciate the review! I hope my entirely bug free and
> > strightforward #v5 will strongly increase the probability of getting
> > this in sooner than later. If something else it shows the approach I
> > prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> > requirements only to the mmu notifier users that can't deal with the
> > scalar and finegrined and already-taken/trashed PT lock.
> 
> Mutex locking? Could you be more specific?

I think he is talking about the external locking that xpmem will need
to do to ensure we are not able to refault pages inside of regions that
are undergoing recall/page table clearing.  At least that has been my
understanding to this point.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  1:56   ` Jack Steiner
@ 2008-02-01  2:24     ` Robin Holt
  2008-02-01  2:37       ` Jack Steiner
  0 siblings, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  2:24 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, linux-kernel, linux-mm,
	daniel.blueman

On Thu, Jan 31, 2008 at 07:56:12PM -0600, Jack Steiner wrote:
> > @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
> >  	unsigned long end;
> >  
> >  	/* mm's last user has gone, and its about to be pulled down */
> > +	mmu_notifier(invalidate_all, mm, 0);
> >  	arch_exit_mmap(mm);
> >  
> 
> The name of the "invalidate_all" callout is not very descriptive.
> Why not use "exit_mmap". That matches the function name, the arch callout
> and better describes what is happening.

This got removed in a later patch.  We now only do the release.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-01  2:23         ` Robin Holt
@ 2008-02-01  2:26           ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  2:26 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Robin Holt wrote:

> > Mutex locking? Could you be more specific?
> 
> I think he is talking about the external locking that xpmem will need
> to do to ensure we are not able to refault pages inside of regions that
> are undergoing recall/page table clearing.  At least that has been my
> understanding to this point.

Right this has to be something like rw spinlock. Its needed for both 
GRU/XPmem. Not sure about KVM.

Take the read lock for invalidate operations. These can occur 
concurrently. (Or a simpler implementation for the GRU may just use a 
spinlock).

The write lock must be held for populate operations.

Lock can be refined as needed by the notifier driver. F.e. locking could 
be restricted to certain ranges.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
  2008-02-01  1:56   ` Jack Steiner
@ 2008-02-01  2:31   ` Robin Holt
  2008-02-01  2:39     ` Christoph Lameter
  2008-02-01  3:52   ` Robin Holt
  2 siblings, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  2:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

Christoph,

Jack has repeatedly pointed out needing an unregister outside the
mmap_sem.  I still don't see the benefit to not having the lock in the mm.

Thanks,
Robin

Index: mmu_notifiers-cl-v4/include/linux/mm_types.h
===================================================================
--- mmu_notifiers-cl-v4.orig/include/linux/mm_types.h	2008-01-31 19:56:08.000000000 -0600
+++ mmu_notifiers-cl-v4/include/linux/mm_types.h	2008-01-31 19:56:58.000000000 -0600
@@ -155,6 +155,7 @@ struct vm_area_struct {
 
 struct mmu_notifier_head {
 #ifdef CONFIG_MMU_NOTIFIER
+	spinlock_t	list_lock;
 	struct hlist_head head;
 #endif
 };
Index: mmu_notifiers-cl-v4/mm/mmu_notifier.c
===================================================================
--- mmu_notifiers-cl-v4.orig/mm/mmu_notifier.c	2008-01-31 19:56:08.000000000 -0600
+++ mmu_notifiers-cl-v4/mm/mmu_notifier.c	2008-01-31 20:26:31.000000000 -0600
@@ -60,25 +60,19 @@ int mmu_notifier_age_page(struct mm_stru
  * Note that all notifiers use RCU. The updates are only guaranteed to be
  * visible to other processes after a RCU quiescent period!
  */
-void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
-{
-	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
-}
-EXPORT_SYMBOL_GPL(__mmu_notifier_register);
-
 void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	down_write(&mm->mmap_sem);
-	__mmu_notifier_register(mn, mm);
-	up_write(&mm->mmap_sem);
+	spin_lock(&mm->mmu_notifier.list_lock);
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+	spin_unlock(&mm->mmu_notifier.list_lock);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
 
 void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	down_write(&mm->mmap_sem);
+	spin_lock(&mm->mmu_notifier.list_lock);
 	hlist_del_rcu(&mn->hlist);
-	up_write(&mm->mmap_sem);
+	spin_unlock(&mm->mmu_notifier.list_lock);
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  2:24     ` Robin Holt
@ 2008-02-01  2:37       ` Jack Steiner
  2008-02-01  2:39         ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Jack Steiner @ 2008-02-01  2:37 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, linux-kernel, linux-mm,
	daniel.blueman

On Thu, Jan 31, 2008 at 08:24:44PM -0600, Robin Holt wrote:
> On Thu, Jan 31, 2008 at 07:56:12PM -0600, Jack Steiner wrote:
> > > @@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
> > >  	unsigned long end;
> > >  
> > >  	/* mm's last user has gone, and its about to be pulled down */
> > > +	mmu_notifier(invalidate_all, mm, 0);
> > >  	arch_exit_mmap(mm);
> > >  
> > 
> > The name of the "invalidate_all" callout is not very descriptive.
> > Why not use "exit_mmap". That matches the function name, the arch callout
> > and better describes what is happening.
> 
> This got removed in a later patch.  We now only do the release.

Found it, thanks.

Christoph, is it time to post a new series of patches? I've got
as many fixup patches as I have patches in the original posting.



-- jack


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: invalidate_range for move_page_tables
  2008-02-01  1:57           ` mmu_notifier: invalidate_range for move_page_tables Christoph Lameter
@ 2008-02-01  2:38             ` Robin Holt
  2008-02-01  2:41               ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  2:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 05:57:25PM -0800, Christoph Lameter wrote:
> Move page tables also needs to invalidate the external references
> and hold new references off while moving page table entries.

I must admit to not having spent any time thinking about this, but aren't
we moving the entries from one set of page tables to the other, leaving
the pte_t entries unchanged.  I guess I should go look, but could you
provide a quick pointer in the proper direction as to why we need to
recall externals when the before and after look of these page tables
will have the same information for the TLBs.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  2:31   ` Robin Holt
@ 2008-02-01  2:39     ` Christoph Lameter
  2008-02-01  2:47       ` Robin Holt
  2008-02-01  3:01       ` Jack Steiner
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  2:39 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Robin Holt wrote:

> Jack has repeatedly pointed out needing an unregister outside the
> mmap_sem.  I still don't see the benefit to not having the lock in the mm.

I never understood why this would be needed. ->release removes the 
mmu_notifier right now.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  2:37       ` Jack Steiner
@ 2008-02-01  2:39         ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  2:39 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Jack Steiner wrote:

> Christoph, is it time to post a new series of patches? I've got
> as many fixup patches as I have patches in the original posting.

Maybe wait another day? This is getting a bit too frequent and so far we 
have only minor changes.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: mmu_notifier: invalidate_range for move_page_tables
  2008-02-01  2:38             ` Robin Holt
@ 2008-02-01  2:41               ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  2:41 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Robin Holt wrote:

> On Thu, Jan 31, 2008 at 05:57:25PM -0800, Christoph Lameter wrote:
> > Move page tables also needs to invalidate the external references
> > and hold new references off while moving page table entries.
> 
> I must admit to not having spent any time thinking about this, but aren't
> we moving the entries from one set of page tables to the other, leaving
> the pte_t entries unchanged.  I guess I should go look, but could you
> provide a quick pointer in the proper direction as to why we need to
> recall externals when the before and after look of these page tables
> will have the same information for the TLBs.

remap changes the address of pages in a process. The pages appear at 
another address. Thus the external pte will have the wrong information if 
not invalidated.

Do a

man mremap



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  2:39     ` Christoph Lameter
@ 2008-02-01  2:47       ` Robin Holt
  2008-02-01  3:01         ` Christoph Lameter
  2008-02-01  3:01       ` Jack Steiner
  1 sibling, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  2:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 06:39:19PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
> 
> > Jack has repeatedly pointed out needing an unregister outside the
> > mmap_sem.  I still don't see the benefit to not having the lock in the mm.
> 
> I never understood why this would be needed. ->release removes the 
> mmu_notifier right now.

Both xpmem and GRU have means of removing their context seperate from
process termination.  XPMEMs is by closing the fd, I believe GRU is
the same.  In the case of XPMEM, we are able to acquire the mmap_sem.
For GRU, I don't think it is possible, but I do not remember the exact
reason.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  2:39     ` Christoph Lameter
  2008-02-01  2:47       ` Robin Holt
@ 2008-02-01  3:01       ` Jack Steiner
  2008-02-01  3:03         ` Christoph Lameter
  1 sibling, 1 reply; 63+ messages in thread
From: Jack Steiner @ 2008-02-01  3:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 06:39:19PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
> 
> > Jack has repeatedly pointed out needing an unregister outside the
> > mmap_sem.  I still don't see the benefit to not having the lock in the mm.
> 
> I never understood why this would be needed. ->release removes the 
> mmu_notifier right now.

Christoph -

We discussed this earlier this week. Here is part of the mail:

------------

> > There currently is no __mmu_notifier_unregister(). Oversite???
>
> No need. mmu_notifier_release implies an unregister and I think that is
> the most favored way to release resources since it deals with the RCU
> quiescent period.


I currently unlink the mmu_notifier when the last GRU mapping is closed. For
example, if a user does a:

        gru_create_context();
        ...
        gru_destroy_context();

the mmu_notifier is unlinked and all task tables allocated
by the driver are freed. Are you suggesting that I leave tables
allocated until the task terminates??

Why is that better? What problem do I cause by trying
to free tables as soon as they are not needed?


-----------------------------------------------

> Christoph responded:
> > the mmu_notifier is unlinked and all task tables allocated
> > by the driver are freed. Are you suggesting that I leave tables
> > allocated until the task terminates??
>
> You need to leave the mmu_notifier structure allocated until the next
> quiescent rcu period unless you use the release notifier.

I assumed that I would need to use call_rcu() or synchronize_rcu()
before the table is actually freed. That's still on my TODO list.



--- jack

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  2:47       ` Robin Holt
@ 2008-02-01  3:01         ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  3:01 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Robin Holt wrote:

> Both xpmem and GRU have means of removing their context seperate from
> process termination.  XPMEMs is by closing the fd, I believe GRU is
> the same.  In the case of XPMEM, we are able to acquire the mmap_sem.
> For GRU, I don't think it is possible, but I do not remember the exact
> reason.

For any action initiated from user space you will not hold mmap sem. So 
you can call the unregister function. Then you need to do a 
synchronize_rcu before freeing the structures.

It is also possible to shut this down outside via f.e. a control thread. 
The control thread can acquire mmap_sem and then unregister the notifier.

Am I missing something?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  3:01       ` Jack Steiner
@ 2008-02-01  3:03         ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  3:03 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Jack Steiner wrote:

> I currently unlink the mmu_notifier when the last GRU mapping is closed. For
> example, if a user does a:
> 
>         gru_create_context();
>         ...
>         gru_destroy_context();
> 
> the mmu_notifier is unlinked and all task tables allocated
> by the driver are freed. Are you suggesting that I leave tables
> allocated until the task terminates??

You are in user space and calling into the kernel somehow. The 
mmap_sem is not held at that point so its no trouble to use the unregister 
function. After that wait for rcu and then free your tables.

> I assumed that I would need to use call_rcu() or synchronize_rcu()
> before the table is actually freed. That's still on my TODO list.

Right.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
  2008-02-01  1:56   ` Jack Steiner
  2008-02-01  2:31   ` Robin Holt
@ 2008-02-01  3:52   ` Robin Holt
  2008-02-01  3:58     ` Christoph Lameter
  2 siblings, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  3:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

> Index: linux-2.6/include/linux/mmu_notifier.h
...
> +struct mmu_notifier_ops {
...
> +	/*
> +	 * invalidate_range_begin() and invalidate_range_end() must paired.
> +	 * Multiple invalidate_range_begin/ends may be nested or called
> +	 * concurrently. That is legit. However, no new external references
> +	 * may be established as long as any invalidate_xxx is running or
> +	 * any invalidate_range_begin() and has not been completed through a
> +	 * corresponding call to invalidate_range_end().
> +	 *
> +	 * Locking within the notifier needs to serialize events correspondingly.
> +	 *
> +	 * If all invalidate_xx notifier calls take a driver lock then it is possible
> +	 * to run follow_page() under the same lock. The lock can then guarantee
> +	 * that no page is removed. That way we can avoid increasing the refcount
> +	 * of the pages.
> +	 *
> +	 * invalidate_range_begin() must clear all references in the range
> +	 * and stop the establishment of new references.
> +	 *
> +	 * invalidate_range_end() reenables the establishment of references.
> +	 *
> +	 * atomic indicates that the function is called in an atomic context.
> +	 * We can sleep if atomic == 0.
> +	 */
> +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int atomic);
> +
> +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm, int atomic);

I think we need to pass in the same start-end here as well.  Without it,
the first invalidate_range would have to block faulting for all addresses
and would need to remain blocked until the last invalidate_range has
completed.  While this would work, (and will probably be how we implement
it for the short term), it is far from ideal.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  3:52   ` Robin Holt
@ 2008-02-01  3:58     ` Christoph Lameter
  2008-02-01  4:15       ` Robin Holt
  2008-02-03  1:33       ` Andrea Arcangeli
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  3:58 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Robin Holt wrote:

> > +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> > +				 struct mm_struct *mm, int atomic);
> 
> I think we need to pass in the same start-end here as well.  Without it,
> the first invalidate_range would have to block faulting for all addresses
> and would need to remain blocked until the last invalidate_range has
> completed.  While this would work, (and will probably be how we implement
> it for the short term), it is far from ideal.

Ok. Andrea wanted the same because then he can void the begin callouts.

The problem is that you would have to track the start-end addres right?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  3:58     ` Christoph Lameter
@ 2008-02-01  4:15       ` Robin Holt
  2008-02-03  1:33       ` Andrea Arcangeli
  1 sibling, 0 replies; 63+ messages in thread
From: Robin Holt @ 2008-02-01  4:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
> 
> > > +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> > > +				 struct mm_struct *mm, int atomic);
> > 
> > I think we need to pass in the same start-end here as well.  Without it,
> > the first invalidate_range would have to block faulting for all addresses
> > and would need to remain blocked until the last invalidate_range has
> > completed.  While this would work, (and will probably be how we implement
> > it for the short term), it is far from ideal.
> 
> Ok. Andrea wanted the same because then he can void the begin callouts.
> 
> The problem is that you would have to track the start-end addres right?

Yep.  We will probably no do that in the next week, but I would expect
we have that working before we submit xpmem again.  We will probably
just chain them up in a regular linked list.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  4:57 ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
  2008-01-31 12:31   ` Andrea Arcangeli
@ 2008-02-01  4:24   ` Robin Holt
  2008-02-01  4:43     ` Christoph Lameter
  1 sibling, 1 reply; 63+ messages in thread
From: Robin Holt @ 2008-02-01  4:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

> Index: linux-2.6/mm/memory.c
...
> @@ -1668,6 +1678,7 @@ gotten:
>  		page_cache_release(old_page);
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
> +	mmu_notifier(invalidate_range_end, mm, 0);

I think we can get an _end call without the _begin call before it.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01  4:24   ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Robin Holt
@ 2008-02-01  4:43     ` Christoph Lameter
  2008-02-01 10:32       ` Robin Holt
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01  4:43 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 31 Jan 2008, Robin Holt wrote:

> > Index: linux-2.6/mm/memory.c
> ...
> > @@ -1668,6 +1678,7 @@ gotten:
> >  		page_cache_release(old_page);
> >  unlock:
> >  	pte_unmap_unlock(page_table, ptl);
> > +	mmu_notifier(invalidate_range_end, mm, 0);
> 
> I think we can get an _end call without the _begin call before it.

If that would be true then also the pte would have been left locked.

We always hit unlock. Maybe I just do not see it?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01  4:43     ` Christoph Lameter
@ 2008-02-01 10:32       ` Robin Holt
  2008-02-01 10:37         ` Robin Holt
  2008-02-01 19:13         ` Christoph Lameter
  0 siblings, 2 replies; 63+ messages in thread
From: Robin Holt @ 2008-02-01 10:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 08:43:58PM -0800, Christoph Lameter wrote:
> On Thu, 31 Jan 2008, Robin Holt wrote:
> 
> > > Index: linux-2.6/mm/memory.c
> > ...
> > > @@ -1668,6 +1678,7 @@ gotten:
> > >  		page_cache_release(old_page);
> > >  unlock:
> > >  	pte_unmap_unlock(page_table, ptl);
> > > +	mmu_notifier(invalidate_range_end, mm, 0);
> > 
> > I think we can get an _end call without the _begin call before it.
> 
> If that would be true then also the pte would have been left locked.
> 
> We always hit unlock. Maybe I just do not see it?

Maybe I haven't looked closely enough, but let's start with some common
assumptions.  Looking at do_wp_page from 2.6.24 (I believe that is what
my work area is based upon).  On line 1559, the function begins being
declared.

On lines 1614 and 1630, we do "goto unlock" where the _end callout is
soon made.  The _begin callout does not come until after those branches
have been taken (occurs on line 1648).

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 10:32       ` Robin Holt
@ 2008-02-01 10:37         ` Robin Holt
  2008-02-01 19:13         ` Christoph Lameter
  1 sibling, 0 replies; 63+ messages in thread
From: Robin Holt @ 2008-02-01 10:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 04:32:21AM -0600, Robin Holt wrote:
> On Thu, Jan 31, 2008 at 08:43:58PM -0800, Christoph Lameter wrote:
> > On Thu, 31 Jan 2008, Robin Holt wrote:
> > 
> > > > Index: linux-2.6/mm/memory.c
> > > ...
> > > > @@ -1668,6 +1678,7 @@ gotten:
> > > >  		page_cache_release(old_page);
> > > >  unlock:
> > > >  	pte_unmap_unlock(page_table, ptl);
> > > > +	mmu_notifier(invalidate_range_end, mm, 0);
> > > 
> > > I think we can get an _end call without the _begin call before it.
> > 
> > If that would be true then also the pte would have been left locked.
> > 
> > We always hit unlock. Maybe I just do not see it?
> 
> Maybe I haven't looked closely enough, but let's start with some common
> assumptions.  Looking at do_wp_page from 2.6.24 (I believe that is what
> my work area is based upon).  On line 1559, the function begins being
> declared.
> 
> On lines 1614 and 1630, we do "goto unlock" where the _end callout is
> soon made.  The _begin callout does not come until after those branches
> have been taken (occurs on line 1648).
> 
> Thanks,
> Robin

Ignore this thread, I am going to throw a patch against the new version.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-01  1:37       ` Christoph Lameter
  2008-02-01  2:23         ` Robin Holt
@ 2008-02-01 12:00         ` Andrea Arcangeli
  1 sibling, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-01 12:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 05:37:21PM -0800, Christoph Lameter wrote:
> On Fri, 1 Feb 2008, Andrea Arcangeli wrote:
> 
> > I appreciate the review! I hope my entirely bug free and
> > strightforward #v5 will strongly increase the probability of getting
> > this in sooner than later. If something else it shows the approach I
> > prefer to cover GRU/KVM 100%, leaving the overkill mutex locking
> > requirements only to the mmu notifier users that can't deal with the
> > scalar and finegrined and already-taken/trashed PT lock.
> 
> Mutex locking? Could you be more specific?

Robin understanding was correct on this point. Also I think if you add
start,end to range_end (like suggested a few times by me and once
by Robin) I may get away without a lock thanks to the page pin. That's
one strong reason why start,end are needed in range_end.

The only one that will definitely be forced to add a new lock around
follow_page, in addition to the very _scalable_ PT lock, is GRU.

> I hope you will continue to do reviews of the other mmu notifier patchset?

Sure. I will also create a new KVM patch to plug on top of your
versions (I already did for V2/V3 even if it never worked, but that
might have a build problem, see kvm-devel for details). To be clear,
as long as anything is merged that avoids me to use kprobes to make
KVM swap work, I'm satisfied. I thought my approach had several
advantages in simplicity, and being more obviously safe (in not
relaying entirely on the page pin and creating a window of time where
the linux pte writes to a page and the sptes writes to another one,
remember populate_range), and it avoids introducing external locks in
the GRU follow_page path. My approach was supposed to be in addition
of the range_start/end needed by xpmem that can't possibly take
advanage of the scalar PT lock and it definitely requires external
lock to serialize xpmem fault against linux pagefault (not the case
for GRU/KVM).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-01  1:44         ` Christoph Lameter
@ 2008-02-01 12:09           ` Andrea Arcangeli
  2008-02-01 19:23             ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-01 12:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 05:44:24PM -0800, Christoph Lameter wrote:
> The trouble is that the invalidates are much more expensive if you have to 
> send theses to remote partitions (XPmem). And its really great if you can 
> simple tear down everything. Certainly this is a significant improvement 
> over the earlier approach but you still have the invalidate_page calls in 
> ptep_clear_flush. So they fire needlessly?

Dunno, they certainly fire more frequently than yours, even _pages
fires more frequently than range_start,end but don't forget why!
That's because I've a different spinlock for every 512
ptes/4k-grub-tlbs that are being invalidated... So it pays off in
scalability. I'm unsure if gru could play tricks with your patch, to
still allow faults to still happen in parallel if they're on virtual
addresses not in the same 2M naturally aligned chunk.

> Serializing access in the device driver makes sense and comes with 
> additional possiblity of not having to increment page counts all the time. 
> So you trade one cacheline dirtying for many that are necessary if you 
> always increment the page count.

Note that my #v5 doesn't require to increase the page count all the
time, so GRU will work fine with #v5.

See this comment in my patch:

    /*
     * invalidate_page[s] is called in atomic context
     * after any pte has been updated and before
     * dropping the PT lock required to update any Linux pte.
     * Once the PT lock will be released the pte will have its
     * final value to export through the secondary MMU.
     * Before this is invoked any secondary MMU is still ok
     * to read/write to the page previously pointed by the
     * Linux pte because the old page hasn't been freed yet.
     * If required set_page_dirty has to be called internally
     * to this method.
     */


invalidate_page[s] is always called before the page is freed. This
will require modifications to the tlb flushing code logic to take
advantage of _pages in certain places. For now it's just safe.

> How does KVM insure the consistency of the shadow page tables? Atomic ops?

A per-VM mmu_lock spinlock is taken to serialize the access, plus
atomic ops for the cpu.

> The GRU has no page table on its own. It populates TLB entries on demand 
> using the linux page table. There is no way it can figure out when to 
> drop page counts again. The invalidate calls are turned directly into tlb 
> flushes.

Yes, this is why it can't serialize follow_page with only the PT lock
with your patch. KVM may do it once you add start,end to range_end
only thanks to the additional pin on the page.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 10:32       ` Robin Holt
  2008-02-01 10:37         ` Robin Holt
@ 2008-02-01 19:13         ` Christoph Lameter
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01 19:13 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Robin Holt wrote:

> Maybe I haven't looked closely enough, but let's start with some common
> assumptions.  Looking at do_wp_page from 2.6.24 (I believe that is what
> my work area is based upon).  On line 1559, the function begins being
> declared.

Aah I looked at the wrong file.

> On lines 1614 and 1630, we do "goto unlock" where the _end callout is
> soon made.  The _begin callout does not come until after those branches
> have been taken (occurs on line 1648).

There are actually two cases...

---
 mm/memory.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-02-01 11:04:21.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-02-01 11:12:12.000000000 -0800
@@ -1611,8 +1611,10 @@ static int do_wp_page(struct mm_struct *
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
 			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
-				goto unlock;
+			if (!pte_same(*page_table, orig_pte)) {
+				pte_unmap_unlock(page_table, ptl);
+				goto check_dirty;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -1628,7 +1630,8 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
-		goto unlock;
+		pte_unmap_unlock(page_table, ptl);
+		goto check_dirty;
 	}
 
 	/*
@@ -1684,10 +1687,10 @@ gotten:
 		page_cache_release(new_page);
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
 	pte_unmap_unlock(page_table, ptl);
 	mmu_notifier(invalidate_range_end, mm,
 				address, address + PAGE_SIZE, 0);
+check_dirty:
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-01 12:09           ` Andrea Arcangeli
@ 2008-02-01 19:23             ` Christoph Lameter
  2008-02-03  2:17               ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-01 19:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Andrea Arcangeli wrote:

> Note that my #v5 doesn't require to increase the page count all the
> time, so GRU will work fine with #v5.

But that comes with the cost of firing invalidate_page for every page 
being evicted. In order to make your single invalidate_range work without 
it you need to hold a refcount on the page.

> invalidate_page[s] is always called before the page is freed. This
> will require modifications to the tlb flushing code logic to take
> advantage of _pages in certain places. For now it's just safe.

Yes so your invalidate_range is still some sort of dysfunctional 
optimization? Gazillions of invalidate_page's will have to be executed 
when tearing down large memory areas.

> > How does KVM insure the consistency of the shadow page tables? Atomic ops?
> 
> A per-VM mmu_lock spinlock is taken to serialize the access, plus
> atomic ops for the cpu.

And that would not be enough to hold of new references? With small tweaks 
this should work with a common scheme. We could also redefine the role 
of _start and _end slightly to just require that the refs are removed when 
_end completes. That would allow the KVM page count ref to work as is now 
and would avoid the individual invalidate_page() callouts.
 
> > The GRU has no page table on its own. It populates TLB entries on demand 
> > using the linux page table. There is no way it can figure out when to 
> > drop page counts again. The invalidate calls are turned directly into tlb 
> > flushes.
> 
> Yes, this is why it can't serialize follow_page with only the PT lock
> with your patch. KVM may do it once you add start,end to range_end
> only thanks to the additional pin on the page.

Right but that pin requires taking a refcount which we cannot do.

Frankly this looks as if this is a solution that would work only for KVM.
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-01  3:58     ` Christoph Lameter
  2008-02-01  4:15       ` Robin Holt
@ 2008-02-03  1:33       ` Andrea Arcangeli
  2008-02-04 19:13         ` Christoph Lameter
  1 sibling, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-03  1:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> Ok. Andrea wanted the same because then he can void the begin callouts.

Exactly. I hope the page-pin will avoid me having to serialize the KVM
page fault against the start/end critical section.

BTW, I wonder if the start/end critical section API is intended to
forbid scheduling inside it. In short I wonder if GRU can is allowed
to take a spinlock in _range_start as last thing before returning, and
to release that same spinlock in _range_end as first thing, and not to
be forced to use a mutex.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-01 19:23             ` Christoph Lameter
@ 2008-02-03  2:17               ` Andrea Arcangeli
  2008-02-03  3:14                 ` Jack Steiner
  2008-02-04 19:09                 ` Christoph Lameter
  0 siblings, 2 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-03  2:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 11:23:57AM -0800, Christoph Lameter wrote:
> Yes so your invalidate_range is still some sort of dysfunctional 
> optimization? Gazillions of invalidate_page's will have to be executed 
> when tearing down large memory areas.

I don't know if gru can flush the external TLB references with a
single instruction like the cpu can do by overwriting %cr3. As far as
vmx/svm are concerned, you got to do some fixed amount of work on the
rmap structures and each spte, for each 4k invalidated regardless of
page/pages/range/ranges. You can only share the cost of the lock and
of the memslot lookup, so you lookup and take the lock once and you
drop 512 sptes instead of just 1. Similar to what we do in the main
linux VM by taking the PT lock for every 512 sptes.

So you worry about gazillions of invalidate_page without taking into
account that even if you call invalidate_range_end(0, -1ULL) KVM will
have to mangle over 4503599627370496 rmap entries anyway. Yes calling
1 invalidate_range_end instead of 4503599627370496 invalidate_page,
will be faster, but not much faster. For KVM it remains an O(N)
operation, where N is the number of pages. I'm not so sure we need to
worry about my invalidate_pages being limited to 512 entries.

Perhaps GRU is different, I'm just describing the KVM situation here.

As far as KVM is concerned it's more sensible to be able to scale when
there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
a different guest physical page (especially when new chunks of memory
are allocated for the first time) that won't collide the whole time on
naturally aligned 2M chunks of virtual addresses.

> And that would not be enough to hold of new references? With small tweaks 
> this should work with a common scheme. We could also redefine the role 
> of _start and _end slightly to just require that the refs are removed when 
> _end completes. That would allow the KVM page count ref to work as is now 
> and would avoid the individual invalidate_page() callouts.

I can already only use _end and ignore _start, only remaining problem
is this won't be 100% coherent (my patch is 100% coherent thanks to PT
lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
secondary MMU faults). My approach give a better peace of mind with
full scalability and no lock recursion when it's the KVM page fault
that invokes get_user_pages that invokes the linux page fault routines
that will try to execute _start taking the lock to block the page
fault that is already running...

> > > The GRU has no page table on its own. It populates TLB entries on demand 
> > > using the linux page table. There is no way it can figure out when to 
> > > drop page counts again. The invalidate calls are turned directly into tlb 
> > > flushes.
> > 
> > Yes, this is why it can't serialize follow_page with only the PT lock
> > with your patch. KVM may do it once you add start,end to range_end
> > only thanks to the additional pin on the page.
> 
> Right but that pin requires taking a refcount which we cannot do.

GRU can use my patch without the pin. XPMEM obviously can't use my
patch as my invalidate_page[s] are under the PT lock (a feature to fit
GRU/KVM in the simplest way), this is why an incremental patch adding
invalidate_range_start/end would be required to support XPMEM too.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-03  2:17               ` Andrea Arcangeli
@ 2008-02-03  3:14                 ` Jack Steiner
  2008-02-03  3:33                   ` Andrea Arcangeli
  2008-02-04 19:09                 ` Christoph Lameter
  1 sibling, 1 reply; 63+ messages in thread
From: Jack Steiner @ 2008-02-03  3:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Sun, Feb 03, 2008 at 03:17:04AM +0100, Andrea Arcangeli wrote:
> On Fri, Feb 01, 2008 at 11:23:57AM -0800, Christoph Lameter wrote:
> > Yes so your invalidate_range is still some sort of dysfunctional 
> > optimization? Gazillions of invalidate_page's will have to be executed 
> > when tearing down large memory areas.
> 
> I don't know if gru can flush the external TLB references with a
> single instruction like the cpu can do by overwriting %cr3. As far as

The mechanism obviously is a little different, but the GRU can flush an
external TLB in a single read/write/flush of a cacheline (should take a few
100 nsec). Typically, an application uses a single GRU. Threaded
applications, however, could use one GRU per thread, so it is possible that
multiple TLBs must be flushed. In some cases, the external flush can be
avoided if the GRU is not currently in use by the thread.

Also, most (but not all) applications that use the GRU do not usually do
anything that requires frequent flushing (fortunately). The GRU is intended
for HPC-like applications. These don't usually do frequent map/unmap
operations or anything else that requires a lot of flushes.

I expect that KVM is a lot different.

I have most of the GRU code working with the latest mmuops patch. I still
have a list of loose ends that I'll get to next week. The most important is
the exact handling of the range invalidates. The code that I currently have
works (mostly) but has a few endcases that will cause problems. Once I
finish, I'll be glad to send you snippets of the code (or all of it) if you
would like to take a look.


> vmx/svm are concerned, you got to do some fixed amount of work on the
> rmap structures and each spte, for each 4k invalidated regardless of
> page/pages/range/ranges. You can only share the cost of the lock and
> of the memslot lookup, so you lookup and take the lock once and you
> drop 512 sptes instead of just 1. Similar to what we do in the main
> linux VM by taking the PT lock for every 512 sptes.
> 
> So you worry about gazillions of invalidate_page without taking into
> account that even if you call invalidate_range_end(0, -1ULL) KVM will
> have to mangle over 4503599627370496 rmap entries anyway. Yes calling
> 1 invalidate_range_end instead of 4503599627370496 invalidate_page,
> will be faster, but not much faster. For KVM it remains an O(N)
> operation, where N is the number of pages. I'm not so sure we need to
> worry about my invalidate_pages being limited to 512 entries.
> 
> Perhaps GRU is different, I'm just describing the KVM situation here.
> 
> As far as KVM is concerned it's more sensible to be able to scale when
> there are 1024 kvm threads on 1024 cpus and each kvm-vcpu is accessing
> a different guest physical page (especially when new chunks of memory
> are allocated for the first time) that won't collide the whole time on
> naturally aligned 2M chunks of virtual addresses.
> 
> > And that would not be enough to hold of new references? With small tweaks 
> > this should work with a common scheme. We could also redefine the role 
> > of _start and _end slightly to just require that the refs are removed when 
> > _end completes. That would allow the KVM page count ref to work as is now 
> > and would avoid the individual invalidate_page() callouts.
> 
> I can already only use _end and ignore _start, only remaining problem
> is this won't be 100% coherent (my patch is 100% coherent thanks to PT
> lock implicitly serializing follow_page/get_user_pages of the KVM/GRU
> secondary MMU faults). My approach give a better peace of mind with
> full scalability and no lock recursion when it's the KVM page fault
> that invokes get_user_pages that invokes the linux page fault routines
> that will try to execute _start taking the lock to block the page
> fault that is already running...
> 
> > > > The GRU has no page table on its own. It populates TLB entries on demand 
> > > > using the linux page table. There is no way it can figure out when to 
> > > > drop page counts again. The invalidate calls are turned directly into tlb 
> > > > flushes.
> > > 
> > > Yes, this is why it can't serialize follow_page with only the PT lock
> > > with your patch. KVM may do it once you add start,end to range_end
> > > only thanks to the additional pin on the page.
> > 
> > Right but that pin requires taking a refcount which we cannot do.
> 
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.

--- jack

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-03  3:14                 ` Jack Steiner
@ 2008-02-03  3:33                   ` Andrea Arcangeli
  0 siblings, 0 replies; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-03  3:33 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Sat, Feb 02, 2008 at 09:14:57PM -0600, Jack Steiner wrote:
> Also, most (but not all) applications that use the GRU do not usually do
> anything that requires frequent flushing (fortunately). The GRU is intended
> for HPC-like applications. These don't usually do frequent map/unmap
> operations or anything else that requires a lot of flushes.
> 
> I expect that KVM is a lot different.

I don't think so. invalidate_page/pages/range_start,end is a slow and
unfrequent path for KVM (or alternatively the ranges are very small in
which case _range_start/end won't payoff compared to _pages). Whenever
invalidate_page[s] become a fast path, we're generally I/O
bound. get_user_pages is always the fast path instead. I thought it
was much more important that get_user_pages scale as well as it does
now and that the KVM page fault isn't serialized with a mutex, than
whatever invalidate side optimization. get_user_pages may run
frequently from all vcpus even if there are no invalidates and no
memory pressure and I don't mean only during startup.

> I have most of the GRU code working with the latest mmuops patch. I still
> have a list of loose ends that I'll get to next week. The most important is
> the exact handling of the range invalidates. The code that I currently have
> works (mostly) but has a few endcases that will cause problems. Once I
> finish, I'll be glad to send you snippets of the code (or all of it) if you
> would like to take a look.

Sure.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-03  2:17               ` Andrea Arcangeli
  2008-02-03  3:14                 ` Jack Steiner
@ 2008-02-04 19:09                 ` Christoph Lameter
  2008-02-05  5:25                   ` Andrea Arcangeli
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-04 19:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Sun, 3 Feb 2008, Andrea Arcangeli wrote:

> > Right but that pin requires taking a refcount which we cannot do.
> 
> GRU can use my patch without the pin. XPMEM obviously can't use my
> patch as my invalidate_page[s] are under the PT lock (a feature to fit
> GRU/KVM in the simplest way), this is why an incremental patch adding
> invalidate_range_start/end would be required to support XPMEM too.

Doesnt the kernel in some situations release the page before releasing the 
pte lock? Then there will be an external pte pointing to a page that may 
now have a different use. Its really bad if that pte does allow writes.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [patch 1/3] mmu_notifier: Core code
  2008-02-03  1:33       ` Andrea Arcangeli
@ 2008-02-04 19:13         ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-04 19:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Sun, 3 Feb 2008, Andrea Arcangeli wrote:

> On Thu, Jan 31, 2008 at 07:58:40PM -0800, Christoph Lameter wrote:
> > Ok. Andrea wanted the same because then he can void the begin callouts.
> 
> Exactly. I hope the page-pin will avoid me having to serialize the KVM
> page fault against the start/end critical section.
> 
> BTW, I wonder if the start/end critical section API is intended to
> forbid scheduling inside it. In short I wonder if GRU can is allowed
> to take a spinlock in _range_start as last thing before returning, and
> to release that same spinlock in _range_end as first thing, and not to
> be forced to use a mutex.

_begin/end encloses code that may sleep and _begin/_end itself may sleep. 
So a semaphore may work but not a spinlock.
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-04 19:09                 ` Christoph Lameter
@ 2008-02-05  5:25                   ` Andrea Arcangeli
  2008-02-05  6:11                     ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-05  5:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Mon, Feb 04, 2008 at 11:09:01AM -0800, Christoph Lameter wrote:
> On Sun, 3 Feb 2008, Andrea Arcangeli wrote:
> 
> > > Right but that pin requires taking a refcount which we cannot do.
> > 
> > GRU can use my patch without the pin. XPMEM obviously can't use my
> > patch as my invalidate_page[s] are under the PT lock (a feature to fit
> > GRU/KVM in the simplest way), this is why an incremental patch adding
> > invalidate_range_start/end would be required to support XPMEM too.
> 
> Doesnt the kernel in some situations release the page before releasing the 
> pte lock? Then there will be an external pte pointing to a page that may 
> now have a different use. Its really bad if that pte does allow writes.

Sure the kernel does that most of the time, which is for example why I
had to use invalidate_page instead of invalidate_pages inside
zap_pte_range. Zero problems with that (this is also the exact reason
why I mentioned the tlb flushing code would need changes to convert
some page in pages).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05  5:25                   ` Andrea Arcangeli
@ 2008-02-05  6:11                     ` Christoph Lameter
  2008-02-05 18:08                       ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-05  6:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, 5 Feb 2008, Andrea Arcangeli wrote:

> On Mon, Feb 04, 2008 at 11:09:01AM -0800, Christoph Lameter wrote:
> > On Sun, 3 Feb 2008, Andrea Arcangeli wrote:
> > 
> > > > Right but that pin requires taking a refcount which we cannot do.
> > > 
> > > GRU can use my patch without the pin. XPMEM obviously can't use my
> > > patch as my invalidate_page[s] are under the PT lock (a feature to fit
> > > GRU/KVM in the simplest way), this is why an incremental patch adding
> > > invalidate_range_start/end would be required to support XPMEM too.
> > 
> > Doesnt the kernel in some situations release the page before releasing the 
> > pte lock? Then there will be an external pte pointing to a page that may 
> > now have a different use. Its really bad if that pte does allow writes.
> 
> Sure the kernel does that most of the time, which is for example why I
> had to use invalidate_page instead of invalidate_pages inside
> zap_pte_range. Zero problems with that (this is also the exact reason
> why I mentioned the tlb flushing code would need changes to convert
> some page in pages).

Zero problems only if you find having a single callout for every page 
acceptable. So the invalidate_range in your patch is only working 
sometimes. And even if it works then it has to be used on 2M range. Seems 
to be a bit fragile and needlessly complex.

"conversion of some page in pages"? A proposal to defer the freeing of the 
pages until after the pte_unlock?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05  6:11                     ` Christoph Lameter
@ 2008-02-05 18:08                       ` Andrea Arcangeli
  2008-02-05 18:17                         ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-05 18:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Mon, Feb 04, 2008 at 10:11:24PM -0800, Christoph Lameter wrote:
> Zero problems only if you find having a single callout for every page 
> acceptable. So the invalidate_range in your patch is only working 

invalidate_pages is only a further optimization that was
strightforward in some places where the page isn't freed but only the
pte modified.

> sometimes. And even if it works then it has to be used on 2M range. Seems 
> to be a bit fragile and needlessly complex.

The patch as a whole isn't fragile nor complex. Pretending to use
invalidate_pages anywhere would be complex (and in turn more fragile
than my current patch, complexity brings fragility).

Overall you can only argue against performance issues (my patch is
simpler for GRU/KVM, and it sure isn't fragile, quite the opposite,
given I never allow a coherency-loss between two threads that will
read/write to two different physical pages for the same virtual
adddress in remap_file_pages).

In performance terms with your patch before GRU can run follow_page it
has to take a mm-wide global mutex where each thread in all cpus will
have to take it. That will trash on >4-way when the tlb misses start
to occour. There's nothing like that in my patch. Your approach will
micro-optimize certain large pte-mangling calls, or do_exit, but those
aren't interesting paths nor for GRU nor for KVM. You're optimizing for
the slow path, and making the fast path slower.

> "conversion of some page in pages"? A proposal to defer the freeing of the 
> pages until after the pte_unlock?

There can be many tricks to optimize page in pages, but again munmap
and do_exit aren't the interesting path to optimzie, nor for GRU nor
for KVM so it doesn't matter right now.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 18:08                       ` Andrea Arcangeli
@ 2008-02-05 18:17                         ` Christoph Lameter
  2008-02-05 20:55                           ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-05 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, 5 Feb 2008, Andrea Arcangeli wrote:

> given I never allow a coherency-loss between two threads that will
> read/write to two different physical pages for the same virtual
> adddress in remap_file_pages).

The other approach will not have any remote ptes at that point. Why would 
there be a coherency issue?
 
> In performance terms with your patch before GRU can run follow_page it
> has to take a mm-wide global mutex where each thread in all cpus will
> have to take it. That will trash on >4-way when the tlb misses start

No. It only has to lock the affected range. Remote page faults can occur 
while another part of the address space is being invalidated. The 
complexity of locking is up to the user of the mmu notifier. A simple 
implementation is satisfactory for the GRU right now. Should it become a 
problem then the lock granularity can be refined without changing the API.

> > "conversion of some page in pages"? A proposal to defer the freeing of the 
> > pages until after the pte_unlock?
> 
> There can be many tricks to optimize page in pages, but again munmap
> and do_exit aren't the interesting path to optimzie, nor for GRU nor
> for KVM so it doesn't matter right now.

Still not sure what we are talking about here.
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 18:17                         ` Christoph Lameter
@ 2008-02-05 20:55                           ` Andrea Arcangeli
  2008-02-05 22:06                             ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-05 20:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> The other approach will not have any remote ptes at that point. Why would 
> there be a coherency issue?

It never happens that two threads writes to two different physical
pages by working on the same process virtual address. This is an issue
only for KVM which is probably ok with it but certainly you can't
consider the dependency on the page-pin less fragile or less complex
than my PT lock approach.

> No. It only has to lock the affected range. Remote page faults can occur 
> while another part of the address space is being invalidated. The 
> complexity of locking is up to the user of the mmu notifier. A simple 
> implementation is satisfactory for the GRU right now. Should it become a 
> problem then the lock granularity can be refined without changing the API.

That will make the follow_page fast path even slower if it has to
lookup a rbtree or a list of locked ranges. Still not comparable to
the PT lock that 1) it's zero cost and 2) it'll provide an even more
granular scalability.

> Still not sure what we are talking about here.

The apps using GRU/KVM never trigger large
munmap/mremap/do_exit. You're optimizing for the irrelevant workload,
by requiring unnecessary new locking in the GRU fast path.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 20:55                           ` Andrea Arcangeli
@ 2008-02-05 22:06                             ` Christoph Lameter
  2008-02-05 22:12                               ` Robin Holt
  2008-02-05 22:26                               ` Andrea Arcangeli
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-05 22:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, 5 Feb 2008, Andrea Arcangeli wrote:

> On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> > The other approach will not have any remote ptes at that point. Why would 
> > there be a coherency issue?
> 
> It never happens that two threads writes to two different physical
> pages by working on the same process virtual address. This is an issue
> only for KVM which is probably ok with it but certainly you can't
> consider the dependency on the page-pin less fragile or less complex
> than my PT lock approach.

You can avoid the page-pin and the pt lock completely by zapping the 
mappings at _start and then holding off new references until _end.

> > No. It only has to lock the affected range. Remote page faults can occur 
> > while another part of the address space is being invalidated. The 
> > complexity of locking is up to the user of the mmu notifier. A simple 
> > implementation is satisfactory for the GRU right now. Should it become a 
> > problem then the lock granularity can be refined without changing the API.
> 
> That will make the follow_page fast path even slower if it has to
> lookup a rbtree or a list of locked ranges. Still not comparable to
> the PT lock that 1) it's zero cost and 2) it'll provide an even more
> granular scalability.

As I said the implementation is up to the caller. Not sure what 
XPmem is using there but then XPmem is not using follow_page. The GRU 
would be using a lightway way of locking not rbtrees.

> > Still not sure what we are talking about here.
> 
> The apps using GRU/KVM never trigger large
> munmap/mremap/do_exit. You're optimizing for the irrelevant workload,
> by requiring unnecessary new locking in the GRU fast path.

Maybe that is true for KVM but certainly not true for the GRU. The GRU is 
designed to manage several petabytes of memory that may be mapped by a 
series of Linux instances. If a process only maps a small chunk of 4 
Gigabytes then we already have to deal with 1 mio callbacks.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 22:06                             ` Christoph Lameter
@ 2008-02-05 22:12                               ` Robin Holt
  2008-02-05 22:26                               ` Andrea Arcangeli
  1 sibling, 0 replies; 63+ messages in thread
From: Robin Holt @ 2008-02-05 22:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, Feb 05, 2008 at 02:06:23PM -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> 
> > On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> > > The other approach will not have any remote ptes at that point. Why would 
> > > there be a coherency issue?
> > 
> > It never happens that two threads writes to two different physical
> > pages by working on the same process virtual address. This is an issue
> > only for KVM which is probably ok with it but certainly you can't
> > consider the dependency on the page-pin less fragile or less complex
> > than my PT lock approach.
> 
> You can avoid the page-pin and the pt lock completely by zapping the 
> mappings at _start and then holding off new references until _end.

XPMEM is doing this by putting our equivalent structure of the mm into a
recalling state which will cause all future faulters to back off, it then
marks any currently active faults in the range as invalid (we have a very
small number of possible concurrent faulters for a different reason),
proceeds to start remote shoot-downs, waits for those shoot-downs to
complete, then returns from the _begin callout with the mm-equiv still in
the recalling state.  Additional recalls may occur, but no new faults can.
The _end callout reduces the number of active recalls until there are
none left at which point the faulters are allowed to proceed again.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 22:06                             ` Christoph Lameter
  2008-02-05 22:12                               ` Robin Holt
@ 2008-02-05 22:26                               ` Andrea Arcangeli
  2008-02-05 23:10                                 ` Christoph Lameter
  1 sibling, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-05 22:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, Feb 05, 2008 at 02:06:23PM -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> 
> > On Tue, Feb 05, 2008 at 10:17:41AM -0800, Christoph Lameter wrote:
> > > The other approach will not have any remote ptes at that point. Why would 
> > > there be a coherency issue?
> > 
> > It never happens that two threads writes to two different physical
> > pages by working on the same process virtual address. This is an issue
> > only for KVM which is probably ok with it but certainly you can't
> > consider the dependency on the page-pin less fragile or less complex
> > than my PT lock approach.
> 
> You can avoid the page-pin and the pt lock completely by zapping the 
> mappings at _start and then holding off new references until _end.

Avoid the PT lock? The PT lock has to be taken anyway by the linux
VM.

"holding off new references until _end" = per-range mutex less scalar
and more expensive than the PT lock that has to be taken anyway.

> As I said the implementation is up to the caller. Not sure what 
> XPmem is using there but then XPmem is not using follow_page. The GRU 
> would be using a lightway way of locking not rbtrees.

"lightway way of locking" = mm-wide-mutex (not necessary at all if we
take advantage of the per-pte-scalar PT lock that has to be taken
anyway like in my patch)

> Maybe that is true for KVM but certainly not true for the GRU. The GRU is 
> designed to manage several petabytes of memory that may be mapped by a 
> series of Linux instances. If a process only maps a small chunk of 4 
> Gigabytes then we already have to deal with 1 mio callbacks.

KVM is also going to map a lot of stuff, but mapping involves mmap,
munmap/mremap/mprotect not. The size of mmap is irrelevant in both
approaches. optimizing do_exit by making the tlb-miss runtime slower
doesn't sound great to me and that's your patch does if you force GRU
to use it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 22:26                               ` Andrea Arcangeli
@ 2008-02-05 23:10                                 ` Christoph Lameter
  2008-02-05 23:47                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Lameter @ 2008-02-05 23:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, 5 Feb 2008, Andrea Arcangeli wrote:

> > You can avoid the page-pin and the pt lock completely by zapping the 
> > mappings at _start and then holding off new references until _end.
> 
> "holding off new references until _end" = per-range mutex less scalar
> and more expensive than the PT lock that has to be taken anyway.

You can of course setup a 2M granularity lock to get the same granularity 
as the pte lock. That would even work for the cases where you have to page 
pin now.

> > Maybe that is true for KVM but certainly not true for the GRU. The GRU is 
> > designed to manage several petabytes of memory that may be mapped by a 
> > series of Linux instances. If a process only maps a small chunk of 4 
> > Gigabytes then we already have to deal with 1 mio callbacks.
> 
> KVM is also going to map a lot of stuff, but mapping involves mmap,
> munmap/mremap/mprotect not. The size of mmap is irrelevant in both
> approaches. optimizing do_exit by making the tlb-miss runtime slower
> doesn't sound great to me and that's your patch does if you force GRU
> to use it.

The size of the mmap is relevant if you have to perform callbacks on 
every mapped page that involved take mmu specific locks. That seems to be 
the case with this approach.

Optimizing do_exit by taking a single lock to zap all external references 
instead of 1 mio callbacks somehow leads to slowdown?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 23:10                                 ` Christoph Lameter
@ 2008-02-05 23:47                                   ` Andrea Arcangeli
  2008-02-06  0:04                                     ` Christoph Lameter
  0 siblings, 1 reply; 63+ messages in thread
From: Andrea Arcangeli @ 2008-02-05 23:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, Feb 05, 2008 at 03:10:52PM -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Andrea Arcangeli wrote:
> 
> > > You can avoid the page-pin and the pt lock completely by zapping the 
> > > mappings at _start and then holding off new references until _end.
> > 
> > "holding off new references until _end" = per-range mutex less scalar
> > and more expensive than the PT lock that has to be taken anyway.
> 
> You can of course setup a 2M granularity lock to get the same granularity 
> as the pte lock. That would even work for the cases where you have to page 
> pin now.

If you set a 2M granularity lock, the _start callback would need to
do:

	for_each_2m_lock()
		mutex_lock()

so you'd run zillon of mutex_lock in a row, you're the one with the
million of operations argument.

> The size of the mmap is relevant if you have to perform callbacks on 
> every mapped page that involved take mmu specific locks. That seems to be 
> the case with this approach.

mmap should never trigger any range_start/_end callback unless it's
overwriting an older mapping which is definitely not the interesting
workload for those apps including kvm.

> Optimizing do_exit by taking a single lock to zap all external references 
> instead of 1 mio callbacks somehow leads to slowdown?

It can if the application runs for more than a couple of seconds,
i.e. not a fork flood in which you care about do_exit speed. Keep in
mind if you had 1mio invalidate_pages callback it means you previously
called follow_page 1 mio of times too...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mmu notifiers #v5
  2008-02-05 23:47                                   ` Andrea Arcangeli
@ 2008-02-06  0:04                                     ` Christoph Lameter
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Lameter @ 2008-02-06  0:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Wed, 6 Feb 2008, Andrea Arcangeli wrote:

> > You can of course setup a 2M granularity lock to get the same granularity 
> > as the pte lock. That would even work for the cases where you have to page 
> > pin now.
> 
> If you set a 2M granularity lock, the _start callback would need to
> do:
> 
> 	for_each_2m_lock()
> 		mutex_lock()
> 
> so you'd run zillon of mutex_lock in a row, you're the one with the
> million of operations argument.

There is no requirement to do a linear search. No one in his right mind 
would implement a performance critical operation that way.
 
> > The size of the mmap is relevant if you have to perform callbacks on 
> > every mapped page that involved take mmu specific locks. That seems to be 
> > the case with this approach.
> 
> mmap should never trigger any range_start/_end callback unless it's
> overwriting an older mapping which is definitely not the interesting
> workload for those apps including kvm.

There is still at least the need for teardown on exit. And you need to 
consider the boundless creativity of user land programmers. You would not 
believe what I have seen....

> > Optimizing do_exit by taking a single lock to zap all external references 
> > instead of 1 mio callbacks somehow leads to slowdown?
> 
> It can if the application runs for more than a couple of seconds,
> i.e. not a fork flood in which you care about do_exit speed. Keep in
> mind if you had 1mio invalidate_pages callback it means you previously
> called follow_page 1 mio of times too...

That is another problem were we are also in need of solutions. I believe 
we have discussed that elsewhere.

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2008-02-06  0:04 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-31  4:57 [patch 0/3] [RFC] MMU Notifiers V4 Christoph Lameter
2008-01-31  4:57 ` [patch 1/3] mmu_notifier: Core code Christoph Lameter
2008-02-01  1:56   ` Jack Steiner
2008-02-01  2:24     ` Robin Holt
2008-02-01  2:37       ` Jack Steiner
2008-02-01  2:39         ` Christoph Lameter
2008-02-01  2:31   ` Robin Holt
2008-02-01  2:39     ` Christoph Lameter
2008-02-01  2:47       ` Robin Holt
2008-02-01  3:01         ` Christoph Lameter
2008-02-01  3:01       ` Jack Steiner
2008-02-01  3:03         ` Christoph Lameter
2008-02-01  3:52   ` Robin Holt
2008-02-01  3:58     ` Christoph Lameter
2008-02-01  4:15       ` Robin Holt
2008-02-03  1:33       ` Andrea Arcangeli
2008-02-04 19:13         ` Christoph Lameter
2008-01-31  4:57 ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-01-31 12:31   ` Andrea Arcangeli
2008-01-31 20:07     ` Christoph Lameter
2008-01-31 22:01     ` mmu_notifier: close hole in fork Christoph Lameter
2008-01-31 22:16       ` mmu_notifier: reduce size of mm_struct if !CONFIG_MMU_NOTIFIER Christoph Lameter
2008-01-31 22:21       ` mmu_notifier: Move mmu_notifier_release up to get rid of the invalidat_all() callback Christoph Lameter
2008-02-01  0:13         ` Andrea Arcangeli
2008-02-01  1:52           ` Christoph Lameter
2008-02-01  1:57           ` mmu_notifier: invalidate_range for move_page_tables Christoph Lameter
2008-02-01  2:38             ` Robin Holt
2008-02-01  2:41               ` Christoph Lameter
2008-02-01  0:01       ` mmu_notifier: close hole in fork Andrea Arcangeli
2008-02-01  1:48         ` Christoph Lameter
2008-02-01  4:24   ` [patch 2/3] mmu_notifier: Callbacks to invalidate address ranges Robin Holt
2008-02-01  4:43     ` Christoph Lameter
2008-02-01 10:32       ` Robin Holt
2008-02-01 10:37         ` Robin Holt
2008-02-01 19:13         ` Christoph Lameter
2008-01-31  4:57 ` [patch 3/3] mmu_notifier: invalidate_page callbacks Christoph Lameter
2008-01-31 17:18 ` [PATCH] mmu notifiers #v5 Andrea Arcangeli
2008-01-31 20:18   ` Christoph Lameter
2008-01-31 23:09     ` Christoph Lameter
2008-01-31 23:41       ` Andrea Arcangeli
2008-02-01  1:44         ` Christoph Lameter
2008-02-01 12:09           ` Andrea Arcangeli
2008-02-01 19:23             ` Christoph Lameter
2008-02-03  2:17               ` Andrea Arcangeli
2008-02-03  3:14                 ` Jack Steiner
2008-02-03  3:33                   ` Andrea Arcangeli
2008-02-04 19:09                 ` Christoph Lameter
2008-02-05  5:25                   ` Andrea Arcangeli
2008-02-05  6:11                     ` Christoph Lameter
2008-02-05 18:08                       ` Andrea Arcangeli
2008-02-05 18:17                         ` Christoph Lameter
2008-02-05 20:55                           ` Andrea Arcangeli
2008-02-05 22:06                             ` Christoph Lameter
2008-02-05 22:12                               ` Robin Holt
2008-02-05 22:26                               ` Andrea Arcangeli
2008-02-05 23:10                                 ` Christoph Lameter
2008-02-05 23:47                                   ` Andrea Arcangeli
2008-02-06  0:04                                     ` Christoph Lameter
2008-01-31 23:28     ` Andrea Arcangeli
2008-02-01  1:37       ` Christoph Lameter
2008-02-01  2:23         ` Robin Holt
2008-02-01  2:26           ` Christoph Lameter
2008-02-01 12:00         ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).