linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] [RFC] EMMU Notifiers V5
@ 2008-02-01  5:04 Christoph Lameter
  2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel (an external MMU).

The known immediate users are

KVM
- Establishes a refcount to the page via get_user_pages().
- External references are called spte.
- Has page tables to track pages whose refcount was elevated(?) but
  no reverse maps.

GRU
- Simple additional hardware TLB (possibly covering multiple instances of
  Linux)
- Needs TLB shootdown when the VM unmaps pages.
- Determines page address via follow_page (from interrupt context) but can
  fall back to get_user_pages().
- No page reference possible since no page status is kept..

XPmem
- Allows use of a processes memory by remote instances of Linux.
- Provides its own reverse mappings to track remote pte.
- Established refcounts on the exported pages.
- Must sleep in order to wait for remote acks of ptes that are being
  cleared.



Known issues:

- RCU quiescent periods are required on registering
  notifiers to guarantee visibility to other processors.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
  we are the only executing thread. This is also true for invalidate_all() so
  we could drop off the mmu_notifier there early. Use hlist_del_init instead
  of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
  holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch

V4->V5:
- Provide missing callouts for mremap.
- Provide missing callouts for copy_page_range.
- Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out
  structure contents.
- Get rid of the invalidate_all() callback by moving ->release in place
  of invalidate_all.
- Require holding mmap_sem on register/unregister instead of acquiring it
  ourselves. In some contexts where we want to register/unregister we are
  already holding mmap_sem.
- Split out the rmap support patch so that there is no need to apply
  all patches for KVM and GRU.

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 1/4] mmu_notifier: Core code
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
@ 2008-02-01  5:04 ` Christoph Lameter
  2008-02-01 10:55   ` Robin Holt
  2008-02-01  5:04 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 14594 bytes --]

Notifier functions for hardware and software that establishes external
references to pages of a Linux system. The notifier calls ensure that
external mappings are removed when the Linux VM removes memory ranges
or individual pages from a process.

This first portion is fitting for external mmu's that do not have their
own rmap or need the ability to sleep before removing individual pages.

Two categories of external mmus are possible:

1. KVM style external mmus that have their own page table.
   These are capable of tracking pages in their page tables and
   can therefore increase the refcount on pages. An increased
   refcount guarantees page existence regardless of the vms unmapping
   actions until the logic in the notifier call decides to drop a page.

2. GRU style external mmus that rely on the Linux page table for TLB lookups.
   These cannot track pages that are externally references.
   TLB entries can only be evicted as necessary.


Callbacks are registered with an mm_struct from a device drivers using
mmu_notifier_register. When the VM removes pages (or restricts
permissions on pages) then callbacks are triggered

The VM holds spinlocks in order to walk reverse maps in rmap.c. The single
page callback invalidate_page() is therefore always run with
spinlocks held (which limits what can be done in the callbacks).

The invalidate_range_start/end callbacks can be run in atomic as well as
sleepable contexts. A flag is passed to indicate an atomic context.
The notifier may decide to defer actions if the context is atomic.

Pages must be marked dirty if dirty bits are found to be set in
the external ptes.

Requirements on synchronization within the driver:

     Multiple invalidate_range_begin/ends may be nested or called
     concurrently. That is legit. However, no new external references
     may be established as long as any invalidate_xxx is running or as long
     as any invalidate_range_begin() and has not been completed through a
     corresponding call to invalidate_range_end().

     Locking within the notifier callbacks needs to serialize events
     correspondingly. One simple implementation would be the use of a spinlock
     that needs to be acquired for access to the page table or tlb managed by
     the driver. A rw lock could be used to allow multiplel concurrent invalidates
     to run but then the driver needs to have additional internal synchronization
     for access to hardware resources.

     If all invalidate_xx notifier calls take the driver lock then it is possible
     to run follow_page() under the same lock. The lock can then guarantee
     that no page is removed and provides an additional existence guarantee
     of the page independent of the page count.

     invalidate_range_begin() must clear all references in the range
     and stop the establishment of new references.

     invalidate_range_end() reenables the establishment of references.
     The atomic paramater passed to invalidatge_range_xx indicates that the function
     is called in an atomic context. We can sleep if atomic == 0.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/mm_types.h     |    8 +
 include/linux/mmu_notifier.h |  179 +++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |   76 ++++++++++++++++++
 7 files changed, 272 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-31 19:59:51.000000000 -0800
@@ -153,6 +153,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +225,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-31 20:56:03.000000000 -0800
@@ -0,0 +1,179 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If pages are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page() callbacks are performed with spinlocks are held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *	Pages must be marked dirty if dirty bits are found to be set in
+ *	the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/* invalidate_page is called from contexts where the pte_lock is held */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must paired.
+	 *
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * If all invalidate_xx notifier calls take a driver lock then it is possible
+	 * to run follow_page() under the same lock. The lock can then guarantee
+	 * that no page is removed and provides an additional existence guarantee
+	 * of the page.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 unsigned long stat, unsigned long end,
+				 struct mm_struct *mm, int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-31 19:59:51.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-31 19:59:51.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-31 20:56:03.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-31 19:59:51.000000000 -0800
@@ -52,6 +52,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-31 19:55:46.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-31 20:56:03.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2033,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
  2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
@ 2008-02-01  5:04 ` Christoph Lameter
  2008-02-01 10:49   ` Robin Holt
  2008-02-01 22:09   ` Robin Holt
  2008-02-01  5:04 ` [patch 3/4] mmu_notifier: invalidate_page callbacks Christoph Lameter
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 9668 bytes --]

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

invalidate_range_begin/end() is frequently called with only mmap_sem
held. If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap_xip.c |    5 +++++
 mm/fremap.c      |    3 +++
 mm/hugetlb.c     |    3 +++
 mm/memory.c      |   24 ++++++++++++++++++++++--
 mm/mmap.c        |    2 ++
 mm/mremap.c      |    7 ++++++-
 6 files changed, 41 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-31 20:56:03.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-31 20:59:14.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -211,7 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-31 20:56:03.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-31 20:59:14.000000000 -0800
@@ -50,6 +50,7 @@
 #include <linux/delayacct.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -601,6 +602,9 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
@@ -611,6 +615,11 @@ int copy_page_range(struct mm_struct *ds
 						vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_range_end, src_mm,
+						vma->vm_start, end, 0);
+
 	return 0;
 }
 
@@ -883,13 +892,16 @@ unsigned long zap_page_range(struct vm_a
 	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
+	int atomic = details ? (details->i_mmap_lock != 0) : 0;
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range_end, mm, address, end, atomic);
 	return end;
 }
 
@@ -1318,7 +1330,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1352,6 +1364,7 @@ int remap_pfn_range(struct vm_area_struc
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
@@ -1359,6 +1372,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1442,10 +1456,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1453,6 +1468,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1630,6 +1646,8 @@ gotten:
 		goto oom;
 	cow_user_page(new_page, old_page, address, vma);
 
+	mmu_notifier(invalidate_range_begin, mm, address,
+				address + PAGE_SIZE, 0);
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1668,6 +1686,8 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	mmu_notifier(invalidate_range_end, mm,
+				address, address + PAGE_SIZE, 0);
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-31 20:58:05.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-31 20:59:14.000000000 -0800
@@ -1744,11 +1744,13 @@ static void unmap_region(struct mm_struc
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-01-31 20:56:03.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-01-31 20:59:14.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -743,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier(invalidate_range_begin, mm, start, end, 1);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -763,6 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 1);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-01-31 20:56:03.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c	2008-01-31 20:59:14.000000000 -0800
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 #include <asm/tlbflush.h>
 
@@ -189,6 +190,8 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+		mmu_notifier(invalidate_range_begin, mm, address,
+					address + PAGE_SIZE, 1);
 		pte = page_check_address(page, mm, address, &ptl);
 		if (pte) {
 			/* Nuke the page table entry. */
@@ -200,6 +203,8 @@ __xip_unmap (struct address_space * mapp
 			pte_unmap_unlock(pte, ptl);
 			page_cache_release(page);
 		}
+		mmu_notifier(invalidate_range_end, mm,
+				address, address + PAGE_SIZE, 1);
 	}
 	spin_unlock(&mapping->i_mmap_lock);
 }
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-01-31 20:56:03.000000000 -0800
+++ linux-2.6/mm/mremap.c	2008-01-31 20:59:14.000000000 -0800
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -124,12 +125,15 @@ unsigned long move_page_tables(struct vm
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len)
 {
-	unsigned long extent, next, old_end;
+	unsigned long extent, next, old_start, old_end;
 	pmd_t *old_pmd, *new_pmd;
 
+	old_start = old_addr;
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	mmu_notifier(invalidate_range_begin, vma->vm_mm,
+					old_addr, old_end, 0);
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -150,6 +154,7 @@ unsigned long move_page_tables(struct vm
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 				new_vma, new_pmd, new_addr);
 	}
+	mmu_notifier(invalidate_range_end, vma->vm_mm, old_start, old_end, 0);
 
 	return len + old_addr - old_end;	/* how much done */
 }

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 3/4] mmu_notifier: invalidate_page callbacks
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
  2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
  2008-02-01  5:04 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-02-01  5:04 ` Christoph Lameter
  2008-02-01  5:04 ` [patch 4/4] mmu_notifier: Support for driverws with revers maps (f.e. for XPmem) Christoph Lameter
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_invalidate_page --]
[-- Type: text/plain, Size: 2814 bytes --]

Two callbacks to remove individual pages as done in rmap code

	invalidate_page()

Called from the inner loop of rmap walks to invalidate pages.

	age_page()

Called for the determination of the page referenced status.

If we do not care about page referenced status then an age_page callback
may be be omitted.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Robin Holt <holt@sgi.com>

---
 mm/rmap.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-31 19:55:45.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-31 20:28:35.000000000 -0800
@@ -49,6 +49,7 @@
 #include <linux/rcupdate.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -284,7 +285,8 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
-	if (ptep_clear_flush_young(vma, address, pte))
+	if (ptep_clear_flush_young(vma, address, pte) |
+	    mmu_notifier_age_page(mm, address))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -434,6 +436,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -677,7 +680,8 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young(vma, address, pte) |
+				mmu_notifier_age_page(mm, address)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
@@ -685,6 +689,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mmu_notifier(invalidate_page, mm, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -809,12 +814,14 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte) |
+		    mmu_notifier_age_page(mm, address))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 4/4] mmu_notifier: Support for driverws with revers maps (f.e. for XPmem)
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-02-01  5:04 ` [patch 3/4] mmu_notifier: invalidate_page callbacks Christoph Lameter
@ 2008-02-01  5:04 ` Christoph Lameter
  2008-02-01 11:58 ` Extending mmu_notifiers to handle __xip_unmap in a sleepable context? Robin Holt
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_rmap_support --]
[-- Type: text/plain, Size: 8863 bytes --]

Support for an additional 3rd class of users of mmu_notifier.

These special additional callbacks are required because XPmem does
use its own rmap (multiple processes on a serires of remote Linux instances
may be accessing the memory of a process). XPmem may have to send out
notifications to remote Linux instances and receive confirmation before a
page can be freed.

So we handle this like an additional Linux reverse map that is walked after
the existing rmaps have been walked. We leave the walking to the driver that
is then able to use something else than a spinlock to walk its reverse
maps. So we can actually call the driver without holding spinlocks.

However, we cannot determine the mm_struct that a page belongs to. That
will have to be determined by the device driver. Therefore we need to
have a global list of reverse map callbacks.

We add another pageflag (PageExternalRmap) that is set if a page has
been remotely mapped (f.e. by a process from another Linux instance).
We can then only perform the callbacks for pages that are actually in
remote use.

Rmap notifiers need an extra page bit and are only available
on 64 bit platforms. This functionality is not available on 32 bit!

A notifier that uses the reverse maps callbacks does not need to provide
the invalidate_page() methods that are called when locks are held.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mmu_notifier.h |   70 +++++++++++++++++++++++++++++++++++++++++--
 include/linux/page-flags.h   |   11 ++++++
 mm/mmu_notifier.c            |   36 +++++++++++++++++++++-
 mm/rmap.c                    |    9 +++++
 4 files changed, 123 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-31 20:56:03.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-31 21:00:40.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,16 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define ClearPageExternalRmap(page) do {} while (0)
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-31 20:58:05.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-31 21:00:40.000000000 -0800
@@ -23,6 +23,18 @@
  * 	where sleeping is allowed or in atomic contexts. A flag is passed
  * 	to indicate an atomic context.
  *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms.
+ *
  *	Pages must be marked dirty if dirty bits are found to be set in
  *	the external ptes.
  */
@@ -89,8 +101,26 @@ struct mmu_notifier_ops {
 				 int atomic);
 
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
-				 unsigned long stat, unsigned long end,
-				 struct mm_struct *mm, int atomic);
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
 };
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -143,6 +173,27 @@ static inline void mmu_notifier_head_ini
 		}							\
 	} while (0)
 
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+/* Must hold PageLock */
+extern void mmu_rmap_export_page(struct page *page);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, hlist)		\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 /*
@@ -161,6 +212,16 @@ static inline void mmu_notifier_head_ini
 		};							\
 	} while (0)
 
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
 static inline void mmu_notifier_register(struct mmu_notifier *mn,
 						struct mm_struct *mm) {}
 static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
@@ -174,6 +235,11 @@ static inline int mmu_notifier_age_page(
 
 static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
 
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-31 20:58:05.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-31 21:00:40.000000000 -0800
@@ -66,7 +66,7 @@ void mmu_notifier_register(struct mmu_no
 {
 	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
 }
-EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
 
 void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
 {
@@ -74,3 +74,37 @@ void mmu_notifier_unregister(struct mmu_
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
+#ifdef CONFIG_64BIT
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
+
+#endif
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-31 21:00:36.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-31 21:00:40.000000000 -0800
@@ -476,6 +476,10 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+			if (unlikely(PageExternalRmap(page))) {
+				mmu_rmap_notifier(invalidate_page, page);
+				ClearPageExternalRmap(page);
+			}
 			if (page_test_dirty(page)) {
 				page_clear_dirty(page);
 				ret = 1;
@@ -978,6 +982,11 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	if (unlikely(PageExternalRmap(page))) {
+		mmu_rmap_notifier(invalidate_page, page);
+		ClearPageExternalRmap(page);
+	}
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01  5:04 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-02-01 10:49   ` Robin Holt
  2008-02-01 19:14     ` Christoph Lameter
  2008-02-01 22:09   ` Robin Holt
  1 sibling, 1 reply; 23+ messages in thread
From: Robin Holt @ 2008-02-01 10:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

do_wp_page can reach the _end callout without passing the _begin
callout.  This prevents making the _end unles the _begin has also
been made.

Index: mmu_notifiers-cl-v5/mm/memory.c
===================================================================
--- mmu_notifiers-cl-v5.orig/mm/memory.c	2008-02-01 04:44:03.000000000 -0600
+++ mmu_notifiers-cl-v5/mm/memory.c	2008-02-01 04:46:18.000000000 -0600
@@ -1564,7 +1564,7 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse = 0, ret = 0;
+	int reuse = 0, ret = 0, invalidate_started = 0;
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
 
@@ -1649,6 +1649,8 @@ gotten:
 
 	mmu_notifier(invalidate_range_begin, mm, address,
 				address + PAGE_SIZE, 0);
+	invalidate_started = 1;
+
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1687,7 +1689,8 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier(invalidate_range_end, mm,
+	if (invalidate_started)
+		mmu_notifier(invalidate_range_end, mm,
 				address, address + PAGE_SIZE, 0);
 	if (dirty_page) {
 		if (vma->vm_file)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
@ 2008-02-01 10:55   ` Robin Holt
  2008-02-01 11:04     ` Robin Holt
  2008-02-01 19:14     ` Christoph Lameter
  0 siblings, 2 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-01 10:55 UTC (permalink / raw)
  To: Christoph Lameter, Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

OK.  Now that release has been moved, I think I agree with you that the
down_write(mmap_sem) can be used as our lock again and still work for
Jack.  I would like a ruling from Jack as well.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-02-01 10:55   ` Robin Holt
@ 2008-02-01 11:04     ` Robin Holt
  2008-02-01 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-01 11:04 UTC (permalink / raw)
  To: Christoph Lameter, Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 04:55:16AM -0600, Robin Holt wrote:
> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Ignore this, I was in the wrong work area.  I am sorry for adding to the
confusion.  This version has no locking requirement outside the driver
itself.

Sorry,
Robin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Extending mmu_notifiers to handle __xip_unmap in a sleepable context?
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-02-01  5:04 ` [patch 4/4] mmu_notifier: Support for driverws with revers maps (f.e. for XPmem) Christoph Lameter
@ 2008-02-01 11:58 ` Robin Holt
  2008-02-01 12:10   ` Robin Holt
  2008-02-01 19:17   ` Christoph Lameter
  2008-02-03  1:39 ` [patch 0/4] [RFC] EMMU Notifiers V5 Andrea Arcangeli
  2008-02-03 13:41 ` Robin Holt
  6 siblings, 2 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-01 11:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman


With this set of patches, I think we have enough to get xpmem working
with most types of mappings.  In the past, we operated without any of
these callouts by significantly restricting why types of mappings could
remotely fault and what types of operations the user could do.  With
this set, I am certain we can continue to meet the above assumptions.

That said, I would like to discuss __xip_unmap in more detail.

Currently, it is calling mmu_notifier _begin and _end under the
i_mmap_lock.  I _THINK_ the following will make it so we could support
__xip_unmap (although I don't recall ever seeing that done on ia64 and
don't even know what the circumstances are for its use).

Thanks,
Robin

Index: mmu_notifiers-cl-v5/mm/filemap_xip.c
===================================================================
--- mmu_notifiers-cl-v5.orig/mm/filemap_xip.c	2008-02-01 05:38:32.000000000 -0600
+++ mmu_notifiers-cl-v5/mm/filemap_xip.c	2008-02-01 05:39:08.000000000 -0600
@@ -184,6 +184,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
+	mmu_rmap_notifier(invalidate_page, page);
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Extending mmu_notifiers to handle __xip_unmap in a sleepable context?
  2008-02-01 11:58 ` Extending mmu_notifiers to handle __xip_unmap in a sleepable context? Robin Holt
@ 2008-02-01 12:10   ` Robin Holt
  2008-02-01 19:17   ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-01 12:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

Argh,

Here is the correct one.  Sorry

On Fri, Feb 01, 2008 at 05:58:41AM -0600, Robin Holt wrote:
> With this set of patches, I think we have enough to get xpmem working
> with most types of mappings.  In the past, we operated without any of
> these callouts by significantly restricting why types of mappings could
> remotely fault and what types of operations the user could do.  With
> this set, I am certain we can continue to meet the above assumptions.
> 
> That said, I would like to discuss __xip_unmap in more detail.
> 
> Currently, it is calling mmu_notifier _begin and _end under the
> i_mmap_lock.  I _THINK_ the following will make it so we could support
> __xip_unmap (although I don't recall ever seeing that done on ia64 and
> don't even know what the circumstances are for its use).

Index: mmu_notifiers-cl-v5/mm/filemap_xip.c
===================================================================
--- mmu_notifiers-cl-v5.orig/mm/filemap_xip.c	2008-02-01 05:38:32.000000000 -0600
+++ mmu_notifiers-cl-v5/mm/filemap_xip.c	2008-02-01 06:09:09.000000000 -0600
@@ -184,6 +184,10 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
+	if (unlikely(PageExternalRmap(page))) {
+		mmu_rmap_notifier(invalidate_page, page);
+		ClearPageExternalRmap(page);
+	}
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 10:49   ` Robin Holt
@ 2008-02-01 19:14     ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

Argh. Did not see this soon enougn. Maybe this one is better since it 
avoids the additional unlocks?

On Fri, 1 Feb 2008, Robin Holt wrote:

> do_wp_page can reach the _end callout without passing the _begin
> callout.  This prevents making the _end unles the _begin has also
> been made.
> 
> Index: mmu_notifiers-cl-v5/mm/memory.c
> ===================================================================
> --- mmu_notifiers-cl-v5.orig/mm/memory.c	2008-02-01 04:44:03.000000000 -0600
> +++ mmu_notifiers-cl-v5/mm/memory.c	2008-02-01 04:46:18.000000000 -0600
> @@ -1564,7 +1564,7 @@ static int do_wp_page(struct mm_struct *
>  {
>  	struct page *old_page, *new_page;
>  	pte_t entry;
> -	int reuse = 0, ret = 0;
> +	int reuse = 0, ret = 0, invalidate_started = 0;
>  	int page_mkwrite = 0;
>  	struct page *dirty_page = NULL;
>  
> @@ -1649,6 +1649,8 @@ gotten:
>  
>  	mmu_notifier(invalidate_range_begin, mm, address,
>  				address + PAGE_SIZE, 0);
> +	invalidate_started = 1;
> +
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> @@ -1687,7 +1689,8 @@ gotten:
>  		page_cache_release(old_page);
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
> -	mmu_notifier(invalidate_range_end, mm,
> +	if (invalidate_started)
> +		mmu_notifier(invalidate_range_end, mm,
>  				address, address + PAGE_SIZE, 0);
>  	if (dirty_page) {
>  		if (vma->vm_file)
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 1/4] mmu_notifier: Core code
  2008-02-01 10:55   ` Robin Holt
  2008-02-01 11:04     ` Robin Holt
@ 2008-02-01 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Robin Holt
  Cc: Jack Steiner, Andrea Arcangeli, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, linux-kernel, linux-mm,
	daniel.blueman

On Fri, 1 Feb 2008, Robin Holt wrote:

> OK.  Now that release has been moved, I think I agree with you that the
> down_write(mmap_sem) can be used as our lock again and still work for
> Jack.  I would like a ruling from Jack as well.

Talked to Jack last night and he said its okay.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Extending mmu_notifiers to handle __xip_unmap in a sleepable context?
  2008-02-01 11:58 ` Extending mmu_notifiers to handle __xip_unmap in a sleepable context? Robin Holt
  2008-02-01 12:10   ` Robin Holt
@ 2008-02-01 19:17   ` Christoph Lameter
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01 19:17 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Robin Holt wrote:

> Currently, it is calling mmu_notifier _begin and _end under the
> i_mmap_lock.  I _THINK_ the following will make it so we could support
> __xip_unmap (although I don't recall ever seeing that done on ia64 and
> don't even know what the circumstances are for its use).

Its called under lock yes.

The problem with this fix is that we currently have the requirement that 
the rmap invalidate_all call requires the pagelock to be held. That is not 
the case here. So I used _begin/_end to skirt the issue.

If you do not need the Pagelock to be held (it holds off modifications on 
the page!) then we are fine.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01  5:04 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
  2008-02-01 10:49   ` Robin Holt
@ 2008-02-01 22:09   ` Robin Holt
  2008-02-01 23:19     ` Christoph Lameter
  1 sibling, 1 reply; 23+ messages in thread
From: Robin Holt @ 2008-02-01 22:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

Christoph,

The following code in do_wp_page is a problem.

We are getting this callout when we transition the pte from a read-only
to read-write.  Jack and I can not see a reason we would need that
callout.  It is causing problems for xpmem in that a write fault goes
to get_user_pages which gets back to do_wp_page that does the callout.

XPMEM only allows either faulting or invalidating to occur for an mm.
As you can see, the case above needs it to be in both states.

Thanks,
Robin


> @@ -1630,6 +1646,8 @@ gotten:
>  		goto oom;
>  	cow_user_page(new_page, old_page, address, vma);
>  
> +	mmu_notifier(invalidate_range_begin, mm, address,
> +				address + PAGE_SIZE, 0);
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> @@ -1668,6 +1686,8 @@ gotten:
>  		page_cache_release(old_page);
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
> +	mmu_notifier(invalidate_range_end, mm,
> +				address, address + PAGE_SIZE, 0);
>  	if (dirty_page) {
>  		if (vma->vm_file)
>  			file_update_time(vma->vm_file);
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c	2008-01-31 20:58:05.000000000 -0800
> +++ linux-2.6/mm/mmap.c	2008-01-31 20:59:14.000000000 -0800
> @@ -1744,11 +1744,13 @@ static void unmap_region(struct mm_struc
>  	lru_add_drain();
>  	tlb = tlb_gather_mmu(mm, 0);
>  	update_hiwater_rss(mm);
> +	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
>  	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
>  	vm_unacct_memory(nr_accounted);
>  	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
>  				 next? next->vm_start: 0);
>  	tlb_finish_mmu(tlb, start, end);
> +	mmu_notifier(invalidate_range_end, mm, start, end, 0);
>  }
>  
>  /*
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c	2008-01-31 20:56:03.000000000 -0800
> +++ linux-2.6/mm/hugetlb.c	2008-01-31 20:59:14.000000000 -0800
> @@ -14,6 +14,7 @@
>  #include <linux/mempolicy.h>
>  #include <linux/cpuset.h>
>  #include <linux/mutex.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -743,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
>  	BUG_ON(start & ~HPAGE_MASK);
>  	BUG_ON(end & ~HPAGE_MASK);
>  
> +	mmu_notifier(invalidate_range_begin, mm, start, end, 1);
>  	spin_lock(&mm->page_table_lock);
>  	for (address = start; address < end; address += HPAGE_SIZE) {
>  		ptep = huge_pte_offset(mm, address);
> @@ -763,6 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
>  	}
>  	spin_unlock(&mm->page_table_lock);
>  	flush_tlb_range(vma, start, end);
> +	mmu_notifier(invalidate_range_end, mm, start, end, 1);
>  	list_for_each_entry_safe(page, tmp, &page_list, lru) {
>  		list_del(&page->lru);
>  		put_page(page);
> Index: linux-2.6/mm/filemap_xip.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap_xip.c	2008-01-31 20:56:03.000000000 -0800
> +++ linux-2.6/mm/filemap_xip.c	2008-01-31 20:59:14.000000000 -0800
> @@ -13,6 +13,7 @@
>  #include <linux/module.h>
>  #include <linux/uio.h>
>  #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/sched.h>
>  #include <asm/tlbflush.h>
>  
> @@ -189,6 +190,8 @@ __xip_unmap (struct address_space * mapp
>  		address = vma->vm_start +
>  			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> +		mmu_notifier(invalidate_range_begin, mm, address,
> +					address + PAGE_SIZE, 1);
>  		pte = page_check_address(page, mm, address, &ptl);
>  		if (pte) {
>  			/* Nuke the page table entry. */
> @@ -200,6 +203,8 @@ __xip_unmap (struct address_space * mapp
>  			pte_unmap_unlock(pte, ptl);
>  			page_cache_release(page);
>  		}
> +		mmu_notifier(invalidate_range_end, mm,
> +				address, address + PAGE_SIZE, 1);
>  	}
>  	spin_unlock(&mapping->i_mmap_lock);
>  }
> Index: linux-2.6/mm/mremap.c
> ===================================================================
> --- linux-2.6.orig/mm/mremap.c	2008-01-31 20:56:03.000000000 -0800
> +++ linux-2.6/mm/mremap.c	2008-01-31 20:59:14.000000000 -0800
> @@ -18,6 +18,7 @@
>  #include <linux/highmem.h>
>  #include <linux/security.h>
>  #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -124,12 +125,15 @@ unsigned long move_page_tables(struct vm
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len)
>  {
> -	unsigned long extent, next, old_end;
> +	unsigned long extent, next, old_start, old_end;
>  	pmd_t *old_pmd, *new_pmd;
>  
> +	old_start = old_addr;
>  	old_end = old_addr + len;
>  	flush_cache_range(vma, old_addr, old_end);
>  
> +	mmu_notifier(invalidate_range_begin, vma->vm_mm,
> +					old_addr, old_end, 0);
>  	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
>  		cond_resched();
>  		next = (old_addr + PMD_SIZE) & PMD_MASK;
> @@ -150,6 +154,7 @@ unsigned long move_page_tables(struct vm
>  		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
>  				new_vma, new_pmd, new_addr);
>  	}
> +	mmu_notifier(invalidate_range_end, vma->vm_mm, old_start, old_end, 0);
>  
>  	return len + old_addr - old_end;	/* how much done */
>  }
> 
> -- 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 22:09   ` Robin Holt
@ 2008-02-01 23:19     ` Christoph Lameter
  2008-02-01 23:35       ` Robin Holt
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-02-01 23:19 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Robin Holt wrote:

> We are getting this callout when we transition the pte from a read-only
> to read-write.  Jack and I can not see a reason we would need that
> callout.  It is causing problems for xpmem in that a write fault goes
> to get_user_pages which gets back to do_wp_page that does the callout.

Right. You placed it there in the first place. So we can drop the code 
from do_wp_page?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 23:19     ` Christoph Lameter
@ 2008-02-01 23:35       ` Robin Holt
  2008-02-02  0:05         ` Christoph Lameter
  2008-02-03  2:23         ` Andrea Arcangeli
  0 siblings, 2 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-01 23:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 03:19:32PM -0800, Christoph Lameter wrote:
> On Fri, 1 Feb 2008, Robin Holt wrote:
> 
> > We are getting this callout when we transition the pte from a read-only
> > to read-write.  Jack and I can not see a reason we would need that
> > callout.  It is causing problems for xpmem in that a write fault goes
> > to get_user_pages which gets back to do_wp_page that does the callout.
> 
> Right. You placed it there in the first place. So we can drop the code 
> from do_wp_page?

No, we need a callout when we are becoming more restrictive, but not
when becoming more permissive.  I would have to guess that is the case
for any of these callouts.  It is for both GRU and XPMEM.  I would
expect the same is true for KVM, but would like a ruling from Andrea on
that.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 23:35       ` Robin Holt
@ 2008-02-02  0:05         ` Christoph Lameter
  2008-02-02  0:21           ` Robin Holt
  2008-02-03  2:23         ` Andrea Arcangeli
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2008-02-02  0:05 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 1 Feb 2008, Robin Holt wrote:

> On Fri, Feb 01, 2008 at 03:19:32PM -0800, Christoph Lameter wrote:
> > On Fri, 1 Feb 2008, Robin Holt wrote:
> > 
> > > We are getting this callout when we transition the pte from a read-only
> > > to read-write.  Jack and I can not see a reason we would need that
> > > callout.  It is causing problems for xpmem in that a write fault goes
> > > to get_user_pages which gets back to do_wp_page that does the callout.
> > 
> > Right. You placed it there in the first place. So we can drop the code 
> > from do_wp_page?
> 
> No, we need a callout when we are becoming more restrictive, but not
> when becoming more permissive.  I would have to guess that is the case
> for any of these callouts.  It is for both GRU and XPMEM.  I would
> expect the same is true for KVM, but would like a ruling from Andrea on
> that.

do_wp_page is entered when the pte shows that the page is not writeable 
and it makes the page writable in some situations. Then we do not 
invalidate the remote reference.

However, when we do COW then a *new* page is put in place of the existing 
readonly page. At that point we need to remove the remote pte that is 
readonly. Then we install a new pte pointing to a *different* page that is 
writable.

Are you saying that you get the callback when transitioning from a read 
only to a read write pte on the *same* page?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-02  0:05         ` Christoph Lameter
@ 2008-02-02  0:21           ` Robin Holt
  2008-02-02  0:38             ` Robin Holt
  0 siblings, 1 reply; 23+ messages in thread
From: Robin Holt @ 2008-02-02  0:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 04:05:08PM -0800, Christoph Lameter wrote:
> On Fri, 1 Feb 2008, Robin Holt wrote:
> 
> > On Fri, Feb 01, 2008 at 03:19:32PM -0800, Christoph Lameter wrote:
> > > On Fri, 1 Feb 2008, Robin Holt wrote:
> > > 
> > > > We are getting this callout when we transition the pte from a read-only
> > > > to read-write.  Jack and I can not see a reason we would need that
> > > > callout.  It is causing problems for xpmem in that a write fault goes
> > > > to get_user_pages which gets back to do_wp_page that does the callout.
> > > 
> > > Right. You placed it there in the first place. So we can drop the code 
> > > from do_wp_page?
> > 
> > No, we need a callout when we are becoming more restrictive, but not
> > when becoming more permissive.  I would have to guess that is the case
> > for any of these callouts.  It is for both GRU and XPMEM.  I would
> > expect the same is true for KVM, but would like a ruling from Andrea on
> > that.
> 
> do_wp_page is entered when the pte shows that the page is not writeable 
> and it makes the page writable in some situations. Then we do not 
> invalidate the remote reference.
> 
> However, when we do COW then a *new* page is put in place of the existing 
> readonly page. At that point we need to remove the remote pte that is 
> readonly. Then we install a new pte pointing to a *different* page that is 
> writable.
> 
> Are you saying that you get the callback when transitioning from a read 
> only to a read write pte on the *same* page?

I believe that is what we saw.  We have not put in any more debug
information yet.  I will try to squeze it in this weekend.  Otherwise,
I will probably have to wait until early Monday.

Thanks
Robin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-02  0:21           ` Robin Holt
@ 2008-02-02  0:38             ` Robin Holt
  0 siblings, 0 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-02  0:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrea Arcangeli, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 06:21:45PM -0600, Robin Holt wrote:
> On Fri, Feb 01, 2008 at 04:05:08PM -0800, Christoph Lameter wrote:
> > Are you saying that you get the callback when transitioning from a read 
> > only to a read write pte on the *same* page?
> 
> I believe that is what we saw.  We have not put in any more debug
> information yet.  I will try to squeze it in this weekend.  Otherwise,
> I will probably have to wait until early Monday.

I hate it when I am confused.  I misunderstood what Dean had been saying.
After I looked at his test case and remembering his screen at the time
we were discussing, I am nearly positive that both the parent and child
were still running (no exec, no exit).  We would therefore have two refs
on the page and, yes, be changing the pte which would warrant the callout.
Now I really need to think this through more.  Sounds like a good thing
for Monday.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/4] [RFC] EMMU Notifiers V5
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-02-01 11:58 ` Extending mmu_notifiers to handle __xip_unmap in a sleepable context? Robin Holt
@ 2008-02-03  1:39 ` Andrea Arcangeli
  2008-02-03 13:41 ` Robin Holt
  6 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2008-02-03  1:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, Jan 31, 2008 at 09:04:39PM -0800, Christoph Lameter wrote:
> - Has page tables to track pages whose refcount was elevated(?) but
>   no reverse maps.

Just a correction, rmaps exists or swap couldn't be sane, it's just
that it's not built on the page_t because the guest memory is really
virtual and not physical at all (hence it swaps really well, thanks to
the regular linux VM algorithms without requiring any KVM knowledge at
all, it all looks (shared) anonymous memory as far as linux is
concerned ;).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-01 23:35       ` Robin Holt
  2008-02-02  0:05         ` Christoph Lameter
@ 2008-02-03  2:23         ` Andrea Arcangeli
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2008-02-03  2:23 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, Feb 01, 2008 at 05:35:28PM -0600, Robin Holt wrote:
> No, we need a callout when we are becoming more restrictive, but not
> when becoming more permissive.  I would have to guess that is the case
> for any of these callouts.  It is for both GRU and XPMEM.  I would
> expect the same is true for KVM, but would like a ruling from Andrea on
> that.

I still hope I don't need to take any lock in _range_start and that
losing coherency (w/o risking global memory corruption but only
risking temporary userland data corruption thanks to the page pin) is
ok for KVM.

If I would have to take a lock in _range_start like XPMEM is forced to
do (GRU is by far not forced to it, if it would switch to my #v5) then
it would be a problem.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch 0/4] [RFC] EMMU Notifiers V5
  2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-02-03  1:39 ` [patch 0/4] [RFC] EMMU Notifiers V5 Andrea Arcangeli
@ 2008-02-03 13:41 ` Robin Holt
  6 siblings, 0 replies; 23+ messages in thread
From: Robin Holt @ 2008-02-03 13:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

Great news!  I have taken over Dean's xpmem patch set while he is on
sabbatical.  Before he left, he had his patch mostly working on top of
this patch set.  We had one deadlock.  I have coded for that specific
deadlock and xpmem now passes a simple grant/attach/fault/fork/unmap/map
test.

After analyzing it, I believe we still have a nearly related deadlock
which will require some refactoring of code.  I am certain that the
same mechanism I used for this deadlock break will work in that case,
but it will require too many changes for me to finish this weekend.

For our customer base, this case, in the past, has resulted in termination
of the application and our MPI library specifically states that this
mode of operation is not permitted, so I think we will be able to pass
their regression tests.  I will need to coordinate that early next week.

The good news, at this point, Christoph's version 5 of the mmu_notifiers
appears to work for xpmem.  The mmu_notifier call-outs where the
in_atomic flag is set still result in a BUG_ON.  That is not an issue
for our normal customer as our MPI already states this is not a valid
mode of operation and provides means to avoid those types of mappings.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
@ 2008-01-25  5:56 ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2008-01-25  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 4588 bytes --]

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active.

Exceptions:

We hold i_mmap_lock in __unmap_hugepage_range and
sometimes in zap_page_range. Should we pass a parameter to indicate
the different lock situation?

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this?

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/fremap.c  |    2 ++
 mm/hugetlb.c |    2 ++
 mm/memory.c  |    9 +++++++--
 mm/mmap.c    |    1 +
 4 files changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-24 21:01:17.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range, mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-01-24 21:01:17.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-24 20:59:17.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-24 21:01:17.000000000 -0800
@@ -50,6 +50,7 @@
 #include <linux/delayacct.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -891,6 +892,7 @@ unsigned long zap_page_range(struct vm_a
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range, mm, address, end);
 	return end;
 }
 
@@ -1319,7 +1321,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1360,6 +1362,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1446,7 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
@@ -1454,6 +1457,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1634,6 +1638,7 @@ gotten:
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	mmu_notifier(invalidate_range, mm, address, address + PAGE_SIZE - 1);
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-24 20:59:19.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-24 21:01:17.000000000 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier(invalidate_range, mm, start, end);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-02-03 13:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-01  5:04 [patch 0/4] [RFC] EMMU Notifiers V5 Christoph Lameter
2008-02-01  5:04 ` [patch 1/4] mmu_notifier: Core code Christoph Lameter
2008-02-01 10:55   ` Robin Holt
2008-02-01 11:04     ` Robin Holt
2008-02-01 19:14     ` Christoph Lameter
2008-02-01  5:04 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-01 10:49   ` Robin Holt
2008-02-01 19:14     ` Christoph Lameter
2008-02-01 22:09   ` Robin Holt
2008-02-01 23:19     ` Christoph Lameter
2008-02-01 23:35       ` Robin Holt
2008-02-02  0:05         ` Christoph Lameter
2008-02-02  0:21           ` Robin Holt
2008-02-02  0:38             ` Robin Holt
2008-02-03  2:23         ` Andrea Arcangeli
2008-02-01  5:04 ` [patch 3/4] mmu_notifier: invalidate_page callbacks Christoph Lameter
2008-02-01  5:04 ` [patch 4/4] mmu_notifier: Support for driverws with revers maps (f.e. for XPmem) Christoph Lameter
2008-02-01 11:58 ` Extending mmu_notifiers to handle __xip_unmap in a sleepable context? Robin Holt
2008-02-01 12:10   ` Robin Holt
2008-02-01 19:17   ` Christoph Lameter
2008-02-03  1:39 ` [patch 0/4] [RFC] EMMU Notifiers V5 Andrea Arcangeli
2008-02-03 13:41 ` Robin Holt
  -- strict thread matches above, loose matches on Subject: below --
2008-01-25  5:56 [patch 0/4] [RFC] MMU Notifiers V1 Christoph Lameter
2008-01-25  5:56 ` [patch 2/4] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).