linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00 of 11] mmu notifier #v15
@ 2008-05-02 15:05 Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
                   ` (11 more replies)
  0 siblings, 12 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

Hello everyone,

1/11 is the latest version of the mmu-notifier-core patch.

As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01 of 11] mmu-notifier-core
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-04 19:13   ` Robin Holt
  2008-05-05 16:21   ` Jack Steiner
  2008-05-02 15:05 ` [PATCH 02 of 11] get_task_mm Andrea Arcangeli
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740175 -7200
# Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa
# Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with "mm". This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further ->invalidate_page notification (invalidate_range_begin/end
   works on ranges and ->invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_lock may not
allocate the required vmalloc space. See the comment on top of
mm_lock() implementation for the worst case memory requirements.
Because mmu_notifier_reigster is used when a driver startup, a failure
can be gracefully handled. Here an example of the change applied to
kvm to register the mmu notifiers. Usually when a driver startups
other allocations are required anyway and -ENOMEM failure paths exists
already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
  * or hlist_del_rcu(), running on this same list.
  * However, it is perfectly legal to run concurrently with
  * the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
  */
 static inline void hlist_del_rcu(struct hlist_node *n)
 {
@@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
 	if (!hlist_unhashed(n)) {
 		__hlist_del(n);
 		INIT_HLIST_NODE(n);
+	}
+}
+
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+	if (!hlist_unhashed(n)) {
+		__hlist_del(n);
+		n->pprev = NULL;
 	}
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+struct mm_lock_data {
+	spinlock_t **i_mmap_locks;
+	spinlock_t **anon_vma_locks;
+	size_t nr_i_mmap_locks;
+	size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/cpumask.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -19,6 +20,7 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
 struct address_space;
+struct mmu_notifier_mm;
 
 #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
 typedef atomic_long_t mm_counter_t;
@@ -235,6 +237,9 @@ struct mm_struct {
 	struct file *exe_file;
 	unsigned long num_exe_file_vmas;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_notifier_mm *mmu_notifier_mm;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,265 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+#include <linux/srcu.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_lock() protected critical section
+ * and it's released only when mm_count reaches zero in mmdrop().
+ */
+struct mmu_notifier_mm {
+	/* all mmu notifiers registerd in this mm are queued in this list */
+	struct hlist_head list;
+	/* srcu structure for this mm */
+	struct srcu_struct srcu;
+	/* to serialize the list modifications and hlist_unhashed */
+	spinlock_t lock;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Called either by mmu_notifier_unregister or when the mm is
+	 * being destroyed by exit_mmap, always before all pages are
+	 * freed. It's mandatory to implement this method. This can
+	 * run concurrently with other mmu notifier methods and it
+	 * should tear down all secondary mmu mappings and freeze the
+	 * secondary mmu.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed to by the Linux
+	 * pte because the page hasn't been freed yet and it won't be
+	 * freed until this returns. If required set_page_dirty has to
+	 * be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired and are called only when the mmap_sem and/or the
+	 * locks protecting the reverse maps are held. Both functions
+	 * may sleep. The subsystem must guarantee that no additional
+	 * references are taken to the pages in the range established
+	 * between the call to invalidate_range_start() and the
+	 * matching call to invalidate_range_end().
+	 *
+	 * Invalidation of multiple concurrent ranges may be
+	 * optionally permitted by the driver. Either way the
+	 * establishment of sptes is forbidden in the range passed to
+	 * invalidate_range_begin/end for the whole duration of the
+	 * invalidate_range_begin/end critical section.
+	 *
+	 * invalidate_range_start() is called when all pages in the
+	 * range are still mapped and have at least a refcount of one.
+	 *
+	 * invalidate_range_end() is called when all pages in the
+	 * range have been unmapped and the pages have been freed by
+	 * the VM.
+	 *
+	 * The VM will remove the page table entries and potentially
+	 * the page between invalidate_range_start() and
+	 * invalidate_range_end(). If the page must not be freed
+	 * because of pending I/O or other circumstances then the
+	 * invalidate_range_start() callback (or the initial mapping
+	 * by the driver) must make sure that the refcount is kept
+	 * elevated.
+	 *
+	 * If the driver increases the refcount when the pages are
+	 * initially mapped into an address space then either
+	 * invalidate_range_start() or invalidate_range_end() may
+	 * decrease the refcount. If the refcount is decreased on
+	 * invalidate_range_start() then the VM can free pages as page
+	 * table entries are removed.  If the refcount is only
+	 * droppped on invalidate_range_end() then the driver itself
+	 * will drop the last refcount but it must take care to flush
+	 * any secondary tlb before doing the final free on the
+	 * page. Pages will no longer be referenced by the linux
+	 * address space but may still be referenced by sptes until
+	 * the last refcount is dropped.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+/*
+ * The notifier chains are protected by mmap_sem and/or the reverse map
+ * semaphores. Notifier chains are only changed when all reverse maps and
+ * the mmap_sem locks are taken.
+ *
+ * Therefore notifier chains can only be traversed when either
+ *
+ * 1. mmap_sem is held.
+ * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem).
+ * 3. No other concurrent thread can access the list (release)
+ */
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(mm->mmu_notifier_mm);
+}
+
+extern int mmu_notifier_register(struct mmu_notifier *mn,
+				 struct mm_struct *mm);
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	mm->mmu_notifier_mm = NULL;
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_mm_destroy(mm);
+}
+
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/include/linux/srcu.h b/include/linux/srcu.h
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include <linux/mutex.h>
+
 struct srcu_struct_array {
 	int c[2];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
@@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mmu_notifier_mm_destroy(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -205,3 +205,6 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	bool
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
+	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+					    vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						  vma->vm_start, end);
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
+	struct mm_struct *mm = vma->vm_mm;
 
+	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
 
@@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
 		}
 	}
 out:
+	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1753,7 +1776,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,9 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	mmu_notifier_release(mm);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -2255,3 +2259,190 @@ int install_special_mapping(struct mm_st
 
 	return 0;
 }
+
+static int mm_lock_cmp(const void *a, const void *b)
+{
+	unsigned long _a = (unsigned long)*(spinlock_t **)a;
+	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+
+	cond_resched();
+	if (_a < _b)
+		return -1;
+	if (_a > _b)
+		return 1;
+	return 0;
+}
+
+static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+				  int anon)
+{
+	struct vm_area_struct *vma;
+	size_t i = 0;
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (anon) {
+			if (vma->anon_vma)
+				locks[i++] = &vma->anon_vma->lock;
+		} else {
+			if (vma->vm_file && vma->vm_file->f_mapping)
+				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+		}
+	}
+
+	if (!i)
+		goto out;
+
+	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+
+out:
+	return i;
+}
+
+static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
+						  spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 1);
+}
+
+static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
+						spinlock_t **locks)
+{
+	return mm_lock_sort(mm, locks, 0);
+}
+
+static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+{
+	spinlock_t *last = NULL;
+	size_t i;
+
+	for (i = 0; i < nr; i++)
+		/*  Multiple vmas may use the same lock. */
+		if (locks[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
+			last = locks[i];
+			if (lock)
+				spin_lock(last);
+			else
+				spin_unlock(last);
+		}
+}
+
+static inline void __mm_lock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 1);
+}
+
+static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+{
+	mm_lock_unlock(locks, nr, 0);
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm_lock in a row or it would deadlock.
+ *
+ * The mmap_sem must be taken in write mode to block all operations
+ * that could modify pagetables and free pages without altering the
+ * vma layout (for example populate_range() with nonlinear vmas).
+ *
+ * The sorting is needed to avoid lock inversion deadlocks if two
+ * tasks run mm_lock at the same time on different mm that happen to
+ * share some anon_vmas/inodes but mapped in different order.
+ *
+ * mm_lock and mm_unlock are expensive operations that may have to
+ * take thousand of locks. Thanks to sort() the complexity is
+ * O(N*log(N)) where N is the number of VMAs in the mm. The max number
+ * of vmas is defined in /proc/sys/vm/max_map_count.
+ *
+ * mm_lock() can fail if memory allocation fails. The worst case
+ * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
+ * so around 1Mbyte, but in practice it'll be much less because
+ * normally there won't be max_map_count vmas allocated in the task
+ * that runs mm_lock().
+ *
+ * The vmalloc memory allocated by mm_lock is stored in the
+ * mm_lock_data structure that must be allocated by the caller and it
+ * must be later passed to mm_unlock that will free it after using it.
+ * Allocating the mm_lock_data structure on the stack is fine because
+ * it's only a couple of bytes in size.
+ *
+ * If mm_lock() returns -ENOMEM no memory has been allocated and the
+ * mm_lock_data structure can be freed immediately, and mm_unlock must
+ * not be called.
+ */
+int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	spinlock_t **anon_vma_locks, **i_mmap_locks;
+
+	down_write(&mm->mmap_sem);
+	if (mm->map_count) {
+		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!anon_vma_locks)) {
+			up_write(&mm->mmap_sem);
+			return -ENOMEM;
+		}
+
+		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
+		if (unlikely(!i_mmap_locks)) {
+			up_write(&mm->mmap_sem);
+			vfree(anon_vma_locks);
+			return -ENOMEM;
+		}
+
+		/*
+		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
+		 * means there's no lock to take and so we can free
+		 * the array here without waiting mm_unlock. mm_unlock
+		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * zero.
+		 */
+		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
+		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+
+		if (data->nr_anon_vma_locks) {
+			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
+			data->anon_vma_locks = anon_vma_locks;
+		} else
+			vfree(anon_vma_locks);
+
+		if (data->nr_i_mmap_locks) {
+			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
+			data->i_mmap_locks = i_mmap_locks;
+		} else
+			vfree(i_mmap_locks);
+	}
+	return 0;
+}
+
+static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+{
+	__mm_unlock(locks, nr);
+	vfree(locks);
+}
+
+/*
+ * mm_unlock doesn't require any memory allocation and it won't fail.
+ *
+ * All memory has been previously allocated by mm_lock and it'll be
+ * all freed before returning. Only after mm_unlock returns, the
+ * caller is allowed to free and forget the mm_lock_data structure.
+ * 
+ * mm_unlock runs in O(N) where N is the max number of VMAs in the
+ * mm. The max number of vmas is defined in
+ * /proc/sys/vm/max_map_count.
+ */
+void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
+{
+	if (mm->map_count) {
+		if (data->nr_anon_vma_locks)
+			mm_unlock_vfree(data->anon_vma_locks,
+					data->nr_anon_vma_locks);
+		if (data->nr_i_mmap_locks)
+			mm_unlock_vfree(data->i_mmap_locks,
+					data->nr_i_mmap_locks);
+	}
+	up_write(&mm->mmap_sem);
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,269 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/srcu.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+
+/*
+ * This function can't run concurrently against mmu_notifier_register
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
+ * can't go away from under us as exit_mmap holds an mm_count pin
+ * itself.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	int srcu;
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
+		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+				 struct mmu_notifier,
+				 hlist);
+		/*
+		 * We arrived before mmu_notifier_unregister so
+		 * mmu_notifier_unregister will do nothing other than
+		 * to wait ->release to finish and
+		 * mmu_notifier_unregister to return.
+		 */
+		hlist_del_init_rcu(&mn->hlist);
+		/*
+		 * SRCU here will block mmu_notifier_unregister until
+		 * ->release returns.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * if ->release runs before mmu_notifier_unregister it
+		 * must be handled as it's the only way for the driver
+		 * to flush all existing sptes and stop the driver
+		 * from establishing any more sptes before all the
+		 * pages in the mm are freed.
+		 */
+		mn->ops->release(mn, mm);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+		spin_lock(&mm->mmu_notifier_mm->lock);
+	}
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * synchronize_srcu here prevents mmu_notifier_release to
+	 * return to exit_mmap (which would proceed freeing all pages
+	 * in the mm) until the ->release method returns, if it was
+	 * invoked by mmu_notifier_unregister.
+	 *
+	 * The mmu_notifier_mm can't go away from under us because one
+	 * mm_count is hold by exit_mmap.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0, srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int srcu;
+
+	srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct mm_lock_data data;
+	struct mmu_notifier_mm * mmu_notifier_mm;
+	int ret;
+
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+
+	ret = -ENOMEM;
+	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+	if (unlikely(!mmu_notifier_mm))
+		goto out;
+
+	ret = init_srcu_struct(&mmu_notifier_mm->srcu);
+	if (unlikely(ret))
+		goto out_kfree;
+
+	ret = mm_lock(mm, &data);
+	if (unlikely(ret))
+		goto out_cleanup;
+
+	if (!mm_has_notifiers(mm)) {
+		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+		spin_lock_init(&mmu_notifier_mm->lock);
+		mm->mmu_notifier_mm = mmu_notifier_mm;
+		mmu_notifier_mm = NULL;
+	}
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Serialize the update against mmu_notifier_unregister. A
+	 * side note: mmu_notifier_release can't run concurrently with
+	 * us because we hold the mm_users pin (either implicitly as
+	 * current->mm or explicitly with get_task_mm() or similar).
+	 * We can't race against any other mmu notifiers either thanks
+	 * to mm_lock().
+	 */
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
+	spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	mm_unlock(mm, &data);
+out_cleanup:
+	if (mmu_notifier_mm)
+		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
+out_kfree:
+	/* kfree() does nothing if mmu_notifier_mm is NULL */
+	kfree(mmu_notifier_mm);
+out:
+	BUG_ON(atomic_read(&mm->mm_users) <= 0);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/* this is called after the last mmu_notifier_unregister() returned */
+void __mmu_notifier_mm_destroy(struct mm_struct *mm)
+{
+	BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
+	cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu);
+	kfree(mm->mmu_notifier_mm);
+	mm->mmu_notifier_mm = LIST_POISON1; /* debug */
+}
+
+/*
+ * This releases the mm_count pin automatically and frees the mm
+ * structure if it was the last user of it. It serializes against
+ * running mmu notifiers with SRCU and against mmu_notifier_unregister
+ * with the unregister lock + SRCU. All sptes must be dropped before
+ * calling mmu_notifier_unregister. ->release or any other notifier
+ * method may be invoked concurrently with mmu_notifier_unregister,
+ * and only after mmu_notifier_unregister returned we're guaranteed
+ * that ->release or any other method can't run anymore.
+ */
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	spin_lock(&mm->mmu_notifier_mm->lock);
+	if (!hlist_unhashed(&mn->hlist)) {
+		int srcu;
+
+		hlist_del_rcu(&mn->hlist);
+
+		/*
+		 * SRCU here will force exit_mmap to wait ->release to finish
+		 * before freeing the pages.
+		 */
+		srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+		/*
+		 * exit_mmap will block in mmu_notifier_release to
+		 * guarantee ->release is called before freeing the
+		 * pages.
+		 */
+		mn->ops->release(mn, mm);
+		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
+	} else
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+
+	/*
+	 * Wait any running method to finish, of course including
+	 * ->release if it was run by mmu_notifier_relase instead of us.
+	 */
+	synchronize_srcu(&mm->mmu_notifier_mm->srcu);
+
+	BUG_ON(atomic_read(&mm->mm_count) <= 0);
+
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -457,7 +458,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 02 of 11] get_task_mm
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 03 of 11] invalidate_page outside PT lock Andrea Arcangeli
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740185 -7200
# Node ID c85c85c4be165eb6de16136bb97cf1fa7fd5c88f
# Parent  1489529e7b53d3f2dab8431372aa4850ec821caa
get_task_mm

get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas
 		if (task->flags & PF_BORROWED_MM)
 			mm = NULL;
 		else
-			atomic_inc(&mm->mm_users);
+			if (!atomic_inc_not_zero(&mm->mm_users))
+				mm = NULL;
 	}
 	task_unlock(task);
 	return mm;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 03 of 11] invalidate_page outside PT lock
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 02 of 11] get_task_mm Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 04 of 11] free-pgtables Andrea Arcangeli
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740185 -7200
# Node ID ea8fc9187b6d3ef2742061b4f62598afe55281cf
# Parent  c85c85c4be165eb6de16136bb97cf1fa7fd5c88f
invalidate_page outside PT lock

Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -193,35 +193,6 @@ static inline void mmu_notifier_mm_destr
 		__mmu_notifier_mm_destroy(mm);
 }
 
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is impleemnted as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
-	__pte;								\
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
-({									\
-	int __young;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
-	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
-						  ___address);		\
-	__young;							\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -257,9 +228,6 @@ static inline void mmu_notifier_mm_destr
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush_notify(vma, address, pte);
+			pteval = ptep_clear_flush(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
+			/* must invalidate_page _before_ freeing the page */
+			mmu_notifier_invalidate_page(mm, address);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
+			new_page = NULL;
 			if (!pte_same(*page_table, orig_pte))
 				goto unlock;
+			page_cache_release(old_page);
 
 			page_mkwrite = 1;
 		}
@@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
+		old_page = new_page = NULL;
 		goto unlock;
 	}
 
@@ -1776,7 +1778,7 @@ gotten:
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush_notify(vma, address, page_table);
+		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
@@ -1788,12 +1790,18 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
-	if (new_page)
+unlock:
+	pte_unmap_unlock(page_table, ptl);
+
+	if (new_page) {
+		if (new_page == old_page)
+			/* cow happened, notify before releasing old_page */
+			mmu_notifier_invalidate_page(mm, address);
 		page_cache_release(new_page);
+	}
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
-	pte_unmap_unlock(page_table, ptl);
+
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -275,7 +275,7 @@ static int page_referenced_one(struct pa
 	unsigned long address;
 	pte_t *pte;
 	spinlock_t *ptl;
-	int referenced = 0;
+	int referenced = 0, clear_flush_young = 0;
 
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
@@ -288,8 +288,11 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young_notify(vma, address, pte))
-		referenced++;
+	} else {
+		clear_flush_young = 1;
+		if (ptep_clear_flush_young(vma, address, pte))
+			referenced++;
+	}
 
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
@@ -299,6 +302,10 @@ static int page_referenced_one(struct pa
 
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
+
+	if (clear_flush_young)
+		referenced += mmu_notifier_clear_flush_young(mm, address);
+
 out:
 	return referenced;
 }
@@ -458,7 +465,7 @@ static int page_mkclean_one(struct page 
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush_notify(vma, address, pte);
+		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -466,6 +473,10 @@ static int page_mkclean_one(struct page 
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		mmu_notifier_invalidate_page(mm, address);
+
 out:
 	return ret;
 }
@@ -717,15 +728,14 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young_notify(vma, address, pte)))) {
+	if (!migration && (vma->vm_flags & VM_LOCKED)) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush_notify(vma, address, pte);
+	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -780,6 +790,8 @@ static int try_to_unmap_one(struct page 
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (ret != SWAP_FAIL)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 }
@@ -818,7 +830,7 @@ static void try_to_unmap_cluster(unsigne
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long end;
+	unsigned long start, end;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
 	end = address + CLUSTER_SIZE;
@@ -839,6 +851,8 @@ static void try_to_unmap_cluster(unsigne
 	if (!pmd_present(*pmd))
 		return;
 
+	start = address;
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	/* Update high watermark before we lower rss */
@@ -850,12 +864,12 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young_notify(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush_notify(vma, address, pte);
+		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -871,6 +885,7 @@ static void try_to_unmap_cluster(unsigne
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
 static int try_to_unmap_anon(struct page *page, int migration)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 04 of 11] free-pgtables
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 03 of 11] invalidate_page outside PT lock Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 05 of 11] unmap vmas tlb flushing Andrea Arcangeli
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740185 -7200
# Node ID 14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e
# Parent  ea8fc9187b6d3ef2742061b4f62598afe55281cf
free-pgtables

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather **
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+							unsigned long ceiling)
 {
+	struct mmu_gather *tlb;
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		} else {
 			/*
@@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
 		}
+		tlb_finish_mmu(tlb, addr, vma->vm_end);
 		vma = next;
 	}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 05 of 11] unmap vmas tlb flushing
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 04 of 11] free-pgtables Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 06 of 11] rwsem contended Andrea Arcangeli
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740186 -7200
# Node ID a8ac53b928dfcea0ccb326fb7d71f908f0df85f4
# Parent  14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e
unmap vmas tlb flushing

Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar
 
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-		struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-		struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
@@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
-	int fullmm = (*tlbp)->fullmm;
+	int fullmm;
+	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
 
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	update_hiwater_rss(mm);
+	fullmm = tlb->fullmm;
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
@@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath
 						(HPAGE_SIZE / PAGE_SIZE);
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
+			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
 				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
 				if (i_mmap_lock) {
-					*tlbp = NULL;
+					tlb = NULL;
 					goto out;
 				}
 				cond_resched();
 			}
 
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+			tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
+	tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
@@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
-	return end;
+	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1751,15 +1751,10 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, start, end);
 	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 }
@@ -2044,7 +2039,6 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2055,12 +2049,11 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
+
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, 0, end);
 	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 06 of 11] rwsem contended
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 05 of 11] unmap vmas tlb flushing Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740186 -7200
# Node ID 74b873f3ea07012e2fc864f203edf1179865feb1
# Parent  a8ac53b928dfcea0ccb326fb7d71f908f0df85f4
rwsem contended

Add a function to rw_semaphores to check if there are any processes
waiting for the semaphore. Add rwsem_needbreak to sched.h that works
in the same way as spinlock_needbreak().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore
  */
 extern void downgrade_write(struct rw_semaphore *sem);
 
+extern int rwsem_is_contended(struct rw_semaphore *sem);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 /*
  * nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc
 #endif
 }
 
+static inline int rwsem_needbreak(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_PREEMPT
+	return rwsem_is_contended(sem);
+#else
+	return 0;
+#endif
+}
+
 /*
  * Reevaluate whether the task has signals pending delivery.
  * Wake the task if so.
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho
 	spin_unlock_irqrestore(&sem->wait_lock, flags);
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(__init_rwsem);
 EXPORT_SYMBOL(__down_read);
 EXPORT_SYMBOL(__down_read_trylock);
diff --git a/lib/rwsem.c b/lib/rwsem.c
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do
 	return sem;
 }
 
+int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	/*
+	 * Racy check for an empty list. False positives or negatives
+	 * would be okay. False positive may cause a useless dropping of
+	 * locks. False negatives may cause locks to be held a bit
+	 * longer until the next check.
+	 */
+	return !list_empty(&sem->wait_list);
+}
+
+EXPORT_SYMBOL(rwsem_is_contended);
 EXPORT_SYMBOL(rwsem_down_read_failed);
 EXPORT_SYMBOL(rwsem_down_write_failed);
 EXPORT_SYMBOL(rwsem_wake);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 07 of 11] i_mmap_rwsem
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 06 of 11] rwsem contended Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740186 -7200
# Node ID de28c85baef11b90c993047ca851a2f52c85a5be
# Parent  74b873f3ea07012e2fc864f203edf1179865feb1
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	mmu_notifier_invalidate_range_start(mm, start, start + size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
 	int fullmm;
 	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
@@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_
 			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+				(i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+				if (i_mmap_sem) {
 					tlb = NULL;
 					goto out;
 				}
@@ -1829,7 +1829,7 @@ unwritable_page:
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1848,7 +1848,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1898,7 +1898,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1912,9 +1912,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file) {
@@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -373,14 +373,14 @@ static int page_referenced_file(struct p
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -403,7 +403,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 08 of 11] anon-vma-rwsem
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 09 of 11] mm_lock-rwsem Andrea Arcangeli
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740186 -7200
# Node ID 0be678c52e540d5f5d5fd9af549b57b9bb018d32
# Parent  de28c85baef11b90c993047ca851a2f52c85a5be
anon-vma-rwsem

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma(page);
 
 	/*
 	 * Corner case handling:
@@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -570,7 +570,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -624,7 +624,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -157,9 +157,9 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 09 of 11] mm_lock-rwsem
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 10 of 11] export zap_page_range for XPMEM Andrea Arcangeli
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740226 -7200
# Node ID 721c3787cd42043734331e54a42eb20c51766f71
# Parent  0be678c52e540d5f5d5fd9af549b57b9bb018d32
mm_lock-rwsem

Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc
 				   unsigned long flags, struct page **pages);
 
 struct mm_lock_data {
-	spinlock_t **i_mmap_locks;
-	spinlock_t **anon_vma_locks;
-	size_t nr_i_mmap_locks;
-	size_t nr_anon_vma_locks;
+	struct rw_semaphore **i_mmap_sems;
+	struct rw_semaphore **anon_vma_sems;
+	size_t nr_i_mmap_sems;
+	size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st
 
 static int mm_lock_cmp(const void *a, const void *b)
 {
-	unsigned long _a = (unsigned long)*(spinlock_t **)a;
-	unsigned long _b = (unsigned long)*(spinlock_t **)b;
+	unsigned long _a = (unsigned long)*(struct rw_semaphore **)a;
+	unsigned long _b = (unsigned long)*(struct rw_semaphore **)b;
 
 	cond_resched();
 	if (_a < _b)
@@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co
 	return 0;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems,
 				  int anon)
 {
 	struct vm_area_struct *vma;
@@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		if (anon) {
 			if (vma->anon_vma)
-				locks[i++] = &vma->anon_vma->lock;
+				sems[i++] = &vma->anon_vma->sem;
 		} else {
 			if (vma->vm_file && vma->vm_file->f_mapping)
-				locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock;
+				sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem;
 		}
 	}
 
 	if (!i)
 		goto out;
 
-	sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+	sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
 	return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
-						  spinlock_t **locks)
+						  struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 1);
+	return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-						spinlock_t **locks)
+						struct rw_semaphore **sems)
 {
-	return mm_lock_sort(mm, locks, 0);
+	return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-	spinlock_t *last = NULL;
+	struct rw_semaphore *last = NULL;
 	size_t i;
 
 	for (i = 0; i < nr; i++)
 		/*  Multiple vmas may use the same lock. */
-		if (locks[i] != last) {
-			BUG_ON((unsigned long) last > (unsigned long) locks[i]);
-			last = locks[i];
+		if (sems[i] != last) {
+			BUG_ON((unsigned long) last > (unsigned long) sems[i]);
+			last = sems[i];
 			if (lock)
-				spin_lock(last);
+				down_write(last);
 			else
-				spin_unlock(last);
+				up_write(last);
 		}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 1);
+	mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-	mm_lock_unlock(locks, nr, 0);
+	mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2351,10 +2351,10 @@ static inline void __mm_unlock(spinlock_
  * of vmas is defined in /proc/sys/vm/max_map_count.
  *
  * mm_lock() can fail if memory allocation fails. The worst case
- * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *),
- * so around 1Mbyte, but in practice it'll be much less because
- * normally there won't be max_map_count vmas allocated in the task
- * that runs mm_lock().
+ * vmalloc allocation required is 2*max_map_count*sizeof(struct
+ * rw_semaphore *), so around 1Mbyte, but in practice it'll be much
+ * less because normally there won't be max_map_count vmas allocated
+ * in the task that runs mm_lock().
  *
  * The vmalloc memory allocated by mm_lock is stored in the
  * mm_lock_data structure that must be allocated by the caller and it
@@ -2368,20 +2368,20 @@ static inline void __mm_unlock(spinlock_
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-	spinlock_t **anon_vma_locks, **i_mmap_locks;
+	struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
 	down_write(&mm->mmap_sem);
 	if (mm->map_count) {
-		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!anon_vma_locks)) {
+		anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!anon_vma_sems)) {
 			up_write(&mm->mmap_sem);
 			return -ENOMEM;
 		}
 
-		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!i_mmap_locks)) {
+		i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count);
+		if (unlikely(!i_mmap_sems)) {
 			up_write(&mm->mmap_sem);
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 			return -ENOMEM;
 		}
 
@@ -2389,31 +2389,31 @@ int mm_lock(struct mm_struct *mm, struct
 		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
 		 * means there's no lock to take and so we can free
 		 * the array here without waiting mm_unlock. mm_unlock
-		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * will do nothing if nr_i_mmap/anon_vma_sems is
 		 * zero.
 		 */
-		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
-		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
+		data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems);
+		data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems);
 
-		if (data->nr_anon_vma_locks) {
-			__mm_lock(anon_vma_locks, data->nr_anon_vma_locks);
-			data->anon_vma_locks = anon_vma_locks;
+		if (data->nr_anon_vma_sems) {
+			__mm_lock(anon_vma_sems, data->nr_anon_vma_sems);
+			data->anon_vma_sems = anon_vma_sems;
 		} else
-			vfree(anon_vma_locks);
+			vfree(anon_vma_sems);
 
-		if (data->nr_i_mmap_locks) {
-			__mm_lock(i_mmap_locks, data->nr_i_mmap_locks);
-			data->i_mmap_locks = i_mmap_locks;
+		if (data->nr_i_mmap_sems) {
+			__mm_lock(i_mmap_sems, data->nr_i_mmap_sems);
+			data->i_mmap_sems = i_mmap_sems;
 		} else
-			vfree(i_mmap_locks);
+			vfree(i_mmap_sems);
 	}
 	return 0;
 }
 
-static void mm_unlock_vfree(spinlock_t **locks, size_t nr)
+static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr)
 {
-	__mm_unlock(locks, nr);
-	vfree(locks);
+	__mm_unlock(sems, nr);
+	vfree(sems);
 }
 
 /*
@@ -2430,12 +2430,12 @@ void mm_unlock(struct mm_struct *mm, str
 void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 {
 	if (mm->map_count) {
-		if (data->nr_anon_vma_locks)
-			mm_unlock_vfree(data->anon_vma_locks,
-					data->nr_anon_vma_locks);
-		if (data->nr_i_mmap_locks)
-			mm_unlock_vfree(data->i_mmap_locks,
-					data->nr_i_mmap_locks);
+		if (data->nr_anon_vma_sems)
+			mm_unlock_vfree(data->anon_vma_sems,
+					data->nr_anon_vma_sems);
+		if (data->nr_i_mmap_sems)
+			mm_unlock_vfree(data->i_mmap_sems,
+					data->nr_i_mmap_sems);
 	}
 	up_write(&mm->mmap_sem);
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 10 of 11] export zap_page_range for XPMEM
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 09 of 11] mm_lock-rwsem Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-02 15:05 ` [PATCH 11 of 11] mmap sems Andrea Arcangeli
  2008-05-03 11:09 ` [PATCH 00 of 11] mmu notifier #v15 Jack Steiner
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740229 -7200
# Node ID 4f462fb3dff614cd7d971219c3feaef0b43359c1
# Parent  721c3787cd42043734331e54a42eb20c51766f71
export zap_page_range for XPMEM

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a
 
 	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 11 of 11] mmap sems
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 10 of 11] export zap_page_range for XPMEM Andrea Arcangeli
@ 2008-05-02 15:05 ` Andrea Arcangeli
  2008-05-03 11:09 ` [PATCH 00 of 11] mmu notifier #v15 Jack Steiner
  11 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-02 15:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1209740229 -7200
# Node ID b4bf6df98bc00bfbef9423b0dd31cfdba63a5eeb
# Parent  4f462fb3dff614cd7d971219c3feaef0b43359c1
mmap sems

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki
  *
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
+ *
+ *    When taking multiple mmap_sems, one should lock the lowest-addressed
+ *    one first proceeding on up to the highest-addressed one.
  *
  *  ->i_mutex
  *    ->i_alloc_sem             (various)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00 of 11] mmu notifier #v15
  2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2008-05-02 15:05 ` [PATCH 11 of 11] mmap sems Andrea Arcangeli
@ 2008-05-03 11:09 ` Jack Steiner
  11 siblings, 0 replies; 24+ messages in thread
From: Jack Steiner @ 2008-05-03 11:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Fri, May 02, 2008 at 05:05:03PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> 1/11 is the latest version of the mmu-notifier-core patch.
> 
> As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.
> 

Not sure why -mm is different, but I get compile errors w/o the following...

--- jack


Index: linux/mm/mmu_notifier.c
===================================================================
--- linux.orig/mm/mmu_notifier.c	2008-05-02 16:54:52.780576831 -0500
+++ linux/mm/mmu_notifier.c	2008-05-02 16:56:38.817719509 -0500
@@ -16,6 +16,7 @@
 #include <linux/srcu.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/rculist.h>
 
 /*
  * This function can't run concurrently against mmu_notifier_register

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-02 15:05 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
@ 2008-05-04 19:13   ` Robin Holt
  2008-05-04 22:08     ` Andrea Arcangeli
  2008-05-05 16:21   ` Jack Steiner
  1 sibling, 1 reply; 24+ messages in thread
From: Robin Holt @ 2008-05-04 19:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Christoph Lameter, Jack Steiner, Robin Holt,
	Nick Piggin, Peter Zijlstra, kvm-devel, Kanoj Sarcar,
	Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, linux-mm,
	general, Hugh Dickins, Rusty Russell, Anthony Liguori,
	Chris Wright, Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -205,3 +205,6 @@ config VIRT_TO_BUS
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	bool

Without some text following the bool keyword, I am not even asked for
this config setting on my ia64 build.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-04 19:13   ` Robin Holt
@ 2008-05-04 22:08     ` Andrea Arcangeli
  2008-05-05  2:25       ` Robin Holt
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-04 22:08 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrew Morton, Christoph Lameter, Jack Steiner, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -205,3 +205,6 @@ config VIRT_TO_BUS
> >  config VIRT_TO_BUS
> >  	def_bool y
> >  	depends on !ARCH_NO_VIRT_TO_BUS
> > +
> > +config MMU_NOTIFIER
> > +	bool
> 
> Without some text following the bool keyword, I am not even asked for
> this config setting on my ia64 build.

Yes, this was explicitly asked by Andrew after his review. This is the
explanation pasted from the changelog.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-04 22:08     ` Andrea Arcangeli
@ 2008-05-05  2:25       ` Robin Holt
  0 siblings, 0 replies; 24+ messages in thread
From: Robin Holt @ 2008-05-05  2:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Andrew Morton, Christoph Lameter, Jack Steiner,
	Nick Piggin, Peter Zijlstra, kvm-devel, Kanoj Sarcar,
	Roland Dreier, Steve Wise, linux-kernel, Avi Kivity, linux-mm,
	general, Hugh Dickins, Rusty Russell, Anthony Liguori,
	Chris Wright, Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Mon, May 05, 2008 at 12:08:25AM +0200, Andrea Arcangeli wrote:
> On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -205,3 +205,6 @@ config VIRT_TO_BUS
> > >  config VIRT_TO_BUS
> > >  	def_bool y
> > >  	depends on !ARCH_NO_VIRT_TO_BUS
> > > +
> > > +config MMU_NOTIFIER
> > > +	bool
> > 
> > Without some text following the bool keyword, I am not even asked for
> > this config setting on my ia64 build.
> 
> Yes, this was explicitly asked by Andrew after his review. This is the
> explanation pasted from the changelog.
> 
> 3) It'd be a waste to add branches in the VM if nobody could possibly
>    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
>    if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
>    advantage of mmu notifiers, but this already allows to compile a
>    KVM external module against a kernel with mmu notifiers enabled and
>    from the next pull from kvm.git we'll start using them. And
>    GRU/XPMEM will also be able to continue the development by enabling
>    KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
>    to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
>    the same way KVM does it (even if KVM=n). This guarantees nobody
>    selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

Ah, so Andrew wants users of KVM to do a select of MMU_NOTIFIER.  That
makes sense.  I will change (fix) my Kconfig changes.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-02 15:05 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
  2008-05-04 19:13   ` Robin Holt
@ 2008-05-05 16:21   ` Jack Steiner
  2008-05-05 17:14     ` Andrea Arcangeli
  1 sibling, 1 reply; 24+ messages in thread
From: Jack Steiner @ 2008-05-05 16:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Fri, May 02, 2008 at 05:05:04PM +0200, Andrea Arcangeli wrote:
> # HG changeset patch
> # User Andrea Arcangeli <andrea@qumranet.com>
> # Date 1209740175 -7200
> # Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa
> # Parent  5026689a3bc323a26d33ad882c34c4c9c9a3ecd8
> mmu-notifier-core
 


I upgraded to the latest mmu notifier patch & hit a deadlock. (Sorry -
I should have seen this  earlier but I haven't tracked the last couple
of patches).

The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
At this point, the mmap_sem is already held writeable. I hit a deadlock
in mm_lock.

A quick fix would be to do one of the following:

	- move the mmap_sem locking to the caller of the [de]registration routines.
	  Since the first/last thing done in mm_lock/mm_unlock is to
	  acquire/release mmap_sem, this change does not cause major changes.

	- add a flag to mmu_notifier_[un]register routines to indicate
	  if mmap_sem is already locked.


I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
test. More later....


--- jack

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-05 16:21   ` Jack Steiner
@ 2008-05-05 17:14     ` Andrea Arcangeli
  2008-05-05 17:25       ` Jack Steiner
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-05 17:14 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote:
> The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
> At this point, the mmap_sem is already held writeable. I hit a deadlock
> in mm_lock.

It'd been better to know about this detail earlier, but frankly this
is a minor problem, the important thing is we all agree together on
the more difficult parts ;).

> A quick fix would be to do one of the following:
> 
> 	- move the mmap_sem locking to the caller of the [de]registration routines.
> 	  Since the first/last thing done in mm_lock/mm_unlock is to
> 	  acquire/release mmap_sem, this change does not cause major changes.

I don't like this solution very much. Nor GRU nor KVM will call
mmu_notifier_register inside the mmap_sem protected sections, so I
think the default mmu_notifier_register should be smp safe by itself
without requiring additional locks to be artificially taken externally
(especially because the need for mmap_sem in write mode is a very
mmu_notifier internal detail).

> 	- add a flag to mmu_notifier_[un]register routines to indicate
> 	  if mmap_sem is already locked.

The interface would change like this:

#define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0)
void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm,
			   unsigned long mmu_notifier_flags);

A third solution is to add:

/*
 * This must can be called instead of mmu_notifier_register after
 * taking the mmap_sem in write mode (read mode isn't enough).
 */
void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm);

Do you still prefer the bitflag or you prefer
__mmu_notifier_register. It's ok either ways, except
__mmu_notifier_reigster could be removed in a backwards compatible
way, the bitflag can't.

> I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
> test. More later....

Sure! In the meantime go ahead this way.

Another very minor change I've been thinking about is to make
->release not mandatory. It happens that with KVM ->release isn't
strictly required because after mm_users reaches 0, no guest could
possibly run anymore. So I'm using ->release only for debugging by
placing -1UL in the root shadow pagetable, to be sure ;). So because
at least one user won't strictly require ->release being consistent in
having all method optional may be nicer. Alternatively we could make
them all mandatory and if somebody doesn't need one of the methods it
should implement it as a dummy function. Both ways have pros and cons,
but they don't make any difference to us in practice. If I've to
change the patch for the mmap_sem taken during registration I may as
well cleanup this minor bit.

Also note the rculist.h patch you sent earlier won't work against
mainline so I can't incorporate it in my patchset, Andrew will have to
apply it as mmu-notifier-core-mm after incorporating mmu-notifier-core
into -mm.

Until a new update is released, mmu-notifier-core v15 remains ok for
merging, no known bugs, here we're talking about a new and simple
feature and a tiny cleanup that nobody can notice anyway.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-05 17:14     ` Andrea Arcangeli
@ 2008-05-05 17:25       ` Jack Steiner
  2008-05-05 18:34         ` Andrea Arcangeli
  0 siblings, 1 reply; 24+ messages in thread
From: Jack Steiner @ 2008-05-05 17:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Mon, May 05, 2008 at 07:14:34PM +0200, Andrea Arcangeli wrote:
> On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote:
> > The GRU does the registration/deregistration of mmu notifiers from mmap/munmap.
> > At this point, the mmap_sem is already held writeable. I hit a deadlock
> > in mm_lock.
> 
> It'd been better to know about this detail earlier,

Agree. My apologies... I should have caught it.


> but frankly this
> is a minor problem, the important thing is we all agree together on
> the more difficult parts ;).
> 
> > A quick fix would be to do one of the following:
> > 
> > 	- move the mmap_sem locking to the caller of the [de]registration routines.
> > 	  Since the first/last thing done in mm_lock/mm_unlock is to
> > 	  acquire/release mmap_sem, this change does not cause major changes.
> 
> I don't like this solution very much. Nor GRU nor KVM will call
> mmu_notifier_register inside the mmap_sem protected sections, so I
> think the default mmu_notifier_register should be smp safe by itself
> without requiring additional locks to be artificially taken externally
> (especially because the need for mmap_sem in write mode is a very
> mmu_notifier internal detail).
> 
> > 	- add a flag to mmu_notifier_[un]register routines to indicate
> > 	  if mmap_sem is already locked.
> 
> The interface would change like this:
> 
> #define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0)
> void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm,
> 			   unsigned long mmu_notifier_flags);

That works...


> 
> A third solution is to add:
> 
> /*
>  * This must can be called instead of mmu_notifier_register after
>  * taking the mmap_sem in write mode (read mode isn't enough).
>  */
> void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm);
> 
> Do you still prefer the bitflag or you prefer
> __mmu_notifier_register. It's ok either ways, except
> __mmu_notifier_reigster could be removed in a backwards compatible
> way, the bitflag can't.
> 
> > I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to
> > test. More later....

__mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
go, although either is ok.


> 
> Sure! In the meantime go ahead this way.
> 
> Another very minor change I've been thinking about is to make
> ->release not mandatory. It happens that with KVM ->release isn't
> strictly required because after mm_users reaches 0, no guest could
> possibly run anymore. So I'm using ->release only for debugging by
> placing -1UL in the root shadow pagetable, to be sure ;). So because
> at least one user won't strictly require ->release being consistent in
> having all method optional may be nicer. Alternatively we could make
> them all mandatory and if somebody doesn't need one of the methods it
> should implement it as a dummy function. Both ways have pros and cons,
> but they don't make any difference to us in practice. If I've to
> change the patch for the mmap_sem taken during registration I may as
> well cleanup this minor bit.
 
Let me finish my testing. At one time, I did not use ->release but
with all the locking & teardown changes, I need to do some reverification.


--- jack

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-05 17:25       ` Jack Steiner
@ 2008-05-05 18:34         ` Andrea Arcangeli
  2008-05-05 19:46           ` Jack Steiner
  0 siblings, 1 reply; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-05 18:34 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote:
> Agree. My apologies... I should have caught it.

No problem.

> __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
> go, although either is ok.

If you also like __mmu_notifier_register more I'll go with it. The
bitflags seems like a bit of overkill as I can't see the need of any
other bitflag other than this one and they also can't be removed as
easily in case you'll find a way to call it outside the lock later.

> Let me finish my testing. At one time, I did not use ->release but
> with all the locking & teardown changes, I need to do some reverification.

If you didn't implement it you shall apply this patch but you shall
read carefully the comment I written that covers that usage case.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -29,10 +29,25 @@ struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
-	 * freed. It's mandatory to implement this method. This can
-	 * run concurrently with other mmu notifier methods and it
+	 * freed. This can run concurrently with other mmu notifier
+	 * methods (the ones invoked outside the mm context) and it
 	 * should tear down all secondary mmu mappings and freeze the
-	 * secondary mmu.
+	 * secondary mmu. If this method isn't implemented you've to
+	 * be sure that nothing could possibly write to the pages
+	 * through the secondary mmu by the time the last thread with
+	 * tsk->mm == mm exits.
+	 *
+	 * As side note: the pages freed after ->release returns could
+	 * be immediately reallocated by the gart at an alias physical
+	 * address with a different cache model, so if ->release isn't
+	 * implemented because all memory accesses through the
+	 * secondary mmu implicitly are terminated by the time the
+	 * last thread of this mm quits, you've also to be sure that
+	 * speculative hardware operations can't allocate dirty
+	 * cachelines in the cpu that could not be snooped and made
+	 * coherent with the other read and write operations happening
+	 * through the gart alias address, leading to memory
+	 * corruption.
 	 */
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st
 		 * from establishing any more sptes before all the
 		 * pages in the mm are freed.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 		spin_lock(&mm->mmu_notifier_mm->lock);
 	}
@@ -251,7 +252,8 @@ void mmu_notifier_unregister(struct mmu_
 		 * guarantee ->release is called before freeing the
 		 * pages.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 	} else
 		spin_unlock(&mm->mmu_notifier_mm->lock);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-05 18:34         ` Andrea Arcangeli
@ 2008-05-05 19:46           ` Jack Steiner
  2008-05-06 14:46             ` Andrea Arcangeli
  2008-05-06 17:53             ` mmu notifier v15 -> v16 diff Andrea Arcangeli
  0 siblings, 2 replies; 24+ messages in thread
From: Jack Steiner @ 2008-05-05 19:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

On Mon, May 05, 2008 at 08:34:05PM +0200, Andrea Arcangeli wrote:
> On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote:
> > Agree. My apologies... I should have caught it.
> 
> No problem.
> 
> > __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to
> > go, although either is ok.
> 
> If you also like __mmu_notifier_register more I'll go with it. The
> bitflags seems like a bit of overkill as I can't see the need of any
> other bitflag other than this one and they also can't be removed as
> easily in case you'll find a way to call it outside the lock later.
> 
> > Let me finish my testing. At one time, I did not use ->release but
> > with all the locking & teardown changes, I need to do some reverification.

I finished testing & everything looks good. I do use the ->release callout but
mainly as a performance hint that teardown is in progress & that TLB flushing is
no longer needed. (GRU TLB entries are tagged with a task-specific ID that will
not be reused until a full TLB purge is done. This eliminates the requirement
to purge at task-exit.)


Normally, a notifier is registered when a GRU segment is mmaped, and unregistered
when the segment is unmapped. Well behaved tasks will not have a GRU or
a notifier when exit starts.

If a task fails to unmap a GRU segment, they still exist at the start of
exit. On the ->release callout, I set a flag in the container of my
mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes
are skipped because of the flag is set. When the GRU VMA is deleted, I free
my structure containing the notifier.

I _think_ works. Do you see any problems?

I should also mention that I have an open-coded function that possibly
belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments.
Each GRU has a couple of data structures linked to the VMA. All, however,
need to share the same notifier. I currently open code a function that
scans the notifier list to determine if a GRU notifier already exists.
If it does, I update a refcnt & use it. Otherwise, I register a new
one. All of this is protected by the mmap_sem.

Just in case I mangled the above description, I'll attach a copy of the GRU mmuops
code.

--- jack

[-- Attachment #2: z --]
[-- Type: application/x-compress, Size: 2763 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 01 of 11] mmu-notifier-core
  2008-05-05 19:46           ` Jack Steiner
@ 2008-05-06 14:46             ` Andrea Arcangeli
  2008-05-06 17:53             ` mmu notifier v15 -> v16 diff Andrea Arcangeli
  1 sibling, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-06 14:46 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

On Mon, May 05, 2008 at 02:46:25PM -0500, Jack Steiner wrote:
> If a task fails to unmap a GRU segment, they still exist at the start of

Yes, this will also happen in case the well behaved task receives
SIGKILL, so you can test it that way too.

> exit. On the ->release callout, I set a flag in the container of my
> mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes
> are skipped because of the flag is set. When the GRU VMA is deleted, I free

GRU TLB flushes aren't skipped because your flag is set but because
__mmu_notifier_release already executed
list_del_init_rcu(&grunotifier->hlist) before proceeding with
unmap_vmas.

> my structure containing the notifier.

As long as nobody can write through the already established gru tlbs
and nobody can establish new tlbs after exit_mmap run you don't
strictly need ->release.

> I _think_ works. Do you see any problems?

You can remove the flag and ->release and ->clear_flush_young (if you
keep clear_flush_young implemented it should return 0). The
synchronize_rcu after mmu_notifier_register can also be dropped thanks
to mm_lock(). gru_drop_mmu_notifier should be careful with current->mm
if you're using an fd and if the fd can be passed to a different task
through unix sockets (you should probably fail any operation if
current->mm != gru->mm).

The way I use ->release in KVM is to set the root hpa to -1UL
(invalid) as a debug trap. That's only for debugging because even if
tlb entries and sptes are still established on the secondary mmu they
are only relevant when the cpu jumps to guest mode and that can never
happen again after exit_mmap is started.

> I should also mention that I have an open-coded function that possibly
> belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments.
> Each GRU has a couple of data structures linked to the VMA. All, however,
> need to share the same notifier. I currently open code a function that
> scans the notifier list to determine if a GRU notifier already exists.
> If it does, I update a refcnt & use it. Otherwise, I register a new
> one. All of this is protected by the mmap_sem.
> 
> Just in case I mangled the above description, I'll attach a copy of the GRU mmuops
> code.

Well that function needs fixing w.r.t. srcu. Are you sure you want to
search for mn->ops == gru_mmuops and not for mn == gmn?  And if you
search for mn why can't you keep track of the mn being registered or
unregistered outside of the mmu_notifier layer? Set a bitflag in the
container after mmu_notifier_register returns and a clear it after
_unregister returns. I doubt saving one bitflag is worth searching the
list and your approach make it obvious that you've to protect the
bitflag and the register/unregister under write-mmap_sem
yourself. Otherwise the find function will return an object that can
be freed at any time if somebody calls unregister and
kfree. (synchronize_srcu in mmu_notifier_unregister won't wait for
anything but some outstanding srcu_read_lock)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* mmu notifier v15 -> v16 diff
  2008-05-05 19:46           ` Jack Steiner
  2008-05-06 14:46             ` Andrea Arcangeli
@ 2008-05-06 17:53             ` Andrea Arcangeli
  1 sibling, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-06 17:53 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrew Morton, Christoph Lameter, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

Hello everyone,

This is to allow GRU code to call __mmu_notifier_register inside the
mmap_sem (write mode is required as documented in the patch).

It also removes the requirement to implement ->release as it's not
guaranteed all users will really need it.

I didn't integrate the search function as we can sort that out after
2.6.26 is out and it wasn't entirely obvious it's really needed, as
the driver should be able to track if a mmu notifier is registered in
the container.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -29,10 +29,25 @@ struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
-	 * freed. It's mandatory to implement this method. This can
-	 * run concurrently with other mmu notifier methods and it
+	 * freed. This can run concurrently with other mmu notifier
+	 * methods (the ones invoked outside the mm context) and it
 	 * should tear down all secondary mmu mappings and freeze the
-	 * secondary mmu.
+	 * secondary mmu. If this method isn't implemented you've to
+	 * be sure that nothing could possibly write to the pages
+	 * through the secondary mmu by the time the last thread with
+	 * tsk->mm == mm exits.
+	 *
+	 * As side note: the pages freed after ->release returns could
+	 * be immediately reallocated by the gart at an alias physical
+	 * address with a different cache model, so if ->release isn't
+	 * implemented because all _software_ driven memory accesses
+	 * through the secondary mmu are terminated by the time the
+	 * last thread of this mm quits, you've also to be sure that
+	 * speculative _hardware_ operations can't allocate dirty
+	 * cachelines in the cpu that could not be snooped and made
+	 * coherent with the other read and write operations happening
+	 * through the gart alias address, so leading to memory
+	 * corruption.
 	 */
 	void (*release)(struct mmu_notifier *mn,
 			struct mm_struct *mm);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2340,13 +2340,20 @@ static inline void __mm_unlock(spinlock_
 /*
  * This operation locks against the VM for all pte/vma/mm related
  * operations that could ever happen on a certain mm. This includes
- * vmtruncate, try_to_unmap, and all page faults. The holder
- * must not hold any mm related lock. A single task can't take more
- * than one mm_lock in a row or it would deadlock.
+ * vmtruncate, try_to_unmap, and all page faults.
  *
- * The mmap_sem must be taken in write mode to block all operations
- * that could modify pagetables and free pages without altering the
- * vma layout (for example populate_range() with nonlinear vmas).
+ * The caller must take the mmap_sem in read or write mode before
+ * calling mm_lock(). The caller isn't allowed to release the mmap_sem
+ * until mm_unlock() returns.
+ *
+ * While mm_lock() itself won't strictly require the mmap_sem in write
+ * mode to be safe, in order to block all operations that could modify
+ * pagetables and free pages without need of altering the vma layout
+ * (for example populate_range() with nonlinear vmas) the mmap_sem
+ * must be taken in write mode by the caller.
+ *
+ * A single task can't take more than one mm_lock in a row or it would
+ * deadlock.
  *
  * The sorting is needed to avoid lock inversion deadlocks if two
  * tasks run mm_lock at the same time on different mm that happen to
@@ -2377,17 +2384,13 @@ int mm_lock(struct mm_struct *mm, struct
 {
 	spinlock_t **anon_vma_locks, **i_mmap_locks;
 
-	down_write(&mm->mmap_sem);
 	if (mm->map_count) {
 		anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
-		if (unlikely(!anon_vma_locks)) {
-			up_write(&mm->mmap_sem);
+		if (unlikely(!anon_vma_locks))
 			return -ENOMEM;
-		}
 
 		i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count);
 		if (unlikely(!i_mmap_locks)) {
-			up_write(&mm->mmap_sem);
 			vfree(anon_vma_locks);
 			return -ENOMEM;
 		}
@@ -2426,10 +2429,12 @@ static void mm_unlock_vfree(spinlock_t *
 /*
  * mm_unlock doesn't require any memory allocation and it won't fail.
  *
+ * The mmap_sem cannot be released until mm_unlock returns.
+ *
  * All memory has been previously allocated by mm_lock and it'll be
  * all freed before returning. Only after mm_unlock returns, the
  * caller is allowed to free and forget the mm_lock_data structure.
- * 
+ *
  * mm_unlock runs in O(N) where N is the max number of VMAs in the
  * mm. The max number of vmas is defined in
  * /proc/sys/vm/max_map_count.
@@ -2444,5 +2449,4 @@ void mm_unlock(struct mm_struct *mm, str
 			mm_unlock_vfree(data->i_mmap_locks,
 					data->nr_i_mmap_locks);
 	}
-	up_write(&mm->mmap_sem);
 }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st
 		 * from establishing any more sptes before all the
 		 * pages in the mm are freed.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 		spin_lock(&mm->mmu_notifier_mm->lock);
 	}
@@ -144,20 +145,9 @@ void __mmu_notifier_invalidate_range_end
 	srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 }
 
-/*
- * Must not hold mmap_sem nor any other VM related lock when calling
- * this registration function. Must also ensure mm_users can't go down
- * to zero while this runs to avoid races with mmu_notifier_release,
- * so mm has to be current->mm or the mm should be pinned safely such
- * as with get_task_mm(). If the mm is not current->mm, the mm_users
- * pin should be released by calling mmput after mmu_notifier_register
- * returns. mmu_notifier_unregister must be always called to
- * unregister the notifier. mm_count is automatically pinned to allow
- * mmu_notifier_unregister to safely run at any time later, before or
- * after exit_mmap. ->release will always be called before exit_mmap
- * frees the pages.
- */
-int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+static int do_mmu_notifier_register(struct mmu_notifier *mn,
+				    struct mm_struct *mm,
+				    int take_mmap_sem)
 {
 	struct mm_lock_data data;
 	struct mmu_notifier_mm * mmu_notifier_mm;
@@ -174,6 +164,8 @@ int mmu_notifier_register(struct mmu_not
 	if (unlikely(ret))
 		goto out_kfree;
 
+	if (take_mmap_sem)
+		down_write(&mm->mmap_sem);
 	ret = mm_lock(mm, &data);
 	if (unlikely(ret))
 		goto out_cleanup;
@@ -200,6 +192,8 @@ int mmu_notifier_register(struct mmu_not
 
 	mm_unlock(mm, &data);
 out_cleanup:
+	if (take_mmap_sem)
+		up_write(&mm->mmap_sem);
 	if (mmu_notifier_mm)
 		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
 out_kfree:
@@ -209,7 +203,35 @@ out:
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 	return ret;
 }
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function. Must also ensure mm_users can't go down
+ * to zero while this runs to avoid races with mmu_notifier_release,
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
+ * returns. mmu_notifier_unregister must be always called to
+ * unregister the notifier. mm_count is automatically pinned to allow
+ * mmu_notifier_unregister to safely run at any time later, before or
+ * after exit_mmap. ->release will always be called before exit_mmap
+ * frees the pages.
+ */
+int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 1);
+}
 EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+/*
+ * Same as mmu_notifier_register but here the caller must hold the
+ * mmap_sem in write mode.
+ */
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	return do_mmu_notifier_register(mn, mm, 0);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
 /* this is called after the last mmu_notifier_unregister() returned */
 void __mmu_notifier_mm_destroy(struct mm_struct *mm)
@@ -251,7 +273,8 @@ void mmu_notifier_unregister(struct mmu_
 		 * guarantee ->release is called before freeing the
 		 * pages.
 		 */
-		mn->ops->release(mn, mm);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
 		srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu);
 	} else
 		spin_unlock(&mm->mmu_notifier_mm->lock);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -148,6 +148,8 @@ static inline int mm_has_notifiers(struc
 
 extern int mmu_notifier_register(struct mmu_notifier *mn,
 				 struct mm_struct *mm);
+extern int __mmu_notifier_register(struct mmu_notifier *mn,
+				   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 07 of 11] i_mmap_rwsem
  2008-05-07 14:35 [PATCH 00 of 11] mmu notifier #v16 Andrea Arcangeli
@ 2008-05-07 14:35 ` Andrea Arcangeli
  0 siblings, 0 replies; 24+ messages in thread
From: Andrea Arcangeli @ 2008-05-07 14:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Jack Steiner, Robin Holt, Nick Piggin,
	Peter Zijlstra, kvm-devel, Kanoj Sarcar, Roland Dreier,
	Steve Wise, linux-kernel, Avi Kivity, linux-mm, general,
	Hugh Dickins, akpm, Rusty Russell, Anthony Liguori, Chris Wright,
	Marcelo Tosatti, Eric Dumazet, Paul E. McKenney

# HG changeset patch
# User Andrea Arcangeli <andrea@qumranet.com>
# Date 1210115135 -7200
# Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d
# Parent  0621238970155f8ff2d60ca4996dcdd470f9c6ce
i_mmap_rwsem

The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -502,7 +502,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -735,7 +735,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	mmu_notifier_invalidate_range_start(mm, start, start + size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL;
 	int fullmm;
 	struct mmu_gather *tlb;
 	struct mm_struct *mm = vma->vm_mm;
@@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_
 			tlb_finish_mmu(tlb, tlb_start, start);
 
 			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
+				(i_mmap_sem && rwsem_needbreak(i_mmap_sem))) {
+				if (i_mmap_sem) {
 					tlb = NULL;
 					goto out;
 				}
@@ -1829,7 +1829,7 @@ unwritable_page:
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1848,7 +1848,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1898,7 +1898,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem);
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1912,9 +1912,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -542,7 +542,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -626,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file) {
@@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -373,14 +373,14 @@ static int page_referenced_file(struct p
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -403,7 +403,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-05-07 14:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
2008-05-04 19:13   ` Robin Holt
2008-05-04 22:08     ` Andrea Arcangeli
2008-05-05  2:25       ` Robin Holt
2008-05-05 16:21   ` Jack Steiner
2008-05-05 17:14     ` Andrea Arcangeli
2008-05-05 17:25       ` Jack Steiner
2008-05-05 18:34         ` Andrea Arcangeli
2008-05-05 19:46           ` Jack Steiner
2008-05-06 14:46             ` Andrea Arcangeli
2008-05-06 17:53             ` mmu notifier v15 -> v16 diff Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 02 of 11] get_task_mm Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 03 of 11] invalidate_page outside PT lock Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 04 of 11] free-pgtables Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 05 of 11] unmap vmas tlb flushing Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 06 of 11] rwsem contended Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 09 of 11] mm_lock-rwsem Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 10 of 11] export zap_page_range for XPMEM Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 11 of 11] mmap sems Andrea Arcangeli
2008-05-03 11:09 ` [PATCH 00 of 11] mmu notifier #v15 Jack Steiner
2008-05-07 14:35 [PATCH 00 of 11] mmu notifier #v16 Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).