linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/9] [RFC] EMM Notifier V2
@ 2008-04-01 20:55 Christoph Lameter
  2008-04-01 20:55 ` [patch 1/9] EMM Notifier: The notifier calls Christoph Lameter
                   ` (8 more replies)
  0 siblings, 9 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[Note that I will be giving talks next week at the OpenFabrics Forum
and at the Linux Collab Summit in Austin on memory pinning etc. It would
be great if I could get some feedback on the approach then]

V1->V2:
- Additional optimizations in the VM
- Convert vm spinlocks to rw sems.
- Add XPMEM driver (requires sleeping in callbacks)
- Add XPMEM example

This patch implements a simple callback for device drivers that establish
their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines
etc). These references are unknown to the VM (therefore external).

With these callbacks it is possible for the device driver to release external
references when the VM requests it. This enables swapping, page migration and
allows support of remapping, permission changes etc etc for the externally
mapped memory.

With this functionality it becomes also possible to avoid pinning or mlocking
pages (commonly done to stop the VM from unmapping device mapped pages).

A device driver must subscribe to a process using

        emm_register_notifier(struct emm_notifier *, struct mm_struct *)


The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space. When the process terminates
the callback function is called with emm_release.

Callbacks are performed before and after the unmapping action of the VM.

        emm_invalidate_start    before

        emm_invalidate_end      after

The device driver must hold off establishing new references to pages
in the range specified between a callback with emm_invalidate_start and
the subsequent call with emm_invalidate_end set. This allows the VM to
ensure that no concurrent driver actions are performed on an address
range while performing remapping or unmapping operations.


This patchset contains additional modifications needed to ensure
that the callbacks can sleep. For that purpose two key locks in the vm
need to be converted to rw_sems. These patches are brand new, invasive
and need extensive discussion and evaluation.

The first patch alone may be applied if callbacks in atomic context are
sufficient for a device driver (likely the case for KVM and GRU and simple
DMA drivers).

Following the VM modifications is the XPMEM device driver that allows sharing
of memory between processes running on different instances of Linux. This is
also a prototype. It is known to run trivial sample programs included as the
last patch.

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 1/9] EMM Notifier: The notifier calls
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 21:14   ` Peter Zijlstra
  2008-04-02  6:49   ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
  2008-04-01 20:55 ` [patch 2/9] Move tlb flushing into free_pgtables Christoph Lameter
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: emm_notifier --]
[-- Type: text/plain, Size: 19852 bytes --]

This patch implements a simple callback for device drivers that establish
their own references to pages (KVM, GRU, XPmem, RDMA/Infiniband, DMA engines
etc). These references are unknown to the VM (therefore external).

With these callbacks it is possible for the device driver to release external
references when the VM requests it. This enables swapping, page migration and
allows support of remapping, permission changes etc etc for externally
mapped memory.

With this functionality it becomes also possible to avoid pinning or mlocking
pages (commonly done to stop the VM from unmapping device mapped pages).

A device driver must subscribe to a process using

	emm_register_notifier(struct emm_notifier *, struct mm_struct *)


The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space. When the process terminates
the callback function is called with emm_release.

Callbacks are performed before and after the unmapping action of the VM.

	emm_invalidate_start	before

	emm_invalidate_end	after

The device driver must hold off establishing new references to pages
in the range specified between a callback with emm_invalidate_start and
the subsequent call with emm_invalidate_end set. This allows the VM to
ensure that no concurrent driver actions are performed on an address
range while performing remapping or unmapping operations.

Callbacks are mostly performed in a non atomic context. However, in
various places spinlocks are held to traverse rmaps. So this patch here
is only useful for those devices that can remove mappings in an atomic
context (f.e. KVM/GRU).

If the rmap spinlocks are converted to semaphores then all callbacks will
be performed in a nonatomic context. No additional changes will be necessary
to this patchset.

V1->V2:
- page_referenced_one: Do not increment reference count if it is already
  != 0.
- Use rcu_assign_pointer and rcu_derefence_pointer instead of putting in our
  own barriers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm_types.h |    3 +
 include/linux/rmap.h     |   50 +++++++++++++++++++++++++++++
 kernel/fork.c            |    3 +
 mm/Kconfig               |    5 ++
 mm/filemap_xip.c         |    4 ++
 mm/fremap.c              |    2 +
 mm/hugetlb.c             |    3 +
 mm/memory.c              |   42 +++++++++++++++++++-----
 mm/mmap.c                |    3 +
 mm/mprotect.c            |    3 +
 mm/mremap.c              |    4 ++
 mm/rmap.c                |   80 +++++++++++++++++++++++++++++++++++++++++++++--
 12 files changed, 192 insertions(+), 10 deletions(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-04-01 12:57:14.957042203 -0700
+++ linux-2.6/include/linux/mm_types.h	2008-04-01 12:57:38.957452502 -0700
@@ -225,6 +225,9 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_EMM_NOTIFIER
+	struct emm_notifier     *emm_notifier;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-04-01 12:56:13.595994316 -0700
+++ linux-2.6/mm/Kconfig	2008-04-01 12:57:38.957452502 -0700
@@ -193,3 +193,8 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config EMM_NOTIFIER
+	def_bool n
+	bool "External Mapped Memory Notifier for drivers directly mapping memory"
+
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-04-01 12:57:24.033197213 -0700
+++ linux-2.6/include/linux/rmap.h	2008-04-01 13:02:26.426353593 -0700
@@ -85,6 +85,56 @@ static inline void page_dup_rmap(struct 
 #endif
 
 /*
+ * Notifier for devices establishing their own references to Linux
+ * kernel pages in addition to the regular mapping via page
+ * table and rmap. The notifier allows the device to drop the mapping
+ * when the VM removes references to pages.
+ */
+enum emm_operation {
+	emm_release,		/* Process existing, */
+	emm_invalidate_start,	/* Before the VM unmaps pages */
+	emm_invalidate_end,	/* After the VM unmapped pages */
+ 	emm_referenced		/* Check if a range was referenced */
+};
+
+struct emm_notifier {
+	int (*callback)(struct emm_notifier *e, struct mm_struct *mm,
+		enum emm_operation op,
+		unsigned long start, unsigned long end);
+	struct emm_notifier *next;
+};
+
+extern int __emm_notify(struct mm_struct *mm, enum emm_operation op,
+		unsigned long start, unsigned long end);
+
+/*
+ * Callback to the device driver for an externally memory mapped section
+ * of memory.
+ *
+ * start	Address of first byte of the range
+ * end		Address of first byte after range.
+ */
+static inline int emm_notify(struct mm_struct *mm, enum emm_operation op,
+	unsigned long start, unsigned long end)
+{
+#ifdef CONFIG_EMM_NOTIFIER
+	if (unlikely(mm->emm_notifier))
+		return __emm_notify(mm, op, start, end);
+#endif
+	return 0;
+}
+
+/*
+ * Register a notifier with an mm struct. Release occurs when the process
+ * terminates by calling the notifier function with emm_release.
+ *
+ * Must hold the mmap_sem for write.
+ */
+extern void emm_notifier_register(struct emm_notifier *e,
+					struct mm_struct *mm);
+
+
+/*
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-04-01 12:56:13.603994568 -0700
+++ linux-2.6/mm/rmap.c	2008-04-01 12:57:38.957452502 -0700
@@ -263,6 +263,67 @@ pte_t *page_check_address(struct page *p
 	return NULL;
 }
 
+#ifdef CONFIG_EMM_NOTIFIER
+/*
+ * Notifier for devices establishing their own references to Linux
+ * kernel pages in addition to the regular mapping via page
+ * table and rmap. The notifier allows the device to drop the mapping
+ * when the VM removes references to pages.
+ */
+
+/*
+ * This function is only called when a single process remains that performs
+ * teardown when the last process is exiting.
+ */
+void emm_notifier_release(struct mm_struct *mm)
+{
+	struct emm_notifier *e;
+
+	while (mm->emm_notifier) {
+		e = mm->emm_notifier;
+		mm->emm_notifier = e->next;
+		e->callback(e, mm, emm_release, 0, 0);
+	}
+}
+
+/* Register a notifier */
+void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
+{
+	e->next = mm->emm_notifier;
+	/*
+	 * The update to emm_notifier (e->next) must be visible
+	 * before the pointer becomes visible.
+	 * rcu_assign_pointer() does exactly what we need.
+	 */
+	rcu_assign_pointer(mm->emm_notifier, e);
+}
+EXPORT_SYMBOL_GPL(emm_notifier_register);
+
+/* Perform a callback */
+int __emm_notify(struct mm_struct *mm, enum emm_operation op,
+		unsigned long start, unsigned long end)
+{
+	struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
+	int x;
+
+	while (e) {
+
+		if (e->callback) {
+			x = e->callback(e, mm, op, start, end);
+			if (x)
+				return x;
+		}
+		/*
+		 * emm_notifier contents (e) must be fetched after
+		 * the retrival of the pointer to the notifier.
+		 */
+		e = rcu_dereference(e)->next;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(__emm_notify);
+#endif
+
 /*
  * Subfunctions of page_referenced: page_referenced_one called
  * repeatedly from either page_referenced_anon or page_referenced_file.
@@ -298,6 +359,10 @@ static int page_referenced_one(struct pa
 
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
+
+	if (emm_notify(mm, emm_referenced, address, address + PAGE_SIZE)
+							&& !referenced)
+			referenced++;
 out:
 	return referenced;
 }
@@ -448,9 +513,10 @@ static int page_mkclean_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
+	emm_notify(mm, emm_invalidate_start, address, address + PAGE_SIZE);
 	pte = page_check_address(page, mm, address, &ptl);
 	if (!pte)
-		goto out;
+		goto out_notifier;
 
 	if (pte_dirty(*pte) || pte_write(*pte)) {
 		pte_t entry;
@@ -464,6 +530,9 @@ static int page_mkclean_one(struct page 
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+out_notifier:
+	emm_notify(mm, emm_invalidate_end, address, address + PAGE_SIZE);
 out:
 	return ret;
 }
@@ -707,9 +776,10 @@ static int try_to_unmap_one(struct page 
 	if (address == -EFAULT)
 		goto out;
 
+	emm_notify(mm, emm_invalidate_start, address, address + PAGE_SIZE);
 	pte = page_check_address(page, mm, address, &ptl);
 	if (!pte)
-		goto out;
+		goto out_notify;
 
 	/*
 	 * If the page is mlock()d, we cannot swap it out.
@@ -779,6 +849,8 @@ static int try_to_unmap_one(struct page 
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+out_notify:
+	emm_notify(mm, emm_invalidate_end, address, address + PAGE_SIZE);
 out:
 	return ret;
 }
@@ -817,6 +889,7 @@ static void try_to_unmap_cluster(unsigne
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
+	unsigned long start;
 	unsigned long end;
 
 	address = (vma->vm_start + cursor) & CLUSTER_MASK;
@@ -838,6 +911,8 @@ static void try_to_unmap_cluster(unsigne
 	if (!pmd_present(*pmd))
 		return;
 
+	start = address;
+	emm_notify(mm, emm_invalidate_start, start, end);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	/* Update high watermark before we lower rss */
@@ -870,6 +945,7 @@ static void try_to_unmap_cluster(unsigne
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	emm_notify(mm, emm_invalidate_end, start, end);
 }
 
 static int try_to_unmap_anon(struct page *page, int migration)
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-04-01 12:56:13.655995449 -0700
+++ linux-2.6/kernel/fork.c	2008-04-01 12:57:38.961451952 -0700
@@ -362,6 +362,9 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+#ifdef CONFIG_EMM_NOTIFIER
+		mm->emm_notifier = NULL;
+#endif
 		return mm;
 	}
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-04-01 12:56:13.607994778 -0700
+++ linux-2.6/mm/memory.c	2008-04-01 12:57:38.961451952 -0700
@@ -596,6 +596,7 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int ret = 0;
 
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.
@@ -605,12 +606,15 @@ int copy_page_range(struct mm_struct *ds
 	 */
 	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
 		if (!vma->anon_vma)
-			return 0;
+			goto out;
 	}
 
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_cow_mapping(vma->vm_flags))
+		emm_notify(src_mm, emm_invalidate_start, addr, end);
+
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
@@ -618,10 +622,16 @@ int copy_page_range(struct mm_struct *ds
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-						vma, addr, next))
-			return -ENOMEM;
+						vma, addr, next)) {
+			ret = -ENOMEM;
+			break;
+		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
-	return 0;
+
+	if (is_cow_mapping(vma->vm_flags))
+		emm_notify(src_mm, emm_invalidate_end, addr, end);
+out:
+	return ret;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -894,12 +904,15 @@ unsigned long zap_page_range(struct vm_a
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
+	emm_notify(mm, emm_invalidate_start, address, end);
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	emm_notify(mm, emm_invalidate_end, address, end);
 	return end;
 }
 
@@ -1340,6 +1353,7 @@ int remap_pfn_range(struct vm_area_struc
 	pgd_t *pgd;
 	unsigned long next;
 	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr;
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1372,6 +1386,7 @@ int remap_pfn_range(struct vm_area_struc
 	BUG_ON(addr >= end);
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
+	emm_notify(mm, emm_invalidate_start, start, end);
 	flush_cache_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1380,6 +1395,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	emm_notify(mm, emm_invalidate_end, start, end);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1463,10 +1479,12 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
+	unsigned long start = addr;
 	unsigned long end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	emm_notify(mm, emm_invalidate_start, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1474,6 +1492,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	emm_notify(mm, emm_invalidate_end, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1614,8 +1633,10 @@ static int do_wp_page(struct mm_struct *
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
 			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
-				goto unlock;
+			if (!pte_same(*page_table, orig_pte)) {
+				pte_unmap_unlock(page_table, ptl);
+				goto check_dirty;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -1631,7 +1652,8 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
-		goto unlock;
+		pte_unmap_unlock(page_table, ptl);
+		goto check_dirty;
 	}
 
 	/*
@@ -1653,6 +1675,7 @@ gotten:
 	if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
 		goto oom_free_new;
 
+	emm_notify(mm, emm_invalidate_start, address, address + PAGE_SIZE);
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1691,8 +1714,11 @@ gotten:
 		page_cache_release(new_page);
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
+
 	pte_unmap_unlock(page_table, ptl);
+	emm_notify(mm, emm_invalidate_end, address, address + PAGE_SIZE);
+
+check_dirty:
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-04-01 12:56:13.615994773 -0700
+++ linux-2.6/mm/mmap.c	2008-04-01 12:57:38.961451952 -0700
@@ -1744,6 +1744,7 @@ static void unmap_region(struct mm_struc
 	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
+	emm_notify(mm, emm_invalidate_start, start, end);
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
@@ -1752,6 +1753,7 @@ static void unmap_region(struct mm_struc
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	emm_notify(mm, emm_invalidate_end, start, end);
 }
 
 /*
@@ -2038,6 +2040,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
+	emm_notify(mm, emm_release, 0, TASK_SIZE);
 
 	lru_add_drain();
 	flush_cache_mm(mm);
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2008-04-01 12:56:13.619994769 -0700
+++ linux-2.6/mm/mprotect.c	2008-04-01 12:57:38.961451952 -0700
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/rmap.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
 		dirty_accountable = 1;
 	}
 
+	emm_notify(mm, emm_invalidate_start, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	emm_notify(mm, emm_invalidate_end, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-04-01 12:56:13.627994994 -0700
+++ linux-2.6/mm/mremap.c	2008-04-01 12:57:38.961451952 -0700
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,9 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start = old_addr;
 
+	emm_notify(mm, emm_invalidate_start, old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +119,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	emm_notify(mm, emm_invalidate_end, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-04-01 12:56:13.631995051 -0700
+++ linux-2.6/mm/filemap_xip.c	2008-04-01 12:57:38.961451952 -0700
@@ -190,6 +190,8 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+		emm_notify(mm, emm_invalidate_start,
+					address, address + PAGE_SIZE);
 		pte = page_check_address(page, mm, address, &ptl);
 		if (pte) {
 			/* Nuke the page table entry. */
@@ -201,6 +203,8 @@ __xip_unmap (struct address_space * mapp
 			pte_unmap_unlock(pte, ptl);
 			page_cache_release(page);
 		}
+		emm_notify(mm, emm_invalidate_end,
+					address, address + PAGE_SIZE);
 	}
 	spin_unlock(&mapping->i_mmap_lock);
 }
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-04-01 12:56:13.639995208 -0700
+++ linux-2.6/mm/fremap.c	2008-04-01 12:57:38.961451952 -0700
@@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	emm_notify(mm, emm_invalidate_start, start, end);
 	err = populate_range(mm, vma, start, size, pgoff);
+	emm_notify(mm, emm_invalidate_end, start, end);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-04-01 12:56:13.647995311 -0700
+++ linux-2.6/mm/hugetlb.c	2008-04-01 12:57:38.961451952 -0700
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/rmap.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -799,6 +800,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	emm_notify(mm, emm_invalidate_start, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -819,6 +821,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	emm_notify(mm, emm_invalidate_end, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 2/9] Move tlb flushing into free_pgtables
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
  2008-04-01 20:55 ` [patch 1/9] EMM Notifier: The notifier calls Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 20:55 ` [patch 3/9] Convert i_mmap_lock to i_mmap_sem Christoph Lameter
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: move_tlb_flush --]
[-- Type: text/plain, Size: 4260 bytes --]

Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables(). Moving the tlb flushing into free_pgtables allows
sleeping in parts of free_pgtables().

This means that we do a tlb_finish_mmu() before freeing the page tables.
Strictly speaking there may not be the need to do another tlb flush after
freeing the tables. But its the only way to free a series of page table
pages from the tlb list. And we do not want to call into the page allocator
for performance reasons. Aim9 numbers look okay after this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    4 ++--
 mm/memory.c        |   14 ++++++++++----
 mm/mmap.c          |    6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-03-19 13:30:51.460856986 -0700
+++ linux-2.6/include/linux/mm.h	2008-03-19 13:31:20.809377398 -0700
@@ -751,8 +751,8 @@ int walk_page_range(const struct mm_stru
 		    void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+						unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-03-19 13:29:06.007351495 -0700
+++ linux-2.6/mm/memory.c	2008-03-19 13:46:31.352774359 -0700
@@ -271,9 +271,11 @@ void free_pgd_range(struct mmu_gather **
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+							unsigned long ceiling)
 {
+	struct mmu_gather *tlb;
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -285,8 +287,10 @@ void free_pgtables(struct mmu_gather **t
 		unlink_file_vma(vma);
 
 		if (is_vm_hugetlb_page(vma)) {
-			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			hugetlb_free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
+			tlb_finish_mmu(tlb, addr, vma->vm_end);
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
@@ -298,8 +302,10 @@ void free_pgtables(struct mmu_gather **t
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
+			tlb = tlb_gather_mmu(vma->vm_mm, 0);
+			free_pgd_range(&tlb, addr, vma->vm_end,
 				floor, next? next->vm_start: ceiling);
+			tlb_finish_mmu(tlb, addr, vma->vm_end);
 		}
 		vma = next;
 	}
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-03-19 13:29:48.659889667 -0700
+++ linux-2.6/mm/mmap.c	2008-03-19 13:30:36.296604891 -0700
@@ -1750,9 +1750,9 @@ static void unmap_region(struct mm_struc
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
-				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
+				 next? next->vm_start: 0);
 	emm_notify(mm, emm_invalidate_end, start, end);
 }
 
@@ -2049,8 +2049,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 3/9] Convert i_mmap_lock to i_mmap_sem
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
  2008-04-01 20:55 ` [patch 1/9] EMM Notifier: The notifier calls Christoph Lameter
  2008-04-01 20:55 ` [patch 2/9] Move tlb flushing into free_pgtables Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 20:55 ` [patch 4/9] Remove tlb pointer from the parameters of unmap vmas Christoph Lameter
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: emm_immap_sem --]
[-- Type: text/plain, Size: 19519 bytes --]

The conversion to a rwsem allows callbacks during rmap traversal
for files in a non atomic context. A rw style lock also allows concurrent
walking of the reverse map. This is fairly straightforward if one removes
pieces of the resched checking.

[Restarting unmapping is an issue to be discussed].

This slightly increases Aim9 performance results on an 8p.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/mm/hugetlbpage.c |    4 ++--
 fs/hugetlbfs/inode.c      |    4 ++--
 fs/inode.c                |    2 +-
 include/linux/fs.h        |    2 +-
 include/linux/mm.h        |    2 +-
 kernel/fork.c             |    4 ++--
 mm/filemap.c              |    8 ++++----
 mm/filemap_xip.c          |    4 ++--
 mm/fremap.c               |    4 ++--
 mm/hugetlb.c              |   10 +++++-----
 mm/memory.c               |   29 +++++++++--------------------
 mm/migrate.c              |    4 ++--
 mm/mmap.c                 |   16 ++++++++--------
 mm/mremap.c               |    4 ++--
 mm/rmap.c                 |   20 +++++++++-----------
 15 files changed, 52 insertions(+), 65 deletions(-)

Index: linux-2.6/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/hugetlbpage.c	2008-04-01 12:17:51.760884399 -0700
+++ linux-2.6/arch/x86/mm/hugetlbpage.c	2008-04-01 12:33:17.589824389 -0700
@@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2008-04-01 12:17:51.760884399 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c	2008-04-01 12:33:17.601824636 -0700
@@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2008-04-01 12:17:51.760884399 -0700
+++ linux-2.6/fs/inode.c	2008-04-01 12:33:17.617824968 -0700
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+	init_rwsem(&inode->i_data.i_mmap_sem);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-04-01 12:17:51.764884492 -0700
+++ linux-2.6/include/linux/fs.h	2008-04-01 12:33:17.629825226 -0700
@@ -503,7 +503,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	struct rw_semaphore	i_mmap_sem;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-04-01 12:30:41.734442981 -0700
+++ linux-2.6/include/linux/mm.h	2008-04-01 12:33:17.641825483 -0700
@@ -716,7 +716,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	struct rw_semaphore *i_mmap_sem;	/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-04-01 12:29:11.704459847 -0700
+++ linux-2.6/kernel/fork.c	2008-04-01 12:33:17.641825483 -0700
@@ -273,12 +273,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			down_write(&file->f_mapping->i_mmap_sem);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			up_write(&file->f_mapping->i_mmap_sem);
 		}
 
 		/*
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2008-04-01 12:17:51.764884492 -0700
+++ linux-2.6/mm/filemap.c	2008-04-01 12:33:17.669826082 -0700
@@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki
 /*
  * Lock ordering:
  *
- *  ->i_mmap_lock		(vmtruncate)
+ *  ->i_mmap_sem		(vmtruncate)
  *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
  *  ->i_mutex
- *    ->i_mmap_lock		(truncate->unmap_mapping_range)
+ *    ->i_mmap_sem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_sem
- *    ->i_mmap_lock
+ *    ->i_mmap_sem
  *      ->page_table_lock or pte_lock	(various, mainly in memory.c)
  *        ->mapping->tree_lock	(arch-dependent flush_dcache_mmap_lock)
  *
@@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
- *  ->i_mmap_lock
+ *  ->i_mmap_sem
  *    ->anon_vma.lock		(vma_adjust)
  *
  *  ->anon_vma.lock
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-04-01 12:29:11.752460995 -0700
+++ linux-2.6/mm/filemap_xip.c	2008-04-01 12:33:17.669826082 -0700
@@ -184,7 +184,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -206,7 +206,7 @@ __xip_unmap (struct address_space * mapp
 		emm_notify(mm, emm_invalidate_end,
 					address, address + PAGE_SIZE);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-04-01 12:29:11.760461078 -0700
+++ linux-2.6/mm/fremap.c	2008-04-01 12:33:17.669826082 -0700
@@ -205,13 +205,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 
 	emm_notify(mm, emm_invalidate_start, start, end);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-04-01 12:29:11.796461877 -0700
+++ linux-2.6/mm/hugetlb.c	2008-04-01 12:33:17.669826082 -0700
@@ -790,7 +790,7 @@ void __unmap_hugepage_range(struct vm_ar
 	struct page *page;
 	struct page *tmp;
 	/*
-	 * A page gathering list, protected by per file i_mmap_lock. The
+	 * A page gathering list, protected by per file i_mmap_sem. The
 	 * lock is used to avoid list corruption from multiple unmapping
 	 * of the same page since we are using page->lru.
 	 */
@@ -840,9 +840,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	}
 }
 
@@ -1085,7 +1085,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	down_write(&vma->vm_file->f_mapping->i_mmap_sem);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -1100,7 +1100,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	up_write(&vma->vm_file->f_mapping->i_mmap_sem);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-04-01 12:30:41.750443458 -0700
+++ linux-2.6/mm/memory.c	2008-04-01 12:36:12.677510808 -0700
@@ -839,7 +839,6 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
 
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
@@ -876,22 +875,12 @@ unsigned long unmap_vmas(struct mmu_gath
 			}
 
 			tlb_finish_mmu(*tlbp, tlb_start, start);
-
-			if (need_resched() ||
-				(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
-				if (i_mmap_lock) {
-					*tlbp = NULL;
-					goto out;
-				}
-				cond_resched();
-			}
-
+			cond_resched();
 			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
-out:
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -1757,7 +1746,7 @@ unwritable_page:
 /*
  * Helper functions for unmap_mapping_range().
  *
- * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __
+ * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __
  *
  * We have to restart searching the prio_tree whenever we drop the lock,
  * since the iterator is only valid while the lock is held, and anyway
@@ -1776,7 +1765,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * i_mmap_sem.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1826,7 +1815,7 @@ again:
 
 	restart_addr = zap_page_range(vma, start_addr,
 					end_addr - start_addr, details);
-	need_break = need_resched() || spin_needbreak(details->i_mmap_lock);
+	need_break = need_resched();
 
 	if (restart_addr >= end_addr) {
 		/* We have now completed this vma: mark it so */
@@ -1840,9 +1829,9 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+	up_write(details->i_mmap_sem);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	down_write(details->i_mmap_sem);
 	return -EINTR;
 }
 
@@ -1936,9 +1925,9 @@ void unmap_mapping_range(struct address_
 	details.last_index = hba + hlen - 1;
 	if (details.last_index < details.first_index)
 		details.last_index = ULONG_MAX;
-	details.i_mmap_lock = &mapping->i_mmap_lock;
+	details.i_mmap_sem = &mapping->i_mmap_sem;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_write(&mapping->i_mmap_sem);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -1953,7 +1942,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2008-04-01 12:17:51.768884533 -0700
+++ linux-2.6/mm/migrate.c	2008-04-01 12:33:17.673826169 -0700
@@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 }
 
 /*
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-04-01 12:30:41.750443458 -0700
+++ linux-2.6/mm/mmap.c	2008-04-01 12:33:17.673826169 -0700
@@ -186,7 +186,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires inode->i_mapping->i_mmap_sem
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -214,9 +214,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	}
 }
 
@@ -439,7 +439,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -449,7 +449,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -536,7 +536,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -620,7 +620,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		spin_unlock(&anon_vma->lock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 
 	if (remove_next) {
 		if (file)
@@ -2064,7 +2064,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_sem is taken here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-04-01 12:29:11.752460995 -0700
+++ linux-2.6/mm/mremap.c	2008-04-01 12:33:17.673826169 -0700
@@ -86,7 +86,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		down_write(&mapping->i_mmap_sem);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -118,7 +118,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		up_write(&mapping->i_mmap_sem);
 	emm_notify(mm, emm_invalidate_end, old_start, old_end);
 }
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-04-01 12:30:07.993704887 -0700
+++ linux-2.6/mm/rmap.c	2008-04-01 12:33:17.673826169 -0700
@@ -24,7 +24,7 @@
  *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
- *       mapping->i_mmap_lock
+ *       mapping->i_mmap_sem
  *         anon_vma->lock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -430,14 +430,14 @@ static int page_referenced_file(struct p
 	 * The page lock not only makes sure that page->mapping cannot
 	 * suddenly be NULLified by truncation, it makes sure that the
 	 * structure at mapping cannot be freed and reused yet,
-	 * so we can safely take mapping->i_mmap_lock.
+	 * so we can safely take mapping->i_mmap_sem.
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 
 	/*
-	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
+	 * i_mmap_sem does not stabilize mapcount at all, but mapcount
 	 * is more likely to be accurate if we note it after spinning.
 	 */
 	mapcount = page_mapcount(page);
@@ -460,7 +460,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return referenced;
 }
 
@@ -546,12 +546,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	up_read(&mapping->i_mmap_sem);
 	return ret;
 }
 
@@ -990,7 +990,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	down_read(&mapping->i_mmap_sem);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -1027,7 +1027,6 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -1049,7 +1048,6 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -1061,7 +1059,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	up_write(&mapping->i_mmap_sem);
 	return ret;
 }
 

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 4/9] Remove tlb pointer from the parameters of unmap vmas
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-04-01 20:55 ` [patch 3/9] Convert i_mmap_lock to i_mmap_sem Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 20:55 ` [patch 5/9] Convert anon_vma lock to rw_sem and refcount Christoph Lameter
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: cleanup_unmap_vmas --]
[-- Type: text/plain, Size: 6690 bytes --]

We no longer abort unmapping in unmap vmas because we can reschedule while
unmapping since we are holding a semaphore. This would allow moving more
of the tlb flusing into unmap_vmas reducing code in various places.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    3 +--
 mm/memory.c        |   43 +++++++++++++++++--------------------------
 mm/mmap.c          |   18 +++---------------
 3 files changed, 21 insertions(+), 43 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-04-01 13:02:41.374608387 -0700
+++ linux-2.6/include/linux/mm.h	2008-04-01 13:02:43.898651546 -0700
@@ -723,8 +723,7 @@ struct zap_details {
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-		struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-04-01 13:02:41.378608315 -0700
+++ linux-2.6/mm/memory.c	2008-04-01 13:02:43.902651345 -0700
@@ -806,7 +806,6 @@ static unsigned long unmap_page_range(st
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -818,20 +817,13 @@ static unsigned long unmap_page_range(st
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-		struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
 {
@@ -839,7 +831,15 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	int fullmm = (*tlbp)->fullmm;
+	int fullmm;
+	struct mmu_gather *tlb;
+	struct mm_struct *mm = vma->vm_mm;
+
+	emm_notify(mm, emm_invalidate_start, start_addr, end_addr);
+	lru_add_drain();
+	tlb = tlb_gather_mmu(mm, 0);
+	update_hiwater_rss(mm);
+	fullmm = tlb->fullmm;
 
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
 		unsigned long end;
@@ -866,7 +866,7 @@ unsigned long unmap_vmas(struct mmu_gath
 						(HPAGE_SIZE / PAGE_SIZE);
 				start = end;
 			} else
-				start = unmap_page_range(*tlbp, vma,
+				start = unmap_page_range(tlb, vma,
 						start, end, &zap_work, details);
 
 			if (zap_work > 0) {
@@ -874,13 +874,15 @@ unsigned long unmap_vmas(struct mmu_gath
 				break;
 			}
 
-			tlb_finish_mmu(*tlbp, tlb_start, start);
+			tlb_finish_mmu(tlb, tlb_start, start);
 			cond_resched();
-			*tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+			tlb = tlb_gather_mmu(vma->vm_mm, fullmm);
 			tlb_start_valid = 0;
 			zap_work = ZAP_BLOCK_SIZE;
 		}
 	}
+	tlb_finish_mmu(tlb, start_addr, end_addr);
+	emm_notify(mm, emm_invalidate_end, start_addr, end_addr);
 	return start;	/* which is now the end (or restart) address */
 }
 
@@ -894,21 +896,10 @@ unsigned long unmap_vmas(struct mmu_gath
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
 
-	emm_notify(mm, emm_invalidate_start, address, end);
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
-	if (tlb)
-		tlb_finish_mmu(tlb, address, end);
-	emm_notify(mm, emm_invalidate_end, address, end);
-	return end;
+	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
 
 /*
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-04-01 13:02:41.378608315 -0700
+++ linux-2.6/mm/mmap.c	2008-04-01 13:03:19.627259624 -0700
@@ -1741,19 +1741,12 @@ static void unmap_region(struct mm_struc
 		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
-	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
-	emm_notify(mm, emm_invalidate_start, start, end);
-	lru_add_drain();
-	tlb = tlb_gather_mmu(mm, 0);
-	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, start, end);
 	free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
-	emm_notify(mm, emm_invalidate_end, start, end);
 }
 
 /*
@@ -2033,7 +2026,6 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
-	struct mmu_gather *tlb;
 	struct vm_area_struct *vma = mm->mmap;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
@@ -2041,15 +2033,11 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
 	emm_notify(mm, emm_release, 0, TASK_SIZE);
-
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb = tlb_gather_mmu(mm, 1);
-	/* Don't update_hiwater_rss(mm) here, do_exit already did */
-	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+
+	end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	tlb_finish_mmu(tlb, 0, end);
 	free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
 	/*

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 5/9] Convert anon_vma lock to rw_sem and refcount
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-04-01 20:55 ` [patch 4/9] Remove tlb pointer from the parameters of unmap vmas Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-02 17:50   ` Andrea Arcangeli
  2008-04-01 20:55 ` [patch 6/9] This patch exports zap_page_range as it is needed by XPMEM Christoph Lameter
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: emm_anon_vma_sem --]
[-- Type: text/plain, Size: 10096 bytes --]

Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap and page_mkclean. It also
allows the calling of sleeping functions from reverse map traversal.

An additional complication is that rcu is used in some context to guarantee
the presence of the anon_vma while we acquire the lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

Issues:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration). There is also the more frequent processor
  change due to up_xxx letting waiting tasks run first.
  This results in f.e. the Aim9 brk performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/rmap.h |   20 ++++++++++++++++---
 mm/migrate.c         |   26 ++++++++++---------------
 mm/mmap.c            |    4 +--
 mm/rmap.c            |   53 +++++++++++++++++++++++++++++----------------------
 4 files changed, 61 insertions(+), 42 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-03-25 21:59:40.597918752 -0700
+++ linux-2.6/include/linux/rmap.h	2008-03-25 22:00:02.909168301 -0700
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	atomic_t refcount;	/* vmas on the list */
+	struct rw_semaphore sem;/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +44,31 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+	atomic_inc(&anon_vma->refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+	if (atomic_dec_and_test(&anon_vma->refcount))
+		anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 }
 
 /*
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2008-03-25 21:59:57.246668245 -0700
+++ linux-2.6/mm/migrate.c	2008-03-25 22:00:02.909168301 -0700
@@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s
 		return;
 
 	/*
-	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+	 * We hold either the mmap_sem lock or a reference on the
+	 * anon_vma. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	down_read(&anon_vma->sem);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	up_read(&anon_vma->sem);
 }
 
 /*
@@ -623,7 +624,7 @@ static int unmap_and_move(new_page_t get
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
-	int rcu_locked = 0;
+	struct anon_vma *anon_vma = NULL;
 	int charge = 0;
 
 	if (!newpage)
@@ -647,16 +648,14 @@ static int unmap_and_move(new_page_t get
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * we cannot notice that anon_vma is freed while we migrate a page.
 	 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 	 * of migration. File cache pages are no problem because of page_lock()
 	 * File Caches may use write_page() or lock_page() in migration, then,
 	 * just care Anon page here.
 	 */
-	if (PageAnon(page)) {
-		rcu_read_lock();
-		rcu_locked = 1;
-	}
+	if (PageAnon(page))
+		anon_vma = grab_anon_vma(page);
 
 	/*
 	 * Corner case handling:
@@ -674,10 +673,7 @@ static int unmap_and_move(new_page_t get
 		if (!PageAnon(page) && PagePrivate(page)) {
 			/*
 			 * Go direct to try_to_free_buffers() here because
-			 * a) that's what try_to_release_page() would do anyway
-			 * b) we may be under rcu_read_lock() here, so we can't
-			 *    use GFP_KERNEL which is what try_to_release_page()
-			 *    needs to be effective.
+			 * that's what try_to_release_page() would do anyway
 			 */
 			try_to_free_buffers(page);
 		}
@@ -698,8 +694,8 @@ static int unmap_and_move(new_page_t get
 	} else if (charge)
  		mem_cgroup_end_migration(newpage);
 rcu_unlock:
-	if (rcu_locked)
-		rcu_read_unlock();
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 
 unlock:
 
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-03-25 21:59:57.256667410 -0700
+++ linux-2.6/mm/rmap.c	2008-03-25 22:00:02.909168301 -0700
@@ -68,7 +68,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			down_write(&locked->sem);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -80,6 +80,7 @@ int anon_vma_prepare(struct vm_area_stru
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
+			get_anon_vma(anon_vma);
 			vma->anon_vma = anon_vma;
 			list_add_tail(&vma->anon_vma_node, &anon_vma->head);
 			allocated = NULL;
@@ -87,7 +88,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			up_write(&locked->sem);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -98,14 +99,17 @@ void __anon_vma_merge(struct vm_area_str
 {
 	BUG_ON(vma->anon_vma != next->anon_vma);
 	list_del(&next->anon_vma_node);
+	put_anon_vma(vma->anon_vma);
 }
 
 void __anon_vma_link(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 
-	if (anon_vma)
+	if (anon_vma) {
+		get_anon_vma(anon_vma);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
+	}
 }
 
 void anon_vma_link(struct vm_area_struct *vma)
@@ -113,36 +117,32 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		get_anon_vma(anon_vma);
+		down_write(&anon_vma->sem);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	}
 }
 
 void anon_vma_unlink(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
-	int empty;
 
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	down_write(&anon_vma->sem);
 	list_del(&vma->anon_vma_node);
-
-	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
-
-	if (empty)
-		anon_vma_free(anon_vma);
+	up_write(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 static void anon_vma_ctor(struct kmem_cache *cachep, void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+	init_rwsem(&anon_vma->sem);
+	atomic_set(&anon_vma->refcount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -156,9 +156,9 @@ void __init anon_vma_init(void)
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *grab_anon_vma(struct page *page)
 {
-	struct anon_vma *anon_vma;
+	struct anon_vma *anon_vma = NULL;
 	unsigned long anon_mapping;
 
 	rcu_read_lock();
@@ -169,17 +169,26 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
-	return anon_vma;
+	if (!atomic_inc_not_zero(&anon_vma->refcount))
+		anon_vma = NULL;
 out:
 	rcu_read_unlock();
-	return NULL;
+	return anon_vma;
+}
+
+static struct anon_vma *page_lock_anon_vma(struct page *page)
+{
+	struct anon_vma *anon_vma = grab_anon_vma(page);
+
+	if (anon_vma)
+		down_read(&anon_vma->sem);
+	return anon_vma;
 }
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
-	rcu_read_unlock();
+	up_read(&anon_vma->sem);
+	put_anon_vma(anon_vma);
 }
 
 /*
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-03-25 21:59:57.256667410 -0700
+++ linux-2.6/mm/mmap.c	2008-03-25 22:00:02.909168301 -0700
@@ -564,7 +564,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		down_write(&anon_vma->sem);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -618,7 +618,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		up_write(&anon_vma->sem);
 	if (mapping)
 		up_write(&mapping->i_mmap_sem);
 

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 6/9] This patch exports zap_page_range as it is needed by XPMEM.
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-04-01 20:55 ` [patch 5/9] Convert anon_vma lock to rw_sem and refcount Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 20:55 ` [patch 7/9] Locking rules for taking multiple mmap_sem locks Christoph Lameter
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Dean Nelson, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman

[-- Attachment #1: xpmem_v003_export-zap_page_range --]
[-- Type: text/plain, Size: 910 bytes --]

XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson <dcn@sgi.com>

---
 mm/memory.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-04-01 13:02:43.902651345 -0700
+++ linux-2.6/mm/memory.c	2008-04-01 13:04:43.720691616 -0700
@@ -901,6 +901,7 @@ unsigned long zap_page_range(struct vm_a
 
 	return unmap_vmas(vma, address, end, &nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 7/9] Locking rules for taking multiple mmap_sem locks.
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-04-01 20:55 ` [patch 6/9] This patch exports zap_page_range as it is needed by XPMEM Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 20:55 ` [patch 8/9] XPMEM: The device driver Christoph Lameter
  2008-04-01 20:55 ` [patch 9/9] XPMEM: Simple example Christoph Lameter
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Dean Nelson, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman

[-- Attachment #1: xpmem_v003_lock-rule --]
[-- Type: text/plain, Size: 826 bytes --]

This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson <dcn@sgi.com>

---
 mm/filemap.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2008-04-01 13:02:41.374608387 -0700
+++ linux-2.6/mm/filemap.c	2008-04-01 13:05:02.777015782 -0700
@@ -80,6 +80,9 @@ generic_file_direct_IO(int rw, struct ki
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
  *
+ *    When taking multiple mmap_sems, one should lock the lowest-addressed
+ *    one first proceeding on up to the highest-addressed one.
+ *
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 8/9] XPMEM: The device driver
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
                   ` (6 preceding siblings ...)
  2008-04-01 20:55 ` [patch 7/9] Locking rules for taking multiple mmap_sem locks Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  2008-04-01 20:55 ` [patch 9/9] XPMEM: Simple example Christoph Lameter
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: xpmem_v003_emm_SSI_v3 --]
[-- Type: text/plain, Size: 122852 bytes --]

XPmem device driver that allows sharing of address spaces across different
instances of Linux. [Experimental, lots of issues still to be fixed].

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: emm_notifier_xpmem_v1/drivers/misc/xp/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/Makefile	2008-04-01 10:42:33.045763082 -0500
@@ -0,0 +1,16 @@
+# drivers/misc/xp/Makefile
+#
+# This file is subject to the terms and conditions of the GNU General Public
+# License.  See the file "COPYING" in the main directory of this archive
+# for more details.
+#
+# Copyright (C) 1999,2001-2008 Silicon Graphics, Inc.  All Rights Reserved.
+#
+
+# This is just temporary.  Please do not comment.  I am waiting for Dean
+# Nelson's XPC patches to go in and will modify files introduced by his patches
+# to enable.
+obj-m				+= xpmem.o
+xpmem-y				:= xpmem_main.o xpmem_make.o xpmem_get.o \
+				   xpmem_attach.o xpmem_pfn.o \
+				   xpmem_misc.o
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_attach.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_attach.c	2008-04-01 10:42:33.221784791 -0500
@@ -0,0 +1,824 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) attach support.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/mman.h>
+#include "xpmem.h"
+#include "xpmem_private.h"
+
+/*
+ * This function is called whenever a XPMEM address segment is unmapped.
+ * We only expect this to occur from a XPMEM detach operation, and if that
+ * is the case, there is nothing to do since the detach code takes care of
+ * everything. In all other cases, something is tinkering with XPMEM vmas
+ * outside of the XPMEM API, so we do the necessary cleanup and kill the
+ * current thread group. The vma argument is the portion of the address space
+ * that is being unmapped.
+ */
+static void
+xpmem_close(struct vm_area_struct *vma)
+{
+	struct vm_area_struct *remaining_vma;
+	u64 remaining_vaddr;
+	struct xpmem_access_permit *ap;
+	struct xpmem_attachment *att;
+
+	att = vma->vm_private_data;
+	if (att == NULL)
+		return;
+
+	xpmem_att_ref(att);
+	mutex_lock(&att->mutex);
+
+	if (att->flags & XPMEM_FLAG_DESTROYING) {
+		/* the unmap is being done via a detach operation */
+		mutex_unlock(&att->mutex);
+		xpmem_att_deref(att);
+		return;
+	}
+
+	if (current->flags & PF_EXITING) {
+		/* the unmap is being done via process exit */
+		mutex_unlock(&att->mutex);
+		ap = att->ap;
+		xpmem_ap_ref(ap);
+		xpmem_detach_att(ap, att);
+		xpmem_ap_deref(ap);
+		xpmem_att_deref(att);
+		return;
+	}
+
+	/*
+	 * See if the entire vma is being unmapped. If so, clean up the
+	 * the xpmem_attachment structure and leave the vma to be cleaned up
+	 * by the kernel exit path.
+	 */
+	if (vma->vm_start == att->at_vaddr &&
+	    ((vma->vm_end - vma->vm_start) == att->at_size)) {
+
+		xpmem_att_set_destroying(att);
+
+		ap = att->ap;
+		xpmem_ap_ref(ap);
+
+		spin_lock(&ap->lock);
+		list_del_init(&att->att_list);
+		spin_unlock(&ap->lock);
+
+		xpmem_ap_deref(ap);
+
+		xpmem_att_set_destroyed(att);
+		xpmem_att_destroyable(att);
+		goto out;
+	}
+
+	/*
+	 * Find the starting vaddr of the vma that will remain after the unmap
+	 * has finished. The following if-statement tells whether the kernel
+	 * is unmapping the head, tail, or middle of a vma respectively.
+	 */
+	if (vma->vm_start == att->at_vaddr)
+		remaining_vaddr = vma->vm_end;
+	else if (vma->vm_end == att->at_vaddr + att->at_size)
+		remaining_vaddr = att->at_vaddr;
+	else {
+		/*
+		 * If the unmap occurred in the middle of vma, we have two
+		 * remaining vmas to fix up. We first clear out the tail vma
+		 * so it gets cleaned up at exit without any ties remaining
+		 * to XPMEM.
+		 */
+		remaining_vaddr = vma->vm_end;
+		remaining_vma = find_vma(current->mm, remaining_vaddr);
+		BUG_ON(!remaining_vma ||
+		       remaining_vma->vm_start > remaining_vaddr ||
+		       remaining_vma->vm_private_data != vma->vm_private_data);
+
+		/* this should be safe (we have the mmap_sem write-locked) */
+		remaining_vma->vm_private_data = NULL;
+		remaining_vma->vm_ops = NULL;
+
+		/* now set the starting vaddr to point to the head vma */
+		remaining_vaddr = att->at_vaddr;
+	}
+
+	/*
+	 * Find the remaining vma left over by the unmap split and fix
+	 * up the corresponding xpmem_attachment structure.
+	 */
+	remaining_vma = find_vma(current->mm, remaining_vaddr);
+	BUG_ON(!remaining_vma ||
+	       remaining_vma->vm_start > remaining_vaddr ||
+	       remaining_vma->vm_private_data != vma->vm_private_data);
+
+	att->at_vaddr = remaining_vma->vm_start;
+	att->at_size = remaining_vma->vm_end - remaining_vma->vm_start;
+
+	/* clear out the private data for the vma being unmapped */
+	vma->vm_private_data = NULL;
+
+out:
+	mutex_unlock(&att->mutex);
+	xpmem_att_deref(att);
+
+	/* cause the demise of the current thread group */
+	dev_err(xpmem, "unexpected unmap of XPMEM segment at [0x%lx - 0x%lx], "
+		"killed process %d (%s)\n", vma->vm_start, vma->vm_end,
+		current->pid, current->comm);
+	sigaddset(&current->pending.signal, SIGKILL);
+	set_tsk_thread_flag(current, TIF_SIGPENDING);
+}
+
+static unsigned long
+xpmem_fault_handler(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret;
+	int drop_memprot = 0;
+	int seg_tg_mmap_sem_locked = 0;
+	int vma_verification_needed = 0;
+	int recalls_blocked = 0;
+	u64 seg_vaddr;
+	u64 paddr;
+	unsigned long pfn = 0;
+	u64 *xpmem_pfn;
+	struct xpmem_thread_group *ap_tg;
+	struct xpmem_thread_group *seg_tg;
+	struct xpmem_access_permit *ap;
+	struct xpmem_attachment *att;
+	struct xpmem_segment *seg;
+	sigset_t oldset;
+
+	/* ensure do_coredump() doesn't fault pages of this attachment */
+	if (current->flags & PF_DUMPCORE)
+		return 0;
+
+	att = vma->vm_private_data;
+	if (att == NULL)
+		return 0;
+
+	xpmem_att_ref(att);
+	ap = att->ap;
+	xpmem_ap_ref(ap);
+	ap_tg = ap->tg;
+	xpmem_tg_ref(ap_tg);
+
+	seg = ap->seg;
+	xpmem_seg_ref(seg);
+	seg_tg = seg->tg;
+	xpmem_tg_ref(seg_tg);
+
+	DBUG_ON(current->tgid != ap_tg->tgid);
+	DBUG_ON(ap->mode != XPMEM_RDWR);
+
+	if ((ap->flags & XPMEM_FLAG_DESTROYING) ||
+	    (ap_tg->flags & XPMEM_FLAG_DESTROYING))
+		goto out_1;
+
+	/* translate the fault page offset to the source virtual address */
+	seg_vaddr = seg->vaddr + (vmf->pgoff << PAGE_SHIFT);
+
+	/*
+	 * The faulting thread has its mmap_sem locked on entrance to this
+	 * fault handler. In order to supply the missing page we will need
+	 * to get access to the segment that has it, as well as lock the
+	 * mmap_sem of the thread group that owns the segment should it be
+	 * different from the faulting thread's. Together these provide the
+	 * potential for a deadlock, which we attempt to avoid in what follows.
+	 */
+
+	ret = xpmem_seg_down_read(seg_tg, seg, 0, 0);
+
+avoid_deadlock_1:
+
+	if (ret == -EAGAIN) {
+		/* to avoid possible deadlock drop current->mm->mmap_sem */
+		up_read(&current->mm->mmap_sem);
+		ret = xpmem_seg_down_read(seg_tg, seg, 0, 1);
+		down_read(&current->mm->mmap_sem);
+		vma_verification_needed = 1;
+	}
+	if (ret != 0)
+		goto out_1;
+
+avoid_deadlock_2:
+
+	/* verify vma hasn't changed due to dropping current->mm->mmap_sem */
+	if (vma_verification_needed) {
+		struct vm_area_struct *retry_vma;
+
+		retry_vma = find_vma(current->mm, (u64)vmf->virtual_address);
+		if (!retry_vma ||
+		    retry_vma->vm_start > (u64)vmf->virtual_address ||
+		    !xpmem_is_vm_ops_set(retry_vma) ||
+		    retry_vma->vm_private_data != att)
+			goto out_2;
+
+		vma_verification_needed = 0;
+	}
+
+	xpmem_block_nonfatal_signals(&oldset);
+	if (mutex_lock_interruptible(&att->mutex)) {
+		xpmem_unblock_nonfatal_signals(&oldset);
+		goto out_2;
+	}
+	xpmem_unblock_nonfatal_signals(&oldset);
+
+	if ((att->flags & XPMEM_FLAG_DESTROYING) ||
+	    (ap_tg->flags & XPMEM_FLAG_DESTROYING) ||
+	    (seg_tg->flags & XPMEM_FLAG_DESTROYING))
+		goto out_3;
+
+	if (!seg_tg_mmap_sem_locked &&
+		   &current->mm->mmap_sem > &seg_tg->mm->mmap_sem) {
+		/*
+		 * The faulting thread's mmap_sem is numerically smaller
+		 * than the seg's thread group's mmap_sem address-wise,
+		 * therefore we need to acquire the latter's mmap_sem in a
+		 * safe manner before calling xpmem_ensure_valid_PFNs() to
+		 * avoid a potential deadlock.
+		 *
+		 * Concerning the inc/dec of mm_users in this function:
+		 * When /dev/xpmem is opened by a user process, xpmem_open()
+		 * increments mm_users and when it is flushed, xpmem_flush()
+		 * decrements it via mmput() after having first ensured that
+		 * no XPMEM attachments to this mm exist. Therefore, the
+		 * decrement of mm_users by this function will never take it
+		 * to zero.
+		 */
+		seg_tg_mmap_sem_locked = 1;
+		atomic_inc(&seg_tg->mm->mm_users);
+		if (!down_read_trylock(&seg_tg->mm->mmap_sem)) {
+			mutex_unlock(&att->mutex);
+			up_read(&current->mm->mmap_sem);
+			down_read(&seg_tg->mm->mmap_sem);
+			down_read(&current->mm->mmap_sem);
+			vma_verification_needed = 1;
+			goto avoid_deadlock_2;
+		}
+	}
+
+	ret = xpmem_ensure_valid_PFNs(seg, seg_vaddr, 1, drop_memprot, 1,
+				      (vma->vm_flags & VM_PFNMAP),
+				      seg_tg_mmap_sem_locked, &recalls_blocked);
+	if (seg_tg_mmap_sem_locked) {
+		up_read(&seg_tg->mm->mmap_sem);
+		/* mm_users won't dec to 0, see comment above where inc'd */
+		atomic_dec(&seg_tg->mm->mm_users);
+		seg_tg_mmap_sem_locked = 0;
+	}
+	if (ret != 0) {
+		/* xpmem_ensure_valid_PFNs could not re-acquire. */
+		if (ret == -ENOENT) {
+			mutex_unlock(&att->mutex);
+			goto out_3;
+		}
+
+		if (ret == -EAGAIN) {
+			if (recalls_blocked) {
+				xpmem_unblock_recall_PFNs(seg_tg);
+				recalls_blocked = 0;
+			}
+			mutex_unlock(&att->mutex);
+			xpmem_seg_up_read(seg_tg, seg, 0);
+			goto avoid_deadlock_1;
+		}
+
+		goto out_4;
+	}
+
+	xpmem_pfn = xpmem_vaddr_to_PFN(seg, seg_vaddr);
+	DBUG_ON(!XPMEM_PFN_IS_KNOWN(xpmem_pfn));
+
+	if (*xpmem_pfn & XPMEM_PFN_UNCACHED)
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	paddr = XPMEM_PFN_TO_PADDR(xpmem_pfn);
+
+#ifdef CONFIG_IA64
+	if (att->flags & XPMEM_ATTACH_WC)
+		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+	else if (att->flags & XPMEM_ATTACH_GETSPACE)
+		paddr = __pa(TO_GET(paddr));
+#endif /* CONFIG_IA64 */
+
+	pfn = paddr >> PAGE_SHIFT;
+
+	att->flags |= XPMEM_FLAG_VALIDPTES;
+
+out_4:
+	if (recalls_blocked) {
+		xpmem_unblock_recall_PFNs(seg_tg);
+		recalls_blocked = 0;
+	}
+out_3:
+	mutex_unlock(&att->mutex);
+out_2:
+	if (seg_tg_mmap_sem_locked) {
+		up_read(&seg_tg->mm->mmap_sem);
+		/* mm_users won't dec to 0, see comment above where inc'd */
+		atomic_dec(&seg_tg->mm->mm_users);
+	}
+	xpmem_seg_up_read(seg_tg, seg, 0);
+out_1:
+	xpmem_att_deref(att);
+	xpmem_ap_deref(ap);
+	xpmem_tg_deref(ap_tg);
+	xpmem_seg_deref(seg);
+	xpmem_tg_deref(seg_tg);
+	return pfn;
+}
+
+/*
+ * This is the vm_ops->fault for xpmem_attach()'d segments. It is
+ * called by the Linux kernel function __do_fault().
+ */
+static int
+xpmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	unsigned long pfn;
+
+	pfn = xpmem_fault_handler(vma, vmf);
+	if (!pfn)
+		return VM_FAULT_SIGBUS;
+
+	BUG_ON(!pfn_valid(pfn));
+	vmf->page = pfn_to_page(pfn);
+	get_page(vmf->page);
+	return 0;
+}
+
+/*
+ * This is the vm_ops->nopfn for xpmem_attach()'d segments. It is
+ * called by the Linux kernel function do_no_pfn().
+ */
+static unsigned long
+xpmem_nopfn(struct vm_area_struct *vma, unsigned long vaddr)
+{
+	struct vm_fault vmf;
+	unsigned long pfn;
+
+	vmf.virtual_address = (void __user *)vaddr;
+	vmf.pgoff = (((vaddr & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) +
+		    vma->vm_pgoff;
+	vmf.flags = 0; /* >>> Should be (write_access ? FAULT_FLAG_WRITE : 0) */
+	vmf.page = NULL;
+
+	pfn = xpmem_fault_handler(vma, &vmf);
+	if (!pfn)
+		return NOPFN_SIGBUS;
+
+	return pfn;
+}
+
+struct vm_operations_struct xpmem_vm_ops_fault = {
+	.close = xpmem_close,
+	.fault = xpmem_fault
+};
+
+struct vm_operations_struct xpmem_vm_ops_nopfn = {
+	.close = xpmem_close,
+	.nopfn = xpmem_nopfn
+};
+
+/*
+ * This function is called via the Linux kernel mmap() code, which is
+ * instigated by the call to do_mmap() in xpmem_attach().
+ */
+int
+xpmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/*
+	 * When a mapping is related to a file, the file pointer is typically
+	 * stored in vma->vm_file and a fput() is done to it when the VMA is
+	 * unmapped. Since file is of no interest in XPMEM's case, we ensure
+	 * vm_file is empty and do the fput() here.
+	 */
+	vma->vm_file = NULL;
+	fput(file);
+
+	vma->vm_ops = &xpmem_vm_ops_fault;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
+	return 0;
+}
+
+/*
+ * Attach a XPMEM address segment.
+ */
+int
+xpmem_attach(struct file *file, __s64 apid, off_t offset, size_t size,
+	     u64 vaddr, int fd, int att_flags, u64 *at_vaddr_p)
+{
+	int ret;
+	unsigned long flags;
+	unsigned long prot_flags = PROT_READ | PROT_WRITE;
+	unsigned long vm_pfnmap = 0;
+	u64 seg_vaddr;
+	u64 at_vaddr;
+	struct xpmem_thread_group *ap_tg;
+	struct xpmem_thread_group *seg_tg;
+	struct xpmem_access_permit *ap;
+	struct xpmem_segment *seg;
+	struct xpmem_attachment *att;
+	struct vm_area_struct *vma;
+	struct vm_area_struct *seg_vma;
+
+
+	/*
+	 * The attachment's starting offset into the source segment must be
+	 * page aligned and the attachment must be a multiple of pages in size.
+	 */
+	if (offset_in_page(offset) != 0 || offset_in_page(size) != 0)
+		return -EINVAL;
+
+	/* ensure the requested attach point (i.e., vaddr) is valid */
+	if (vaddr && (offset_in_page(vaddr) != 0 || vaddr + size > TASK_SIZE))
+		return -EINVAL;
+
+	/*
+	 * Ensure threads doing GET space attachments are pinned, and set
+	 * prot_flags to read-only.
+	 *
+	 * raw_smp_processor_id() is called directly to avoid the debug info
+	 * generated by smp_processor_id() should CONFIG_DEBUG_PREEMPT be set
+	 * and the thread not be pinned to this CPU, a condition for which
+	 * we return an error anyways.
+	 */
+	if (att_flags & XPMEM_ATTACH_GETSPACE) {
+		cpumask_t this_cpu;
+
+		this_cpu = cpumask_of_cpu(raw_smp_processor_id());
+
+		if (!cpus_equal(current->cpus_allowed, this_cpu))
+			return -EINVAL;
+
+		prot_flags = PROT_READ;
+	}
+
+	ap_tg = xpmem_tg_ref_by_apid(apid);
+	if (IS_ERR(ap_tg))
+		return PTR_ERR(ap_tg);
+
+	ap = xpmem_ap_ref_by_apid(ap_tg, apid);
+	if (IS_ERR(ap)) {
+		ret = PTR_ERR(ap);
+		goto out_1;
+	}
+
+	seg = ap->seg;
+	xpmem_seg_ref(seg);
+	seg_tg = seg->tg;
+	xpmem_tg_ref(seg_tg);
+
+	ret = xpmem_seg_down_read(seg_tg, seg, 0, 1);
+	if (ret != 0)
+		goto out_2;
+
+	seg_vaddr = xpmem_get_seg_vaddr(ap, offset, size, XPMEM_RDWR);
+	if (IS_ERR_VALUE(seg_vaddr)) {
+		ret = seg_vaddr;
+		goto out_3;
+	}
+
+	/*
+	 * Ensure thread is not attempting to attach its own memory on top
+	 * of itself (i.e. ensure the destination vaddr range doesn't overlap
+	 * the source vaddr range).
+	 */
+	if (current->tgid == seg_tg->tgid &&
+	    vaddr && (vaddr + size > seg_vaddr) && (vaddr < seg_vaddr + size)) {
+		ret = -EINVAL;
+		goto out_3;
+	}
+
+	/* source segment resides on this partition */
+	down_read(&seg_tg->mm->mmap_sem);
+	seg_vma = find_vma(seg_tg->mm, seg_vaddr);
+	if (seg_vma && seg_vma->vm_start <= seg_vaddr)
+		vm_pfnmap = (seg_vma->vm_flags & VM_PFNMAP);
+	up_read(&seg_tg->mm->mmap_sem);
+
+	/* create new attach structure */
+	att = kzalloc(sizeof(struct xpmem_attachment), GFP_KERNEL);
+	if (att == NULL) {
+		ret = -ENOMEM;
+		goto out_3;
+	}
+
+	mutex_init(&att->mutex);
+	att->offset = offset;
+	att->at_size = size;
+	att->flags |= (att_flags | XPMEM_FLAG_CREATING);
+	att->ap = ap;
+	INIT_LIST_HEAD(&att->att_list);
+	att->mm = current->mm;
+        init_waitqueue_head(&att->destroyed_wq);
+
+	xpmem_att_not_destroyable(att);
+	xpmem_att_ref(att);
+
+	/* must lock mmap_sem before att's sema to prevent deadlock */
+	down_write(&current->mm->mmap_sem);
+	mutex_lock(&att->mutex);	/* this will never block */
+
+	/* link attach structure to its access permit's att list */
+	spin_lock(&ap->lock);
+	list_add_tail(&att->att_list, &ap->att_list);
+	if (ap->flags & XPMEM_FLAG_DESTROYING) {
+		spin_unlock(&ap->lock);
+		ret = -ENOENT;
+		goto out_4;
+	}
+	spin_unlock(&ap->lock);
+
+	flags = MAP_SHARED;
+	if (vaddr)
+		flags |= MAP_FIXED;
+
+	/* check if a segment is already attached in the requested area */
+	if (flags & MAP_FIXED) {
+		struct vm_area_struct *existing_vma;
+
+		existing_vma = find_vma_intersection(current->mm, vaddr,
+						     vaddr + size);
+		if (existing_vma && xpmem_is_vm_ops_set(existing_vma)) {
+			ret = -ENOMEM;
+			goto out_4;
+		}
+	}
+
+	at_vaddr = do_mmap(file, vaddr, size, prot_flags, flags, offset);
+	if (IS_ERR_VALUE(at_vaddr)) {
+		ret = at_vaddr;
+		goto out_4;
+	}
+	att->at_vaddr = at_vaddr;
+	att->flags &= ~XPMEM_FLAG_CREATING;
+
+	vma = find_vma(current->mm, at_vaddr);
+	vma->vm_private_data = att;
+	vma->vm_flags |=
+	    VM_DONTCOPY | VM_RESERVED | VM_IO | VM_DONTEXPAND | vm_pfnmap;
+	if (vma->vm_flags & VM_PFNMAP) {
+		vma->vm_ops = &xpmem_vm_ops_nopfn;
+		vma->vm_flags &= ~VM_CAN_NONLINEAR;
+	}
+
+	*at_vaddr_p = at_vaddr;
+
+out_4:
+	if (ret != 0) {
+		xpmem_att_set_destroying(att);
+		spin_lock(&ap->lock);
+		list_del_init(&att->att_list);
+		spin_unlock(&ap->lock);
+		xpmem_att_set_destroyed(att);
+		xpmem_att_destroyable(att);
+	}
+	mutex_unlock(&att->mutex);
+	up_write(&current->mm->mmap_sem);
+	xpmem_att_deref(att);
+out_3:
+	xpmem_seg_up_read(seg_tg, seg, 0);
+out_2:
+	xpmem_seg_deref(seg);
+	xpmem_tg_deref(seg_tg);
+	xpmem_ap_deref(ap);
+out_1:
+	xpmem_tg_deref(ap_tg);
+	return ret;
+}
+
+/*
+ * Detach an attached XPMEM address segment.
+ */
+int
+xpmem_detach(u64 at_vaddr)
+{
+	int ret = 0;
+	struct xpmem_access_permit *ap;
+	struct xpmem_attachment *att;
+	struct vm_area_struct *vma;
+	sigset_t oldset;
+
+	down_write(&current->mm->mmap_sem);
+
+	/* find the corresponding vma */
+	vma = find_vma(current->mm, at_vaddr);
+	if (!vma || vma->vm_start > at_vaddr) {
+		ret = -ENOENT;
+		goto out_1;
+	}
+
+	att = vma->vm_private_data;
+	if (!xpmem_is_vm_ops_set(vma) || att == NULL) {
+		ret = -EINVAL;
+		goto out_1;
+	}
+	xpmem_att_ref(att);
+
+	xpmem_block_nonfatal_signals(&oldset);
+	if (mutex_lock_interruptible(&att->mutex)) {
+		xpmem_unblock_nonfatal_signals(&oldset);
+		ret = -EINTR;
+		goto out_2;
+	}
+	xpmem_unblock_nonfatal_signals(&oldset);
+
+	if (att->flags & XPMEM_FLAG_DESTROYING)
+		goto out_3;
+	xpmem_att_set_destroying(att);
+
+	ap = att->ap;
+	xpmem_ap_ref(ap);
+
+	if (current->tgid != ap->tg->tgid) {
+		xpmem_att_clear_destroying(att);
+		ret = -EACCES;
+		goto out_4;
+	}
+
+	vma->vm_private_data = NULL;
+
+	ret = do_munmap(current->mm, vma->vm_start, att->at_size);
+	DBUG_ON(ret != 0);
+
+	att->flags &= ~XPMEM_FLAG_VALIDPTES;
+
+	spin_lock(&ap->lock);
+	list_del_init(&att->att_list);
+	spin_unlock(&ap->lock);
+
+	xpmem_att_set_destroyed(att);
+	xpmem_att_destroyable(att);
+
+out_4:
+	xpmem_ap_deref(ap);
+out_3:
+	mutex_unlock(&att->mutex);
+out_2:
+	xpmem_att_deref(att);
+out_1:
+	up_write(&current->mm->mmap_sem);
+	return ret;
+}
+
+/*
+ * Detach an attached XPMEM address segment. This is functionally identical
+ * to xpmem_detach(). It is called when ap and att are known.
+ */
+void
+xpmem_detach_att(struct xpmem_access_permit *ap, struct xpmem_attachment *att)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	/* must lock mmap_sem before att's sema to prevent deadlock */
+	down_write(&att->mm->mmap_sem);
+	mutex_lock(&att->mutex);
+
+	if (att->flags & XPMEM_FLAG_DESTROYING)
+		goto out;
+
+	xpmem_att_set_destroying(att);
+
+	/* find the corresponding vma */
+	vma = find_vma(att->mm, att->at_vaddr);
+	if (!vma || vma->vm_start > att->at_vaddr)
+		goto out;
+
+	DBUG_ON(!xpmem_is_vm_ops_set(vma));
+	DBUG_ON((vma->vm_end - vma->vm_start) != att->at_size);
+	DBUG_ON(vma->vm_private_data != att);
+
+	vma->vm_private_data = NULL;
+
+	if (!(current->flags & PF_EXITING)) {
+		ret = do_munmap(att->mm, vma->vm_start, att->at_size);
+		DBUG_ON(ret != 0);
+	}
+
+	att->flags &= ~XPMEM_FLAG_VALIDPTES;
+
+	spin_lock(&ap->lock);
+	list_del_init(&att->att_list);
+	spin_unlock(&ap->lock);
+
+	xpmem_att_set_destroyed(att);
+	xpmem_att_destroyable(att);
+
+out:
+	mutex_unlock(&att->mutex);
+	up_write(&att->mm->mmap_sem);
+}
+
+/*
+ * Clear all of the PTEs associated with the specified attachment.
+ */
+static void
+xpmem_clear_PTEs_of_att(struct xpmem_attachment *att, u64 vaddr, size_t size)
+{
+	if (att->flags & XPMEM_FLAG_DESTROYING)
+		xpmem_att_wait_destroyed(att);
+
+	if (att->flags & XPMEM_FLAG_DESTROYED)
+		return;
+
+	/* must lock mmap_sem before att's sema to prevent deadlock */
+	down_read(&att->mm->mmap_sem);
+	mutex_lock(&att->mutex);
+
+	/*
+	 * The att may have been detached before the down() succeeded.
+	 * If not, clear kernel PTEs, flush TLBs, etc.
+	 */
+	if (att->flags & XPMEM_FLAG_VALIDPTES) {
+		struct vm_area_struct *vma;
+
+		vma = find_vma(att->mm, vaddr);
+		zap_page_range(vma, vaddr, size, NULL);
+		att->flags &= ~XPMEM_FLAG_VALIDPTES;
+	}
+
+	mutex_unlock(&att->mutex);
+	up_read(&att->mm->mmap_sem);
+}
+
+/*
+ * Clear all of the PTEs associated with all attachments related to the
+ * specified access permit.
+ */
+static void
+xpmem_clear_PTEs_of_ap(struct xpmem_access_permit *ap, u64 seg_offset,
+		       size_t size)
+{
+	struct xpmem_attachment *att;
+	u64 t_vaddr;
+	size_t t_size;
+
+	spin_lock(&ap->lock);
+	list_for_each_entry(att, &ap->att_list, att_list) {
+		if (!(att->flags & XPMEM_FLAG_VALIDPTES))
+			continue;
+
+		t_vaddr = att->at_vaddr + seg_offset - att->offset,
+		t_size = size;
+		if (!xpmem_get_overlapping_range(att->at_vaddr, att->at_size,
+		    &t_vaddr, &t_size))
+			continue;
+
+		xpmem_att_ref(att);  /* don't care if XPMEM_FLAG_DESTROYING */
+		spin_unlock(&ap->lock);
+
+		xpmem_clear_PTEs_of_att(att, t_vaddr, t_size);
+
+		spin_lock(&ap->lock);
+		if (list_empty(&att->att_list)) {
+			/* att was deleted from ap->att_list, start over */
+			xpmem_att_deref(att);
+			att = list_entry(&ap->att_list, struct xpmem_attachment,
+					 att_list);
+		} else
+			xpmem_att_deref(att);
+	}
+	spin_unlock(&ap->lock);
+}
+
+/*
+ * Clear all of the PTEs associated with all attaches to the specified segment.
+ */
+void
+xpmem_clear_PTEs(struct xpmem_segment *seg, u64 vaddr, size_t size)
+{
+	struct xpmem_access_permit *ap;
+	u64 seg_offset = vaddr - seg->vaddr;
+
+	spin_lock(&seg->lock);
+	list_for_each_entry(ap, &seg->ap_list, ap_list) {
+		xpmem_ap_ref(ap);  /* don't care if XPMEM_FLAG_DESTROYING */
+		spin_unlock(&seg->lock);
+
+		xpmem_clear_PTEs_of_ap(ap, seg_offset, size);
+
+		spin_lock(&seg->lock);
+		if (list_empty(&ap->ap_list)) {
+			/* ap was deleted from seg->ap_list, start over */
+			xpmem_ap_deref(ap);
+			ap = list_entry(&seg->ap_list,
+					 struct xpmem_access_permit, ap_list);
+		} else
+			xpmem_ap_deref(ap);
+	}
+	spin_unlock(&seg->lock);
+}
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_get.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_get.c	2008-04-01 10:42:33.189780844 -0500
@@ -0,0 +1,343 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) get access support.
+ */
+
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/stat.h>
+#include "xpmem.h"
+#include "xpmem_private.h"
+
+/*
+ * This is the kernel's IPC permission checking function without calls to
+ * do any extra security checks. See ipc/util.c for the original source.
+ */
+static int
+xpmem_ipcperms(struct kern_ipc_perm *ipcp, short flag)
+{
+	int requested_mode;
+	int granted_mode;
+
+	requested_mode = (flag >> 6) | (flag >> 3) | flag;
+	granted_mode = ipcp->mode;
+	if (current->euid == ipcp->cuid || current->euid == ipcp->uid)
+		granted_mode >>= 6;
+	else if (in_group_p(ipcp->cgid) || in_group_p(ipcp->gid))
+		granted_mode >>= 3;
+	/* is there some bit set in requested_mode but not in granted_mode? */
+	if ((requested_mode & ~granted_mode & 0007) && !capable(CAP_IPC_OWNER))
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Ensure that the user is actually allowed to access the segment.
+ */
+static int
+xpmem_check_permit_mode(int flags, struct xpmem_segment *seg)
+{
+	struct kern_ipc_perm perm;
+	int ret;
+
+	DBUG_ON(seg->permit_type != XPMEM_PERMIT_MODE);
+
+	memset(&perm, 0, sizeof(struct kern_ipc_perm));
+	perm.uid = seg->tg->uid;
+	perm.gid = seg->tg->gid;
+	perm.cuid = seg->tg->uid;
+	perm.cgid = seg->tg->gid;
+	perm.mode = (u64)seg->permit_value;
+
+	ret = xpmem_ipcperms(&perm, S_IRUSR);
+	if (ret == 0 && (flags & XPMEM_RDWR))
+		ret = xpmem_ipcperms(&perm, S_IWUSR);
+
+	return ret;
+}
+
+/*
+ * Create a new and unique apid.
+ */
+static __s64
+xpmem_make_apid(struct xpmem_thread_group *ap_tg)
+{
+	struct xpmem_id apid;
+	__s64 *apid_p = (__s64 *)&apid;
+	int uniq;
+
+	DBUG_ON(sizeof(struct xpmem_id) != sizeof(__s64));
+	DBUG_ON(ap_tg->partid < 0 || ap_tg->partid >= XP_MAX_PARTITIONS);
+
+	uniq = atomic_inc_return(&ap_tg->uniq_apid);
+	if (uniq > XPMEM_MAX_UNIQ_ID) {
+		atomic_dec(&ap_tg->uniq_apid);
+		return -EBUSY;
+	}
+
+	apid.tgid = ap_tg->tgid;
+	apid.uniq = uniq;
+	apid.partid = ap_tg->partid;
+	return *apid_p;
+}
+
+/*
+ * Get permission to access a specified segid.
+ */
+int
+xpmem_get(__s64 segid, int flags, int permit_type, void *permit_value,
+	  __s64 *apid_p)
+{
+	__s64 apid;
+	struct xpmem_access_permit *ap;
+	struct xpmem_segment *seg;
+	struct xpmem_thread_group *ap_tg;
+	struct xpmem_thread_group *seg_tg;
+	int index;
+	int ret = 0;
+
+	if ((flags & ~(XPMEM_RDONLY | XPMEM_RDWR)) ||
+	    (flags & (XPMEM_RDONLY | XPMEM_RDWR)) ==
+	    (XPMEM_RDONLY | XPMEM_RDWR))
+		return -EINVAL;
+
+	if (permit_type != XPMEM_PERMIT_MODE || permit_value != NULL)
+		return -EINVAL;
+
+	ap_tg = xpmem_tg_ref_by_tgid(xpmem_my_part, current->tgid);
+	if (IS_ERR(ap_tg)) {
+		DBUG_ON(PTR_ERR(ap_tg) != -ENOENT);
+		return -XPMEM_ERRNO_NOPROC;
+	}
+
+	seg_tg = xpmem_tg_ref_by_segid(segid);
+	if (IS_ERR(seg_tg)) {
+		if (PTR_ERR(seg_tg) != -EREMOTE) {
+			ret = PTR_ERR(seg_tg);
+			goto out_1;
+		}
+
+		ret = -ENOENT;
+		goto out_1;
+	} else {
+		seg = xpmem_seg_ref_by_segid(seg_tg, segid);
+		if (IS_ERR(seg)) {
+			if (PTR_ERR(seg) != -EREMOTE) {
+				ret = PTR_ERR(seg);
+				goto out_2;
+			}
+			ret = -ENOENT;
+			goto out_2;
+		} else {
+			/* wait for proxy seg's creation to be complete */
+			wait_event(seg->created_wq,
+				   ((!(seg->flags & XPMEM_FLAG_CREATING)) ||
+				    (seg->flags & XPMEM_FLAG_DESTROYING)));
+			if (seg->flags & XPMEM_FLAG_DESTROYING) {
+				ret = -ENOENT;
+				goto out_3;
+			}
+		}
+	}
+
+	/* assuming XPMEM_PERMIT_MODE, do the appropriate permission check */
+	if (xpmem_check_permit_mode(flags, seg) != 0) {
+		ret = -EACCES;
+		goto out_3;
+	}
+
+	/* create a new xpmem_access_permit structure with a unique apid */
+
+	apid = xpmem_make_apid(ap_tg);
+	if (apid < 0) {
+		ret = apid;
+		goto out_3;
+	}
+
+	ap = kzalloc(sizeof(struct xpmem_access_permit), GFP_KERNEL);
+	if (ap == NULL) {
+		ret = -ENOMEM;
+		goto out_3;
+	}
+
+	spin_lock_init(&ap->lock);
+	ap->seg = seg;
+	ap->tg = ap_tg;
+	ap->apid = apid;
+	ap->mode = flags;
+	INIT_LIST_HEAD(&ap->att_list);
+	INIT_LIST_HEAD(&ap->ap_list);
+	INIT_LIST_HEAD(&ap->ap_hashlist);
+
+	xpmem_ap_not_destroyable(ap);
+
+	/* add ap to its seg's access permit list */
+	spin_lock(&seg->lock);
+	list_add_tail(&ap->ap_list, &seg->ap_list);
+	spin_unlock(&seg->lock);
+
+	/* add ap to its hash list */
+	index = xpmem_ap_hashtable_index(ap->apid);
+	write_lock(&ap_tg->ap_hashtable[index].lock);
+	list_add_tail(&ap->ap_hashlist, &ap_tg->ap_hashtable[index].list);
+	write_unlock(&ap_tg->ap_hashtable[index].lock);
+
+	*apid_p = apid;
+
+	/*
+	 * The following two derefs aren't being done at this time in order
+	 * to prevent the seg and seg_tg structures from being prematurely
+	 * kfree'd as long as the potential for them to be referenced via
+	 * this ap structure exists.
+	 *
+	 *      xpmem_seg_deref(seg);
+	 *      xpmem_tg_deref(seg_tg);
+	 *
+	 * These two derefs will be done by xpmem_release_ap() at the time
+	 * this ap structure is destroyed.
+	 */
+	goto out_1;
+
+out_3:
+	xpmem_seg_deref(seg);
+out_2:
+	xpmem_tg_deref(seg_tg);
+out_1:
+	xpmem_tg_deref(ap_tg);
+	return ret;
+}
+
+/*
+ * Release an access permit and detach all associated attaches.
+ */
+static void
+xpmem_release_ap(struct xpmem_thread_group *ap_tg,
+		  struct xpmem_access_permit *ap)
+{
+	int index;
+	struct xpmem_thread_group *seg_tg;
+	struct xpmem_attachment *att;
+	struct xpmem_segment *seg;
+
+	spin_lock(&ap->lock);
+	if (ap->flags & XPMEM_FLAG_DESTROYING) {
+		spin_unlock(&ap->lock);
+		return;
+	}
+	ap->flags |= XPMEM_FLAG_DESTROYING;
+
+	/* deal with all attaches first */
+	while (!list_empty(&ap->att_list)) {
+		att = list_entry((&ap->att_list)->next, struct xpmem_attachment,
+				 att_list);
+		xpmem_att_ref(att);
+		spin_unlock(&ap->lock);
+		xpmem_detach_att(ap, att);
+		DBUG_ON(atomic_read(&att->mm->mm_users) <= 0);
+		DBUG_ON(atomic_read(&att->mm->mm_count) <= 0);
+		xpmem_att_deref(att);
+		spin_lock(&ap->lock);
+	}
+	ap->flags |= XPMEM_FLAG_DESTROYED;
+	spin_unlock(&ap->lock);
+
+	/*
+	 * Remove access structure from its hash list.
+	 * This is done after the xpmem_detach_att to prevent any racing
+	 * thread from looking up access permits for the owning thread group
+	 * and not finding anything, assuming everything is clean, and
+	 * freeing the mm before xpmem_detach_att has a chance to
+	 * use it.
+	 */
+	index = xpmem_ap_hashtable_index(ap->apid);
+	write_lock(&ap_tg->ap_hashtable[index].lock);
+	list_del_init(&ap->ap_hashlist);
+	write_unlock(&ap_tg->ap_hashtable[index].lock);
+
+	/* the ap's seg and the seg's tg were ref'd in xpmem_get() */
+	seg = ap->seg;
+	seg_tg = seg->tg;
+
+	/* remove ap from its seg's access permit list */
+	spin_lock(&seg->lock);
+	list_del_init(&ap->ap_list);
+	spin_unlock(&seg->lock);
+
+	xpmem_seg_deref(seg);	/* deref of xpmem_get()'s ref */
+	xpmem_tg_deref(seg_tg);	/* deref of xpmem_get()'s ref */
+
+	xpmem_ap_destroyable(ap);
+}
+
+/*
+ * Release all access permits and detach all associated attaches for the given
+ * thread group.
+ */
+void
+xpmem_release_aps_of_tg(struct xpmem_thread_group *ap_tg)
+{
+	struct xpmem_hashlist *hashlist;
+	struct xpmem_access_permit *ap;
+	int index;
+
+	for (index = 0; index < XPMEM_AP_HASHTABLE_SIZE; index++) {
+		hashlist = &ap_tg->ap_hashtable[index];
+
+		read_lock(&hashlist->lock);
+		while (!list_empty(&hashlist->list)) {
+			ap = list_entry((&hashlist->list)->next,
+					struct xpmem_access_permit,
+					ap_hashlist);
+			xpmem_ap_ref(ap);
+			read_unlock(&hashlist->lock);
+
+			xpmem_release_ap(ap_tg, ap);
+
+			xpmem_ap_deref(ap);
+			read_lock(&hashlist->lock);
+		}
+		read_unlock(&hashlist->lock);
+	}
+}
+
+/*
+ * Release an access permit for a XPMEM address segment.
+ */
+int
+xpmem_release(__s64 apid)
+{
+	struct xpmem_thread_group *ap_tg;
+	struct xpmem_access_permit *ap;
+	int ret = 0;
+
+	ap_tg = xpmem_tg_ref_by_apid(apid);
+	if (IS_ERR(ap_tg))
+		return PTR_ERR(ap_tg);
+
+	if (current->tgid != ap_tg->tgid) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	ap = xpmem_ap_ref_by_apid(ap_tg, apid);
+	if (IS_ERR(ap)) {
+		ret = PTR_ERR(ap);
+		goto out;
+	}
+	DBUG_ON(ap->tg != ap_tg);
+
+	xpmem_release_ap(ap_tg, ap);
+
+	xpmem_ap_deref(ap);
+out:
+	xpmem_tg_deref(ap_tg);
+	return ret;
+}
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_main.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_main.c	2008-04-01 10:42:33.065765549 -0500
@@ -0,0 +1,440 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) support.
+ *
+ * This module (along with a corresponding library) provides support for
+ * cross-partition shared memory between threads.
+ *
+ * Caveats
+ *
+ *   * XPMEM cannot allocate VM_IO pages on behalf of another thread group
+ *     since get_user_pages() doesn't handle VM_IO pages. This is normally
+ *     valid if a thread group attaches a portion of an address space and is
+ *     the first to touch that portion. In addition, any pages which come from
+ *     the "low granule" such as fetchops, pages for cross-coherence
+ *     write-combining, etc. also are impossible since the kernel will try
+ *     to find a struct page which will not exist.
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/err.h>
+#include <linux/proc_fs.h>
+#include <linux/uaccess.h>
+#include "xpmem.h"
+#include "xpmem_private.h"
+
+/* define the XPMEM debug device structure to be used with dev_dbg() et al */
+
+static struct device_driver xpmem_dbg_name = {
+	.name = "xpmem"
+};
+
+static struct device xpmem_dbg_subname = {
+	.bus_id = {0},		/* set to "" */
+	.driver = &xpmem_dbg_name
+};
+
+struct device *xpmem = &xpmem_dbg_subname;
+
+/* array of partitions indexed by partid */
+struct xpmem_partition *xpmem_partitions;
+
+struct xpmem_partition *xpmem_my_part;	/* pointer to this partition */
+short xpmem_my_partid;		/* this partition's ID */
+
+/*
+ * User open of the XPMEM driver. Called whenever /dev/xpmem is opened.
+ * Create a struct xpmem_thread_group structure for the specified thread group.
+ * And add the structure to the tg hash table.
+ */
+static int
+xpmem_open(struct inode *inode, struct file *file)
+{
+	struct xpmem_thread_group *tg;
+	int index;
+#ifdef CONFIG_PROC_FS
+	struct proc_dir_entry *unpin_entry;
+	char tgid_string[XPMEM_TGID_STRING_LEN];
+#endif /* CONFIG_PROC_FS */
+
+	/* if this has already been done, just return silently */
+	tg = xpmem_tg_ref_by_tgid(xpmem_my_part, current->tgid);
+	if (!IS_ERR(tg)) {
+		xpmem_tg_deref(tg);
+		return 0;
+	}
+
+	/* create tg */
+	tg = kzalloc(sizeof(struct xpmem_thread_group), GFP_KERNEL);
+	if (tg == NULL)
+		return -ENOMEM;
+
+	spin_lock_init(&tg->lock);
+	tg->partid = xpmem_my_partid;
+	tg->tgid = current->tgid;
+	tg->uid = current->uid;
+	tg->gid = current->gid;
+	atomic_set(&tg->uniq_segid, 0);
+	atomic_set(&tg->uniq_apid, 0);
+	atomic_set(&tg->n_pinned, 0);
+	tg->addr_limit = TASK_SIZE;
+	tg->seg_list_lock = RW_LOCK_UNLOCKED;
+	INIT_LIST_HEAD(&tg->seg_list);
+	INIT_LIST_HEAD(&tg->tg_hashlist);
+	atomic_set(&tg->n_recall_PFNs, 0);
+	mutex_init(&tg->recall_PFNs_mutex);
+	init_waitqueue_head(&tg->block_recall_PFNs_wq);
+	init_waitqueue_head(&tg->allow_recall_PFNs_wq);
+	tg->emm_notifier.callback = &xpmem_emm_notifier_callback;
+	spin_lock_init(&tg->page_requests_lock);
+	INIT_LIST_HEAD(&tg->page_requests);
+
+	/* create and initialize struct xpmem_access_permit hashtable */
+	tg->ap_hashtable = kzalloc(sizeof(struct xpmem_hashlist) *
+				     XPMEM_AP_HASHTABLE_SIZE, GFP_KERNEL);
+	if (tg->ap_hashtable == NULL) {
+		kfree(tg);
+		return -ENOMEM;
+	}
+	for (index = 0; index < XPMEM_AP_HASHTABLE_SIZE; index++) {
+		tg->ap_hashtable[index].lock = RW_LOCK_UNLOCKED;
+		INIT_LIST_HEAD(&tg->ap_hashtable[index].list);
+	}
+
+#ifdef CONFIG_PROC_FS
+	snprintf(tgid_string, XPMEM_TGID_STRING_LEN, "%d", current->tgid);
+	spin_lock(&xpmem_unpin_procfs_lock);
+	unpin_entry = create_proc_entry(tgid_string, 0644,
+					xpmem_unpin_procfs_dir);
+	spin_unlock(&xpmem_unpin_procfs_lock);
+	if (unpin_entry != NULL) {
+		unpin_entry->data = (void *)(unsigned long)current->tgid;
+		unpin_entry->write_proc = xpmem_unpin_procfs_write;
+		unpin_entry->read_proc = xpmem_unpin_procfs_read;
+		unpin_entry->owner = THIS_MODULE;
+		unpin_entry->uid = current->uid;
+		unpin_entry->gid = current->gid;
+	}
+#endif /* CONFIG_PROC_FS */
+
+	xpmem_tg_not_destroyable(tg);
+
+	/* add tg to its hash list */
+	index = xpmem_tg_hashtable_index(tg->tgid);
+	write_lock(&xpmem_my_part->tg_hashtable[index].lock);
+	list_add_tail(&tg->tg_hashlist,
+		      &xpmem_my_part->tg_hashtable[index].list);
+	write_unlock(&xpmem_my_part->tg_hashtable[index].lock);
+
+	/*
+	 * Increment 'mm->mm_users' for the current task's thread group leader.
+	 * This ensures that its mm_struct will still be around when our
+	 * thread group exits. (The Linux kernel normally tears down the
+	 * mm_struct prior to calling a module's 'flush' function.) Since all
+	 * XPMEM thread groups must go through this path, this extra reference
+	 * to mm_users also allows us to directly inc/dec mm_users in
+	 * xpmem_ensure_valid_PFNs() and avoid mmput() which has a scaling
+	 * issue with the mmlist_lock. Being a thread group leader guarantees
+	 * that the thread group leader's task_struct will still be around.
+	 */
+//>>> with the mm_users being bumped here do we even need to inc/dec mm_users
+//>>> in xpmem_ensure_valid_PFNs()?
+//>>>	get_task_struct(current->group_leader);
+	tg->group_leader = current->group_leader;
+
+	BUG_ON(current->mm != current->group_leader->mm);
+//>>>	atomic_inc(&current->group_leader->mm->mm_users);
+	tg->mm = current->group_leader->mm;
+
+	return 0;
+}
+
+/*
+ * The following function gets called whenever a thread group that has opened
+ * /dev/xpmem closes it.
+ */
+static int
+//>>> do we get rid of this function???
+xpmem_flush(struct file *file, fl_owner_t owner)
+{
+	struct xpmem_thread_group *tg;
+	int index;
+
+	tg = xpmem_tg_ref_by_tgid(xpmem_my_part, current->tgid);
+	if (IS_ERR(tg))
+		return 0;  /* probably child process who inherited fd */
+
+	spin_lock(&tg->lock);
+	if (tg->flags & XPMEM_FLAG_DESTROYING) {
+		spin_unlock(&tg->lock);
+		xpmem_tg_deref(tg);
+		return -EALREADY;
+	}
+	tg->flags |= XPMEM_FLAG_DESTROYING;
+	spin_unlock(&tg->lock);
+
+	xpmem_release_aps_of_tg(tg);
+	xpmem_remove_segs_of_tg(tg);
+
+	/*
+	 * At this point, XPMEM no longer needs to reference the thread group
+	 * leader's mm_struct. Decrement its 'mm->mm_users' to account for the
+	 * extra increment previously done in xpmem_open().
+	 */
+//>>>	mmput(tg->mm);
+//>>>	put_task_struct(tg->group_leader);
+
+	/* Remove tg structure from its hash list */
+	index = xpmem_tg_hashtable_index(tg->tgid);
+	write_lock(&xpmem_my_part->tg_hashtable[index].lock);
+	list_del_init(&tg->tg_hashlist);
+	write_unlock(&xpmem_my_part->tg_hashtable[index].lock);
+
+	xpmem_tg_destroyable(tg);
+	xpmem_tg_deref(tg);
+
+	return 0;
+}
+
+/*
+ * User ioctl to the XPMEM driver. Only 64-bit user applications are
+ * supported.
+ */
+static long
+xpmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct xpmem_cmd_make make_info;
+	struct xpmem_cmd_remove remove_info;
+	struct xpmem_cmd_get get_info;
+	struct xpmem_cmd_release release_info;
+	struct xpmem_cmd_attach attach_info;
+	struct xpmem_cmd_detach detach_info;
+	__s64 segid;
+	__s64 apid;
+	u64 at_vaddr;
+	long ret;
+
+	switch (cmd) {
+	case XPMEM_CMD_VERSION:
+		return XPMEM_CURRENT_VERSION;
+
+	case XPMEM_CMD_MAKE:
+		if (copy_from_user(&make_info, (void __user *)arg,
+				   sizeof(struct xpmem_cmd_make)))
+			return -EFAULT;
+
+		ret = xpmem_make(make_info.vaddr, make_info.size,
+				 make_info.permit_type,
+				 (void *)make_info.permit_value, &segid);
+		if (ret != 0)
+			return ret;
+
+		if (put_user(segid,
+			     &((struct xpmem_cmd_make __user *)arg)->segid)) {
+			(void)xpmem_remove(segid);
+			return -EFAULT;
+		}
+		return 0;
+
+	case XPMEM_CMD_REMOVE:
+		if (copy_from_user(&remove_info, (void __user *)arg,
+				   sizeof(struct xpmem_cmd_remove)))
+			return -EFAULT;
+
+		return xpmem_remove(remove_info.segid);
+
+	case XPMEM_CMD_GET:
+		if (copy_from_user(&get_info, (void __user *)arg,
+				   sizeof(struct xpmem_cmd_get)))
+			return -EFAULT;
+
+		ret = xpmem_get(get_info.segid, get_info.flags,
+				get_info.permit_type,
+				(void *)get_info.permit_value, &apid);
+		if (ret != 0)
+			return ret;
+
+		if (put_user(apid,
+			     &((struct xpmem_cmd_get __user *)arg)->apid)) {
+			(void)xpmem_release(apid);
+			return -EFAULT;
+		}
+		return 0;
+
+	case XPMEM_CMD_RELEASE:
+		if (copy_from_user(&release_info, (void __user *)arg,
+				   sizeof(struct xpmem_cmd_release)))
+			return -EFAULT;
+
+		return xpmem_release(release_info.apid);
+
+	case XPMEM_CMD_ATTACH:
+		if (copy_from_user(&attach_info, (void __user *)arg,
+				   sizeof(struct xpmem_cmd_attach)))
+			return -EFAULT;
+
+		ret = xpmem_attach(file, attach_info.apid, attach_info.offset,
+				   attach_info.size, attach_info.vaddr,
+				   attach_info.fd, attach_info.flags,
+				   &at_vaddr);
+		if (ret != 0)
+			return ret;
+
+		if (put_user(at_vaddr,
+			     &((struct xpmem_cmd_attach __user *)arg)->vaddr)) {
+			(void)xpmem_detach(at_vaddr);
+			return -EFAULT;
+		}
+		return 0;
+
+	case XPMEM_CMD_DETACH:
+		if (copy_from_user(&detach_info, (void __user *)arg,
+				   sizeof(struct xpmem_cmd_detach)))
+			return -EFAULT;
+
+		return xpmem_detach(detach_info.vaddr);
+
+	default:
+		break;
+	}
+	return -ENOIOCTLCMD;
+}
+
+static struct file_operations xpmem_fops = {
+	.owner = THIS_MODULE,
+	.open = xpmem_open,
+	.flush = xpmem_flush,
+	.unlocked_ioctl = xpmem_ioctl,
+	.mmap = xpmem_mmap
+};
+
+static struct miscdevice xpmem_dev_handle = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = XPMEM_MODULE_NAME,
+	.fops = &xpmem_fops
+};
+
+/*
+ * Initialize the XPMEM driver.
+ */
+int __init
+xpmem_init(void)
+{
+	int i;
+	int ret;
+	struct xpmem_hashlist *hashtable;
+
+	xpmem_my_partid = sn_partition_id;
+	if (xpmem_my_partid >= XP_MAX_PARTITIONS) {
+		dev_err(xpmem, "invalid partition ID, XPMEM driver failed to "
+			"initialize\n");
+		return -EINVAL;
+	}
+
+	/* create and initialize struct xpmem_partition array */
+	xpmem_partitions = kzalloc(sizeof(struct xpmem_partition) *
+				   XP_MAX_PARTITIONS, GFP_KERNEL);
+	if (xpmem_partitions == NULL)
+		return -ENOMEM;
+
+	xpmem_my_part = &xpmem_partitions[xpmem_my_partid];
+	for (i = 0; i < XP_MAX_PARTITIONS; i++) {
+		xpmem_partitions[i].flags |=
+		    (XPMEM_FLAG_UNINITIALIZED | XPMEM_FLAG_DOWN);
+		spin_lock_init(&xpmem_partitions[i].lock);
+		xpmem_partitions[i].version = -1;
+		xpmem_partitions[i].coherence_id = -1;
+		atomic_set(&xpmem_partitions[i].n_threads, 0);
+		init_waitqueue_head(&xpmem_partitions[i].thread_wq);
+	}
+
+#ifdef CONFIG_PROC_FS
+	/* create the /proc interface directory (/proc/xpmem) */
+	xpmem_unpin_procfs_dir = proc_mkdir(XPMEM_MODULE_NAME, NULL);
+	if (xpmem_unpin_procfs_dir == NULL) {
+		ret = -EBUSY;
+		goto out_1;
+	}
+	xpmem_unpin_procfs_dir->owner = THIS_MODULE;
+#endif /* CONFIG_PROC_FS */
+
+	/* create the XPMEM character device (/dev/xpmem) */
+	ret = misc_register(&xpmem_dev_handle);
+	if (ret != 0)
+		goto out_2;
+
+	hashtable = kzalloc(sizeof(struct xpmem_hashlist) *
+			    XPMEM_TG_HASHTABLE_SIZE, GFP_KERNEL);
+	if (hashtable == NULL)
+		goto out_2;
+
+	for (i = 0; i < XPMEM_TG_HASHTABLE_SIZE; i++) {
+		hashtable[i].lock = RW_LOCK_UNLOCKED;
+		INIT_LIST_HEAD(&hashtable[i].list);
+	}
+
+	xpmem_my_part->tg_hashtable = hashtable;
+	xpmem_my_part->flags &= ~XPMEM_FLAG_UNINITIALIZED;
+	xpmem_my_part->version = XPMEM_CURRENT_VERSION;
+	xpmem_my_part->flags &= ~XPMEM_FLAG_DOWN;
+	xpmem_my_part->flags |= XPMEM_FLAG_UP;
+
+	dev_info(xpmem, "SGI XPMEM kernel module v%s loaded\n",
+		 XPMEM_CURRENT_VERSION_STRING);
+	return 0;
+
+	/* things didn't work out so well */
+out_2:
+#ifdef CONFIG_PROC_FS
+	remove_proc_entry(XPMEM_MODULE_NAME, NULL);
+#endif /* CONFIG_PROC_FS */
+out_1:
+	kfree(xpmem_partitions);
+	return ret;
+}
+
+/*
+ * Remove the XPMEM driver from the system.
+ */
+void __exit
+xpmem_exit(void)
+{
+	int i;
+
+	for (i = 0; i < XP_MAX_PARTITIONS; i++) {
+		if (!(xpmem_partitions[i].flags & XPMEM_FLAG_UNINITIALIZED))
+			kfree(xpmem_partitions[i].tg_hashtable);
+	}
+
+	kfree(xpmem_partitions);
+
+	misc_deregister(&xpmem_dev_handle);
+#ifdef CONFIG_PROC_FS
+	remove_proc_entry(XPMEM_MODULE_NAME, NULL);
+#endif /* CONFIG_PROC_FS */
+
+	dev_info(xpmem, "SGI XPMEM kernel module v%s unloaded\n",
+		 XPMEM_CURRENT_VERSION_STRING);
+}
+
+#ifdef EXPORT_NO_SYMBOLS
+EXPORT_NO_SYMBOLS;
+#endif
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_INFO(supported, "external");
+MODULE_DESCRIPTION("XPMEM support");
+module_init(xpmem_init);
+module_exit(xpmem_exit);
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_make.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_make.c	2008-04-01 10:42:33.141774923 -0500
@@ -0,0 +1,249 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) make segment support.
+ */
+
+#include <linux/err.h>
+#include <linux/mm.h>
+#include "xpmem.h"
+#include "xpmem_private.h"
+
+/*
+ * Create a new and unique segid.
+ */
+static __s64
+xpmem_make_segid(struct xpmem_thread_group *seg_tg)
+{
+	struct xpmem_id segid;
+	__s64 *segid_p = (__s64 *)&segid;
+	int uniq;
+
+	DBUG_ON(sizeof(struct xpmem_id) != sizeof(__s64));
+	DBUG_ON(seg_tg->partid < 0 || seg_tg->partid >= XP_MAX_PARTITIONS);
+
+	uniq = atomic_inc_return(&seg_tg->uniq_segid);
+	if (uniq > XPMEM_MAX_UNIQ_ID) {
+		atomic_dec(&seg_tg->uniq_segid);
+		return -EBUSY;
+	}
+
+	segid.tgid = seg_tg->tgid;
+	segid.uniq = uniq;
+	segid.partid = seg_tg->partid;
+
+	DBUG_ON(*segid_p <= 0);
+	return *segid_p;
+}
+
+/*
+ * Make a segid and segment for the specified address segment.
+ */
+int
+xpmem_make(u64 vaddr, size_t size, int permit_type, void *permit_value,
+	   __s64 *segid_p)
+{
+	__s64 segid;
+	struct xpmem_thread_group *seg_tg;
+	struct xpmem_segment *seg;
+	int ret = 0;
+
+	if (permit_type != XPMEM_PERMIT_MODE ||
+	    ((u64)permit_value & ~00777) || size == 0)
+		return -EINVAL;
+
+	seg_tg = xpmem_tg_ref_by_tgid(xpmem_my_part, current->tgid);
+	if (IS_ERR(seg_tg)) {
+		DBUG_ON(PTR_ERR(seg_tg) != -ENOENT);
+		return -XPMEM_ERRNO_NOPROC;
+	}
+
+	if (vaddr + size > seg_tg->addr_limit) {
+		if (size != XPMEM_MAXADDR_SIZE) {
+			ret = -EINVAL;
+			goto out;
+		}
+		size = seg_tg->addr_limit - vaddr;
+	}
+
+	/*
+	 * The start of the segment must be page aligned and it must be a
+	 * multiple of pages in size.
+	 */
+	if (offset_in_page(vaddr) != 0 || offset_in_page(size) != 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	segid = xpmem_make_segid(seg_tg);
+	if (segid < 0) {
+		ret = segid;
+		goto out;
+	}
+
+	/* create a new struct xpmem_segment structure with a unique segid */
+	seg = kzalloc(sizeof(struct xpmem_segment), GFP_KERNEL);
+	if (seg == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_init(&seg->lock);
+	init_rwsem(&seg->sema);
+	seg->segid = segid;
+	seg->vaddr = vaddr;
+	seg->size = size;
+	seg->permit_type = permit_type;
+	seg->permit_value = permit_value;
+	init_waitqueue_head(&seg->created_wq);	/* only used for proxy seg */
+	init_waitqueue_head(&seg->destroyed_wq);
+	seg->tg = seg_tg;
+	INIT_LIST_HEAD(&seg->ap_list);
+	INIT_LIST_HEAD(&seg->seg_list);
+
+	/* allocate PFN table (level 4 only) */
+	mutex_init(&seg->PFNtable_mutex);
+	seg->PFNtable = kzalloc(XPMEM_PFNTABLE_L4SIZE * sizeof(u64 ***),
+				GFP_KERNEL);
+	if (seg->PFNtable == NULL) {
+		kfree(seg);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	xpmem_seg_not_destroyable(seg);
+
+	/*
+	 * Add seg to its tg's list of segs and register the tg's emm_notifier
+	 * if there are no previously existing segs for this thread group.
+	 */
+	write_lock(&seg_tg->seg_list_lock);
+	if (list_empty(&seg_tg->seg_list))
+		emm_notifier_register(&seg_tg->emm_notifier, seg_tg->mm);
+	list_add_tail(&seg->seg_list, &seg_tg->seg_list);
+	write_unlock(&seg_tg->seg_list_lock);
+
+	*segid_p = segid;
+
+out:
+	xpmem_tg_deref(seg_tg);
+	return ret;
+}
+
+/*
+ * Remove a segment from the system.
+ */
+static int
+xpmem_remove_seg(struct xpmem_thread_group *seg_tg, struct xpmem_segment *seg)
+{
+	DBUG_ON(atomic_read(&seg->refcnt) <= 0);
+
+	/* see if the requesting thread is the segment's owner */
+	if (current->tgid != seg_tg->tgid)
+		return -EACCES;
+
+	spin_lock(&seg->lock);
+	if (seg->flags & XPMEM_FLAG_DESTROYING) {
+		spin_unlock(&seg->lock);
+		return 0;
+	}
+	seg->flags |= XPMEM_FLAG_DESTROYING;
+	spin_unlock(&seg->lock);
+
+	xpmem_seg_down_write(seg);
+
+	/* clear all PTEs for each local attach to this segment, if any */
+	xpmem_clear_PTEs(seg, seg->vaddr, seg->size);
+
+	/* clear the seg's PFN table and unpin pages */
+	xpmem_clear_PFNtable(seg, seg->vaddr, seg->size, 1, 0);
+
+	/* indicate that the segment has been destroyed */
+	spin_lock(&seg->lock);
+	seg->flags |= XPMEM_FLAG_DESTROYED;
+	spin_unlock(&seg->lock);
+
+	/*
+	 * Remove seg from its tg's list of segs and unregister the tg's
+	 * emm_notifier if there are no other segs for this thread group and
+	 * the process is not in exit processsing (in which case the unregister
+	 * will be done automatically by emm_notifier_release()).
+	 */
+	write_lock(&seg_tg->seg_list_lock);
+	list_del_init(&seg->seg_list);
+// >>> 	if (list_empty(&seg_tg->seg_list) && !(current->flags & PF_EXITING))
+// >>> 		emm_notifier_unregister(&seg_tg->emm_notifier, seg_tg->mm);
+	write_unlock(&seg_tg->seg_list_lock);
+
+	xpmem_seg_up_write(seg);
+	xpmem_seg_destroyable(seg);
+
+	return 0;
+}
+
+/*
+ * Remove all segments belonging to the specified thread group.
+ */
+void
+xpmem_remove_segs_of_tg(struct xpmem_thread_group *seg_tg)
+{
+	struct xpmem_segment *seg;
+
+	DBUG_ON(current->tgid != seg_tg->tgid);
+
+	read_lock(&seg_tg->seg_list_lock);
+
+	while (!list_empty(&seg_tg->seg_list)) {
+		seg = list_entry((&seg_tg->seg_list)->next,
+				 struct xpmem_segment, seg_list);
+		if (!(seg->flags & XPMEM_FLAG_DESTROYING)) {
+			xpmem_seg_ref(seg);
+			read_unlock(&seg_tg->seg_list_lock);
+
+			(void)xpmem_remove_seg(seg_tg, seg);
+
+			xpmem_seg_deref(seg);
+			read_lock(&seg_tg->seg_list_lock);
+		}
+	}
+	read_unlock(&seg_tg->seg_list_lock);
+}
+
+/*
+ * Remove a segment from the system.
+ */
+int
+xpmem_remove(__s64 segid)
+{
+	struct xpmem_thread_group *seg_tg;
+	struct xpmem_segment *seg;
+	int ret;
+
+	seg_tg = xpmem_tg_ref_by_segid(segid);
+	if (IS_ERR(seg_tg))
+		return PTR_ERR(seg_tg);
+
+	if (current->tgid != seg_tg->tgid) {
+		xpmem_tg_deref(seg_tg);
+		return -EACCES;
+	}
+
+	seg = xpmem_seg_ref_by_segid(seg_tg, segid);
+	if (IS_ERR(seg)) {
+		xpmem_tg_deref(seg_tg);
+		return PTR_ERR(seg);
+	}
+	DBUG_ON(seg->tg != seg_tg);
+
+	ret = xpmem_remove_seg(seg_tg, seg);
+	xpmem_seg_deref(seg);
+	xpmem_tg_deref(seg_tg);
+
+	return ret;
+}
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_misc.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_misc.c	2008-04-01 10:42:33.201782324 -0500
@@ -0,0 +1,367 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) miscellaneous functions.
+ */
+
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include "xpmem.h"
+#include "xpmem_private.h"
+
+/*
+ * xpmem_tg_ref() - see xpmem_private.h for inline definition
+ */
+
+/*
+ * Return a pointer to the xpmem_thread_group structure that corresponds to the
+ * specified tgid. Increment the refcnt as well if found.
+ */
+struct xpmem_thread_group *
+xpmem_tg_ref_by_tgid(struct xpmem_partition *part, pid_t tgid)
+{
+	int index;
+	struct xpmem_thread_group *tg;
+
+	index = xpmem_tg_hashtable_index(tgid);
+	read_lock(&part->tg_hashtable[index].lock);
+
+	list_for_each_entry(tg, &part->tg_hashtable[index].list, tg_hashlist) {
+		if (tg->tgid == tgid) {
+			if (tg->flags & XPMEM_FLAG_DESTROYING)
+				continue;  /* could be others with this tgid */
+
+			xpmem_tg_ref(tg);
+			read_unlock(&part->tg_hashtable[index].lock);
+			return tg;
+		}
+	}
+
+	read_unlock(&part->tg_hashtable[index].lock);
+	return ((part != xpmem_my_part) ? ERR_PTR(-EREMOTE) : ERR_PTR(-ENOENT));
+}
+
+/*
+ * Return a pointer to the xpmem_thread_group structure that corresponds to the
+ * specified segid. Increment the refcnt as well if found.
+ */
+struct xpmem_thread_group *
+xpmem_tg_ref_by_segid(__s64 segid)
+{
+	short partid = xpmem_segid_to_partid(segid);
+	struct xpmem_partition *part;
+
+	if (partid < 0 || partid >= XP_MAX_PARTITIONS)
+		return ERR_PTR(-EINVAL);
+
+	part = &xpmem_partitions[partid];
+	/* XPMEM_FLAG_UNINITIALIZED could be an -EHOSTDOWN situation */
+	if (part->flags & XPMEM_FLAG_UNINITIALIZED)
+		return ERR_PTR(-EINVAL);
+
+	return xpmem_tg_ref_by_tgid(part, xpmem_segid_to_tgid(segid));
+}
+
+/*
+ * Return a pointer to the xpmem_thread_group structure that corresponds to the
+ * specified apid. Increment the refcnt as well if found.
+ */
+struct xpmem_thread_group *
+xpmem_tg_ref_by_apid(__s64 apid)
+{
+	short partid = xpmem_apid_to_partid(apid);
+	struct xpmem_partition *part;
+
+	if (partid < 0 || partid >= XP_MAX_PARTITIONS)
+		return ERR_PTR(-EINVAL);
+
+	part = &xpmem_partitions[partid];
+	/* XPMEM_FLAG_UNINITIALIZED could be an -EHOSTDOWN situation */
+	if (part->flags & XPMEM_FLAG_UNINITIALIZED)
+		return ERR_PTR(-EINVAL);
+
+	return xpmem_tg_ref_by_tgid(part, xpmem_apid_to_tgid(apid));
+}
+
+/*
+ * Decrement the refcnt for a xpmem_thread_group structure previously
+ * referenced via xpmem_tg_ref(), xpmem_tg_ref_by_tgid(), or
+ * xpmem_tg_ref_by_segid().
+ */
+void
+xpmem_tg_deref(struct xpmem_thread_group *tg)
+{
+#ifdef CONFIG_PROC_FS
+	char tgid_string[XPMEM_TGID_STRING_LEN];
+#endif /* CONFIG_PROC_FS */
+
+	DBUG_ON(atomic_read(&tg->refcnt) <= 0);
+	if (atomic_dec_return(&tg->refcnt) != 0)
+		return;
+
+	/*
+	 * Process has been removed from lookup lists and is no
+	 * longer being referenced, so it is safe to remove it.
+	 */
+	DBUG_ON(!(tg->flags & XPMEM_FLAG_DESTROYING));
+	DBUG_ON(!list_empty(&tg->seg_list));
+
+#ifdef CONFIG_PROC_FS
+	snprintf(tgid_string, XPMEM_TGID_STRING_LEN, "%d", tg->tgid);
+	spin_lock(&xpmem_unpin_procfs_lock);
+	remove_proc_entry(tgid_string, xpmem_unpin_procfs_dir);
+	spin_unlock(&xpmem_unpin_procfs_lock);
+#endif /* CONFIG_PROC_FS */
+
+	kfree(tg->ap_hashtable);
+
+	kfree(tg);
+}
+
+/*
+ * xpmem_seg_ref - see xpmem_private.h for inline definition
+ */
+
+/*
+ * Return a pointer to the xpmem_segment structure that corresponds to the
+ * given segid. Increment the refcnt as well.
+ */
+struct xpmem_segment *
+xpmem_seg_ref_by_segid(struct xpmem_thread_group *seg_tg, __s64 segid)
+{
+	struct xpmem_segment *seg;
+
+	read_lock(&seg_tg->seg_list_lock);
+
+	list_for_each_entry(seg, &seg_tg->seg_list, seg_list) {
+		if (seg->segid == segid) {
+			if (seg->flags & XPMEM_FLAG_DESTROYING)
+				continue; /* could be others with this segid */
+
+			xpmem_seg_ref(seg);
+			read_unlock(&seg_tg->seg_list_lock);
+			return seg;
+		}
+	}
+
+	read_unlock(&seg_tg->seg_list_lock);
+	return ERR_PTR(-ENOENT);
+}
+
+/*
+ * Decrement the refcnt for a xpmem_segment structure previously referenced via
+ * xpmem_seg_ref() or xpmem_seg_ref_by_segid().
+ */
+void
+xpmem_seg_deref(struct xpmem_segment *seg)
+{
+	int i;
+	int j;
+	int k;
+	u64 ****l4table;
+	u64 ***l3table;
+	u64 **l2table;
+
+	DBUG_ON(atomic_read(&seg->refcnt) <= 0);
+	if (atomic_dec_return(&seg->refcnt) != 0)
+		return;
+
+	/*
+	 * Segment has been removed from lookup lists and is no
+	 * longer being referenced so it is safe to free it.
+	 */
+	DBUG_ON(!(seg->flags & XPMEM_FLAG_DESTROYING));
+
+	/* free this segment's PFN table  */
+	DBUG_ON(seg->PFNtable == NULL);
+	l4table = seg->PFNtable;
+	for (i = 0; i < XPMEM_PFNTABLE_L4SIZE; i++) {
+		if (l4table[i] == NULL)
+			continue;
+
+		l3table = l4table[i];
+		for (j = 0; j < XPMEM_PFNTABLE_L3SIZE; j++) {
+			if (l3table[j] == NULL)
+				continue;
+
+			l2table = l3table[j];
+			for (k = 0; k < XPMEM_PFNTABLE_L2SIZE; k++) {
+				if (l2table[k] != NULL)
+					kfree(l2table[k]);
+			}
+			kfree(l2table);
+		}
+		kfree(l3table);
+	}
+	kfree(l4table);
+
+	kfree(seg);
+}
+
+/*
+ * xpmem_ap_ref() - see xpmem_private.h for inline definition
+ */
+
+/*
+ * Return a pointer to the xpmem_access_permit structure that corresponds to
+ * the given apid. Increment the refcnt as well.
+ */
+struct xpmem_access_permit *
+xpmem_ap_ref_by_apid(struct xpmem_thread_group *ap_tg, __s64 apid)
+{
+	int index;
+	struct xpmem_access_permit *ap;
+
+	index = xpmem_ap_hashtable_index(apid);
+	read_lock(&ap_tg->ap_hashtable[index].lock);
+
+	list_for_each_entry(ap, &ap_tg->ap_hashtable[index].list,
+			    ap_hashlist) {
+		if (ap->apid == apid) {
+			if (ap->flags & XPMEM_FLAG_DESTROYING)
+				break;	/* can't be others with this apid */
+
+			xpmem_ap_ref(ap);
+			read_unlock(&ap_tg->ap_hashtable[index].lock);
+			return ap;
+		}
+	}
+
+	read_unlock(&ap_tg->ap_hashtable[index].lock);
+	return ERR_PTR(-ENOENT);
+}
+
+/*
+ * Decrement the refcnt for a xpmem_access_permit structure previously
+ * referenced via xpmem_ap_ref() or xpmem_ap_ref_by_apid().
+ */
+void
+xpmem_ap_deref(struct xpmem_access_permit *ap)
+{
+	DBUG_ON(atomic_read(&ap->refcnt) <= 0);
+	if (atomic_dec_return(&ap->refcnt) == 0) {
+		/*
+		 * Access has been removed from lookup lists and is no
+		 * longer being referenced so it is safe to remove it.
+		 */
+		DBUG_ON(!(ap->flags & XPMEM_FLAG_DESTROYING));
+		kfree(ap);
+	}
+}
+
+/*
+ * xpmem_att_ref() - see xpmem_private.h for inline definition
+ */
+
+/*
+ * Decrement the refcnt for a xpmem_attachment structure previously referenced
+ * via xpmem_att_ref().
+ */
+void
+xpmem_att_deref(struct xpmem_attachment *att)
+{
+	DBUG_ON(atomic_read(&att->refcnt) <= 0);
+	if (atomic_dec_return(&att->refcnt) == 0) {
+		/*
+		 * Attach has been removed from lookup lists and is no
+		 * longer being referenced so it is safe to remove it.
+		 */
+		DBUG_ON(!(att->flags & XPMEM_FLAG_DESTROYING));
+		kfree(att);
+	}
+}
+
+/*
+ * Acquire read access to a xpmem_segment structure.
+ */
+int
+xpmem_seg_down_read(struct xpmem_thread_group *seg_tg,
+		    struct xpmem_segment *seg, int block_recall_PFNs, int wait)
+{
+	int ret;
+
+	if (block_recall_PFNs) {
+		ret = xpmem_block_recall_PFNs(seg_tg, wait);
+		if (ret != 0)
+			return ret;
+	}
+
+	if (!down_read_trylock(&seg->sema)) {
+		if (!wait) {
+			if (block_recall_PFNs)
+				xpmem_unblock_recall_PFNs(seg_tg);
+			return -EAGAIN;
+		}
+		down_read(&seg->sema);
+	}
+
+	if ((seg->flags & XPMEM_FLAG_DESTROYING) ||
+	    (seg_tg->flags & XPMEM_FLAG_DESTROYING)) {
+		up_read(&seg->sema);
+		if (block_recall_PFNs)
+			xpmem_unblock_recall_PFNs(seg_tg);
+		return -ENOENT;
+	}
+	return 0;
+}
+
+/*
+ * Ensure that a user is correctly accessing a segment for a copy or an attach
+ * and if so, return the segment's vaddr adjusted by the user specified offset.
+ */
+u64
+xpmem_get_seg_vaddr(struct xpmem_access_permit *ap, off_t offset,
+		    size_t size, int mode)
+{
+	/* first ensure that this thread has permission to access segment */
+	if (current->tgid != ap->tg->tgid ||
+	    (mode == XPMEM_RDWR && ap->mode == XPMEM_RDONLY))
+		return -EACCES;
+
+	if (offset < 0 || size == 0 || offset + size > ap->seg->size)
+		return -EINVAL;
+
+	return ap->seg->vaddr + offset;
+}
+
+/*
+ * Only allow through SIGTERM or SIGKILL if they will be fatal to the
+ * current thread.
+ */
+void
+xpmem_block_nonfatal_signals(sigset_t *oldset)
+{
+	unsigned long flags;
+	sigset_t new_blocked_signals;
+
+	spin_lock_irqsave(&current->sighand->siglock, flags);
+	*oldset = current->blocked;
+	sigfillset(&new_blocked_signals);
+	sigdelset(&new_blocked_signals, SIGTERM);
+	if (current->sighand->action[SIGKILL - 1].sa.sa_handler == SIG_DFL)
+		sigdelset(&new_blocked_signals, SIGKILL);
+
+	current->blocked = new_blocked_signals;
+	recalc_sigpending();
+	spin_unlock_irqrestore(&current->sighand->siglock, flags);
+}
+
+/*
+ * Return blocked signal mask to default.
+ */
+void
+xpmem_unblock_nonfatal_signals(sigset_t *oldset)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&current->sighand->siglock, flags);
+	current->blocked = *oldset;
+	recalc_sigpending();
+	spin_unlock_irqrestore(&current->sighand->siglock, flags);
+}
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_pfn.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_pfn.c	2008-04-01 10:42:33.165777884 -0500
@@ -0,0 +1,1242 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) PFN support.
+ */
+
+#include <linux/device.h>
+#include <linux/efi.h>
+#include <linux/pagemap.h>
+#include "xpmem.h"
+#include "xpmem_private.h"
+
+/* #of pages rounded up to that which vaddr and size would occupy */
+static int
+xpmem_num_of_pages(u64 vaddr, size_t size)
+{
+	return (offset_in_page(vaddr) + size + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
+}
+
+/*
+ * Recall all PFNs belonging to the specified segment that have been
+ * accessed by other thread groups.
+ */
+static void
+xpmem_recall_PFNs(struct xpmem_segment *seg, u64 vaddr, size_t size)
+{
+	int handled;	//>>> what name should this have?
+
+	DBUG_ON(atomic_read(&seg->refcnt) <= 0);
+	DBUG_ON(atomic_read(&seg->tg->refcnt) <= 0);
+
+	if (!xpmem_get_overlapping_range(seg->vaddr, seg->size, &vaddr, &size))
+		return;
+
+	spin_lock(&seg->lock);
+	while (seg->flags & (XPMEM_FLAG_DESTROYING |
+	       XPMEM_FLAG_RECALLINGPFNS)) {
+
+		handled = (vaddr >= seg->recall_vaddr && vaddr + size <=
+			     seg->recall_vaddr + seg->recall_size);
+		spin_unlock(&seg->lock);
+
+		xpmem_wait_for_seg_destroyed(seg);
+		if (handled || (seg->flags & XPMEM_FLAG_DESTROYED))
+			return;
+
+		spin_lock(&seg->lock);
+	}
+	seg->recall_vaddr = vaddr;
+	seg->recall_size = size;
+	seg->flags |= XPMEM_FLAG_RECALLINGPFNS;
+	spin_unlock(&seg->lock);
+
+	xpmem_seg_down_write(seg);
+
+	/* clear all PTEs for each local attach to this segment */
+	xpmem_clear_PTEs(seg, vaddr, size);
+
+	/* clear the seg's PFN table and unpin pages */
+	xpmem_clear_PFNtable(seg, vaddr, size, 1, 0);
+
+	spin_lock(&seg->lock);
+	seg->flags &= ~XPMEM_FLAG_RECALLINGPFNS;
+	spin_unlock(&seg->lock);
+
+	xpmem_seg_up_write(seg);
+}
+
+// >>> Argh.
+int xpmem_zzz(struct xpmem_segment *seg, u64 vaddr, size_t size);
+/*
+ * Recall all PFNs belonging to the specified thread group's XPMEM segments
+ * that have been accessed by other thread groups.
+ */
+static void
+xpmem_recall_PFNs_of_tg(struct xpmem_thread_group *seg_tg, u64 vaddr,
+			size_t size)
+{
+	struct xpmem_segment *seg;
+	struct xpmem_page_request *preq;
+	u64 t_vaddr;
+	size_t t_size;
+
+	/* mark any current faults as invalid. */
+	list_for_each_entry(preq, &seg_tg->page_requests, page_requests) {
+		t_vaddr = vaddr;
+		t_size = size;
+		if (xpmem_get_overlapping_range(preq->vaddr, preq->size, &t_vaddr, &t_size))
+			preq->valid = 0;
+	}
+
+	read_lock(&seg_tg->seg_list_lock);
+	list_for_each_entry(seg, &seg_tg->seg_list, seg_list) {
+
+		t_vaddr = vaddr;
+		t_size = size;
+		if (xpmem_get_overlapping_range(seg->vaddr, seg->size,
+		    &t_vaddr, &t_size)) {
+
+			xpmem_seg_ref(seg);
+			read_unlock(&seg_tg->seg_list_lock);
+
+			if (xpmem_zzz(seg, t_vaddr, t_size))
+				xpmem_recall_PFNs(seg, t_vaddr, t_size);
+
+			read_lock(&seg_tg->seg_list_lock);
+			if (list_empty(&seg->seg_list)) {
+				/* seg was deleted from seg_tg->seg_list */
+				xpmem_seg_deref(seg);
+				seg = list_entry(&seg_tg->seg_list,
+						 struct xpmem_segment,
+						 seg_list);
+			} else
+				xpmem_seg_deref(seg);
+		}
+	}
+	read_unlock(&seg_tg->seg_list_lock);
+}
+
+int
+xpmem_block_recall_PFNs(struct xpmem_thread_group *tg, int wait)
+{
+	int value;
+	int returned_value;
+
+	while (1) {
+		if (waitqueue_active(&tg->allow_recall_PFNs_wq))
+			goto wait;
+
+		value = atomic_read(&tg->n_recall_PFNs);
+		while (1) {
+			if (unlikely(value > 0))
+				break;
+
+			returned_value = atomic_cmpxchg(&tg->n_recall_PFNs,
+							value, value - 1);
+			if (likely(returned_value == value))
+				break;
+
+			value = returned_value;
+		}
+
+		if (value <= 0)
+			return 0;
+wait:
+		if (!wait)
+			return -EAGAIN;
+
+		wait_event(tg->block_recall_PFNs_wq,
+			   (atomic_read(&tg->n_recall_PFNs) <= 0));
+	}
+}
+
+void
+xpmem_unblock_recall_PFNs(struct xpmem_thread_group *tg)
+{
+	if (atomic_inc_return(&tg->n_recall_PFNs) == 0)
+		wake_up(&tg->allow_recall_PFNs_wq);
+}
+
+static void
+xpmem_disallow_blocking_recall_PFNs(struct xpmem_thread_group *tg)
+{
+	int value;
+	int returned_value;
+
+	while (1) {
+		value = atomic_read(&tg->n_recall_PFNs);
+		while (1) {
+			if (unlikely(value < 0))
+				break;
+			returned_value = atomic_cmpxchg(&tg->n_recall_PFNs,
+							value, value + 1);
+			if (likely(returned_value == value))
+				break;
+			value = returned_value;
+		}
+
+		if (value >= 0)
+			return;
+
+		wait_event(tg->allow_recall_PFNs_wq,
+			  (atomic_read(&tg->n_recall_PFNs) >= 0));
+	}
+}
+
+static void
+xpmem_allow_blocking_recall_PFNs(struct xpmem_thread_group *tg)
+{
+	if (atomic_dec_return(&tg->n_recall_PFNs) == 0)
+		wake_up(&tg->block_recall_PFNs_wq);
+}
+
+
+int xpmem_emm_notifier_callback(struct emm_notifier *e, struct mm_struct *mm,
+		enum emm_operation op, unsigned long start, unsigned long end)
+{
+	struct xpmem_thread_group *tg;
+
+	tg = container_of(e, struct xpmem_thread_group, emm_notifier);
+	xpmem_tg_ref(tg);
+
+	DBUG_ON(tg->mm != mm);
+	switch(op) {
+	case emm_release:
+		xpmem_remove_segs_of_tg(tg);
+		break;
+	case emm_invalidate_start:
+		xpmem_disallow_blocking_recall_PFNs(tg);
+
+		mutex_lock(&tg->recall_PFNs_mutex);
+		xpmem_recall_PFNs_of_tg(tg, start, end - start);
+		mutex_unlock(&tg->recall_PFNs_mutex);
+		break;
+	case emm_invalidate_end:
+		xpmem_allow_blocking_recall_PFNs(tg);
+		break;
+	case emm_referenced:
+		break;
+	}
+
+	xpmem_tg_deref(tg);
+	return 0;
+}
+
+/*
+ * Fault in and pin all pages in the given range for the specified task and mm.
+ * VM_IO pages can't be pinned via get_user_pages().
+ */
+static int
+xpmem_pin_pages(struct xpmem_thread_group *tg, struct xpmem_segment *seg,
+		struct task_struct *src_task, struct mm_struct *src_mm,
+		u64 vaddr, size_t size, int *pinned, int *recalls_blocked)
+{
+	int ret;
+	int bret;
+	int malloc = 0;
+	int n_pgs = xpmem_num_of_pages(vaddr, size);
+//>>> What is pages_array being used for by get_user_pages() and can
+//>>> xpmem_fill_in_PFNtable() use it to do what it needs to do?
+	struct page *pages_array[16];
+	struct page **pages;
+	struct vm_area_struct *vma;
+	cpumask_t saved_mask = CPU_MASK_NONE;
+	struct xpmem_page_request preq = {.valid = 1, .page_requests = LIST_HEAD_INIT(preq.page_requests), };
+	int request_retries = 0;
+
+	*pinned = 1;
+
+	vma = find_vma(src_mm, vaddr);
+	if (!vma || vma->vm_start > vaddr)
+		return -ENOENT;
+
+	/* don't pin pages in an address range which itself is an attachment */
+	if (xpmem_is_vm_ops_set(vma))
+		return -ENOENT;
+
+	if (n_pgs > 16) {
+		pages = kzalloc(sizeof(struct page *) * n_pgs, GFP_KERNEL);
+		if (pages == NULL)
+			return -ENOMEM;
+
+		malloc = 1;
+	} else
+		pages = pages_array;
+
+	/*
+	 * get_user_pages() may have to allocate pages on behalf of
+	 * the source thread group. If so, we want to ensure that pages
+	 * are allocated near the source thread group and not the current
+	 * thread calling get_user_pages(). Since this does not happen when
+	 * the policy is node-local (the most common default policy),
+	 * we might have to temporarily switch cpus to get the page
+	 * placed where we want it. Since MPI rarely uses xpmem_copy(),
+	 * we don't bother doing this unless we are allocating XPMEM
+	 * attached memory (i.e. n_pgs == 1).
+	 */
+	if (n_pgs == 1 && xpmem_vaddr_to_pte(src_mm, vaddr) == NULL &&
+	    cpu_to_node(task_cpu(current)) != cpu_to_node(task_cpu(src_task))) {
+		saved_mask = current->cpus_allowed;
+		set_cpus_allowed(current, cpumask_of_cpu(task_cpu(src_task)));
+	}
+
+	/*
+	 * At this point, we are ready to call the kernel to fault and reference
+	 * pages.  There is a deadlock case where our fault action may need to
+	 * do an invalidate_range.  To handle this case, we add our page_request
+	 * information to a list which any new invalidates will check and then
+	 * unblock invalidates.
+	 */
+	preq.vaddr = vaddr;
+	preq.size = size;
+	init_waitqueue_head(&preq.wq);
+	spin_lock(&tg->page_requests_lock);
+	list_add(&preq.page_requests, &tg->page_requests);
+	spin_unlock(&tg->page_requests_lock);
+
+retry_fault:
+	mutex_unlock(&seg->PFNtable_mutex);
+	if (recalls_blocked) {
+		xpmem_unblock_recall_PFNs(tg);
+		recalls_blocked = 0;
+	}
+
+	/* get_user_pages() faults and pins the pages */
+	ret = get_user_pages(src_task, src_mm, vaddr, n_pgs, 1, 1, pages, NULL);
+
+	bret = xpmem_block_recall_PFNs(tg, 1);
+	mutex_lock(&seg->PFNtable_mutex);
+
+	if (bret != 0 || !preq.valid) {
+		int to_free = ret;
+
+		while (to_free-- > 0) {
+			page_cache_release(pages[to_free]);
+		}
+		request_retries++;
+	}
+
+	if (preq.valid || bret != 0 || request_retries > 3 ) {
+		spin_lock(&tg->page_requests_lock);
+		list_del(&preq.page_requests);
+		spin_unlock(&tg->page_requests_lock);
+		wake_up_all(&preq.wq);
+	}
+
+	if (bret != 0) {
+		*recalls_blocked = 0;
+		return bret;
+	}
+	if (request_retries > 3)
+		return -EAGAIN;
+
+	if (!preq.valid) {
+
+		preq.valid = 1;
+		goto retry_fault;
+	}
+
+	if (!cpus_empty(saved_mask))
+		set_cpus_allowed(current, saved_mask);
+
+	if (malloc)
+		kfree(pages);
+
+	if (ret >= 0) {
+		DBUG_ON(ret != n_pgs);
+		atomic_add(ret, &tg->n_pinned);
+	} else {
+		struct vm_area_struct *vma;
+		u64 end_vaddr;
+		u64 tmp_vaddr;
+
+		/*
+		 * get_user_pages() doesn't pin VM_IO mappings. If the entire
+		 * area is locked I/O space however, we can continue and just
+		 * make note of the fact that this area was not pinned by
+		 * XPMEM. Fetchop (AMO) pages fall into this category.
+		 */
+		end_vaddr = vaddr + size;
+		tmp_vaddr = vaddr;
+		do {
+			vma = find_vma(src_mm, tmp_vaddr);
+			if (!vma || vma->vm_start >= end_vaddr ||
+//>>> VM_PFNMAP may also be set? Can we say it's always set?
+//>>> perhaps we could check for it and VM_IO and set something to indicate
+//>>> whether one or the other or both of these were set
+			    !(vma->vm_flags & VM_IO))
+				return ret;
+
+			tmp_vaddr = vma->vm_end;
+
+		} while (tmp_vaddr < end_vaddr);
+
+		/*
+		 * All mappings are pinned for I/O. Check the page tables to
+		 * ensure that all pages are present.
+		 */
+		while (n_pgs--) {
+			if (xpmem_vaddr_to_pte(src_mm, vaddr) == NULL)
+				return -EFAULT;
+
+			vaddr += PAGE_SIZE;
+		}
+		*pinned = 0;
+	}
+
+	return 0;
+}
+
+/*
+ * For a given virtual address range, grab the underlying PFNs from the
+ * page table and store them in XPMEM's PFN table. The underlying pages
+ * have already been pinned by the time this function is executed.
+ */
+static int
+xpmem_fill_in_PFNtable(struct mm_struct *src_mm, struct xpmem_segment *seg,
+		       u64 vaddr, size_t size, int drop_memprot, int pinned)
+{
+	int n_pgs = xpmem_num_of_pages(vaddr, size);
+	int n_pgs_unpinned;
+	pte_t *pte_p;
+	u64 *pfn_p;
+	u64 pfn;
+	int ret;
+
+	while (n_pgs--) {
+		pte_p = xpmem_vaddr_to_pte(src_mm, vaddr);
+		if (pte_p == NULL) {
+			ret = -ENOENT;
+			goto unpin_pages;
+		}
+		DBUG_ON(!pte_present(*pte_p));
+
+		pfn_p = xpmem_vaddr_to_PFN(seg, vaddr);
+		DBUG_ON(!XPMEM_PFN_IS_UNKNOWN(pfn_p));
+		pfn = pte_pfn(*pte_p);
+		DBUG_ON(!XPMEM_PFN_IS_KNOWN(&pfn));
+
+#ifdef CONFIG_IA64
+		/* check if this is an uncached page */
+		if (pte_val(*pte_p) & _PAGE_MA_UC)
+			pfn |= XPMEM_PFN_UNCACHED;
+#endif
+
+		if (!pinned)
+			pfn |= XPMEM_PFN_IO;
+
+		if (drop_memprot)
+			pfn |= XPMEM_PFN_MEMPROT_DOWN;
+
+		*pfn_p = pfn;
+		vaddr += PAGE_SIZE;
+	}
+
+	return 0;
+
+unpin_pages:
+	/* unpin any pinned pages not yet added to the PFNtable */
+	if (pinned) {
+		n_pgs_unpinned = 0;
+		do {
+//>>> The fact that the pte can be cleared after we've pinned the page suggests
+//>>> that we need to utilize the page_array set up by get_user_pages() as
+//>>> the only accurate means to find what indeed we've actually pinned.
+//>>> Can in fact the pte really be cleared from the time we pinned the page?
+			if (pte_p != NULL) {
+				page_cache_release(pte_page(*pte_p));
+				n_pgs_unpinned++;
+			}
+			vaddr += PAGE_SIZE;
+			if (n_pgs > 0)
+				pte_p = xpmem_vaddr_to_pte(src_mm, vaddr);
+		} while (n_pgs--);
+
+		atomic_sub(n_pgs_unpinned, &seg->tg->n_pinned);
+	}
+	return ret;
+}
+
+/*
+ * Determine unknown PFNs for a given virtual address range.
+ */
+static int
+xpmem_get_PFNs(struct xpmem_segment *seg, u64 vaddr, size_t size,
+	       int drop_memprot, int *recalls_blocked)
+{
+	struct xpmem_thread_group *seg_tg = seg->tg;
+	struct task_struct *src_task = seg_tg->group_leader;
+	struct mm_struct *src_mm = seg_tg->mm;
+	int ret;
+	int pinned;
+
+	/*
+	 * We used to look up the source task_struct by tgid, but that was
+	 * a performance killer. Instead we stash a pointer to the thread
+	 * group leader's task_struct in the xpmem_thread_group structure.
+	 * This is safe because we incremented the task_struct's usage count
+	 * at the same time we stashed the pointer.
+	 */
+
+	/*
+	 * Find and pin the pages. xpmem_pin_pages() fails if there are
+	 * holes in the vaddr range (which is what we want to happen).
+	 * VM_IO pages can't be pinned, however the Linux kernel ensures
+	 * those pages aren't swapped, so XPMEM keeps its hands off and
+	 * everything works out.
+	 */
+	ret = xpmem_pin_pages(seg_tg, seg, src_task, src_mm, vaddr, size, &pinned, recalls_blocked);
+	if (ret == 0) {
+		/* record the newly discovered pages in XPMEM's PFN table */
+		ret = xpmem_fill_in_PFNtable(src_mm, seg, vaddr, size,
+					     drop_memprot, pinned);
+	}
+	return ret;
+}
+
+/*
+ * Given a virtual address range and XPMEM segment, determine which portions
+ * of that range XPMEM needs to fetch PFN information for. As unknown
+ * contiguous portions of the virtual address range are determined, other
+ * functions are called to do the actual PFN discovery tasks.
+ */
+int
+xpmem_ensure_valid_PFNs(struct xpmem_segment *seg, u64 vaddr, size_t size,
+			int drop_memprot, int faulting,
+			unsigned long expected_vm_pfnmap,
+			int mmap_sem_prelocked, int *recalls_blocked)
+{
+	u64 *pfn;
+	int ret;
+	int n_pfns;
+	int n_pgs = xpmem_num_of_pages(vaddr, size);
+	int mmap_sem_locked = 0;
+	int PFNtable_locked = 0;
+	u64 f_vaddr = vaddr;
+	u64 l_vaddr = vaddr + size;
+	u64 t_vaddr = t_vaddr;
+	size_t t_size;
+	struct xpmem_thread_group *seg_tg = seg->tg;
+	struct xpmem_page_request *preq;
+	DEFINE_WAIT(wait);
+
+
+	DBUG_ON(seg->PFNtable == NULL);
+	DBUG_ON(n_pgs <= 0);
+
+again:
+	/*
+	 * We must grab the mmap_sem before the PFNtable_mutex if we are
+	 * looking up partition-local page data. If we are faulting a page in
+	 * our own address space, we don't have to grab the mmap_sem since we
+	 * already have it via ia64_do_page_fault(). If we are faulting a page
+	 * from another address space, there is a potential for a deadlock
+	 * on the mmap_sem. If the fault handler detects this potential, it
+	 * acquires the two mmap_sems in numeric order (address-wise).
+	 */
+	if (!(faulting && seg_tg->mm == current->mm)) {
+		if (!mmap_sem_prelocked) {
+//>>> Since we inc the mm_users up front in xpmem_open(), why bother here?
+//>>> but do comment that that is the case.
+			atomic_inc(&seg_tg->mm->mm_users);
+			down_read(&seg_tg->mm->mmap_sem);
+			mmap_sem_locked = 1;
+		}
+	}
+
+single_faulter:
+	ret = xpmem_block_recall_PFNs(seg_tg, 0);
+	if (ret != 0)
+		goto unlock;
+	*recalls_blocked = 1;
+
+	mutex_lock(&seg->PFNtable_mutex);
+	spin_lock(&seg_tg->page_requests_lock);
+	/* mark any current faults as invalid. */
+	list_for_each_entry(preq, &seg_tg->page_requests, page_requests) {
+		t_vaddr = vaddr;
+		t_size = size;
+		if (xpmem_get_overlapping_range(preq->vaddr, preq->size, &t_vaddr, &t_size)) {
+			prepare_to_wait(&preq->wq, &wait, TASK_UNINTERRUPTIBLE);
+			spin_unlock(&seg_tg->page_requests_lock);
+			mutex_unlock(&seg->PFNtable_mutex);
+			if (*recalls_blocked) {
+				xpmem_unblock_recall_PFNs(seg_tg);
+				*recalls_blocked = 0;
+			}
+
+			schedule();
+			set_current_state(TASK_RUNNING);
+			goto single_faulter;
+		}
+	}
+	spin_unlock(&seg_tg->page_requests_lock);
+	PFNtable_locked = 1;
+
+	/* the seg may have been marked for destruction while we were down() */
+	if (seg->flags & XPMEM_FLAG_DESTROYING) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	/*
+	 * Determine the number of unknown PFNs and PFNs whose memory
+	 * protections need to be modified.
+	 */
+	n_pfns = 0;
+
+	do {
+		ret = xpmem_vaddr_to_PFN_alloc(seg, vaddr, &pfn, 1);
+		if (ret != 0)
+			goto unlock;
+
+		if (XPMEM_PFN_IS_KNOWN(pfn) &&
+		    !XPMEM_PFN_DROP_MEMPROT(pfn, drop_memprot)) {
+			n_pgs--;
+			vaddr += PAGE_SIZE;
+			break;
+		}
+
+		if (n_pfns++ == 0) {
+			t_vaddr = vaddr;
+			if (t_vaddr > f_vaddr)
+				t_vaddr -= offset_in_page(t_vaddr);
+		}
+
+		n_pgs--;
+		vaddr += PAGE_SIZE;
+
+	} while (n_pgs > 0);
+
+	if (n_pfns > 0) {
+		t_size = (n_pfns * PAGE_SIZE) - offset_in_page(t_vaddr);
+		if (t_vaddr + t_size > l_vaddr)
+			t_size = l_vaddr - t_vaddr;
+
+		ret = xpmem_get_PFNs(seg, t_vaddr, t_size,
+				     drop_memprot, recalls_blocked);
+
+		if (ret != 0) {
+			goto unlock;
+		}
+	}
+
+	if (faulting) {
+		struct vm_area_struct *vma;
+
+		vma = find_vma(seg_tg->mm, vaddr - PAGE_SIZE);
+		BUG_ON(!vma || vma->vm_start > vaddr - PAGE_SIZE);
+		if ((vma->vm_flags & VM_PFNMAP) != expected_vm_pfnmap)
+			ret = -EINVAL;
+	}
+
+unlock:
+	if (PFNtable_locked)
+		mutex_unlock(&seg->PFNtable_mutex);
+	if (mmap_sem_locked) {
+		up_read(&seg_tg->mm->mmap_sem);
+		atomic_dec(&seg_tg->mm->mm_users);
+	}
+	if (ret != 0) {
+		if (*recalls_blocked) {
+			xpmem_unblock_recall_PFNs(seg_tg);
+			*recalls_blocked = 0;
+		}
+		return ret;
+	}
+
+	/*
+	 * Spin through the PFNs until we encounter one that isn't known
+	 * or the memory protection needs to be modified.
+	 */
+	DBUG_ON(faulting && n_pgs > 0);
+	while (n_pgs > 0) {
+		ret = xpmem_vaddr_to_PFN_alloc(seg, vaddr, &pfn, 0);
+		if (ret != 0)
+			return ret;
+
+		if (XPMEM_PFN_IS_UNKNOWN(pfn) ||
+		    XPMEM_PFN_DROP_MEMPROT(pfn, drop_memprot)) {
+			if (*recalls_blocked) {
+				xpmem_unblock_recall_PFNs(seg_tg);
+				*recalls_blocked = 0;
+			}
+			goto again;
+		}
+
+		n_pgs--;
+		vaddr += PAGE_SIZE;
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+#ifndef CONFIG_NUMA
+#ifndef CONFIG_SMP
+#undef node_to_cpumask
+#define	node_to_cpumask(nid)	(xpmem_cpu_online_map)
+static cpumask_t xpmem_cpu_online_map;
+#endif /* !CONFIG_SMP */
+#endif /* !CONFIG_NUMA */
+#endif /* CONFIG_X86_64 */
+
+static int
+xpmem_find_node_with_cpus(struct xpmem_node_PFNlists *npls, int starting_nid)
+{
+	int nid;
+	struct xpmem_node_PFNlist *npl;
+	cpumask_t node_cpus;
+
+	nid = starting_nid;
+	while (--nid != starting_nid) {
+		if (nid == -1)
+			nid = MAX_NUMNODES - 1;
+
+		npl = &npls->PFNlists[nid];
+
+		if (npl->nid == XPMEM_NODE_OFFLINE)
+			continue;
+
+		if (npl->nid != XPMEM_NODE_UNINITIALIZED) {
+			nid = npl->nid;
+			break;
+		}
+
+		if (!node_online(nid)) {
+			DBUG_ON(!cpus_empty(node_to_cpumask(nid)));
+			npl->nid = XPMEM_NODE_OFFLINE;
+			npl->cpu = XPMEM_CPUS_OFFLINE;
+			continue;
+		}
+		node_cpus = node_to_cpumask(nid);
+		if (!cpus_empty(node_cpus)) {
+			DBUG_ON(npl->cpu != XPMEM_CPUS_UNINITIALIZED);
+			npl->nid = nid;
+			break;
+		}
+		npl->cpu = XPMEM_CPUS_OFFLINE;
+	}
+
+	BUG_ON(nid == starting_nid);
+	return nid;
+}
+
+static void
+xpmem_process_PFNlist_by_CPU(struct work_struct *work)
+{
+	int i;
+	int n_unpinned = 0;
+	struct xpmem_PFNlist *pl = (struct xpmem_PFNlist *)work;
+	struct xpmem_node_PFNlists *npls = pl->PFNlists;
+	u64 *pfn;
+	struct page *page;
+
+	/* for each PFN in the PFNlist do... */
+	for (i = 0; i < pl->n_PFNs; i++) {
+		pfn = &pl->PFNs[i];
+
+		if (*pfn & XPMEM_PFN_UNPIN) {
+			if (!(*pfn & XPMEM_PFN_IO)) {
+				/* unpin the page */
+				page = virt_to_page(__va(XPMEM_PFN(pfn)
+							 << PAGE_SHIFT));
+				page_cache_release(page);
+				n_unpinned++;
+			}
+		}
+	}
+
+	if (n_unpinned > 0)
+		atomic_sub(n_unpinned, pl->n_pinned);
+
+	/* indicate we are done processing this PFNlist */
+	if (atomic_dec_return(&npls->n_PFNlists_processing) == 0)
+		wake_up(&npls->PFNlists_processing_wq);
+
+	kfree(pl);
+}
+
+static void
+xpmem_schedule_PFNlist_processing(struct xpmem_node_PFNlists *npls, int nid)
+{
+	int cpu;
+	int ret;
+	struct xpmem_node_PFNlist *npl = &npls->PFNlists[nid];
+	cpumask_t node_cpus;
+
+	DBUG_ON(npl->nid != nid);
+	DBUG_ON(npl->PFNlist == NULL);
+	DBUG_ON(npl->cpu == XPMEM_CPUS_OFFLINE);
+
+	/* select a CPU to schedule work on */
+	cpu = npl->cpu;
+	node_cpus = node_to_cpumask(nid);
+	cpu = next_cpu(cpu, node_cpus);
+	if (cpu == NR_CPUS)
+		cpu = first_cpu(node_cpus);
+
+	npl->cpu = cpu;
+
+	preempt_disable();
+	ret = schedule_delayed_work_on(cpu, &npl->PFNlist->dwork, 0);
+	preempt_enable();
+	BUG_ON(ret != 1);
+
+	npl->PFNlist = NULL;
+	npls->n_PFNlists_scheduled++;
+}
+
+/*
+ * Add the specified PFN to a node based list of PFNs. Each list is to be
+ * 'processed' by the CPUs resident on that node. If a node does not have
+ * any CPUs, the list processing will be scheduled on the CPUs of a node
+ * that does.
+ */
+static void
+xpmem_add_to_PFNlist(struct xpmem_segment *seg,
+		     struct xpmem_node_PFNlists **npls_ptr, u64 *pfn)
+{
+	int nid;
+	struct xpmem_node_PFNlists *npls = *npls_ptr;
+	struct xpmem_node_PFNlist *npl;
+	struct xpmem_PFNlist *pl;
+	cpumask_t node_cpus;
+
+	if (npls == NULL) {
+		npls = kmalloc(sizeof(struct xpmem_node_PFNlists), GFP_KERNEL);
+		BUG_ON(npls == NULL);
+		*npls_ptr = npls;
+
+		atomic_set(&npls->n_PFNlists_processing, 0);
+		init_waitqueue_head(&npls->PFNlists_processing_wq);
+
+		npls->n_PFNlists_created = 0;
+		npls->n_PFNlists_scheduled = 0;
+		npls->PFNlists = kmalloc(sizeof(struct xpmem_node_PFNlist) *
+					 MAX_NUMNODES, GFP_KERNEL);
+		BUG_ON(npls->PFNlists == NULL);
+
+		for (nid = 0; nid < MAX_NUMNODES; nid++) {
+			npls->PFNlists[nid].nid = XPMEM_NODE_UNINITIALIZED;
+			npls->PFNlists[nid].cpu = XPMEM_CPUS_UNINITIALIZED;
+			npls->PFNlists[nid].PFNlist = NULL;
+		}
+	}
+
+#ifdef CONFIG_IA64
+	nid = nasid_to_cnodeid(NASID_GET(XPMEM_PFN_TO_PADDR(pfn)));
+#else
+	nid = pfn_to_nid(XPMEM_PFN(pfn));
+#endif
+	BUG_ON(nid >= MAX_NUMNODES);
+	DBUG_ON(!node_online(nid));
+	npl = &npls->PFNlists[nid];
+
+	pl = npl->PFNlist;
+	if (pl == NULL) {
+
+		DBUG_ON(npl->nid == XPMEM_NODE_OFFLINE);
+		if (npl->nid == XPMEM_NODE_UNINITIALIZED) {
+			node_cpus = node_to_cpumask(nid);
+			if (npl->cpu == XPMEM_CPUS_OFFLINE ||
+			    cpus_empty(node_cpus)) {
+				/* mark this node as headless */
+				npl->cpu = XPMEM_CPUS_OFFLINE;
+
+				/* switch to a node with CPUs */
+				npl->nid = xpmem_find_node_with_cpus(npls, nid);
+				npl = &npls->PFNlists[npl->nid];
+			} else
+				npl->nid = nid;
+
+		} else if (npl->nid != nid) {
+			/* we're on a headless node, switch to one with CPUs */
+			DBUG_ON(npl->cpu != XPMEM_CPUS_OFFLINE);
+			npl = &npls->PFNlists[npl->nid];
+		}
+
+		pl = npl->PFNlist;
+		if (pl == NULL) {
+			pl = kmalloc_node(sizeof(struct xpmem_PFNlist) +
+					  sizeof(u64) * XPMEM_MAXNPFNs_PER_LIST,
+					  GFP_KERNEL, npl->nid);
+			BUG_ON(pl == NULL);
+
+			INIT_DELAYED_WORK(&pl->dwork,
+					  xpmem_process_PFNlist_by_CPU);
+			pl->n_pinned = &seg->tg->n_pinned;
+			pl->PFNlists = npls;
+			pl->n_PFNs = 0;
+
+			npl->PFNlist = pl;
+			npls->n_PFNlists_created++;
+		}
+	}
+
+	pl->PFNs[pl->n_PFNs++] = *pfn;
+
+	if (pl->n_PFNs == XPMEM_MAXNPFNs_PER_LIST)
+		xpmem_schedule_PFNlist_processing(npls, npl->nid);
+}
+
+/*
+ * Search for any PFNs found in the specified seg's level 1 PFNtable.
+ */
+static inline int
+xpmem_zzz_l1(struct xpmem_segment *seg, u64 *l1table, u64 *vaddr,
+			u64 end_vaddr)
+{
+	int nfound = 0;
+	int index = XPMEM_PFNTABLE_L1INDEX(*vaddr);
+	u64 *pfn;
+
+	for (; index < XPMEM_PFNTABLE_L1SIZE && *vaddr <= end_vaddr && nfound == 0;
+	     index++, *vaddr += PAGE_SIZE) {
+		pfn = &l1table[index];
+		if (XPMEM_PFN_IS_UNKNOWN(pfn))
+			continue;
+
+		nfound++;
+	}
+	return nfound;
+}
+
+/*
+ * Search for any PFNs found in the specified seg's level 2 PFNtable.
+ */
+static inline int
+xpmem_zzz_l2(struct xpmem_segment *seg, u64 **l2table, u64 *vaddr,
+			u64 end_vaddr)
+{
+	int nfound = 0;
+	int index = XPMEM_PFNTABLE_L2INDEX(*vaddr);
+	u64 *l1;
+
+	for (; index < XPMEM_PFNTABLE_L2SIZE && *vaddr <= end_vaddr && nfound == 0; index++) {
+		l1 = l2table[index];
+		if (l1 == NULL) {
+			*vaddr = (*vaddr & PMD_MASK) + PMD_SIZE;
+			continue;
+		}
+
+		nfound += xpmem_zzz_l1(seg, l1, vaddr, end_vaddr);
+	}
+	return nfound;
+}
+
+/*
+ * Search for any PFNs found in the specified seg's level 3 PFNtable.
+ */
+static inline int
+xpmem_zzz_l3(struct xpmem_segment *seg, u64 ***l3table, u64 *vaddr,
+			u64 end_vaddr)
+{
+	int nfound = 0;
+	int index = XPMEM_PFNTABLE_L3INDEX(*vaddr);
+	u64 **l2;
+
+	for (; index < XPMEM_PFNTABLE_L3SIZE && *vaddr <= end_vaddr && nfound == 0; index++) {
+		l2 = l3table[index];
+		if (l2 == NULL) {
+			*vaddr = (*vaddr & PUD_MASK) + PUD_SIZE;
+			continue;
+		}
+
+		nfound += xpmem_zzz_l2(seg, l2, vaddr, end_vaddr);
+	}
+	return nfound;
+}
+
+/*
+ * Search for any PFNs found in the specified seg's PFNtable.
+ *
+ * This function should only be called when XPMEM can guarantee that no
+ * other thread will be rummaging through the PFNtable at the same time.
+ */
+int
+xpmem_zzz(struct xpmem_segment *seg, u64 vaddr, size_t size)
+{
+	int nfound = 0;
+	int index;
+	int start_index;
+	int end_index;
+	u64 ***l3;
+	u64 end_vaddr = vaddr + size - 1;
+
+	mutex_lock(&seg->PFNtable_mutex);
+
+	/* ensure vaddr is aligned on a page boundary */
+	if (offset_in_page(vaddr))
+		vaddr = (vaddr & PAGE_MASK);
+
+	start_index = XPMEM_PFNTABLE_L4INDEX(vaddr);
+	end_index = XPMEM_PFNTABLE_L4INDEX(end_vaddr);
+
+	for (index = start_index; index <= end_index && nfound == 0; index++) {
+		/*
+		 * The virtual address space is broken up into 8 regions
+		 * of equal size, and upper portions of each region are
+		 * unaccessible by user page tables. When we encounter
+		 * the unaccessible portion of a region, we set vaddr to
+		 * the beginning of the next region and continue scanning
+		 * the XPMEM PFN table. Note: the region is stored in
+		 * bits 63..61 of a virtual address.
+		 *
+		 * This check would ideally use Linux kernel macros to
+		 * determine when vaddr overlaps with unimplemented space,
+		 * but such macros do not exist in 2.4.19. Instead, we jump
+		 * to the next region at each 1/8 of the page table.
+		 */
+		if ((index != start_index) &&
+		    ((index % (PTRS_PER_PGD / 8)) == 0))
+			vaddr = ((vaddr >> 61) + 1) << 61;
+
+		l3 = seg->PFNtable[index];
+		if (l3 == NULL) {
+			vaddr = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
+			continue;
+		}
+
+		nfound += xpmem_zzz_l3(seg, l3, &vaddr, end_vaddr);
+	}
+
+	mutex_unlock(&seg->PFNtable_mutex);
+	return nfound;
+}
+
+/*
+ * Clear all PFNs found in the specified seg's level 1 PFNtable.
+ */
+static inline void
+xpmem_clear_PFNtable_l1(struct xpmem_segment *seg, u64 *l1table, u64 *vaddr,
+			u64 end_vaddr, int unpin_pages, int recall_only,
+			struct xpmem_node_PFNlists **npls_ptr)
+{
+	int index = XPMEM_PFNTABLE_L1INDEX(*vaddr);
+	u64 *pfn;
+
+	for (; index < XPMEM_PFNTABLE_L1SIZE && *vaddr <= end_vaddr;
+	     index++, *vaddr += PAGE_SIZE) {
+		pfn = &l1table[index];
+		if (XPMEM_PFN_IS_UNKNOWN(pfn))
+			continue;
+
+		if (recall_only) {
+			if (!(*pfn & XPMEM_PFN_UNCACHED) &&
+			    (*pfn & XPMEM_PFN_MEMPROT_DOWN))
+				xpmem_add_to_PFNlist(seg, npls_ptr, pfn);
+
+			continue;
+		}
+
+		if (unpin_pages) {
+			*pfn |= XPMEM_PFN_UNPIN;
+			xpmem_add_to_PFNlist(seg, npls_ptr, pfn);
+		}
+		*pfn = 0;
+	}
+}
+
+/*
+ * Clear all PFNs found in the specified seg's level 2 PFNtable.
+ */
+static inline void
+xpmem_clear_PFNtable_l2(struct xpmem_segment *seg, u64 **l2table, u64 *vaddr,
+			u64 end_vaddr, int unpin_pages, int recall_only,
+			struct xpmem_node_PFNlists **npls_ptr)
+{
+	int index = XPMEM_PFNTABLE_L2INDEX(*vaddr);
+	u64 *l1;
+
+	for (; index < XPMEM_PFNTABLE_L2SIZE && *vaddr <= end_vaddr; index++) {
+		l1 = l2table[index];
+		if (l1 == NULL) {
+			*vaddr = (*vaddr & PMD_MASK) + PMD_SIZE;
+			continue;
+		}
+
+		xpmem_clear_PFNtable_l1(seg, l1, vaddr, end_vaddr,
+					unpin_pages, recall_only, npls_ptr);
+	}
+}
+
+/*
+ * Clear all PFNs found in the specified seg's level 3 PFNtable.
+ */
+static inline void
+xpmem_clear_PFNtable_l3(struct xpmem_segment *seg, u64 ***l3table, u64 *vaddr,
+			u64 end_vaddr, int unpin_pages, int recall_only,
+			struct xpmem_node_PFNlists **npls_ptr)
+{
+	int index = XPMEM_PFNTABLE_L3INDEX(*vaddr);
+	u64 **l2;
+
+	for (; index < XPMEM_PFNTABLE_L3SIZE && *vaddr <= end_vaddr; index++) {
+		l2 = l3table[index];
+		if (l2 == NULL) {
+			*vaddr = (*vaddr & PUD_MASK) + PUD_SIZE;
+			continue;
+		}
+
+		xpmem_clear_PFNtable_l2(seg, l2, vaddr, end_vaddr,
+					unpin_pages, recall_only, npls_ptr);
+	}
+}
+
+/*
+ * Clear all PFNs found in the specified seg's PFNtable and, if requested,
+ * unpin the underlying physical pages.
+ *
+ * This function should only be called when XPMEM can guarantee that no
+ * other thread will be rummaging through the PFNtable at the same time.
+ */
+void
+xpmem_clear_PFNtable(struct xpmem_segment *seg, u64 vaddr, size_t size,
+		     int unpin_pages, int recall_only)
+{
+	int index;
+	int nid;
+	int start_index;
+	int end_index;
+	struct xpmem_node_PFNlists *npls = NULL;
+	u64 ***l3;
+	u64 end_vaddr = vaddr + size - 1;
+
+	DBUG_ON(unpin_pages && recall_only);
+
+	mutex_lock(&seg->PFNtable_mutex);
+
+	/* ensure vaddr is aligned on a page boundary */
+	if (offset_in_page(vaddr))
+		vaddr = (vaddr & PAGE_MASK);
+
+	start_index = XPMEM_PFNTABLE_L4INDEX(vaddr);
+	end_index = XPMEM_PFNTABLE_L4INDEX(end_vaddr);
+
+	for (index = start_index; index <= end_index; index++) {
+		/*
+		 * The virtual address space is broken up into 8 regions
+		 * of equal size, and upper portions of each region are
+		 * unaccessible by user page tables. When we encounter
+		 * the unaccessible portion of a region, we set vaddr to
+		 * the beginning of the next region and continue scanning
+		 * the XPMEM PFN table. Note: the region is stored in
+		 * bits 63..61 of a virtual address.
+		 *
+		 * This check would ideally use Linux kernel macros to
+		 * determine when vaddr overlaps with unimplemented space,
+		 * but such macros do not exist in 2.4.19. Instead, we jump
+		 * to the next region at each 1/8 of the page table.
+		 */
+		if ((index != start_index) &&
+		    ((index % (PTRS_PER_PGD / 8)) == 0))
+			vaddr = ((vaddr >> 61) + 1) << 61;
+
+		l3 = seg->PFNtable[index];
+		if (l3 == NULL) {
+			vaddr = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
+			continue;
+		}
+
+		xpmem_clear_PFNtable_l3(seg, l3, &vaddr, end_vaddr,
+					unpin_pages, recall_only, &npls);
+	}
+
+	if (npls != NULL) {
+		if (npls->n_PFNlists_created > npls->n_PFNlists_scheduled) {
+			for_each_online_node(nid) {
+				if (npls->PFNlists[nid].PFNlist != NULL)
+					xpmem_schedule_PFNlist_processing(npls,
+									  nid);
+			}
+		}
+		DBUG_ON(npls->n_PFNlists_scheduled != npls->n_PFNlists_created);
+
+		atomic_add(npls->n_PFNlists_scheduled,
+			   &npls->n_PFNlists_processing);
+		wait_event(npls->PFNlists_processing_wq,
+			   (atomic_read(&npls->n_PFNlists_processing) == 0));
+
+		kfree(npls->PFNlists);
+		kfree(npls);
+	}
+
+	mutex_unlock(&seg->PFNtable_mutex);
+}
+
+#ifdef CONFIG_PROC_FS
+DEFINE_SPINLOCK(xpmem_unpin_procfs_lock);
+struct proc_dir_entry *xpmem_unpin_procfs_dir;
+
+static int
+xpmem_is_thread_group_stopped(struct xpmem_thread_group *tg)
+{
+	struct task_struct *task = tg->group_leader;
+
+	rcu_read_lock();
+	do {
+		if (!(task->flags & PF_EXITING) &&
+		    task->state != TASK_STOPPED) {
+			rcu_read_unlock();
+			return 0;
+		}
+		task = next_thread(task);
+	} while (task != tg->group_leader);
+	rcu_read_unlock();
+	return 1;
+}
+
+int
+xpmem_unpin_procfs_write(struct file *file, const char __user *buffer,
+			 unsigned long count, void *_tgid)
+{
+	pid_t tgid = (unsigned long)_tgid;
+	struct xpmem_thread_group *tg;
+
+	tg = xpmem_tg_ref_by_tgid(xpmem_my_part, tgid);
+	if (IS_ERR(tg))
+		return -ESRCH;
+
+	if (!xpmem_is_thread_group_stopped(tg)) {
+		xpmem_tg_deref(tg);
+		return -EPERM;
+	}
+
+	xpmem_disallow_blocking_recall_PFNs(tg);
+
+	mutex_lock(&tg->recall_PFNs_mutex);
+	xpmem_recall_PFNs_of_tg(tg, 0, VMALLOC_END);
+	mutex_unlock(&tg->recall_PFNs_mutex);
+
+	xpmem_allow_blocking_recall_PFNs(tg);
+
+	xpmem_tg_deref(tg);
+	return count;
+}
+
+int
+xpmem_unpin_procfs_read(char *page, char **start, off_t off, int count,
+			int *eof, void *_tgid)
+{
+	pid_t tgid = (unsigned long)_tgid;
+	struct xpmem_thread_group *tg;
+	int len = 0;
+
+	tg = xpmem_tg_ref_by_tgid(xpmem_my_part, tgid);
+	if (!IS_ERR(tg)) {
+		len = snprintf(page, count, "pages pinned by XPMEM: %d\n",
+			       atomic_read(&tg->n_pinned));
+		xpmem_tg_deref(tg);
+	}
+
+	return len;
+}
+#endif /* CONFIG_PROC_FS */
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem.h	2008-04-01 10:42:33.093769003 -0500
@@ -0,0 +1,130 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) structures and macros.
+ */
+
+#ifndef _ASM_IA64_SN_XPMEM_H
+#define _ASM_IA64_SN_XPMEM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * basic argument type definitions
+ */
+struct xpmem_addr {
+	__s64 apid;		/* apid that represents memory */
+	off_t offset;		/* offset into apid's memory */
+};
+
+#define XPMEM_MAXADDR_SIZE	(size_t)(-1L)
+
+#define XPMEM_ATTACH_WC		0x10000
+#define XPMEM_ATTACH_GETSPACE	0x20000
+
+/*
+ * path to XPMEM device
+ */
+#define XPMEM_DEV_PATH  "/dev/xpmem"
+
+/*
+ * The following are the possible XPMEM related errors.
+ */
+#define XPMEM_ERRNO_NOPROC	2004	/* unknown thread due to fork() */
+
+/*
+ * flags for segment permissions
+ */
+#define XPMEM_RDONLY	0x1
+#define XPMEM_RDWR	0x2
+
+/*
+ * Valid permit_type values for xpmem_make().
+ */
+#define XPMEM_PERMIT_MODE	0x1
+
+/*
+ * ioctl() commands used to interface to the kernel module.
+ */
+#define XPMEM_IOC_MAGIC		'x'
+#define XPMEM_CMD_VERSION	_IO(XPMEM_IOC_MAGIC, 0)
+#define XPMEM_CMD_MAKE		_IO(XPMEM_IOC_MAGIC, 1)
+#define XPMEM_CMD_REMOVE	_IO(XPMEM_IOC_MAGIC, 2)
+#define XPMEM_CMD_GET		_IO(XPMEM_IOC_MAGIC, 3)
+#define XPMEM_CMD_RELEASE	_IO(XPMEM_IOC_MAGIC, 4)
+#define XPMEM_CMD_ATTACH	_IO(XPMEM_IOC_MAGIC, 5)
+#define XPMEM_CMD_DETACH	_IO(XPMEM_IOC_MAGIC, 6)
+#define XPMEM_CMD_COPY		_IO(XPMEM_IOC_MAGIC, 7)
+#define XPMEM_CMD_BCOPY		_IO(XPMEM_IOC_MAGIC, 8)
+#define XPMEM_CMD_FORK_BEGIN	_IO(XPMEM_IOC_MAGIC, 9)
+#define XPMEM_CMD_FORK_END	_IO(XPMEM_IOC_MAGIC, 10)
+
+/*
+ * Structures used with the preceding ioctl() commands to pass data.
+ */
+struct xpmem_cmd_make {
+	__u64 vaddr;
+	size_t size;
+	int permit_type;
+	__u64 permit_value;
+	__s64 segid;		/* returned on success */
+};
+
+struct xpmem_cmd_remove {
+	__s64 segid;
+};
+
+struct xpmem_cmd_get {
+	__s64 segid;
+	int flags;
+	int permit_type;
+	__u64 permit_value;
+	__s64 apid;		/* returned on success */
+};
+
+struct xpmem_cmd_release {
+	__s64 apid;
+};
+
+struct xpmem_cmd_attach {
+	__s64 apid;
+	off_t offset;
+	size_t size;
+	__u64 vaddr;
+	int fd;
+	int flags;
+};
+
+struct xpmem_cmd_detach {
+	__u64 vaddr;
+};
+
+struct xpmem_cmd_copy {
+	__s64 src_apid;
+	off_t src_offset;
+	__s64 dst_apid;
+	off_t dst_offset;
+	size_t size;
+};
+
+#ifndef __KERNEL__
+extern int xpmem_version(void);
+extern __s64 xpmem_make(void *, size_t, int, void *);
+extern int xpmem_remove(__s64);
+extern __s64 xpmem_get(__s64, int, int, void *);
+extern int xpmem_release(__s64);
+extern void *xpmem_attach(struct xpmem_addr, size_t, void *);
+extern void *xpmem_attach_wc(struct xpmem_addr, size_t, void *);
+extern void *xpmem_attach_getspace(struct xpmem_addr, size_t, void *);
+extern int xpmem_detach(void *);
+extern int xpmem_bcopy(struct xpmem_addr, struct xpmem_addr, size_t);
+#endif
+
+#endif /* _ASM_IA64_SN_XPMEM_H */
Index: emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_private.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ emm_notifier_xpmem_v1/drivers/misc/xp/xpmem_private.h	2008-04-01 10:42:33.117771963 -0500
@@ -0,0 +1,783 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Private Cross Partition Memory (XPMEM) structures and macros.
+ */
+
+#ifndef _ASM_IA64_XPMEM_PRIVATE_H
+#define _ASM_IA64_XPMEM_PRIVATE_H
+
+#include <linux/rmap.h>
+#include <linux/version.h>
+#include <linux/bit_spinlock.h>
+#include <linux/workqueue.h>
+#include <linux/signal.h>
+#include <linux/sched.h>
+#ifdef CONFIG_IA64
+#include <asm/sn/arch.h>
+#else
+#define sn_partition_id			0
+#endif
+
+#ifdef CONFIG_SGI_XP
+#include <asm/sn/xp.h>
+#else
+#define XP_MAX_PARTITIONS		1
+#endif
+
+#ifndef DBUG_ON
+#define DBUG_ON(condition)
+#endif
+/*
+ * XPMEM_CURRENT_VERSION is used to identify functional differences
+ * between various releases of XPMEM to users. XPMEM_CURRENT_VERSION_STRING
+ * is printed when the kernel module is loaded and unloaded.
+ *
+ *   version  differences
+ *
+ *     1.0    initial implementation of XPMEM
+ *     1.1    fetchop (AMO) pages supported
+ *     1.2    GET space and write combining attaches supported
+ *     1.3    Convert to build for both 2.4 and 2.6 versions of kernel
+ *     1.4    add recall PFNs RPC
+ *     1.5    first round of resiliency improvements
+ *     1.6    make coherence domain union of sharing partitions
+ *     2.0    replace 32-bit xpmem_handle_t by 64-bit segid (no typedef)
+ *            replace 32-bit xpmem_id_t by 64-bit apid (no typedef)
+ *
+ *
+ * This int constant has the following format:
+ *
+ *      +----+------------+----------------+
+ *      |////|   major    |     minor      |
+ *      +----+------------+----------------+
+ *
+ *       major - major revision number (12-bits)
+ *       minor - minor revision number (16-bits)
+ */
+#define XPMEM_CURRENT_VERSION		0x00020000
+#define XPMEM_CURRENT_VERSION_STRING	"2.0"
+
+#define XPMEM_MODULE_NAME "xpmem"
+
+#ifndef L1_CACHE_MASK
+#define L1_CACHE_MASK			(L1_CACHE_BYTES - 1)
+#endif /* L1_CACHE_MASK */
+
+/*
+ * Given an address space and a virtual address return a pointer to its
+ * pte if one is present.
+ */
+static inline pte_t *
+xpmem_vaddr_to_pte(struct mm_struct *mm, u64 vaddr)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte_p;
+
+	pgd = pgd_offset(mm, vaddr);
+	if (!pgd_present(*pgd))
+		return NULL;
+
+	pud = pud_offset(pgd, vaddr);
+	if (!pud_present(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, vaddr);
+	if (!pmd_present(*pmd))
+		return NULL;
+
+	pte_p = pte_offset_map(pmd, vaddr);
+	if (!pte_present(*pte_p))
+		return NULL;
+
+	return pte_p;
+}
+
+/*
+ * A 64-bit PFNtable entry contans the following fields:
+ *
+ *                                ,-- XPMEM_PFN_WIDTH (currently 38 bits)
+ *                                |
+ *                    ,-----------'----------------,
+ *      +-+-+-+-+-----+----------------------------+
+ *      |a|u|i|p|/////|            pfn             |
+ *      +-+-+-+-+-----+----------------------------+
+ *      `-^-'-'-'
+ *       | | | |
+ *       | | | |
+ *       | | | |
+ *       | | | `-- unpin page bit
+ *       | | `-- I/O bit
+ *       | `-- uncached bit
+ *       `-- cross-partition access bit
+ *
+ *       a   - all access allowed (i/o and cpu)
+ *       u   - page is a uncached page
+ *       i   - page is an I/O page which wasn't pinned by XPMEM
+ *       p   - page was pinned by XPMEM and now needs to be unpinned
+ *       pfn - actual PFN value
+ */
+
+#define XPMEM_PFN_WIDTH			38
+
+#define XPMEM_PFN_UNPIN			((u64)1 << 60)
+#define XPMEM_PFN_IO			((u64)1 << 61)
+#define XPMEM_PFN_UNCACHED		((u64)1 << 62)
+#define XPMEM_PFN_MEMPROT_DOWN		((u64)1 << 63)
+#define XPMEM_PFN_DROP_MEMPROT(p, f)	((f) && \
+					       !(*(p) & XPMEM_PFN_MEMPROT_DOWN))
+
+#define XPMEM_PFN(p)			(*(p) & (((u64)1 << \
+						 XPMEM_PFN_WIDTH) - 1))
+#define XPMEM_PFN_TO_PADDR(p)		((u64)XPMEM_PFN(p) << PAGE_SHIFT)
+
+#define XPMEM_PFN_IS_UNKNOWN(p)		(*(p) == 0)
+#define XPMEM_PFN_IS_KNOWN(p)		(XPMEM_PFN(p) > 0)
+
+/*
+ * general internal driver structures
+ */
+
+struct xpmem_thread_group {
+	spinlock_t lock;	/* tg lock */
+	short partid;		/* partid tg resides on */
+	pid_t tgid;		/* tg's tgid */
+	uid_t uid;		/* tg's uid */
+	gid_t gid;		/* tg's gid */
+	int flags;		/* tg attributes and state */
+	atomic_t uniq_segid;
+	atomic_t uniq_apid;
+	rwlock_t seg_list_lock;
+	struct list_head seg_list;	/* tg's list of segs */
+	struct xpmem_hashlist *ap_hashtable;	/* locks + ap hash lists */
+	atomic_t refcnt;	/* references to tg */
+	atomic_t n_pinned;	/* #of pages pinned by this tg */
+	u64 addr_limit;		/* highest possible user addr */
+	struct list_head tg_hashlist;	/* tg hash list */
+	struct task_struct *group_leader;	/* thread group leader */
+	struct mm_struct *mm;	/* tg's mm */
+	atomic_t n_recall_PFNs;	/* #of recall of PFNs in progress */
+	struct mutex recall_PFNs_mutex;	/* lock for serializing recall of PFNs*/
+	wait_queue_head_t block_recall_PFNs_wq;	/*wait to block recall of PFNs*/
+	wait_queue_head_t allow_recall_PFNs_wq;	/*wait to allow recall of PFNs*/
+	struct emm_notifier emm_notifier;	/* >>> */
+	spinlock_t page_requests_lock;
+	struct list_head page_requests;		/* get_user_pages while unblocked */
+};
+
+struct xpmem_segment {
+	spinlock_t lock;	/* seg lock */
+	struct rw_semaphore sema;	/* seg sema */
+	__s64 segid;		/* unique segid */
+	u64 vaddr;		/* starting address */
+	size_t size;		/* size of seg */
+	int permit_type;	/* permission scheme */
+	void *permit_value;	/* permission data */
+	int flags;		/* seg attributes and state */
+	atomic_t refcnt;	/* references to seg */
+	wait_queue_head_t created_wq;	/* wait for seg to be created */
+	wait_queue_head_t destroyed_wq;	/* wait for seg to be destroyed */
+	struct xpmem_thread_group *tg;	/* creator tg */
+	struct list_head ap_list;	/* local access permits of seg */
+	struct list_head seg_list;	/* tg's list of segs */
+	int coherence_id;	/* where the seg resides */
+	u64 recall_vaddr;	/* vaddr being recalled if _RECALLINGPFNS set */
+	size_t recall_size;	/* size being recalled if _RECALLINGPFNS set */
+	struct mutex PFNtable_mutex;	/* serialization lock for PFN table */
+	u64 ****PFNtable;	/* PFN table */
+};
+
+struct xpmem_access_permit {
+	spinlock_t lock;	/* access permit lock */
+	__s64 apid;		/* unique apid */
+	int mode;		/* read/write mode */
+	int flags;		/* access permit attributes and state */
+	atomic_t refcnt;	/* references to access permit */
+	struct xpmem_segment *seg;	/* seg permitted to be accessed */
+	struct xpmem_thread_group *tg;	/* access permit's tg */
+	struct list_head att_list;	/* atts of this access permit's seg */
+	struct list_head ap_list;	/* access permits linked to seg */
+	struct list_head ap_hashlist;	/* access permit hash list */
+};
+
+struct xpmem_attachment {
+	struct mutex mutex;	/* att lock for serialization */
+	u64 offset;		/* starting offset within seg */
+	u64 at_vaddr;		/* address where seg is attached */
+	size_t at_size;		/* size of seg attachment */
+	int flags;		/* att attributes and state */
+	atomic_t refcnt;	/* references to att */
+	struct xpmem_access_permit *ap;/* associated access permit */
+	struct list_head att_list;	/* atts linked to access permit */
+	struct mm_struct *mm;	/* mm struct attached to */
+	wait_queue_head_t destroyed_wq;	/* wait for att to be destroyed */
+};
+
+struct xpmem_partition {
+	spinlock_t lock;	/* part lock */
+	int flags;		/* part attributes and state */
+	int n_proxies;		/* #of segs [im|ex]ported */
+	struct xpmem_hashlist *tg_hashtable;	/* locks + tg hash lists */
+	int version;		/* version of XPMEM running */
+	int coherence_id;	/* coherence id for partition */
+	atomic_t n_threads;	/* # of threads active */
+	wait_queue_head_t thread_wq;	/* notified when threads done */
+};
+
+/*
+ * Both the segid and apid are of type __s64 and designed to be opaque to
+ * the user. Both consist of the same underlying fields.
+ *
+ * The 'partid' field identifies the partition on which the thread group
+ * identified by 'tgid' field resides. The 'uniq' field is designed to give
+ * each segid or apid a unique value. Each type is only unique with respect
+ * to itself.
+ *
+ * An ID is never less than or equal to zero.
+ */
+struct xpmem_id {
+	pid_t tgid;		/* thread group that owns ID */
+	unsigned short uniq;	/* this value makes the ID unique */
+	signed short partid;	/* partition where tgid resides */
+};
+
+#define XPMEM_MAX_UNIQ_ID	((1 << (sizeof(short) * 8)) - 1)
+
+static inline signed short
+xpmem_segid_to_partid(__s64 segid)
+{
+	DBUG_ON(segid <= 0);
+	return ((struct xpmem_id *)&segid)->partid;
+}
+
+static inline pid_t
+xpmem_segid_to_tgid(__s64 segid)
+{
+	DBUG_ON(segid <= 0);
+	return ((struct xpmem_id *)&segid)->tgid;
+}
+
+static inline signed short
+xpmem_apid_to_partid(__s64 apid)
+{
+	DBUG_ON(apid <= 0);
+	return ((struct xpmem_id *)&apid)->partid;
+}
+
+static inline pid_t
+xpmem_apid_to_tgid(__s64 apid)
+{
+	DBUG_ON(apid <= 0);
+	return ((struct xpmem_id *)&apid)->tgid;
+}
+
+/*
+ * Attribute and state flags for various xpmem structures. Some values
+ * are defined in xpmem.h, so we reserved space here via XPMEM_DONT_USE_X
+ * to prevent overlap.
+ */
+#define XPMEM_FLAG_UNINITIALIZED	0x00001	/* state is uninitialized */
+#define XPMEM_FLAG_UP			0x00002	/* state is up */
+#define XPMEM_FLAG_DOWN			0x00004	/* state is down */
+
+#define XPMEM_FLAG_CREATING		0x00020	/* being created */
+#define XPMEM_FLAG_DESTROYING		0x00040	/* being destroyed */
+#define XPMEM_FLAG_DESTROYED		0x00080	/* 'being destroyed' finished */
+
+#define XPMEM_FLAG_PROXY		0x00100	/* is a proxy */
+#define XPMEM_FLAG_VALIDPTES		0x00200	/* valid PTEs exist */
+#define XPMEM_FLAG_RECALLINGPFNS	0x00400	/* recalling PFNs */
+
+#define XPMEM_FLAG_GOINGDOWN		0x00800	/* state is changing to down */
+
+#define	XPMEM_DONT_USE_1		0x10000	/* see XPMEM_ATTACH_WC */
+#define	XPMEM_DONT_USE_2		0x20000	/* see XPMEM_ATTACH_GETSPACE */
+#define	XPMEM_DONT_USE_3		0x40000	/* reserved for xpmem.h */
+#define	XPMEM_DONT_USE_4		0x80000	/* reserved for xpmem.h */
+
+/*
+ * The PFN table is a four-level table that can map all of a thread group's
+ * memory. This table is equivalent to the general Linux four-level segment
+ * table described in the pgtable.h file. The sizes of each level are the same,
+ * but the type is different (here the type is a u64).
+ */
+
+/* Size of the XPMEM PFN four-level table */
+#define XPMEM_PFNTABLE_L4SIZE		PTRS_PER_PGD	/* #of L3 pointers */
+#define XPMEM_PFNTABLE_L3SIZE		PTRS_PER_PUD	/* #of L2 pointers */
+#define XPMEM_PFNTABLE_L2SIZE		PTRS_PER_PMD	/* #of L1 pointers */
+#define XPMEM_PFNTABLE_L1SIZE		PTRS_PER_PTE	/* #of PFN entries */
+
+/* Return an index into the specified level given a virtual address */
+#define XPMEM_PFNTABLE_L4INDEX(v)   pgd_index(v)
+#define XPMEM_PFNTABLE_L3INDEX(v)   ((v >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+#define XPMEM_PFNTABLE_L2INDEX(v)   ((v >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
+#define XPMEM_PFNTABLE_L1INDEX(v)   ((v >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
+
+/* The following assumes all levels have been allocated for the given vaddr */
+static inline u64 *
+xpmem_vaddr_to_PFN(struct xpmem_segment *seg, u64 vaddr)
+{
+	u64 ****l4table;
+	u64 ***l3table;
+	u64 **l2table;
+	u64 *l1table;
+
+	l4table = seg->PFNtable;
+	DBUG_ON(l4table == NULL);
+	l3table = l4table[XPMEM_PFNTABLE_L4INDEX(vaddr)];
+	DBUG_ON(l3table == NULL);
+	l2table = l3table[XPMEM_PFNTABLE_L3INDEX(vaddr)];
+	DBUG_ON(l2table == NULL);
+	l1table = l2table[XPMEM_PFNTABLE_L2INDEX(vaddr)];
+	DBUG_ON(l1table == NULL);
+	return &l1table[XPMEM_PFNTABLE_L1INDEX(vaddr)];
+}
+
+/* the following will allocate missing levels for the given vaddr */
+
+static inline void *
+xpmem_alloc_PFNtable_entry(size_t size)
+{
+	void *entry;
+
+	entry = kzalloc(size, GFP_KERNEL);
+	wmb();	/* ensure that others will see the allocated space as zeroed */
+	return entry;
+}
+
+static inline int
+xpmem_vaddr_to_PFN_alloc(struct xpmem_segment *seg, u64 vaddr, u64 **pfn,
+			 int locked)
+{
+	u64 ****l4entry;
+	u64 ***l3entry;
+	u64 **l2entry;
+
+	DBUG_ON(seg->PFNtable == NULL);
+
+	l4entry = seg->PFNtable + XPMEM_PFNTABLE_L4INDEX(vaddr);
+	if (*l4entry == NULL) {
+		if (!locked)
+			mutex_lock(&seg->PFNtable_mutex);
+
+		if (locked || *l4entry == NULL)
+			*l4entry =
+			    xpmem_alloc_PFNtable_entry(XPMEM_PFNTABLE_L3SIZE *
+						       sizeof(u64 *));
+		if (!locked)
+			mutex_unlock(&seg->PFNtable_mutex);
+
+		if (*l4entry == NULL)
+			return -ENOMEM;
+	}
+	l3entry = *l4entry + XPMEM_PFNTABLE_L3INDEX(vaddr);
+	if (*l3entry == NULL) {
+		if (!locked)
+			mutex_lock(&seg->PFNtable_mutex);
+
+		if (locked || *l3entry == NULL)
+			*l3entry =
+			    xpmem_alloc_PFNtable_entry(XPMEM_PFNTABLE_L2SIZE *
+						       sizeof(u64 *));
+		if (!locked)
+			mutex_unlock(&seg->PFNtable_mutex);
+
+		if (*l3entry == NULL)
+			return -ENOMEM;
+	}
+	l2entry = *l3entry + XPMEM_PFNTABLE_L2INDEX(vaddr);
+	if (*l2entry == NULL) {
+		if (!locked)
+			mutex_lock(&seg->PFNtable_mutex);
+
+		if (locked || *l2entry == NULL)
+			*l2entry =
+			    xpmem_alloc_PFNtable_entry(XPMEM_PFNTABLE_L1SIZE *
+						       sizeof(u64));
+		if (!locked)
+			mutex_unlock(&seg->PFNtable_mutex);
+
+		if (*l2entry == NULL)
+			return -ENOMEM;
+	}
+	*pfn = *l2entry + XPMEM_PFNTABLE_L1INDEX(vaddr);
+
+	return 0;
+}
+
+/* node based PFN work list used when PFN tables are being cleared */
+
+struct xpmem_PFNlist {
+	struct delayed_work dwork;	/* for scheduling purposes */
+	atomic_t *n_pinned;	/* &tg->n_pinned */
+	struct xpmem_node_PFNlists *PFNlists;	/* PFNlists this belongs to */
+	int n_PFNs;		/* #of PFNs in array of PFNs */
+	u64 PFNs[0];		/* an array of PFNs */
+};
+
+struct xpmem_node_PFNlist {
+	int nid;		/* node to schedule work on */
+	int cpu;		/* last cpu work was scheduled on */
+	struct xpmem_PFNlist *PFNlist;	/* node based list to process */
+};
+
+struct xpmem_node_PFNlists {
+	atomic_t n_PFNlists_processing;
+	wait_queue_head_t PFNlists_processing_wq;
+
+	int n_PFNlists_created ____cacheline_aligned;
+	int n_PFNlists_scheduled;
+	struct xpmem_node_PFNlist *PFNlists;
+};
+
+#define XPMEM_NODE_UNINITIALIZED	-1
+#define XPMEM_CPUS_UNINITIALIZED	-1
+#define XPMEM_NODE_OFFLINE		-2
+#define XPMEM_CPUS_OFFLINE		-2
+
+/*
+ * Calculate the #of PFNs that can have their cache lines recalled within
+ * one timer tick. The hardcoded '4273504' represents the #of cache lines that
+ * can be recalled per second, which is based on a measured 30usec per page.
+ * The rest of it is just units conversion to pages per tick which allows
+ * for HZ and page size to change.
+ *
+ * (cachelines_per_sec / ticks_per_sec * bytes_per_cacheline / bytes_per_page)
+ */
+#define XPMEM_MAXNPFNs_PER_LIST		(4273504 / HZ * 128 / PAGE_SIZE)
+
+/*
+ * The following are active requests in get_user_pages.  If the address range
+ * is invalidated while these requests are pending, we have to assume the
+ * returned pages are not the correct ones.
+ */
+struct xpmem_page_request {
+	struct list_head page_requests;
+	u64 vaddr;
+	size_t size;
+	int valid;
+	wait_queue_head_t wq;
+};
+
+
+/*
+ * Functions registered by such things as add_timer() or called by functions
+ * like kernel_thread() only allow for a single 64-bit argument. The following
+ * inlines can be used to pack and unpack two (32-bit, 16-bit or 8-bit)
+ * arguments into or out from the passed argument.
+ */
+static inline u64
+xpmem_pack_arg1(u64 args, u32 arg1)
+{
+	return ((args & (((1UL << 32) - 1) << 32)) | arg1);
+}
+
+static inline u64
+xpmem_pack_arg2(u64 args, u32 arg2)
+{
+	return ((args & ((1UL << 32) - 1)) | ((u64)arg2 << 32));
+}
+
+static inline u32
+xpmem_unpack_arg1(u64 args)
+{
+	return (u32)(args & ((1UL << 32) - 1));
+}
+
+static inline u32
+xpmem_unpack_arg2(u64 args)
+{
+	return (u32)(args >> 32);
+}
+
+/* found in xpmem_main.c */
+extern struct device *xpmem;
+extern struct xpmem_thread_group *xpmem_open_proxy_tg_with_ref(__s64);
+extern void xpmem_flush_proxy_tg_with_nosegs(struct xpmem_thread_group *);
+extern int xpmem_send_version(short);
+
+/* found in xpmem_make.c */
+extern int xpmem_make(u64, size_t, int, void *, __s64 *);
+extern void xpmem_remove_segs_of_tg(struct xpmem_thread_group *);
+extern int xpmem_remove(__s64);
+
+/* found in xpmem_get.c */
+extern int xpmem_get(__s64, int, int, void *, __s64 *);
+extern void xpmem_release_aps_of_tg(struct xpmem_thread_group *);
+extern int xpmem_release(__s64);
+
+/* found in xpmem_attach.c */
+extern struct vm_operations_struct xpmem_vm_ops_fault;
+extern struct vm_operations_struct xpmem_vm_ops_nopfn;
+extern int xpmem_attach(struct file *, __s64, off_t, size_t, u64, int, int,
+			u64 *);
+extern void xpmem_clear_PTEs(struct xpmem_segment *, u64, size_t);
+extern int xpmem_detach(u64);
+extern void xpmem_detach_att(struct xpmem_access_permit *,
+			     struct xpmem_attachment *);
+extern int xpmem_mmap(struct file *, struct vm_area_struct *);
+
+/* found in xpmem_pfn.c */
+extern int xpmem_emm_notifier_callback(struct emm_notifier *, struct mm_struct *,
+		enum emm_operation, unsigned long, unsigned long);
+extern int xpmem_ensure_valid_PFNs(struct xpmem_segment *, u64, size_t, int,
+				   int, unsigned long, int, int *);
+extern void xpmem_clear_PFNtable(struct xpmem_segment *, u64, size_t, int, int);
+extern int xpmem_block_recall_PFNs(struct xpmem_thread_group *, int);
+extern void xpmem_unblock_recall_PFNs(struct xpmem_thread_group *);
+extern int xpmem_fork_begin(void);
+extern int xpmem_fork_end(void);
+#ifdef CONFIG_PROC_FS
+#define XPMEM_TGID_STRING_LEN	11
+extern spinlock_t xpmem_unpin_procfs_lock;
+extern struct proc_dir_entry *xpmem_unpin_procfs_dir;
+extern int xpmem_unpin_procfs_write(struct file *, const char __user *,
+				    unsigned long, void *);
+extern int xpmem_unpin_procfs_read(char *, char **, off_t, int, int *, void *);
+#endif /* CONFIG_PROC_FS */
+
+/* found in xpmem_partition.c */
+extern struct xpmem_partition *xpmem_partitions;
+extern struct xpmem_partition *xpmem_my_part;
+extern short xpmem_my_partid;
+/* found in xpmem_misc.c */
+extern struct xpmem_thread_group *xpmem_tg_ref_by_tgid(struct xpmem_partition *,
+						       pid_t);
+extern struct xpmem_thread_group *xpmem_tg_ref_by_segid(__s64);
+extern struct xpmem_thread_group *xpmem_tg_ref_by_apid(__s64);
+extern void xpmem_tg_deref(struct xpmem_thread_group *);
+extern struct xpmem_segment *xpmem_seg_ref_by_segid(struct xpmem_thread_group *,
+						    __s64);
+extern void xpmem_seg_deref(struct xpmem_segment *);
+extern struct xpmem_access_permit *xpmem_ap_ref_by_apid(struct
+							xpmem_thread_group *,
+							__s64);
+extern void xpmem_ap_deref(struct xpmem_access_permit *);
+extern void xpmem_att_deref(struct xpmem_attachment *);
+extern int xpmem_seg_down_read(struct xpmem_thread_group *,
+			       struct xpmem_segment *, int, int);
+extern u64 xpmem_get_seg_vaddr(struct xpmem_access_permit *, off_t, size_t,
+			       int);
+extern void xpmem_block_nonfatal_signals(sigset_t *);
+extern void xpmem_unblock_nonfatal_signals(sigset_t *);
+
+/*
+ * Inlines that mark an internal driver structure as being destroyable or not.
+ * The idea is to set the refcnt to 1 at structure creation time and then
+ * drop that reference at the time the structure is to be destroyed.
+ */
+static inline void
+xpmem_tg_not_destroyable(struct xpmem_thread_group *tg)
+{
+	atomic_set(&tg->refcnt, 1);
+}
+
+static inline void
+xpmem_tg_destroyable(struct xpmem_thread_group *tg)
+{
+	xpmem_tg_deref(tg);
+}
+
+static inline void
+xpmem_seg_not_destroyable(struct xpmem_segment *seg)
+{
+	atomic_set(&seg->refcnt, 1);
+}
+
+static inline void
+xpmem_seg_destroyable(struct xpmem_segment *seg)
+{
+	xpmem_seg_deref(seg);
+}
+
+static inline void
+xpmem_ap_not_destroyable(struct xpmem_access_permit *ap)
+{
+	atomic_set(&ap->refcnt, 1);
+}
+
+static inline void
+xpmem_ap_destroyable(struct xpmem_access_permit *ap)
+{
+	xpmem_ap_deref(ap);
+}
+
+static inline void
+xpmem_att_not_destroyable(struct xpmem_attachment *att)
+{
+	atomic_set(&att->refcnt, 1);
+}
+
+static inline void
+xpmem_att_destroyable(struct xpmem_attachment *att)
+{
+	xpmem_att_deref(att);
+}
+
+static inline void
+xpmem_att_set_destroying(struct xpmem_attachment *att)
+{
+	att->flags |= XPMEM_FLAG_DESTROYING;
+}
+
+static inline void
+xpmem_att_clear_destroying(struct xpmem_attachment *att)
+{
+	att->flags &= ~XPMEM_FLAG_DESTROYING;
+	wake_up(&att->destroyed_wq);
+}
+
+static inline void
+xpmem_att_set_destroyed(struct xpmem_attachment *att)
+{
+	att->flags |= XPMEM_FLAG_DESTROYED;
+	wake_up(&att->destroyed_wq);
+}
+
+static inline void
+xpmem_att_wait_destroyed(struct xpmem_attachment *att)
+{
+	wait_event(att->destroyed_wq, (!(att->flags & XPMEM_FLAG_DESTROYING) ||
+					(att->flags & XPMEM_FLAG_DESTROYED)));
+}
+
+
+/*
+ * Inlines that increment the refcnt for the specified structure.
+ */
+static inline void
+xpmem_tg_ref(struct xpmem_thread_group *tg)
+{
+	DBUG_ON(atomic_read(&tg->refcnt) <= 0);
+	atomic_inc(&tg->refcnt);
+}
+
+static inline void
+xpmem_seg_ref(struct xpmem_segment *seg)
+{
+	DBUG_ON(atomic_read(&seg->refcnt) <= 0);
+	atomic_inc(&seg->refcnt);
+}
+
+static inline void
+xpmem_ap_ref(struct xpmem_access_permit *ap)
+{
+	DBUG_ON(atomic_read(&ap->refcnt) <= 0);
+	atomic_inc(&ap->refcnt);
+}
+
+static inline void
+xpmem_att_ref(struct xpmem_attachment *att)
+{
+	DBUG_ON(atomic_read(&att->refcnt) <= 0);
+	atomic_inc(&att->refcnt);
+}
+
+/*
+ * A simple test to determine whether the specified vma corresponds to a
+ * XPMEM attachment.
+ */
+static inline int
+xpmem_is_vm_ops_set(struct vm_area_struct *vma)
+{
+	return ((vma->vm_flags & VM_PFNMAP) ?
+		(vma->vm_ops == &xpmem_vm_ops_nopfn) :
+		(vma->vm_ops == &xpmem_vm_ops_fault));
+}
+
+
+/* xpmem_seg_down_read() can be found in arch/ia64/sn/kernel/xpmem_misc.c */
+
+static inline void
+xpmem_seg_up_read(struct xpmem_thread_group *seg_tg,
+		  struct xpmem_segment *seg, int unblock_recall_PFNs)
+{
+	up_read(&seg->sema);
+	if (unblock_recall_PFNs)
+		xpmem_unblock_recall_PFNs(seg_tg);
+}
+
+static inline void
+xpmem_seg_down_write(struct xpmem_segment *seg)
+{
+	down_write(&seg->sema);
+}
+
+static inline void
+xpmem_seg_up_write(struct xpmem_segment *seg)
+{
+	up_write(&seg->sema);
+	wake_up(&seg->destroyed_wq);
+}
+
+static inline void
+xpmem_wait_for_seg_destroyed(struct xpmem_segment *seg)
+{
+	wait_event(seg->destroyed_wq, ((seg->flags & XPMEM_FLAG_DESTROYED) ||
+				       !(seg->flags & (XPMEM_FLAG_DESTROYING |
+						   XPMEM_FLAG_RECALLINGPFNS))));
+}
+
+/*
+ * Hash Tables
+ *
+ * XPMEM utilizes hash tables to enable faster lookups of list entries.
+ * These hash tables are implemented as arrays. A simple modulus of the hash
+ * key yields the appropriate array index. A hash table's array element (i.e.,
+ * hash table bucket) consists of a hash list and the lock that protects it.
+ *
+ * XPMEM has the following two hash tables:
+ *
+ * table		bucket					key
+ * part->tg_hashtable	list of struct xpmem_thread_group	tgid
+ * tg->ap_hashtable	list of struct xpmem_access_permit	apid.uniq
+ *
+ * (The 'part' pointer is defined as: &xpmem_partitions[tg->partid])
+ */
+
+struct xpmem_hashlist {
+	rwlock_t lock;		/* lock for hash list */
+	struct list_head list;	/* hash list */
+} ____cacheline_aligned;
+
+#define XPMEM_TG_HASHTABLE_SIZE	512
+#define XPMEM_AP_HASHTABLE_SIZE	8
+
+static inline int
+xpmem_tg_hashtable_index(pid_t tgid)
+{
+	return (tgid % XPMEM_TG_HASHTABLE_SIZE);
+}
+
+static inline int
+xpmem_ap_hashtable_index(__s64 apid)
+{
+	DBUG_ON(apid <= 0);
+	return (((struct xpmem_id *)&apid)->uniq % XPMEM_AP_HASHTABLE_SIZE);
+}
+
+/*
+ * >>>
+ */
+static inline size_t
+xpmem_get_overlapping_range(u64 base_vaddr, size_t base_size, u64 *vaddr_p,
+			    size_t *size_p)
+{
+	u64 start = max(*vaddr_p, base_vaddr);
+	u64 end = min(*vaddr_p + *size_p, base_vaddr + base_size);
+
+	*vaddr_p = start;
+	*size_p	= max((ssize_t)0, (ssize_t)(end - start));
+	return *size_p;
+}
+
+#endif /* _ASM_IA64_XPMEM_PRIVATE_H */
Index: emm_notifier_xpmem_v1/drivers/misc/Makefile
===================================================================
--- emm_notifier_xpmem_v1.orig/drivers/misc/Makefile	2008-04-01 10:12:01.278062055 -0500
+++ emm_notifier_xpmem_v1/drivers/misc/Makefile	2008-04-01 10:13:22.304137897 -0500
@@ -22,3 +22,4 @@ obj-$(CONFIG_FUJITSU_LAPTOP)	+= fujitsu-
 obj-$(CONFIG_EEPROM_93CX6)	+= eeprom_93cx6.o
 obj-$(CONFIG_INTEL_MENLOW)	+= intel_menlow.o
 obj-$(CONFIG_ENCLOSURE_SERVICES) += enclosure.o
+obj-y				+= xp/

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 9/9] XPMEM: Simple example
  2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
                   ` (7 preceding siblings ...)
  2008-04-01 20:55 ` [patch 8/9] XPMEM: The device driver Christoph Lameter
@ 2008-04-01 20:55 ` Christoph Lameter
  8 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-01 20:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: xpmem_test --]
[-- Type: text/plain, Size: 8256 bytes --]

A simple test program (well actually a pair).  They are fairly easy to use.

NOTE: the xpmem.h is copied from the kernel/drivers/misc/xp/xpmem.h
file.

Type make.  Then from one session, type ./A1.  Grab the first
line of output which should begin with ./A2 and paste the whole line
into a second session.  Paste as many times as you like.  Each pass will
increment the value one additional time.  When you are tired, hit enter
in the first window.  You should see the same value printed from A1 as
you most recently received from A2.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 xpmem_test/A1.c     |   64 +++++++++++++++++++++++++
 xpmem_test/A2.c     |   70 ++++++++++++++++++++++++++++
 xpmem_test/Makefile |   14 +++++
 xpmem_test/xpmem.h  |  130 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 278 insertions(+)

Index: linux-2.6/xpmem_test/A1.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/xpmem_test/A1.c	2008-04-01 13:36:06.982428295 -0700
@@ -0,0 +1,64 @@
+/*
+ *  Simple test program.  Makes a segment then waits for an input line
+ * and finally prints the value of the first integer of that segment.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stropts.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "xpmem.h"
+
+int xpmem_fd;
+
+int
+main(int argc, char **argv)
+{
+	char input[32];
+	struct xpmem_cmd_make make_info;
+	int *data_block;
+	int ret;
+	__s64 segid;
+
+	xpmem_fd = open("/dev/xpmem", O_RDWR);
+	if (xpmem_fd == -1) {
+		perror("Opening /dev/xpmem");
+		return -1;
+	}
+
+	data_block = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+			  MAP_SHARED | MAP_ANONYMOUS, 0, 0);
+	if (data_block == MAP_FAILED) {
+		perror("Creating mapping.");
+		return -1;
+	}
+	data_block[0] = 1;
+
+	make_info.vaddr = (__u64) data_block;
+	make_info.size = getpagesize();
+	make_info.permit_type = XPMEM_PERMIT_MODE;
+	make_info.permit_value = (__u64) 0600;
+	ret = ioctl(xpmem_fd, XPMEM_CMD_MAKE, &make_info);
+	if (ret != 0) {
+		perror("xpmem_make");
+		return -1;
+	}
+
+	segid = make_info.segid;
+	printf("./A2 %d %d %d %d\ndata_block[0] = %d\n",
+	       (int)(segid >> 48 & 0xffff), (int)(segid >> 32 & 0xffff),
+	       (int)(segid >> 16 & 0xffff), (int)(segid & 0xffff),
+	       data_block[0]);
+	printf("Waiting for input before exiting.\n");
+	fscanf(stdin, "%s", input);
+
+	printf("data_block[0] = %d\n", data_block[0]);
+
+	return 0;
+}
Index: linux-2.6/xpmem_test/A2.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/xpmem_test/A2.c	2008-04-01 13:36:09.498469523 -0700
@@ -0,0 +1,70 @@
+/*
+ * Simple test program that gets then attaches an xpmem segment identified
+ * on the command line then increments the first integer of that buffer by
+ * one and exits.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stropts.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "xpmem.h"
+
+int xpmem_fd;
+
+int
+main(int argc, char **argv)
+{
+	int ret;
+	__s64 segid;
+	__s64 apid;
+	struct xpmem_cmd_get get_info;
+	struct xpmem_cmd_attach attach_info;
+	int *attached_buffer;
+
+	xpmem_fd = open("/dev/xpmem", O_RDWR);
+	if (xpmem_fd == -1) {
+		perror("Opening /dev/xpmem");
+		return -1;
+	}
+
+	segid = (__s64) atoi(argv[1]) << 48;
+	segid |= (__s64) atoi(argv[2]) << 32;
+	segid |= (__s64) atoi(argv[3]) << 16;
+	segid |= (__s64) atoi(argv[4]);
+	get_info.segid = segid;
+	get_info.flags = XPMEM_RDWR;
+	get_info.permit_type = XPMEM_PERMIT_MODE;
+	get_info.permit_value = (__u64) NULL;
+	ret = ioctl(xpmem_fd, XPMEM_CMD_GET, &get_info);
+	if (ret != 0) {
+		perror("xpmem_get");
+		return -1;
+	}
+	apid = get_info.apid;
+
+	attach_info.apid = get_info.apid;
+	attach_info.offset = 0;
+	attach_info.size = getpagesize();
+	attach_info.vaddr = (__u64) NULL;
+	attach_info.fd = xpmem_fd;
+	attach_info.flags = 0;
+
+	ret = ioctl(xpmem_fd, XPMEM_CMD_ATTACH, &attach_info);
+	if (ret != 0) {
+		perror("xpmem_attach");
+		return -1;
+	}
+
+	attached_buffer = (int *)attach_info.vaddr;
+	attached_buffer[0]++;
+
+	printf("Just incremented the value to %d\n", attached_buffer[0]);
+	return 0;
+}
Index: linux-2.6/xpmem_test/Makefile
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/xpmem_test/Makefile	2008-04-01 13:36:19.218628862 -0700
@@ -0,0 +1,14 @@
+
+default:	A1 A2
+
+A1:	A1.c xpmem.h
+	gcc -Wall -o A1 A1.c
+
+A2:	A2.c xpmem.h
+	gcc -Wall -o A2 A2.c
+
+indent:
+	indent -npro -kr -i8 -ts8 -sob -l80 -ss -ncs -cp1 -psl -npcs A1.c A2.c
+
+clean:
+	rm -f A1 A2 *~
Index: linux-2.6/xpmem_test/xpmem.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/xpmem_test/xpmem.h	2008-04-01 13:36:24.418714133 -0700
@@ -0,0 +1,130 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2004-2007 Silicon Graphics, Inc.  All Rights Reserved.
+ */
+
+/*
+ * Cross Partition Memory (XPMEM) structures and macros.
+ */
+
+#ifndef _ASM_IA64_SN_XPMEM_H
+#define _ASM_IA64_SN_XPMEM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * basic argument type definitions
+ */
+struct xpmem_addr {
+	__s64 apid;		/* apid that represents memory */
+	off_t offset;		/* offset into apid's memory */
+};
+
+#define XPMEM_MAXADDR_SIZE	(size_t)(-1L)
+
+#define XPMEM_ATTACH_WC		0x10000
+#define XPMEM_ATTACH_GETSPACE	0x20000
+
+/*
+ * path to XPMEM device
+ */
+#define XPMEM_DEV_PATH  "/dev/xpmem"
+
+/*
+ * The following are the possible XPMEM related errors.
+ */
+#define XPMEM_ERRNO_NOPROC	2004	/* unknown thread due to fork() */
+
+/*
+ * flags for segment permissions
+ */
+#define XPMEM_RDONLY	0x1
+#define XPMEM_RDWR	0x2
+
+/*
+ * Valid permit_type values for xpmem_make().
+ */
+#define XPMEM_PERMIT_MODE	0x1
+
+/*
+ * ioctl() commands used to interface to the kernel module.
+ */
+#define XPMEM_IOC_MAGIC		'x'
+#define XPMEM_CMD_VERSION	_IO(XPMEM_IOC_MAGIC, 0)
+#define XPMEM_CMD_MAKE		_IO(XPMEM_IOC_MAGIC, 1)
+#define XPMEM_CMD_REMOVE	_IO(XPMEM_IOC_MAGIC, 2)
+#define XPMEM_CMD_GET		_IO(XPMEM_IOC_MAGIC, 3)
+#define XPMEM_CMD_RELEASE	_IO(XPMEM_IOC_MAGIC, 4)
+#define XPMEM_CMD_ATTACH	_IO(XPMEM_IOC_MAGIC, 5)
+#define XPMEM_CMD_DETACH	_IO(XPMEM_IOC_MAGIC, 6)
+#define XPMEM_CMD_COPY		_IO(XPMEM_IOC_MAGIC, 7)
+#define XPMEM_CMD_BCOPY		_IO(XPMEM_IOC_MAGIC, 8)
+#define XPMEM_CMD_FORK_BEGIN	_IO(XPMEM_IOC_MAGIC, 9)
+#define XPMEM_CMD_FORK_END	_IO(XPMEM_IOC_MAGIC, 10)
+
+/*
+ * Structures used with the preceding ioctl() commands to pass data.
+ */
+struct xpmem_cmd_make {
+	__u64 vaddr;
+	size_t size;
+	int permit_type;
+	__u64 permit_value;
+	__s64 segid;		/* returned on success */
+};
+
+struct xpmem_cmd_remove {
+	__s64 segid;
+};
+
+struct xpmem_cmd_get {
+	__s64 segid;
+	int flags;
+	int permit_type;
+	__u64 permit_value;
+	__s64 apid;		/* returned on success */
+};
+
+struct xpmem_cmd_release {
+	__s64 apid;
+};
+
+struct xpmem_cmd_attach {
+	__s64 apid;
+	off_t offset;
+	size_t size;
+	__u64 vaddr;
+	int fd;
+	int flags;
+};
+
+struct xpmem_cmd_detach {
+	__u64 vaddr;
+};
+
+struct xpmem_cmd_copy {
+	__s64 src_apid;
+	off_t src_offset;
+	__s64 dst_apid;
+	off_t dst_offset;
+	size_t size;
+};
+
+#ifndef __KERNEL__
+extern int xpmem_version(void);
+extern __s64 xpmem_make(void *, size_t, int, void *);
+extern int xpmem_remove(__s64);
+extern __s64 xpmem_get(__s64, int, int, void *);
+extern int xpmem_release(__s64);
+extern void *xpmem_attach(struct xpmem_addr, size_t, void *);
+extern void *xpmem_attach_wc(struct xpmem_addr, size_t, void *);
+extern void *xpmem_attach_getspace(struct xpmem_addr, size_t, void *);
+extern int xpmem_detach(void *);
+extern int xpmem_bcopy(struct xpmem_addr, struct xpmem_addr, size_t);
+#endif
+
+#endif /* _ASM_IA64_SN_XPMEM_H */

-- 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-01 20:55 ` [patch 1/9] EMM Notifier: The notifier calls Christoph Lameter
@ 2008-04-01 21:14   ` Peter Zijlstra
  2008-04-01 21:38     ` Paul E. McKenney
  2008-04-02  6:49   ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2008-04-01 21:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Andrea Arcangeli, Paul E. McKenney, linux-kernel

(Christoph, why are your CCs so often messed up?)

On Tue, 2008-04-01 at 13:55 -0700, Christoph Lameter wrote:
> plain text document attachment (emm_notifier)

> +/* Register a notifier */
> +void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
> +{
> +	e->next = mm->emm_notifier;
> +	/*
> +	 * The update to emm_notifier (e->next) must be visible
> +	 * before the pointer becomes visible.
> +	 * rcu_assign_pointer() does exactly what we need.
> +	 */
> +	rcu_assign_pointer(mm->emm_notifier, e);
> +}
> +EXPORT_SYMBOL_GPL(emm_notifier_register);
> +
> +/* Perform a callback */
> +int __emm_notify(struct mm_struct *mm, enum emm_operation op,
> +		unsigned long start, unsigned long end)
> +{
> +	struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
> +	int x;
> +
> +	while (e) {
> +
> +		if (e->callback) {
> +			x = e->callback(e, mm, op, start, end);
> +			if (x)
> +				return x;
> +		}
> +		/*
> +		 * emm_notifier contents (e) must be fetched after
> +		 * the retrival of the pointer to the notifier.
> +		 */
> +		e = rcu_dereference(e)->next;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(__emm_notify);
> +#endif

Those rcu_dereference()s are wrong. They should read:

  e = rcu_dereference(mm->emm_notifier);

and

  e = rcu_dereference(e->next);




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-01 21:14   ` Peter Zijlstra
@ 2008-04-01 21:38     ` Paul E. McKenney
  2008-04-02 17:44       ` Christoph Lameter
  2008-04-02 18:43       ` EMM: Fix rcu handling and spelling Christoph Lameter
  0 siblings, 2 replies; 51+ messages in thread
From: Paul E. McKenney @ 2008-04-01 21:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Hugh Dickins, Andrea Arcangeli, linux-kernel

On Tue, Apr 01, 2008 at 11:14:40PM +0200, Peter Zijlstra wrote:
> (Christoph, why are your CCs so often messed up?)
> 
> On Tue, 2008-04-01 at 13:55 -0700, Christoph Lameter wrote:
> > plain text document attachment (emm_notifier)
> 
> > +/* Register a notifier */
> > +void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
> > +{
> > +	e->next = mm->emm_notifier;
> > +	/*
> > +	 * The update to emm_notifier (e->next) must be visible
> > +	 * before the pointer becomes visible.
> > +	 * rcu_assign_pointer() does exactly what we need.
> > +	 */
> > +	rcu_assign_pointer(mm->emm_notifier, e);
> > +}
> > +EXPORT_SYMBOL_GPL(emm_notifier_register);
> > +
> > +/* Perform a callback */
> > +int __emm_notify(struct mm_struct *mm, enum emm_operation op,
> > +		unsigned long start, unsigned long end)
> > +{
> > +	struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
> > +	int x;
> > +
> > +	while (e) {
> > +
> > +		if (e->callback) {
> > +			x = e->callback(e, mm, op, start, end);
> > +			if (x)
> > +				return x;
> > +		}
> > +		/*
> > +		 * emm_notifier contents (e) must be fetched after
> > +		 * the retrival of the pointer to the notifier.
> > +		 */
> > +		e = rcu_dereference(e)->next;
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(__emm_notify);
> > +#endif
> 
> Those rcu_dereference()s are wrong. They should read:
> 
>   e = rcu_dereference(mm->emm_notifier);
> 
> and
> 
>   e = rcu_dereference(e->next);

Peter has it right.  You need to rcu_dereference() the same thing that
you rcu_assign_pointer() to.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-01 20:55 ` [patch 1/9] EMM Notifier: The notifier calls Christoph Lameter
  2008-04-01 21:14   ` Peter Zijlstra
@ 2008-04-02  6:49   ` Andrea Arcangeli
  2008-04-02 10:59     ` Robin Holt
  2008-04-02 17:59     ` Christoph Lameter
  1 sibling, 2 replies; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02  6:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Tue, Apr 01, 2008 at 01:55:32PM -0700, Christoph Lameter wrote:
> +/* Perform a callback */
> +int __emm_notify(struct mm_struct *mm, enum emm_operation op,
> +		unsigned long start, unsigned long end)
> +{
> +	struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
> +	int x;
> +
> +	while (e) {
> +
> +		if (e->callback) {
> +			x = e->callback(e, mm, op, start, end);
> +			if (x)
> +				return x;

There are much bigger issues besides the rcu safety in this patch,
proper aging of the secondary mmu through access bits set by hardware
is unfixable with this model (you would need to do age |=
e->callback), which is the proof of why this isn't flexibile enough by
forcing the same parameter and retvals for all methods. No idea why
you go for such inferior solution that will never get the aging right
and will likely fall apart if we add more methods in the future.

For example the "switch" you have to add in
xpmem_emm_notifier_callback doesn't look good, at least gcc may be
able to optimize it with an array indexing simulating proper pointer
to function like in #v9.

Most other patches will apply cleanly on top of my coming mmu
notifiers #v10 that I hope will go in -mm.

For #v10 the only two left open issues to discuss are:

1) the moment you remove rcu_read_lock from the methods (my #v9 had
   rcu_read_lock so synchronize_rcu() in Jack's patch was working with
   my #v9) GRU has no way to ensure the methods will fire immediately
   after registering. To fix this race after removing the
   rcu_read_lock (to prepare for the later patches that allows the VM
   to schedule when the mmu notifiers methods are invoked) I can
   replace rcu_read_lock with seqlock locking in the same way as I did
   in a previous patch posted here (seqlock_write around the
   registration method, and seqlock_read replying all callbacks if the
   race happened). then synchronize_rcu become unnecessary and the
   methods will be correctly replied allowing GRU not to corrupt
   memory after the registration method. EMM would also need a fix
   like this for GRU to be safe on top of EMM.

   Another less obviously safe approach is to allow the register
   method to succeed only when mm_users=1 and the task is single
   threaded. This way if all the places where the mmu notifers aren't
   invoked on the mm not by the current task, are only doing
   invalidates after/before zapping ptes, if the istantiation of new
   ptes is single threaded too, we shouldn't worry if we miss an
   invalidate for a pte that is zero and doesn't point to any physical
   page. In the places where current->mm != mm I'm using
   invalidate_page 99% of the time, and that only follows the
   ptep_clear_flush. The problem are the range_begin that will happen
   before zapping the pte in places where current->mm !=
   mm. Unfortunately in my incremental patch where I move all
   invalidate_page outside of the PT lock to prepare for allowing
   sleeping inside the mmu notifiers, I used range_begin/end in places
   like try_to_unmap_cluster where current->mm != mm. In general
   this solution looks more fragile than the seqlock.

2) I'm uncertain how the driver can handle a range_end called before
   range_begin. Also multiple range_begin can happen in parallel later
   followed by range_end, so if there's a global seqlock that
   serializes the secondary mmu page fault, that will screwup (you
   can't seqlock_write in range_begin and sequnlock_write in
   range_end). The write side of the seqlock must be serialized and
   calling seqlock_write twice in a row before any sequnlock operation
   will break.

   A recursive rwsem taken in range_begin and released in range_end
   seems to be the only way to stop the secondary mmu page faults.

   If I would remove all range_begin/end in places where current->mm
   != mm, then I could as well bail out in mmu_notifier_register if
   use mm_users != 1 to solve problem 2 too.

   My solution to this is that I believe the driver is safe if the
   range_end is being missed if range_end is followed by an invalidate
   event like in invalidate_range_end, so the driver is ok to just
   have a static value that accounts if range_begin has ever happened
   and it will just return from range_end without doing anything if no
   range_begin ever happened.


Notably I'll be trying to use range_begin in KVM too so I got to deal
with 2) too. For Nick: the reason for using range_begin is supposedly
an optimization: to guarantee that the last free of the page will
happen outside the mmu_lock, so KVM internally to the mmu_lock is free
to do:

   	     spin_lock(kvm->mmu_lock)
   	     put_page()
	     spte = nonpresent
	     flush secondary tlb()
	     spin_unlock(kvm->mmu_lock)

The above ordering is unsafe if the page could ever reach the freelist
before the tlb flush happened. The range_begin will take the mmu_lock
and will hold off kvm new page faults to allow kvm to free as many
page it wants, invalidate all ptes and only at the end do a single tlb
flush, while still being allowed to madvise(don't need) or munmap
parts of the memory mapped by sptes. It's uncertain if the ordering
should be changed to be robust against put_page putting the page in
the freelist immediately, instead of using range_begin to serialize
against the page going out of ptes immediately after put_page is
called. If we go for a range_end-only usage of the mmu notifiers kvm
will need some reordering and zapping a large number of ptes will
require multiple tlb flushes as the pages have to be pointed by an
array and the array is of limited size (the size of the array decides
the frequency of the tlb flushes). The suggested usage of range_begin
allows to do a single tlb flush for an unlimited number of sptes being
zapped.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02  6:49   ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
@ 2008-04-02 10:59     ` Robin Holt
  2008-04-02 11:16       ` Andrea Arcangeli
  2008-04-02 17:59     ` Christoph Lameter
  1 sibling, 1 reply; 51+ messages in thread
From: Robin Holt @ 2008-04-02 10:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Hugh Dickins, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote:
> Most other patches will apply cleanly on top of my coming mmu
> notifiers #v10 that I hope will go in -mm.
> 
> For #v10 the only two left open issues to discuss are:

Does your v10 allow sleeping inside the callbacks?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02 10:59     ` Robin Holt
@ 2008-04-02 11:16       ` Andrea Arcangeli
  2008-04-02 14:26         ` Robin Holt
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 11:16 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Hugh Dickins, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman,
	Nick Piggin

On Wed, Apr 02, 2008 at 05:59:25AM -0500, Robin Holt wrote:
> On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote:
> > Most other patches will apply cleanly on top of my coming mmu
> > notifiers #v10 that I hope will go in -mm.
> > 
> > For #v10 the only two left open issues to discuss are:
> 
> Does your v10 allow sleeping inside the callbacks?

Yes if you apply all the patches. But not if you apply the first patch
only, most patches in EMM serie will apply cleanly or with minor
rejects to #v10 too, Christoph's further work to make EEM sleep
capable looks very good and it's going to be 100% shared, it's also
going to be a lot more controversial for merging than the two #v10 or
EMM first patch. EMM also doesn't allow sleeping inside the callbacks
if you only apply the first patch in the serie.

My priority is to get #v9 or the coming #v10 merged in -mm (only
difference will be the replacement of rcu_read_lock with the seqlock
to avoid breaking the synchronize_rcu in GRU code). I will mix seqlock
with rcu ordered writes. EMM indeed breaks GRU by making
synchronize_rcu a noop and by not providing any alternative (I will
obsolete synchronize_rcu making it a noop instead). This assumes Jack
used synchronize_rcu for whatever good reason. But this isn't the real
strong point against EMM, adding seqlock to EMM is as easy as adding
it to #v10 (admittedly with #v10 is a bit easier because I didn't
expand the hlist operations for zero gain like in EMM).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02 11:16       ` Andrea Arcangeli
@ 2008-04-02 14:26         ` Robin Holt
  0 siblings, 0 replies; 51+ messages in thread
From: Robin Holt @ 2008-04-02 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Christoph Lameter, Hugh Dickins, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman, Nick Piggin

I must have missed v10.  Could you repost so I can build xpmem
against it to see how it operates?  To help reduce confusion, you should
probably comandeer the patches from Christoph's set which you think are
needed to make it sleep.

Thanks,
Robin


On Wed, Apr 02, 2008 at 01:16:51PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 02, 2008 at 05:59:25AM -0500, Robin Holt wrote:
> > On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote:
> > > Most other patches will apply cleanly on top of my coming mmu
> > > notifiers #v10 that I hope will go in -mm.
> > > 
> > > For #v10 the only two left open issues to discuss are:
> > 
> > Does your v10 allow sleeping inside the callbacks?
> 
> Yes if you apply all the patches. But not if you apply the first patch
> only, most patches in EMM serie will apply cleanly or with minor
> rejects to #v10 too, Christoph's further work to make EEM sleep
> capable looks very good and it's going to be 100% shared, it's also
> going to be a lot more controversial for merging than the two #v10 or
> EMM first patch. EMM also doesn't allow sleeping inside the callbacks
> if you only apply the first patch in the serie.
> 
> My priority is to get #v9 or the coming #v10 merged in -mm (only
> difference will be the replacement of rcu_read_lock with the seqlock
> to avoid breaking the synchronize_rcu in GRU code). I will mix seqlock
> with rcu ordered writes. EMM indeed breaks GRU by making
> synchronize_rcu a noop and by not providing any alternative (I will
> obsolete synchronize_rcu making it a noop instead). This assumes Jack
> used synchronize_rcu for whatever good reason. But this isn't the real
> strong point against EMM, adding seqlock to EMM is as easy as adding
> it to #v10 (admittedly with #v10 is a bit easier because I didn't
> expand the hlist operations for zero gain like in EMM).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-01 21:38     ` Paul E. McKenney
@ 2008-04-02 17:44       ` Christoph Lameter
  2008-04-02 18:43       ` EMM: Fix rcu handling and spelling Christoph Lameter
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 17:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Hugh Dickins, Andrea Arcangeli, linux-kernel

On Tue, 1 Apr 2008, Paul E. McKenney wrote:

> Peter has it right.  You need to rcu_dereference() the same thing that
> you rcu_assign_pointer() to.

Ah. Ok. Thanks.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount
  2008-04-01 20:55 ` [patch 5/9] Convert anon_vma lock to rw_sem and refcount Christoph Lameter
@ 2008-04-02 17:50   ` Andrea Arcangeli
  2008-04-02 18:15     ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 17:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
>   This results in f.e. the Aim9 brk performance test to got down by 10-15%.

I guess it's more likely because of overscheduling for small crtitical
sections, did you counted the total number of context switches? I
guess there will be a lot more with your patch applied. That
regression is a showstopper and it is the reason why I've suggested
before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
option to make the VM locks sleep capable only when XPMEM=y
(PREEMPT_RT will enable it too). Thanks for doing the benchmark work!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02  6:49   ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
  2008-04-02 10:59     ` Robin Holt
@ 2008-04-02 17:59     ` Christoph Lameter
  2008-04-02 19:03       ` EMM: Fixup return value handling of emm_notify() Christoph Lameter
                         ` (2 more replies)
  1 sibling, 3 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 17:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

> There are much bigger issues besides the rcu safety in this patch,
> proper aging of the secondary mmu through access bits set by hardware
> is unfixable with this model (you would need to do age |=
> e->callback), which is the proof of why this isn't flexibile enough by
> forcing the same parameter and retvals for all methods. No idea why
> you go for such inferior solution that will never get the aging right
> and will likely fall apart if we add more methods in the future.

There is always the possibility to add special functions in the same way 
as done in the mmu notifier series if it really becomes necessary. EMM 
does  in no way preclude that.

Here f.e. We can add a special emm_age() function that iterates 
differently and does the | for you.

> For example the "switch" you have to add in
> xpmem_emm_notifier_callback doesn't look good, at least gcc may be
> able to optimize it with an array indexing simulating proper pointer
> to function like in #v9.

Actually the switch looks really good because it allows code to run
for all callbacks like f.e. xpmem_tg_ref(). Otherwise the refcounting code 
would have to be added to each callback.

> 
> Most other patches will apply cleanly on top of my coming mmu
> notifiers #v10 that I hope will go in -mm.
> 
> For #v10 the only two left open issues to discuss are:

Did I see #v10? Could you start a new subject when you post please? Do 
not respond to some old message otherwise the threading will be wrong.

>    methods will be correctly replied allowing GRU not to corrupt
>    memory after the registration method. EMM would also need a fix
>    like this for GRU to be safe on top of EMM.

How exactly does the GRU corrupt memory?
 
>    Another less obviously safe approach is to allow the register
>    method to succeed only when mm_users=1 and the task is single
>    threaded. This way if all the places where the mmu notifers aren't
>    invoked on the mm not by the current task, are only doing
>    invalidates after/before zapping ptes, if the istantiation of new
>    ptes is single threaded too, we shouldn't worry if we miss an
>    invalidate for a pte that is zero and doesn't point to any physical
>    page. In the places where current->mm != mm I'm using
>    invalidate_page 99% of the time, and that only follows the
>    ptep_clear_flush. The problem are the range_begin that will happen
>    before zapping the pte in places where current->mm !=
>    mm. Unfortunately in my incremental patch where I move all
>    invalidate_page outside of the PT lock to prepare for allowing
>    sleeping inside the mmu notifiers, I used range_begin/end in places
>    like try_to_unmap_cluster where current->mm != mm. In general
>    this solution looks more fragile than the seqlock.

Hmmm... Okay that is one solution that would just require a BUG_ON in the 
registration methods.

> 2) I'm uncertain how the driver can handle a range_end called before
>    range_begin. Also multiple range_begin can happen in parallel later
>    followed by range_end, so if there's a global seqlock that
>    serializes the secondary mmu page fault, that will screwup (you
>    can't seqlock_write in range_begin and sequnlock_write in
>    range_end). The write side of the seqlock must be serialized and
>    calling seqlock_write twice in a row before any sequnlock operation
>    will break.

Well doesnt the requirement of just one execution thread also deal with 
that issue?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount
  2008-04-02 17:50   ` Andrea Arcangeli
@ 2008-04-02 18:15     ` Christoph Lameter
  2008-04-02 21:56       ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 18:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

> On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
> >   This results in f.e. the Aim9 brk performance test to got down by 10-15%.
> 
> I guess it's more likely because of overscheduling for small crtitical
> sections, did you counted the total number of context switches? I
> guess there will be a lot more with your patch applied. That
> regression is a showstopper and it is the reason why I've suggested
> before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
> option to make the VM locks sleep capable only when XPMEM=y
> (PREEMPT_RT will enable it too). Thanks for doing the benchmark work!

There are more context switches if locks are contended. 

But that has actually also some good aspects because we avoid busy loops 
and can potentially continue work in another process.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* EMM: Fix rcu handling and spelling
  2008-04-01 21:38     ` Paul E. McKenney
  2008-04-02 17:44       ` Christoph Lameter
@ 2008-04-02 18:43       ` Christoph Lameter
  2008-04-02 19:02         ` Paul E. McKenney
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 18:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Hugh Dickins, Andrea Arcangeli, linux-kernel

Subject: EMM: Fix rcu handling and spelling

Fix the way rcu_dereference is done.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/rmap.h |    2 +-
 mm/rmap.c            |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-04-02 11:41:58.737866596 -0700
+++ linux-2.6/include/linux/rmap.h	2008-04-02 11:42:08.282029661 -0700
@@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct 
  * when the VM removes references to pages.
  */
 enum emm_operation {
-	emm_release,		/* Process existing, */
+	emm_release,		/* Process exiting, */
 	emm_invalidate_start,	/* Before the VM unmaps pages */
 	emm_invalidate_end,	/* After the VM unmapped pages */
  	emm_referenced		/* Check if a range was referenced */
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-04-02 11:41:58.737866596 -0700
+++ linux-2.6/mm/rmap.c	2008-04-02 11:42:08.282029661 -0700
@@ -303,7 +303,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register)
 int __emm_notify(struct mm_struct *mm, enum emm_operation op,
 		unsigned long start, unsigned long end)
 {
-	struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
+	struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
 	int x;
 
 	while (e) {
@@ -317,7 +317,7 @@ int __emm_notify(struct mm_struct *mm, e
 		 * emm_notifier contents (e) must be fetched after
 		 * the retrival of the pointer to the notifier.
 		 */
-		e = rcu_dereference(e)->next;
+		e = rcu_dereference(e->next);
 	}
 	return 0;
 }


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Fix rcu handling and spelling
  2008-04-02 18:43       ` EMM: Fix rcu handling and spelling Christoph Lameter
@ 2008-04-02 19:02         ` Paul E. McKenney
  0 siblings, 0 replies; 51+ messages in thread
From: Paul E. McKenney @ 2008-04-02 19:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Hugh Dickins, Andrea Arcangeli, linux-kernel

On Wed, Apr 02, 2008 at 11:43:02AM -0700, Christoph Lameter wrote:
> Subject: EMM: Fix rcu handling and spelling
> 
> Fix the way rcu_dereference is done.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/rmap.h |    2 +-
>  mm/rmap.c            |    4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/rmap.h	2008-04-02 11:41:58.737866596 -0700
> +++ linux-2.6/include/linux/rmap.h	2008-04-02 11:42:08.282029661 -0700
> @@ -91,7 +91,7 @@ static inline void page_dup_rmap(struct 
>   * when the VM removes references to pages.
>   */
>  enum emm_operation {
> -	emm_release,		/* Process existing, */
> +	emm_release,		/* Process exiting, */
>  	emm_invalidate_start,	/* Before the VM unmaps pages */
>  	emm_invalidate_end,	/* After the VM unmapped pages */
>   	emm_referenced		/* Check if a range was referenced */
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c	2008-04-02 11:41:58.737866596 -0700
> +++ linux-2.6/mm/rmap.c	2008-04-02 11:42:08.282029661 -0700
> @@ -303,7 +303,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register)
>  int __emm_notify(struct mm_struct *mm, enum emm_operation op,
>  		unsigned long start, unsigned long end)
>  {
> -	struct emm_notifier *e = rcu_dereference(mm)->emm_notifier;
> +	struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
>  	int x;
> 
>  	while (e) {
> @@ -317,7 +317,7 @@ int __emm_notify(struct mm_struct *mm, e
>  		 * emm_notifier contents (e) must be fetched after
>  		 * the retrival of the pointer to the notifier.
>  		 */
> -		e = rcu_dereference(e)->next;
> +		e = rcu_dereference(e->next);
>  	}
>  	return 0;
>  }
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* EMM: Fixup return value handling of emm_notify()
  2008-04-02 17:59     ` Christoph Lameter
@ 2008-04-02 19:03       ` Christoph Lameter
  2008-04-02 21:25         ` Andrea Arcangeli
  2008-04-02 21:05       ` EMM: Require single threadedness for registration Christoph Lameter
  2008-04-02 21:53       ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
  2 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, 2 Apr 2008, Christoph Lameter wrote:

> Here f.e. We can add a special emm_age() function that iterates 
> differently and does the | for you.

Well maybe not really necessary. How about this fix? Its likely a problem 
to stop callbacks if one callback returned an error.


Subject: EMM: Fixup return value handling of emm_notify()

Right now we stop calling additional subsystems if one callback returned
an error. That has the potential for causing additional trouble with the
subsystems that do not receive the callbacks they expect if one has failed.

So change the handling of error code to continue callbacks to other
subsystems but return the first error code encountered.

If a callback returns a positive return value then add up all the value
from all the calls. That can be used to establish how many references
exist (xpmem may want this feature at some point) or ensure that the
aging works the way Andrea wants it to (KVM, XPmem so far do not
care too much).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/rmap.c |   28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-04-02 11:46:20.738342852 -0700
+++ linux-2.6/mm/rmap.c	2008-04-02 12:03:57.672494320 -0700
@@ -299,27 +299,45 @@ void emm_notifier_register(struct emm_no
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
 
-/* Perform a callback */
+/*
+ * Perform a callback
+ *
+ * The return of this function is either a negative error of the first
+ * callback that failed or a consolidated count of all the positive
+ * values that were returned by the callbacks.
+ */
 int __emm_notify(struct mm_struct *mm, enum emm_operation op,
 		unsigned long start, unsigned long end)
 {
 	struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
 	int x;
+	int result = 0;
 
 	while (e) {
-
 		if (e->callback) {
 			x = e->callback(e, mm, op, start, end);
-			if (x)
-				return x;
+
+			/*
+			 * Callback may return a positive value to indicate a count
+			 * or a negative error code. We keep the first error code
+			 * but continue to perform callbacks to other subscribed
+			 * subsystems.
+			 */
+			if (x && result >= 0) {
+				if (x >= 0)
+					result += x;
+				else
+					result = x;
+			}
 		}
+
 		/*
 		 * emm_notifier contents (e) must be fetched after
 		 * the retrival of the pointer to the notifier.
 		 */
 		e = rcu_dereference(e->next);
 	}
-	return 0;
+	return result;
 }
 EXPORT_SYMBOL_GPL(__emm_notify);
 #endif

^ permalink raw reply	[flat|nested] 51+ messages in thread

* EMM: Require single threadedness for registration.
  2008-04-02 17:59     ` Christoph Lameter
  2008-04-02 19:03       ` EMM: Fixup return value handling of emm_notify() Christoph Lameter
@ 2008-04-02 21:05       ` Christoph Lameter
  2008-04-02 22:01         ` Andrea Arcangeli
  2008-04-02 21:53       ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
  2 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 21:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

Here is a patch to require single threaded execution during emm_register. 
This also allows an easy implementation of an unregister function and gets
rid of the races that Andrea worried about.

The approach here is similar to what was used in selinux for security
context changes (see selinux_setprocattr).

Is it okay for the users of emm to require single threadedness for 
registration?



Subject: EMM: Require single threaded execution for register and unregister

We can avoid the concurrency issues arising at registration if we
only allow registration of notifiers when the process has only a single
thread. That even allows to avoid the use of rcu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/rmap.c |   46 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 37 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-04-02 13:53:46.002473685 -0700
+++ linux-2.6/mm/rmap.c	2008-04-02 14:03:05.872199896 -0700
@@ -286,20 +286,48 @@ void emm_notifier_release(struct mm_stru
 	}
 }
 
-/* Register a notifier */
+/*
+ * Register a notifier
+ *
+ * mmap_sem is held writably.
+ *
+ * Process must be single threaded.
+ */
 void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
 {
+	BUG_ON(atomic_read(&mm->mm_users) != 1);
+
 	e->next = mm->emm_notifier;
-	/*
-	 * The update to emm_notifier (e->next) must be visible
-	 * before the pointer becomes visible.
-	 * rcu_assign_pointer() does exactly what we need.
-	 */
-	rcu_assign_pointer(mm->emm_notifier, e);
+	mm->emm_notifier = e;
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
 
 /*
+ * Unregister a notifier
+ *
+ * mmap_sem is held writably
+ *
+ * Process must be single threaded
+ */
+void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
+{
+	struct emm_notifier *p = mm->emm_notifier;
+
+	BUG_ON(atomic_read(&mm->mm_users) != 1);
+
+	if (e == p)
+		mm->emm_notifier = e->next;
+	else {
+		while (p->next != e)
+			p = p->next;
+
+		p->next = e->next;
+	}
+	e->callback(e, mm, emm_release, 0, TASK_SIZE);
+}
+EXPORT_SYMBOL_GPL(emm_notifier_unregister);
+
+/*
  * Perform a callback
  *
  * The return of this function is either a negative error of the first
@@ -309,7 +337,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register)
 int __emm_notify(struct mm_struct *mm, enum emm_operation op,
 		unsigned long start, unsigned long end)
 {
-	struct emm_notifier *e = rcu_dereference(mm->emm_notifier);
+	struct emm_notifier *e = mm->emm_notifier;
 	int x;
 	int result = 0;
 
@@ -335,7 +363,7 @@ int __emm_notify(struct mm_struct *mm, e
 		 * emm_notifier contents (e) must be fetched after
 		 * the retrival of the pointer to the notifier.
 		 */
-		e = rcu_dereference(e->next);
+		e = e->next;
 	}
 	return result;
 }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Fixup return value handling of emm_notify()
  2008-04-02 19:03       ` EMM: Fixup return value handling of emm_notify() Christoph Lameter
@ 2008-04-02 21:25         ` Andrea Arcangeli
  2008-04-02 21:33           ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 21:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 12:03:50PM -0700, Christoph Lameter wrote:
> +			/*
> +			 * Callback may return a positive value to indicate a count
> +			 * or a negative error code. We keep the first error code
> +			 * but continue to perform callbacks to other subscribed
> +			 * subsystems.
> +			 */
> +			if (x && result >= 0) {
> +				if (x >= 0)
> +					result += x;
> +				else
> +					result = x;
> +			}
>  		}
> +

Now think of when one of the kernel janitors will micro-optimize
PG_dirty to be returned by invalidate_page so a single set_page_dirty
will be invoked... Keep in mind this is a kernel internal APIs, ask
Greg if we can change it in order to optimize later in the future. I
think my #v9 is optimal enough while being simple at the same time,
but anyway it's silly to be hardwired to such an interface that worst
of all requires switch statements instead of proper pointer to
functions and a fixed set of parameters and retval semantics for all
methods.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Fixup return value handling of emm_notify()
  2008-04-02 21:25         ` Andrea Arcangeli
@ 2008-04-02 21:33           ` Christoph Lameter
  2008-04-03 10:40             ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 21:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

> but anyway it's silly to be hardwired to such an interface that worst
> of all requires switch statements instead of proper pointer to
> functions and a fixed set of parameters and retval semantics for all
> methods.

The EMM API with a single callback is the simplest approach at this point. 
A common callback for all operations allows the driver to implement common 
entry and exit code as seen in XPMem.

I guess we can complicate this more by switching to a different API or 
adding additional emm_xxx() callback if need be but I really want to have 
a strong case for why this would be needed. There is the danger of 
adding frills with special callbacks in this and that situation that could 
make the notifier complicated and specific to a certain usage scenario. 

Having this generic simple interface will hopefully avoid such things.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02 17:59     ` Christoph Lameter
  2008-04-02 19:03       ` EMM: Fixup return value handling of emm_notify() Christoph Lameter
  2008-04-02 21:05       ` EMM: Require single threadedness for registration Christoph Lameter
@ 2008-04-02 21:53       ` Andrea Arcangeli
  2008-04-02 21:54         ` Christoph Lameter
  2 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 21:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 10:59:50AM -0700, Christoph Lameter wrote:
> Did I see #v10? Could you start a new subject when you post please? Do 
> not respond to some old message otherwise the threading will be wrong.

I wasn't clear enough, #v10 was in the works... I was thinking about
the last two issues before posting it.

> How exactly does the GRU corrupt memory?

Jack added synchronize_rcu, I assume for a reason.

>  
> >    Another less obviously safe approach is to allow the register
> >    method to succeed only when mm_users=1 and the task is single
> >    threaded. This way if all the places where the mmu notifers aren't
> >    invoked on the mm not by the current task, are only doing
> >    invalidates after/before zapping ptes, if the istantiation of new
> >    ptes is single threaded too, we shouldn't worry if we miss an
> >    invalidate for a pte that is zero and doesn't point to any physical
> >    page. In the places where current->mm != mm I'm using
> >    invalidate_page 99% of the time, and that only follows the
> >    ptep_clear_flush. The problem are the range_begin that will happen
> >    before zapping the pte in places where current->mm !=
> >    mm. Unfortunately in my incremental patch where I move all
> >    invalidate_page outside of the PT lock to prepare for allowing
> >    sleeping inside the mmu notifiers, I used range_begin/end in places
> >    like try_to_unmap_cluster where current->mm != mm. In general
> >    this solution looks more fragile than the seqlock.
> 
> Hmmm... Okay that is one solution that would just require a BUG_ON in the 
> registration methods.

Perhaps you didn't notice that this solution can't work if you call
range_begin/end not in the "current" context and try_to_unmap_cluster
does exactly that for both my patchset and yours. Missing an _end is
ok, missing a _begin is never ok.

> Well doesnt the requirement of just one execution thread also deal with 
> that issue?

Yes, except again it can't work for try_to_unmap_cluster.

This solution is only applicable to #v10 if I fix try_to_unmap_cluster
to only call invalidate_page (relaying on the fact the VM holds a pin
and a lock on any page that is being mmu-notifier-invalidated).

You can't use the single threaded approach to solve either 1 or 2,
because your _begin call is called anywhere and that's where you call
the secondary-tlb flush and it's fatal to miss it.

invalidate_page is called always after, so it enforced the tlb flush
to be called _after_ and so it's inherently safe.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02 21:53       ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
@ 2008-04-02 21:54         ` Christoph Lameter
  2008-04-02 22:09           ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 21:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

> > Hmmm... Okay that is one solution that would just require a BUG_ON in the 
> > registration methods.
> 
> Perhaps you didn't notice that this solution can't work if you call
> range_begin/end not in the "current" context and try_to_unmap_cluster
> does exactly that for both my patchset and yours. Missing an _end is
> ok, missing a _begin is never ok.

If you look at the patch you will see a requirement of holding a 
writelock on mmap_sem which will keep out get_user_pages().

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount
  2008-04-02 18:15     ` Christoph Lameter
@ 2008-04-02 21:56       ` Andrea Arcangeli
  2008-04-02 21:56         ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 21:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Wed, Apr 02, 2008 at 11:15:26AM -0700, Christoph Lameter wrote:
> On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
> 
> > On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
> > >   This results in f.e. the Aim9 brk performance test to got down by 10-15%.
> > 
> > I guess it's more likely because of overscheduling for small crtitical
> > sections, did you counted the total number of context switches? I
> > guess there will be a lot more with your patch applied. That
> > regression is a showstopper and it is the reason why I've suggested
> > before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
> > option to make the VM locks sleep capable only when XPMEM=y
> > (PREEMPT_RT will enable it too). Thanks for doing the benchmark work!
> 
> There are more context switches if locks are contended. 
> 
> But that has actually also some good aspects because we avoid busy loops 
> and can potentially continue work in another process.

That would be the case if the "wait time" would be longer than the
scheduling time, the whole point is that with anonvma the write side
is so fast it's likely never worth scheduling (probably not even with
preempt-rt for the write side, the read side is an entirely different
matter but the read side can run concurrently if the system is heavy
paging), hence the slowdown. What you benchmarked is the write side,
which is also the fast path when the system is heavily CPU bound. I've
to say aim is a great benchmark to test this regression.

But I think a config option will solve all of this.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount
  2008-04-02 21:56       ` Andrea Arcangeli
@ 2008-04-02 21:56         ` Christoph Lameter
  2008-04-02 22:12           ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 21:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

> paging), hence the slowdown. What you benchmarked is the write side,
> which is also the fast path when the system is heavily CPU bound. I've
> to say aim is a great benchmark to test this regression.

I am a bit surprised that brk performance is that important. There may be 
other measurement that have to be made to assess how this would impact a 
real load.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Require single threadedness for registration.
  2008-04-02 21:05       ` EMM: Require single threadedness for registration Christoph Lameter
@ 2008-04-02 22:01         ` Andrea Arcangeli
  2008-04-02 22:06           ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 22:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 02:05:28PM -0700, Christoph Lameter wrote:
> Here is a patch to require single threaded execution during emm_register. 
> This also allows an easy implementation of an unregister function and gets
> rid of the races that Andrea worried about.

That would work for #v10 if I remove the invalidate_range_start from
try_to_unmap_cluster, it can't work for EMM because you've
emm_invalidate_start firing anywhere outside the context of the
current task (even regular rmap code, not just nonlinear corner case
will trigger the race). In short the single threaded approach would be
workable only thanks to the fact #v10 has the notion of
invalidate_page for flushing the tlb _after_ and to avoid blocking the
secondary page fault during swapping. In the kvm case I don't want to
block the page fault for anything but madvise which is strictly only
used after guest inflated the balloon, and the existence of
invalidate_page allows that optimization, and allows not to serialize
against the kvm page fault during all regular page faults when the
invalidate_page is called while the page is pinned by the VM.

The requirement for invalidate_page is that the pte and linux tlb are
flushed _before_ and the page is freed _after_ the invalidate_page
method. that's not the case for _begin/_end. The page is freed well
before _end runs, hence the need of _begin and to block the secondary
mmu page fault during the vma-mangling operations.

#v10 takes care of all this, and despite I could perhaps fix the
remaining two issues using the single-threaded enforcement I
suggested, I preferred to go safe and spend an unsigned per-mm in case
anybody needs to attach at runtime, the single threaded restriction
didn't look very clean.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Require single threadedness for registration.
  2008-04-02 22:01         ` Andrea Arcangeli
@ 2008-04-02 22:06           ` Christoph Lameter
  2008-04-02 22:17             ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 22:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

> That would work for #v10 if I remove the invalidate_range_start from
> try_to_unmap_cluster, it can't work for EMM because you've
> emm_invalidate_start firing anywhere outside the context of the
> current task (even regular rmap code, not just nonlinear corner case
> will trigger the race). In short the single threaded approach would be

But in that case it will be firing for a callback to another mm_struct. 
The notifiers are bound to mm_structs and keep separate contexts.

> The requirement for invalidate_page is that the pte and linux tlb are
> flushed _before_ and the page is freed _after_ the invalidate_page
> method. that's not the case for _begin/_end. The page is freed well
> before _end runs, hence the need of _begin and to block the secondary
> mmu page fault during the vma-mangling operations.

You could flush in _begin and free on _end? I thought you are taking a 
refcount on the page? You can drop the refcount only on _end to ensure 
that the page does not go away before.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02 21:54         ` Christoph Lameter
@ 2008-04-02 22:09           ` Andrea Arcangeli
  2008-04-02 23:04             ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 22:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 02:54:52PM -0700, Christoph Lameter wrote:
> On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
> 
> > > Hmmm... Okay that is one solution that would just require a BUG_ON in the 
> > > registration methods.
> > 
> > Perhaps you didn't notice that this solution can't work if you call
> > range_begin/end not in the "current" context and try_to_unmap_cluster
> > does exactly that for both my patchset and yours. Missing an _end is
> > ok, missing a _begin is never ok.
> 
> If you look at the patch you will see a requirement of holding a 
> writelock on mmap_sem which will keep out get_user_pages().

I said try_to_unmap_cluster, not get_user_pages.

  CPU0					CPU1
  try_to_unmap_cluster:
  emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10)
  walking the list by hand in EMM (or with hlist cleaner in #v10)
  xpmem method invoked
  schedule for a long while inside invalidate_range_start while skbs are sent
					gru registers
					synchronize_rcu (sorry useless now)
					single threaded, so taking a page fault
  					secondary tlb instantiated
  xpm method returns
  end of the list (didn't notice that it has to restart to flush the gru)
  zap pte
  free the page
					gru corrupts memory

CPU 1 was single threaded, CPU0 doesn't hold any mmap_sem or any other
lock that could ever serialize against the GRU as far as I can tell.

In general my #v10 solution mixing seqlock + rcu looks more robust and
allows multithreaded attachment of mmu notifers as well. I could have
fixed it with the single threaded thanks to the fact the only place
outside the mm->mmap_sem is try_to_unmap_cluster for me but it wasn't
simple to convert, nor worth it, given nonlinear isn't worth
optimizing for (not even the core VM cares about try_to_unmap_cluster
which is infact the only place in the VM with a O(N) complexity for
each try_to_unmap call, where N is the size of the mapping divided by
page_size).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] Convert anon_vma lock to rw_sem and refcount
  2008-04-02 21:56         ` Christoph Lameter
@ 2008-04-02 22:12           ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 22:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Wed, Apr 02, 2008 at 02:56:25PM -0700, Christoph Lameter wrote:
> I am a bit surprised that brk performance is that important. There may be 

I think it's not brk but fork that is being slowed down, did you
oprofile? AIM forks a lot... The write side fast path generating the
overscheduling I guess is when the new vmas are created for the child
and queued in the parent anon-vma in O(1), so immediate, even
preempt-rt would be ok with it spinning and not scheduling, it's just
a list_add (much faster than schedule() indeed). Every time there's a
collision when multiple child forks simultaneously and they all try to
queue in the same anon-vma, things will slowdown.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Require single threadedness for registration.
  2008-04-02 22:06           ` Christoph Lameter
@ 2008-04-02 22:17             ` Andrea Arcangeli
  2008-04-02 22:41               ` Christoph Lameter
  2008-04-03  1:24               ` EMM: disable other notifiers before register and unregister Christoph Lameter
  0 siblings, 2 replies; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-02 22:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 03:06:19PM -0700, Christoph Lameter wrote:
> On Thu, 3 Apr 2008, Andrea Arcangeli wrote:
> 
> > That would work for #v10 if I remove the invalidate_range_start from
> > try_to_unmap_cluster, it can't work for EMM because you've
> > emm_invalidate_start firing anywhere outside the context of the
> > current task (even regular rmap code, not just nonlinear corner case
> > will trigger the race). In short the single threaded approach would be
> 
> But in that case it will be firing for a callback to another mm_struct. 
> The notifiers are bound to mm_structs and keep separate contexts.

Why can't it fire on the mm_struct where GRU just registered? That
mm_struct existed way before GRU registered, and VM is free to unmap
it w/o mmap_sem if there was any memory pressure.

> You could flush in _begin and free on _end? I thought you are taking a 
> refcount on the page? You can drop the refcount only on _end to ensure 
> that the page does not go away before.

we're going to lock + flush on begin and unlock on _end w/o
refcounting to microoptimize. Free is done by
unmap_vmas/madvise/munmap at will. That's a very slow path, inflating
the balloon is not problematic. But invalidate_page allows to avoid
blocking page faults during swapping so minor faults can happen and
refresh the pte young bits etc... When the VM unmaps the page while
holding the page pin, there's no race and that's where invalidate_page
is being used to generate lower invalidation overhead.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Require single threadedness for registration.
  2008-04-02 22:17             ` Andrea Arcangeli
@ 2008-04-02 22:41               ` Christoph Lameter
  2008-04-03  1:24               ` EMM: disable other notifiers before register and unregister Christoph Lameter
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 22:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

> Why can't it fire on the mm_struct where GRU just registered? That
> mm_struct existed way before GRU registered, and VM is free to unmap
> it w/o mmap_sem if there was any memory pressure.

Right. Hmmm... Bad situation. We would have invalidate_start take
a lock to prevent registration until _end has run.

We could use stop_machine_run to register the notifier.... ;-).


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] EMM Notifier: The notifier calls
  2008-04-02 22:09           ` Andrea Arcangeli
@ 2008-04-02 23:04             ` Christoph Lameter
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-02 23:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

> I said try_to_unmap_cluster, not get_user_pages.
> 
>   CPU0					CPU1
>   try_to_unmap_cluster:
>   emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10)
>   walking the list by hand in EMM (or with hlist cleaner in #v10)
>   xpmem method invoked
>   schedule for a long while inside invalidate_range_start while skbs are sent
> 					gru registers
> 					synchronize_rcu (sorry useless now)

All of this would be much easier if you could stop the drivel. The sync 
rcu was for an earlier release of the mmu notifier. Why the sniping?

> 					single threaded, so taking a page fault
>   					secondary tlb instantiated

The driver must not allow faults to occur between start and end. The 
trouble here is that GRU and xpmem are mixed. If CPU0 would have been 
running GRU instead of XPMEM then the fault would not have occurred 
because the gru would have noticed that a range op is active. If both
systems would have run xpmem then the same would have worked.
 
I guess this means that an address space cannot reliably registered to 
multiple subsystems if some of those do not take a refcount. If all 
drivers would be required to take a refcount then this would also not 
occur.

> In general my #v10 solution mixing seqlock + rcu looks more robust and
> allows multithreaded attachment of mmu notifers as well. I could have

Well its easy to say that if no one else has looked at it yet. I expressed 
some concerns in reply to your post of #v10.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* EMM: disable other notifiers before register and unregister
  2008-04-02 22:17             ` Andrea Arcangeli
  2008-04-02 22:41               ` Christoph Lameter
@ 2008-04-03  1:24               ` Christoph Lameter
  2008-04-03 10:40                 ` Peter Zijlstra
  2008-04-03 15:29                 ` Andrea Arcangeli
  1 sibling, 2 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-03  1:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

Ok lets forget about the single theaded thing to solve the registration 
races. As Andrea pointed out this still has ssues with other subscribed 
subsystems (and also try_to_unmap). We could do something like what 
stop_machine_run does: First disable all running subsystems before 
registering a new one.

Maybe this is a possible solution.


Subject: EMM: disable other notifiers before register and unregister

As Andrea has pointed out: There are races during registration if other
subsystem notifiers are active while we register a callback.

Solve that issue by adding two new notifiers:

emm_stop
	Stops the notifier operations. Notifier must block on
	invalidate_start and emm_referenced from this point on.
	If an invalidate_start has not been completed by a call
	to invalidate_end then the driver must wait until the
	operation is complete before returning.

emm_start
	Restart notifier operations.

Before registration all other subscribed subsystems are stopped.
Then the new subsystem is subscribed and things can get running
without consistency issues.

Subsystems are restarted after the lists have been updated.

This also works for unregistering. If we can get all subsystems
to stop then we can also reliably unregister a subsystem. So
provide that callback.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/rmap.h |   10 +++++++---
 mm/rmap.c            |   30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h	2008-04-02 18:16:07.906032549 -0700
+++ linux-2.6/include/linux/rmap.h	2008-04-02 18:17:10.291070009 -0700
@@ -94,7 +94,9 @@ enum emm_operation {
 	emm_release,		/* Process exiting, */
 	emm_invalidate_start,	/* Before the VM unmaps pages */
 	emm_invalidate_end,	/* After the VM unmapped pages */
- 	emm_referenced		/* Check if a range was referenced */
+ 	emm_referenced,		/* Check if a range was referenced */
+	emm_stop,		/* Halt all faults/invalidate_starts */
+	emm_start,		/* Restart operations */
 };
 
 struct emm_notifier {
@@ -126,13 +128,15 @@ static inline int emm_notify(struct mm_s
 
 /*
  * Register a notifier with an mm struct. Release occurs when the process
- * terminates by calling the notifier function with emm_release.
+ * terminates by calling the notifier function with emm_release or when
+ * emm_notifier_unregister is called.
  *
  * Must hold the mmap_sem for write.
  */
 extern void emm_notifier_register(struct emm_notifier *e,
 					struct mm_struct *mm);
-
+extern void emm_notifier_unregister(struct emm_notifier *e,
+					struct mm_struct *mm);
 
 /*
  * Called from mm/vmscan.c to handle paging out
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-04-02 18:16:09.378057062 -0700
+++ linux-2.6/mm/rmap.c	2008-04-02 18:16:10.710079201 -0700
@@ -289,16 +289,46 @@ void emm_notifier_release(struct mm_stru
 /* Register a notifier */
 void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
 {
+	/* Bring all other notifiers into a quiescent state */
+	emm_notify(mm, emm_stop, 0, TASK_SIZE);
+
 	e->next = mm->emm_notifier;
+
 	/*
 	 * The update to emm_notifier (e->next) must be visible
 	 * before the pointer becomes visible.
 	 * rcu_assign_pointer() does exactly what we need.
 	 */
 	rcu_assign_pointer(mm->emm_notifier, e);
+
+	/* Continue notifiers */
+	emm_notify(mm, emm_start, 0, TASK_SIZE);
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
 
+/* Unregister a notifier */
+void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
+{
+	struct emm_notifier *p;
+
+	emm_notify(mm, emm_stop, 0, TASK_SIZE);
+
+	p = mm->emm_notifier;
+	if (e == p)
+		mm->emm_notifier = e->next;
+	else {
+		while (p->next != e)
+			p = p->next;
+
+		p->next = e->next;
+	}
+	e->next = mm->emm_notifier;
+
+	emm_notify(mm, emm_start, 0, TASK_SIZE);
+	e->callback(e, mm, emm_release, 0, TASK_SIZE);
+}
+EXPORT_SYMBOL_GPL(emm_notifier_unregister);
+
 /*
  * Perform a callback
  *


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Fixup return value handling of emm_notify()
  2008-04-02 21:33           ` Christoph Lameter
@ 2008-04-03 10:40             ` Peter Zijlstra
  2008-04-03 15:00               ` Andrea Arcangeli
  2008-04-03 19:14               ` Christoph Lameter
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2008-04-03 10:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Hugh Dickins, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman,
	Nick Piggin

On Wed, 2008-04-02 at 14:33 -0700, Christoph Lameter wrote:
> On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
> 
> > but anyway it's silly to be hardwired to such an interface that worst
> > of all requires switch statements instead of proper pointer to
> > functions and a fixed set of parameters and retval semantics for all
> > methods.
> 
> The EMM API with a single callback is the simplest approach at this point. 
> A common callback for all operations allows the driver to implement common 
> entry and exit code as seen in XPMem.

It seems to me that common code can be shared using functions? No need
to stuff everything into a single function. We have method vectors all
over the kernel, we could do a_ops as a single callback too, but we
dont.

FWIW I prefer separate methods.

> I guess we can complicate this more by switching to a different API or 
> adding additional emm_xxx() callback if need be but I really want to have 
> a strong case for why this would be needed. There is the danger of 
> adding frills with special callbacks in this and that situation that could 
> make the notifier complicated and specific to a certain usage scenario. 
> 
> Having this generic simple interface will hopefully avoid such things.
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: disable other notifiers before register and unregister
  2008-04-03  1:24               ` EMM: disable other notifiers before register and unregister Christoph Lameter
@ 2008-04-03 10:40                 ` Peter Zijlstra
  2008-04-03 15:29                 ` Andrea Arcangeli
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2008-04-03 10:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Hugh Dickins, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman,
	Nick Piggin

On Wed, 2008-04-02 at 18:24 -0700, Christoph Lameter wrote:
> Ok lets forget about the single theaded thing to solve the registration 
> races. As Andrea pointed out this still has ssues with other subscribed 
> subsystems (and also try_to_unmap). We could do something like what 
> stop_machine_run does: First disable all running subsystems before 
> registering a new one.
> 
> Maybe this is a possible solution.
> 
> 
> Subject: EMM: disable other notifiers before register and unregister
> 
> As Andrea has pointed out: There are races during registration if other
> subsystem notifiers are active while we register a callback.
> 
> Solve that issue by adding two new notifiers:
> 
> emm_stop
> 	Stops the notifier operations. Notifier must block on
> 	invalidate_start and emm_referenced from this point on.
> 	If an invalidate_start has not been completed by a call
> 	to invalidate_end then the driver must wait until the
> 	operation is complete before returning.
> 
> emm_start
> 	Restart notifier operations.

Please use pause and resume or something like that. stop-start is an
unnatural order; we usually start before we stop, whereas we pause first
and resume later.

> Before registration all other subscribed subsystems are stopped.
> Then the new subsystem is subscribed and things can get running
> without consistency issues.
> 
> Subsystems are restarted after the lists have been updated.
> 
> This also works for unregistering. If we can get all subsystems
> to stop then we can also reliably unregister a subsystem. So
> provide that callback.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/rmap.h |   10 +++++++---
>  mm/rmap.c            |   30 ++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/include/linux/rmap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/rmap.h	2008-04-02 18:16:07.906032549 -0700
> +++ linux-2.6/include/linux/rmap.h	2008-04-02 18:17:10.291070009 -0700
> @@ -94,7 +94,9 @@ enum emm_operation {
>  	emm_release,		/* Process exiting, */
>  	emm_invalidate_start,	/* Before the VM unmaps pages */
>  	emm_invalidate_end,	/* After the VM unmapped pages */
> - 	emm_referenced		/* Check if a range was referenced */
> + 	emm_referenced,		/* Check if a range was referenced */
> +	emm_stop,		/* Halt all faults/invalidate_starts */
> +	emm_start,		/* Restart operations */
>  };
>  
>  struct emm_notifier {
> @@ -126,13 +128,15 @@ static inline int emm_notify(struct mm_s
>  
>  /*
>   * Register a notifier with an mm struct. Release occurs when the process
> - * terminates by calling the notifier function with emm_release.
> + * terminates by calling the notifier function with emm_release or when
> + * emm_notifier_unregister is called.
>   *
>   * Must hold the mmap_sem for write.
>   */
>  extern void emm_notifier_register(struct emm_notifier *e,
>  					struct mm_struct *mm);
> -
> +extern void emm_notifier_unregister(struct emm_notifier *e,
> +					struct mm_struct *mm);
>  
>  /*
>   * Called from mm/vmscan.c to handle paging out
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c	2008-04-02 18:16:09.378057062 -0700
> +++ linux-2.6/mm/rmap.c	2008-04-02 18:16:10.710079201 -0700
> @@ -289,16 +289,46 @@ void emm_notifier_release(struct mm_stru
>  /* Register a notifier */
>  void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
>  {
> +	/* Bring all other notifiers into a quiescent state */
> +	emm_notify(mm, emm_stop, 0, TASK_SIZE);
> +
>  	e->next = mm->emm_notifier;
> +
>  	/*
>  	 * The update to emm_notifier (e->next) must be visible
>  	 * before the pointer becomes visible.
>  	 * rcu_assign_pointer() does exactly what we need.
>  	 */
>  	rcu_assign_pointer(mm->emm_notifier, e);
> +
> +	/* Continue notifiers */
> +	emm_notify(mm, emm_start, 0, TASK_SIZE);
>  }
>  EXPORT_SYMBOL_GPL(emm_notifier_register);
>  
> +/* Unregister a notifier */
> +void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
> +{
> +	struct emm_notifier *p;
> +
> +	emm_notify(mm, emm_stop, 0, TASK_SIZE);
> +
> +	p = mm->emm_notifier;
> +	if (e == p)
> +		mm->emm_notifier = e->next;
> +	else {
> +		while (p->next != e)
> +			p = p->next;
> +
> +		p->next = e->next;
> +	}
> +	e->next = mm->emm_notifier;
> +
> +	emm_notify(mm, emm_start, 0, TASK_SIZE);
> +	e->callback(e, mm, emm_release, 0, TASK_SIZE);
> +}
> +EXPORT_SYMBOL_GPL(emm_notifier_unregister);
> +
>  /*
>   * Perform a callback
>   *
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Fixup return value handling of emm_notify()
  2008-04-03 10:40             ` Peter Zijlstra
@ 2008-04-03 15:00               ` Andrea Arcangeli
  2008-04-03 19:14               ` Christoph Lameter
  1 sibling, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-03 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Hugh Dickins, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman,
	Nick Piggin

On Thu, Apr 03, 2008 at 12:40:46PM +0200, Peter Zijlstra wrote:
> It seems to me that common code can be shared using functions? No need
> FWIW I prefer separate methods.

kvm patch using mmu notifiers shares 99% of the code too between the
two different methods implemented indeed. Code sharing is the same and
if something pointer to functions will be faster if gcc isn't smart or
can't create a compile time hash to jump into the right address
without having to check every case: .

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: disable other notifiers before register and unregister
  2008-04-03  1:24               ` EMM: disable other notifiers before register and unregister Christoph Lameter
  2008-04-03 10:40                 ` Peter Zijlstra
@ 2008-04-03 15:29                 ` Andrea Arcangeli
  2008-04-03 19:20                   ` Christoph Lameter
  1 sibling, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-03 15:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Wed, Apr 02, 2008 at 06:24:15PM -0700, Christoph Lameter wrote:
> Ok lets forget about the single theaded thing to solve the registration 
> races. As Andrea pointed out this still has ssues with other subscribed 
> subsystems (and also try_to_unmap). We could do something like what 
> stop_machine_run does: First disable all running subsystems before 
> registering a new one.
> 
> Maybe this is a possible solution.

It still doesn't solve this kernel crash.

   CPU0				CPU1
   range_start (mmu notifier chain is empty)
   range_start returns
				mmu_notifier_register
				kvm_emm_stop (how kvm can ever know
				the other cpu is in the middle of the critical section?)
				kvm page fault (kvm thinks mmu_notifier_register serialized)
   zap ptes
   free_page mapped by spte/GRU and not pinned -> crash


There's no way the lowlevel can stop mmu_notifier_register and if
mmu_notifier_register returns, then sptes will be instantiated and
it'll corrupt memory the same way.

The seqlock was fine, what is wrong is the assumption that we can let
the lowlevel driver handle a range_end happening without range_begin
before it. The problem is that by design the lowlevel can't handle a
range_end happening without a range_begin before it. This is the core
kernel crashing problem we have (it's a kernel crashing issue only for
drivers that don't pin the pages, so XPMEM wouldn't crash but still it
would leak memory, which is a more graceful failure than random mm
corruption).

The basic trouble is that sometime range_begin/end critical sections
run outside the mmap_sem (see try_to_unmap_cluster in #v10 or even
try_to_unmap_one only in EMM-V2).

My attempt to fix this once and for all is to walk all vmas of the
"mm" inside mmu_notifier_register and take all anon_vma locks and
i_mmap_locks in virtual address order in a row. It's ok to take those
inside the mmap_sem. Supposedly if anybody will ever take a double
lock it'll do in order too. Then I can dump all the other locking and
remove the seqlock, and the driver is guaranteed there will be a
single call of range_begin followed by a single call of range_end the
whole time and no race could ever happen, and there won't be replied
calls of range_begin that would screwup a recursive semaphore
locking. The patch won't be pretty, I guess I'll vmalloc an array of
pointers to locks to reorder them. It doesn't need to be fast. Also
the locks can't go away from under us while we hold the
down_write(mmap_sem) because the vmas can be altered only with
down_write(mmap_sem) (modulo vm_start/vm_end that can be modified with
only down_read(mmap_sem) + page_table_lock like in growsdown page
faults). So it should be ok to take all those locks inside the
mmap_sem and implement a lock_vm(mm) unlock_vm(mm). I'll think more
about this hammer approach while I try to implement it...

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: Fixup return value handling of emm_notify()
  2008-04-03 10:40             ` Peter Zijlstra
  2008-04-03 15:00               ` Andrea Arcangeli
@ 2008-04-03 19:14               ` Christoph Lameter
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-03 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Hugh Dickins, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman,
	Nick Piggin

On Thu, 3 Apr 2008, Peter Zijlstra wrote:

> It seems to me that common code can be shared using functions? No need
> to stuff everything into a single function. We have method vectors all
> over the kernel, we could do a_ops as a single callback too, but we
> dont.
> 
> FWIW I prefer separate methods.

Ok. It seems that I already added some new methods which do not use all 
parameters. So lets switch back to the old scheme for the next release.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: disable other notifiers before register and unregister
  2008-04-03 15:29                 ` Andrea Arcangeli
@ 2008-04-03 19:20                   ` Christoph Lameter
  2008-04-03 20:23                     ` Christoph Lameter
                                       ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-03 19:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

> My attempt to fix this once and for all is to walk all vmas of the
> "mm" inside mmu_notifier_register and take all anon_vma locks and
> i_mmap_locks in virtual address order in a row. It's ok to take those
> inside the mmap_sem. Supposedly if anybody will ever take a double
> lock it'll do in order too. Then I can dump all the other locking and

What about concurrent mmu_notifier registrations from two mm_structs 
that have shared mappings? Isnt there a potential deadlock situation?

> faults). So it should be ok to take all those locks inside the
> mmap_sem and implement a lock_vm(mm) unlock_vm(mm). I'll think more
> about this hammer approach while I try to implement it...

Well good luck. Hopefully we will get to something that works.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: disable other notifiers before register and unregister
  2008-04-03 19:20                   ` Christoph Lameter
@ 2008-04-03 20:23                     ` Christoph Lameter
  2008-04-04 12:30                     ` Andrea Arcangeli
  2008-04-04 20:20                     ` [PATCH] mmu notifier #v11 Andrea Arcangeli
  2 siblings, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2008-04-03 20:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Peter Zijlstra, steiner, linux-kernel, linux-mm, Nick Piggin

On Thu, 3 Apr 2008, Christoph Lameter wrote:

> > faults). So it should be ok to take all those locks inside the
> > mmap_sem and implement a lock_vm(mm) unlock_vm(mm). I'll think more
> > about this hammer approach while I try to implement it...
> 
> Well good luck. Hopefully we will get to something that works.

Another hammer to use may be the freezer from software suspend. With that 
you can get all tasks of a process into a definite state. Then take the 
mmap_sem writably. But then there is still try_to_unmap and friends that 
can race.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: EMM: disable other notifiers before register and unregister
  2008-04-03 19:20                   ` Christoph Lameter
  2008-04-03 20:23                     ` Christoph Lameter
@ 2008-04-04 12:30                     ` Andrea Arcangeli
  2008-04-04 20:20                     ` [PATCH] mmu notifier #v11 Andrea Arcangeli
  2 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-04 12:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Thu, Apr 03, 2008 at 12:20:41PM -0700, Christoph Lameter wrote:
> On Thu, 3 Apr 2008, Andrea Arcangeli wrote:
> 
> > My attempt to fix this once and for all is to walk all vmas of the
> > "mm" inside mmu_notifier_register and take all anon_vma locks and
> > i_mmap_locks in virtual address order in a row. It's ok to take those
> > inside the mmap_sem. Supposedly if anybody will ever take a double
> > lock it'll do in order too. Then I can dump all the other locking and
> 
> What about concurrent mmu_notifier registrations from two mm_structs 
> that have shared mappings? Isnt there a potential deadlock situation?

No, the ordering of the lock avoids that. Here a snippnet.

/*
 * This operation locks against the VM for all pte/vma/mm related
 * operations that could ever happen on a certain mm. This includes
 * vmtruncate, try_to_unmap, and all page faults. The holder
 * must not hold any mm related lock. A single task can't take more
 * than one mm lock in a row or it would deadlock.
 */

So you can't do:

   mm_lock(mm1);
   mm_lock(mm2);

But if two different tasks run the mm_lock everything is ok. Each task
in the system can lock at most 1 mm at time.

> Well good luck. Hopefully we will get to something that works.

Looks good so far but I didn't finish it yet.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] mmu notifier #v11
  2008-04-03 19:20                   ` Christoph Lameter
  2008-04-03 20:23                     ` Christoph Lameter
  2008-04-04 12:30                     ` Andrea Arcangeli
@ 2008-04-04 20:20                     ` Andrea Arcangeli
  2008-04-04 22:06                       ` Christoph Lameter
  2 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-04 20:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

This should guarantee that nobody can register when any of the mmu
notifiers is running avoiding all the races including guaranteeing
range_start not to be missed. I'll adapt the other patches to provide
the sleeping-feature on top of this (only needed by XPMEM) soon. KVM
seems to run fine on top of this one.

Andrew can you apply this to -mm?

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,9 @@
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+extern void mm_lock(struct mm_struct *mm);
+extern void mm_unlock(struct mm_struct *mm);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,6 +225,9 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 	struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,175 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_ops {
+	/*
+	 * Called when nobody can register any more notifier in the mm
+	 * and after the "mn" notifier has been disarmed already.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * clear_flush_young is called after the VM is
+	 * test-and-clearing the young/accessed bitflag in the
+	 * pte. This way the VM will provide proper aging to the
+	 * accesses to the page through the secondary MMUs and not
+	 * only to the ones through the Linux pte.
+	 */
+	int (*clear_flush_young)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long address);
+
+	/*
+	 * Before this is invoked any secondary MMU is still ok to
+	 * read/write to the page previously pointed by the Linux pte
+	 * because the old page hasn't been freed yet.  If required
+	 * set_page_dirty has to be called internally to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_start() and invalidate_range_end() must be
+	 * paired. Multiple invalidate_range_start/ends may be nested
+	 * or called concurrently.
+	 */
+	void (*invalidate_range_start)(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start, unsigned long end);
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start, unsigned long end);
+};
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+static inline int mm_has_notifiers(struct mm_struct *mm)
+{
+	return unlikely(!hlist_empty(&mm->mmu_notifier_list));
+}
+
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+extern void __mmu_notifier_release(struct mm_struct *mm);
+extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address);
+extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end);
+
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_release(mm);
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_clear_flush_young(mm, address);
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_page(mm, address);
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_start(mm, start, end);
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+	INIT_HLIST_HEAD(&mm->mmu_notifier_list);
+}
+
+#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
+({									\
+	pte_t __pte;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
+	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
+	__pte;								\
+})
+
+#define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+static inline void mmu_notifier_release(struct mm_struct *mm)
+{
+}
+
+static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+}
+
+static inline void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+}
+
+#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_mm_init(mm);
 		return mm;
 	}
 
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,4 +33,5 @@
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,7 +194,7 @@
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush(vma, address, pte);
+			pteval = ptep_clear_flush_notify(vma, address, pte);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));
diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, start + size);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier_invalidate_range_end(mm, start, start + size);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -799,6 +800,7 @@
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -819,6 +821,7 @@
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -611,6 +612,9 @@
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
@@ -621,6 +625,11 @@
 						vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier_invalidate_range_end(src_mm,
+						vma->vm_start, end);
+
 	return 0;
 }
 
@@ -897,7 +906,9 @@
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier_invalidate_range_start(mm, address, end);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+	mmu_notifier_invalidate_range_end(mm, address, end);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
 	return end;
@@ -1463,10 +1474,11 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1474,6 +1486,7 @@
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1675,7 +1688,7 @@
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush(vma, address, page_table);
+		ptep_clear_flush_notify(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		lru_cache_add_active(new_page);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1747,11 +1748,13 @@
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
 /*
@@ -2037,6 +2040,7 @@
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();
@@ -2242,3 +2246,69 @@
 
 	return 0;
 }
+
+static void mm_lock_unlock(struct mm_struct *mm, int lock)
+{
+	struct vm_area_struct *vma;
+	spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
+
+	i_mmap_lock_last = NULL;
+	for (;;) {
+		spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			if (vma->vm_file && vma->vm_file->f_mapping &&
+			    (unsigned long) i_mmap_lock >
+			    (unsigned long)
+			    &vma->vm_file->f_mapping->i_mmap_lock &&
+			    (unsigned long)
+			    &vma->vm_file->f_mapping->i_mmap_lock >
+			    (unsigned long) i_mmap_lock_last)
+				i_mmap_lock =
+					&vma->vm_file->f_mapping->i_mmap_lock;
+		if (i_mmap_lock == (spinlock_t *) -1UL)
+			break;
+		i_mmap_lock_last = i_mmap_lock;
+		if (lock)
+			spin_lock(i_mmap_lock);
+		else
+			spin_unlock(i_mmap_lock);
+	}
+
+	anon_vma_lock_last = NULL;
+	for (;;) {
+		spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			if (vma->anon_vma &&
+			    (unsigned long) anon_vma_lock >
+			    (unsigned long) &vma->anon_vma->lock &&
+			    (unsigned long) &vma->anon_vma->lock >
+			    (unsigned long) anon_vma_lock_last)
+				anon_vma_lock = &vma->anon_vma->lock;
+		if (anon_vma_lock == (spinlock_t *) -1UL)
+			break;
+		anon_vma_lock_last = anon_vma_lock;
+		if (lock)
+			spin_lock(anon_vma_lock);
+		else
+			spin_unlock(anon_vma_lock);
+	}
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+void mm_lock(struct mm_struct * mm)
+{
+	down_write(&mm->mmap_sem);
+	mm_lock_unlock(mm, 1);
+}
+
+void mm_unlock(struct mm_struct *mm)
+{
+	mm_lock_unlock(mm, 0);
+	up_write(&mm->mmap_sem);
+}
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,100 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+
+	while (unlikely(!hlist_empty(&mm->mmu_notifier_list))) {
+		mn = hlist_entry(mm->mmu_notifier_list.first,
+				 struct mmu_notifier,
+				 hlist);
+		hlist_del(&mn->hlist);
+		if (mn->ops->release)
+			mn->ops->release(mn, mm);
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->clear_flush_young can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
+					unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->clear_flush_young)
+			young |= mn->ops->clear_flush_young(mn, mm, address);
+	}
+
+	return young;
+}
+
+void __mmu_notifier_invalidate_page(struct mm_struct *mm,
+					  unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->invalidate_page)
+			mn->ops->invalidate_page(mn, mm, address);
+	}
+}
+
+void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->invalidate_range_start)
+			mn->ops->invalidate_range_start(mn, mm, start, end);
+	}
+}
+
+void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+
+	hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) {
+		if (mn->ops->invalidate_range_end)
+			mn->ops->invalidate_range_end(mn, mm, start, end);
+	}
+}
+
+/*
+ * Must not hold mmap_sem nor any other VM related lock when calling
+ * this registration function.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	mm_lock(mm);
+	hlist_add_head(&mn->hlist, &mm->mmu_notifier_list);
+	mm_unlock(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier_invalidate_range_start(mm, start, end);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -74,7 +75,11 @@
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
+	unsigned long old_start;
 
+	old_start = old_addr;
+	mmu_notifier_invalidate_range_start(vma->vm_mm,
+					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -116,6 +121,7 @@
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,7 @@
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young_notify(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -456,7 +457,7 @@
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
+		entry = ptep_clear_flush_notify(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -717,14 +718,14 @@
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young_notify(vma, address, pte)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	pteval = ptep_clear_flush_notify(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -849,12 +850,12 @@
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young_notify(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush(vma, address, pte);
+		pteval = ptep_clear_flush_notify(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mmu notifier #v11
  2008-04-04 20:20                     ` [PATCH] mmu notifier #v11 Andrea Arcangeli
@ 2008-04-04 22:06                       ` Christoph Lameter
  2008-04-05  0:23                         ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-04 22:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

I am always the guy doing the cleanup after Andrea it seems. Sigh.

Here is the mm_lock/mm_unlock logic separated out for easier review.
Adds some comments. Still objectionable is the multiple ways of
invalidating pages in #v11. Callout now has similar locking to emm.

From: Christoph Lameter <clameter@sgi.com>
Subject: mm_lock: Lock a process against reclaim

Provide a way to lock an mm_struct against reclaim (try_to_unmap
etc). This is necessary for the invalidate notifier approaches so
that they can reliably add and remove a notifier.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |   10 ++++++++
 mm/mmap.c          |   66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-04-02 11:41:47.741678873 -0700
+++ linux-2.6/include/linux/mm.h	2008-04-04 15:02:17.660504756 -0700
@@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
+/*
+ * Locking and unlocking an mm against reclaim.
+ *
+ * mm_lock will take mmap_sem writably (to prevent additional vmas from being
+ * added) and then take all mapping locks of the existing vmas. With that
+ * reclaim is effectively stopped.
+ */
+extern void mm_lock(struct mm_struct *mm);
+extern void mm_unlock(struct mm_struct *mm);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-04-04 14:55:03.477593980 -0700
+++ linux-2.6/mm/mmap.c	2008-04-04 14:59:05.505395402 -0700
@@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st
 
 	return 0;
 }
+
+static void mm_lock_unlock(struct mm_struct *mm, int lock)
+{
+	struct vm_area_struct *vma;
+	spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
+
+	i_mmap_lock_last = NULL;
+	for (;;) {
+		spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			if (vma->vm_file && vma->vm_file->f_mapping &&
+			    (unsigned long) i_mmap_lock >
+			    (unsigned long)
+			    &vma->vm_file->f_mapping->i_mmap_lock &&
+			    (unsigned long)
+			    &vma->vm_file->f_mapping->i_mmap_lock >
+			    (unsigned long) i_mmap_lock_last)
+				i_mmap_lock =
+					&vma->vm_file->f_mapping->i_mmap_lock;
+		if (i_mmap_lock == (spinlock_t *) -1UL)
+			break;
+		i_mmap_lock_last = i_mmap_lock;
+		if (lock)
+			spin_lock(i_mmap_lock);
+		else
+			spin_unlock(i_mmap_lock);
+	}
+
+	anon_vma_lock_last = NULL;
+	for (;;) {
+		spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
+		for (vma = mm->mmap; vma; vma = vma->vm_next)
+			if (vma->anon_vma &&
+			    (unsigned long) anon_vma_lock >
+			    (unsigned long) &vma->anon_vma->lock &&
+			    (unsigned long) &vma->anon_vma->lock >
+			    (unsigned long) anon_vma_lock_last)
+				anon_vma_lock = &vma->anon_vma->lock;
+		if (anon_vma_lock == (spinlock_t *) -1UL)
+			break;
+		anon_vma_lock_last = anon_vma_lock;
+		if (lock)
+			spin_lock(anon_vma_lock);
+		else
+			spin_unlock(anon_vma_lock);
+	}
+}
+
+/*
+ * This operation locks against the VM for all pte/vma/mm related
+ * operations that could ever happen on a certain mm. This includes
+ * vmtruncate, try_to_unmap, and all page faults. The holder
+ * must not hold any mm related lock. A single task can't take more
+ * than one mm lock in a row or it would deadlock.
+ */
+void mm_lock(struct mm_struct * mm)
+{
+	down_write(&mm->mmap_sem);
+	mm_lock_unlock(mm, 1);
+}
+
+void mm_unlock(struct mm_struct *mm)
+{
+	mm_lock_unlock(mm, 0);
+	up_write(&mm->mmap_sem);
+}


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mmu notifier #v11
  2008-04-04 22:06                       ` Christoph Lameter
@ 2008-04-05  0:23                         ` Andrea Arcangeli
  2008-04-07  5:45                           ` Christoph Lameter
  0 siblings, 1 reply; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-05  0:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Fri, Apr 04, 2008 at 03:06:18PM -0700, Christoph Lameter wrote:
> Adds some comments. Still objectionable is the multiple ways of
> invalidating pages in #v11. Callout now has similar locking to emm.

range_begin exists because range_end is called after the page has
already been freed. invalidate_page is called _before_ the page is
freed but _after_ the pte has been zapped.

In short when working with single pages it's a waste to block the
secondary-mmu page fault, because it's zero cost to invalidate_page
before put_page. Not even GRU need to do that.

Instead for the multiple-pte-zapping we have to call range_end _after_
the page is already freed. This is so that there is a single range_end
call for an huge amount of address space. So we need a range_begin for
the subsystems not using page pinning for example. When working with
single pages (try_to_unmap_one, do_wp_page) invalidate_page avoids to
block the secondary mmu page fault, and it's in turn faster.

Besides avoiding need of serializing the secondary mmu page fault,
invalidate_page also reduces the overhead when the mmu notifiers are
disarmed (i.e. kvm not running).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mmu notifier #v11
  2008-04-05  0:23                         ` Andrea Arcangeli
@ 2008-04-07  5:45                           ` Christoph Lameter
  2008-04-07  6:02                             ` Andrea Arcangeli
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Lameter @ 2008-04-07  5:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Sat, 5 Apr 2008, Andrea Arcangeli wrote:

> In short when working with single pages it's a waste to block the
> secondary-mmu page fault, because it's zero cost to invalidate_page
> before put_page. Not even GRU need to do that.

That depends on what the notifier is being used for. Some serialization 
with the external mappings has to be done anyways. And its cleaner to have 
one API that does a lock/unlock scheme. Atomic operations can easily lead
to races.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] mmu notifier #v11
  2008-04-07  5:45                           ` Christoph Lameter
@ 2008-04-07  6:02                             ` Andrea Arcangeli
  0 siblings, 0 replies; 51+ messages in thread
From: Andrea Arcangeli @ 2008-04-07  6:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman, Nick Piggin

On Sun, Apr 06, 2008 at 10:45:41PM -0700, Christoph Lameter wrote:
> That depends on what the notifier is being used for. Some serialization 
> with the external mappings has to be done anyways. And its cleaner to have 

As far as I can tell no, you don't need to serialize against the
secondary mmu page fault in invalidate_page, like you instead have to
do in range_begin if you don't unpin the pages in range_end.

> one API that does a lock/unlock scheme. Atomic operations can easily lead
> to races.

What races? Note that if you don't want to optimize XPMEM and GRU can
feel free to implement their own invalidate_page as this:

     invalidate_page(mm, addr) {
     	range_begin(mm, addr, addr+PAGE_SIZE)
	range_end(mm, addr, addr+PAGE_SIZE)
     }

There's zero risk of adding races if they do this, but I doubt they
want to run as slow as with EMM so I guess they'll exploit the
optimization by going lock-free vs the spte page fault in
invalidate_page.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2008-04-07  6:02 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-01 20:55 [patch 0/9] [RFC] EMM Notifier V2 Christoph Lameter
2008-04-01 20:55 ` [patch 1/9] EMM Notifier: The notifier calls Christoph Lameter
2008-04-01 21:14   ` Peter Zijlstra
2008-04-01 21:38     ` Paul E. McKenney
2008-04-02 17:44       ` Christoph Lameter
2008-04-02 18:43       ` EMM: Fix rcu handling and spelling Christoph Lameter
2008-04-02 19:02         ` Paul E. McKenney
2008-04-02  6:49   ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
2008-04-02 10:59     ` Robin Holt
2008-04-02 11:16       ` Andrea Arcangeli
2008-04-02 14:26         ` Robin Holt
2008-04-02 17:59     ` Christoph Lameter
2008-04-02 19:03       ` EMM: Fixup return value handling of emm_notify() Christoph Lameter
2008-04-02 21:25         ` Andrea Arcangeli
2008-04-02 21:33           ` Christoph Lameter
2008-04-03 10:40             ` Peter Zijlstra
2008-04-03 15:00               ` Andrea Arcangeli
2008-04-03 19:14               ` Christoph Lameter
2008-04-02 21:05       ` EMM: Require single threadedness for registration Christoph Lameter
2008-04-02 22:01         ` Andrea Arcangeli
2008-04-02 22:06           ` Christoph Lameter
2008-04-02 22:17             ` Andrea Arcangeli
2008-04-02 22:41               ` Christoph Lameter
2008-04-03  1:24               ` EMM: disable other notifiers before register and unregister Christoph Lameter
2008-04-03 10:40                 ` Peter Zijlstra
2008-04-03 15:29                 ` Andrea Arcangeli
2008-04-03 19:20                   ` Christoph Lameter
2008-04-03 20:23                     ` Christoph Lameter
2008-04-04 12:30                     ` Andrea Arcangeli
2008-04-04 20:20                     ` [PATCH] mmu notifier #v11 Andrea Arcangeli
2008-04-04 22:06                       ` Christoph Lameter
2008-04-05  0:23                         ` Andrea Arcangeli
2008-04-07  5:45                           ` Christoph Lameter
2008-04-07  6:02                             ` Andrea Arcangeli
2008-04-02 21:53       ` [patch 1/9] EMM Notifier: The notifier calls Andrea Arcangeli
2008-04-02 21:54         ` Christoph Lameter
2008-04-02 22:09           ` Andrea Arcangeli
2008-04-02 23:04             ` Christoph Lameter
2008-04-01 20:55 ` [patch 2/9] Move tlb flushing into free_pgtables Christoph Lameter
2008-04-01 20:55 ` [patch 3/9] Convert i_mmap_lock to i_mmap_sem Christoph Lameter
2008-04-01 20:55 ` [patch 4/9] Remove tlb pointer from the parameters of unmap vmas Christoph Lameter
2008-04-01 20:55 ` [patch 5/9] Convert anon_vma lock to rw_sem and refcount Christoph Lameter
2008-04-02 17:50   ` Andrea Arcangeli
2008-04-02 18:15     ` Christoph Lameter
2008-04-02 21:56       ` Andrea Arcangeli
2008-04-02 21:56         ` Christoph Lameter
2008-04-02 22:12           ` Andrea Arcangeli
2008-04-01 20:55 ` [patch 6/9] This patch exports zap_page_range as it is needed by XPMEM Christoph Lameter
2008-04-01 20:55 ` [patch 7/9] Locking rules for taking multiple mmap_sem locks Christoph Lameter
2008-04-01 20:55 ` [patch 8/9] XPMEM: The device driver Christoph Lameter
2008-04-01 20:55 ` [patch 9/9] XPMEM: Simple example Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).