linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
@ 2005-04-26 15:49 David Addison
  2005-04-26 16:57 ` Jesper Juhl
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: David Addison @ 2005-04-26 15:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Andrea Arcangeli, David Addison

[-- Attachment #1: Type: text/plain, Size: 852 bytes --]

Hi,

here is a patch we use to integrate the Quadrics NICs into the Linux kernel.
The patch adds hooks to the Linux VM subsystem so that registered 'IOPROC'
devices can be informed of page table changes.
This allows the Quadrics NICs to perform user RDMAs safely, without requiring
page pinning. Looking through some of the recent IB and Ammasso discussions,
it may also prove useful to those NICs too.

This patch has been deployed in many large (1000+ CPUs) production Linux
clusters at high profile HPC sites such as LLNL and PNL. It has also been
incorporated in Linux kernel releases from HP, SGI and Bull.

I have discussed this patch with Andrew Morton and Andrea Arcangeli and they
believe now is the time to encourage further comments on whether it's
suitable to be incorporated into the mainline kernel.

Cheers,

David Addison
Quadrics Ltd


[-- Attachment #2: ioproc-2.6.12-rc3.patch --]
[-- Type: text/x-patch, Size: 37629 bytes --]

diff -ruN linux-2.6.12-rc3.orig/include/linux/ioproc.h linux-2.6.12-rc3.ioproc/include/linux/ioproc.h
--- linux-2.6.12-rc3.orig/include/linux/ioproc.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/include/linux/ioproc.h	2005-04-26 15:55:14.000000000 +0100
@@ -0,0 +1,271 @@
+/* -*- linux-c -*-
+ *
+ *    Copyright (C) 2002-2005 Quadrics Ltd.
+ *
+ *    This program is free software; you can redistribute it and/or modify
+ *    it under the terms of the GNU General Public License as published by
+ *    the Free Software Foundation; either version 2 of the License, or
+ *    (at your option) any later version.
+ *
+ *    This program is distributed in the hope that it will be useful,
+ *    but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *    GNU General Public License for more details.
+ *
+ *    You should have received a copy of the GNU General Public License
+ *    along with this program; if not, write to the Free Software
+ *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ */
+
+/*
+ * Callbacks for IO processor page table updates.
+ */
+
+#ifndef __LINUX_IOPROC_H__
+#define __LINUX_IOPROC_H__
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+typedef struct ioproc_ops {
+	struct ioproc_ops *next;
+	void *arg;
+
+	void (*release)(void *arg, struct mm_struct *mm);
+	void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+
+	void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);
+
+	void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+
+} ioproc_ops_t;
+
+/* IOPROC Registration
+ *
+ * Called by the IOPROC device driver to register its interest in page table
+ * changes for the process associated with the supplied mm_struct
+ *
+ * The caller should first allocate and fill out an ioproc_ops structure with
+ * the function pointers initialised to the device driver specific code for
+ * each callback. If the device driver doesn't have code for a particular
+ * callback then it should set the function pointer to be NULL.
+ * The ioproc_ops arg parameter will be passed unchanged as the first argument
+ * to each callback function invocation.
+ *
+ * The ioproc registration is not inherited across fork() and should be called
+ * once for each process that the IOPROC device driver is interested in.
+ *
+ * Must be called holding the mm->page_table_lock
+ */
+extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+
+/* IOPROC De-registration
+ *
+ * Called by the IOPROC device driver when it is no longer interested in page
+ * table changes for the process associated with the supplied mm_struct
+ *
+ * Normally this is not needed to be called as the ioproc_release() code will
+ * automatically unlink the ioproc_ops struct from the mm_struct as the
+ * process exits
+ *
+ * Must be called holding the mm->page_table_lock
+ */
+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+#ifdef CONFIG_IOPROC
+
+/* IOPROC Release
+ *
+ * Called during exit_mmap() as all vmas are torn down and unmapped.
+ *
+ * Also unlinks the ioproc_ops structure from the mm list as it goes.
+ *
+ * No need for locks as the mm can no longer be accessed at this point
+ *
+ */
+static inline void
+ioproc_release(struct mm_struct *mm)
+{
+	struct ioproc_ops *cp;
+
+	while ((cp = mm->ioproc_ops) != NULL) {
+		mm->ioproc_ops = cp->next;
+
+		if (cp->release)
+			cp->release(cp->arg, mm);
+	}
+}
+
+/* IOPROC SYNC RANGE
+ *
+ * Called when a memory map is synchronised with its disk image i.e. when the
+ * msync() syscall is invoked. Any future read or write to the associated
+ * pages by the IOPROC should cause the page to be marked as referenced or
+ * modified.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_sync_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->sync_range)
+			cp->sync_range(cp->arg, vma, start, end);
+}
+
+/* IOPROC INVALIDATE RANGE
+ *
+ * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the
+ * user or paged out by the kernel.
+ *
+ * After this call the IOPROC must not access the physical memory again unless
+ * a new translation is loaded.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_invalidate_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	struct ioproc_ops *cp;
+	
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->invalidate_range)
+			cp->invalidate_range(cp->arg, vma, start, end);
+}
+
+/* IOPROC UPDATE RANGE
+ *
+ * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk
+ * up, when breaking COW or faulting in an anonymous page of memory.
+ *
+ * These give the IOPROC device driver the opportunity to load translations
+ * speculatively, which can improve performance by avoiding device translation
+ * faults.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_update_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->update_range)
+			cp->update_range(cp->arg, vma, start, end);
+}
+
+
+/* IOPROC CHANGE PROTECTION
+ *
+ * Called when the protection on a region of memory is changed i.e. when the
+ * mprotect() syscall is invoked.
+ *
+ * The IOPROC must not be able to write to a read-only page, so if the
+ * permissions are downgraded then it must honour them. If they are upgraded
+ * it can treat this in the same way as the ioproc_update_[range|sync]() calls
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->change_protection)
+			cp->change_protection(cp->arg, vma, start, end, newprot);
+}
+
+/* IOPROC SYNC PAGE
+ *
+ * Called when a memory map is synchronised with its disk image i.e. when the
+ * msync() syscall is invoked. Any future read or write to the associated page
+ * by the IOPROC should cause the page to be marked as referenced or modified.
+ *
+ * Not currently called as msync() calls ioproc_sync_range() instead
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_sync_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->sync_page)
+			cp->sync_page(cp->arg, vma, addr);
+}
+
+/* IOPROC INVALIDATE PAGE
+ *
+ * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the
+ * user or paged out by the kernel.
+ *
+ * After this call the IOPROC must not access the physical memory again unless
+ * a new translation is loaded.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_invalidate_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->invalidate_page)
+			cp->invalidate_page(cp->arg, vma, addr);
+}
+
+/* IOPROC UPDATE PAGE
+ *
+ * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk
+ * up, when breaking COW or faulting in an anonymous page of memory.
+ *
+ * These give the IOPROC device the opportunity to load translations
+ * speculatively, which can improve performance by avoiding device translation
+ * faults.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void
+ioproc_update_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->update_page)
+			cp->update_page(cp->arg, vma, addr);
+}
+
+#else
+
+/* ! CONFIG_IOPROC so make all hooks empty */
+
+#define ioproc_release(mm)			do { } while (0)
+
+#define ioproc_sync_range(vma, start, end)	do { } while (0)
+
+#define ioproc_invalidate_range(vma, start,end)	do { } while (0)
+
+#define ioproc_update_range(vma, start, end)	do { } while (0)
+
+#define ioproc_change_protection(vma, start, end, prot)	do { } while (0)
+
+#define ioproc_sync_page(vma, addr)		do { } while (0)
+
+#define ioproc_invalidate_page(vma, addr)	do { } while (0)
+
+#define ioproc_update_page(vma, addr)		do { } while (0)
+
+#endif /* CONFIG_IOPROC */
+
+#endif /* __LINUX_IOPROC_H__ */
diff -ruN linux-2.6.12-rc3.orig/include/linux/sched.h linux-2.6.12-rc3.ioproc/include/linux/sched.h
--- linux-2.6.12-rc3.orig/include/linux/sched.h	2005-04-26 09:02:29.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/include/linux/sched.h	2005-04-26 15:55:14.000000000 +0100
@@ -186,6 +186,9 @@
 asmlinkage void schedule(void);
 
 struct namespace;
+#ifdef CONFIG_IOPROC
+struct ioproc_ops;
+#endif
 
 /* Maximum number of active map areas.. This is a random (large) number */
 #define DEFAULT_MAX_MAP_COUNT	65536
@@ -267,6 +270,11 @@
 
 	unsigned long hiwater_rss;	/* High-water RSS usage */
 	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+
+#ifdef CONFIG_IOPROC
+	/* hooks for io devices with advanced RDMA capabilities */
+	struct ioproc_ops       *ioproc_ops;
+#endif
 };
 
 struct sighand_struct {
diff -ruN linux-2.6.12-rc3.orig/kernel/fork.c linux-2.6.12-rc3.ioproc/kernel/fork.c
--- linux-2.6.12-rc3.orig/kernel/fork.c	2005-04-26 09:02:36.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/kernel/fork.c	2005-04-26 15:55:14.000000000 +0100
@@ -320,6 +320,9 @@
 	spin_lock_init(&mm->page_table_lock);
 	rwlock_init(&mm->ioctx_list_lock);
 	mm->ioctx_list = NULL;
+#ifdef CONFIG_IOPROC
+	mm->ioproc_ops = NULL;
+#endif
 	mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 
diff -ruN linux-2.6.12-rc3.orig/mm/fremap.c linux-2.6.12-rc3.ioproc/mm/fremap.c
--- linux-2.6.12-rc3.orig/mm/fremap.c	2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/fremap.c	2005-04-26 15:55:14.000000000 +0100
@@ -12,6 +12,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
+#include <linux/ioproc.h>
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
@@ -30,6 +31,7 @@
 	if (pte_present(pte)) {
 		unsigned long pfn = pte_pfn(pte);
 
+		ioproc_invalidate_page(vma, addr);
 		flush_cache_page(vma, addr, pfn);
 		pte = ptep_clear_flush(vma, addr, ptep);
 		if (pfn_valid(pfn)) {
@@ -99,6 +101,7 @@
 	pte_val = *pte;
 	pte_unmap(pte);
 	update_mmu_cache(vma, addr, pte_val);
+	ioproc_update_page(vma, addr);
 
 	err = 0;
 err_unlock:
@@ -143,6 +146,7 @@
 	pte_val = *pte;
 	pte_unmap(pte);
 	update_mmu_cache(vma, addr, pte_val);
+	ioproc_update_page(vma, addr);
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 
diff -ruN linux-2.6.12-rc3.orig/mm/hugetlb.c linux-2.6.12-rc3.ioproc/mm/hugetlb.c
--- linux-2.6.12-rc3.orig/mm/hugetlb.c	2005-03-02 07:38:12.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/mm/hugetlb.c	2005-04-26 15:55:14.000000000 +0100
@@ -11,6 +11,7 @@
 #include <linux/sysctl.h>
 #include <linux/highmem.h>
 #include <linux/nodemask.h>
+#include <linux/ioproc.h>
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static unsigned long nr_huge_pages, free_huge_pages;
@@ -255,6 +256,7 @@
 	struct mm_struct *mm = vma->vm_mm;
 
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, start, start + length);
 	unmap_hugepage_range(vma, start, start + length);
 	spin_unlock(&mm->page_table_lock);
 }
diff -ruN linux-2.6.12-rc3.orig/mm/ioproc.c linux-2.6.12-rc3.ioproc/mm/ioproc.c
--- linux-2.6.12-rc3.orig/mm/ioproc.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/ioproc.c	2005-04-26 15:55:14.000000000 +0100
@@ -0,0 +1,58 @@
+/* -*- linux-c -*-
+ *
+ *    Copyright (C) 2002-2005 Quadrics Ltd.
+ *
+ *    This program is free software; you can redistribute it and/or modify
+ *    it under the terms of the GNU General Public License as published by
+ *    the Free Software Foundation; either version 2 of the License, or
+ *    (at your option) any later version.
+ *
+ *    This program is distributed in the hope that it will be useful,
+ *    but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *    GNU General Public License for more details.
+ *
+ *    You should have received a copy of the GNU General Public License
+ *    along with this program; if not, write to the Free Software
+ *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ */
+
+/*
+ * Registration for IO processor page table updates.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/mm.h>
+#include <linux/ioproc.h>
+
+int
+ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
+{
+	ip->next = mm->ioproc_ops;
+	mm->ioproc_ops = ip;
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(ioproc_register_ops);
+
+int
+ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip)
+{
+	struct ioproc_ops **tmp;
+
+	for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next)
+		;
+	if (*tmp) {
+		*tmp = ip->next;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+EXPORT_SYMBOL_GPL(ioproc_unregister_ops);
diff -ruN linux-2.6.12-rc3.orig/mm/Kconfig linux-2.6.12-rc3.ioproc/mm/Kconfig
--- linux-2.6.12-rc3.orig/mm/Kconfig	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/Kconfig	2005-04-26 15:55:14.000000000 +0100
@@ -0,0 +1,15 @@
+#
+# VM subsystem specific config
+#
+
+# Support for IO processors which have advanced RDMA capabilities
+#
+config IOPROC
+	bool "Enable IOPROC VM hooks"
+	depends on MMU
+	default y
+	help
+	This option enables hooks in the VM subsystem so that IO devices which
+	incorporate advanced RDMA capabilities can be kept in sync with CPU
+	page table changes.
+	See Documentation/vm/ioproc.txt for more details.
diff -ruN linux-2.6.12-rc3.orig/mm/Makefile linux-2.6.12-rc3.ioproc/mm/Makefile
--- linux-2.6.12-rc3.orig/mm/Makefile	2005-03-02 07:38:12.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/mm/Makefile	2005-04-26 15:55:14.000000000 +0100
@@ -17,4 +17,5 @@
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SHMEM) += shmem.o
 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
+obj-$(CONFIG_IOPROC)	+= ioproc.o
 
diff -ruN linux-2.6.12-rc3.orig/mm/memory.c linux-2.6.12-rc3.ioproc/mm/memory.c
--- linux-2.6.12-rc3.orig/mm/memory.c	2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/memory.c	2005-04-26 15:55:14.000000000 +0100
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/ioproc.h>
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
@@ -765,6 +766,7 @@
 
 	lru_add_drain();
 	spin_lock(&mm->page_table_lock);
+ 	ioproc_invalidate_range(vma, address, end);
 	tlb = tlb_gather_mmu(mm, 0);
 	end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
 	tlb_finish_mmu(tlb, address, end);
@@ -1076,6 +1078,7 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
+	unsigned long beg = addr;
 	unsigned long end = addr + size;
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
@@ -1084,12 +1087,14 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, beg, end);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = zeromap_pud_range(mm, pgd, addr, next, prot);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	ioproc_update_range(vma, beg, end);
 	spin_unlock(&mm->page_table_lock);
 	return err;
 }
@@ -1164,6 +1169,7 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
+	unsigned long beg = addr;
 	unsigned long end = addr + size;
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
@@ -1183,6 +1189,7 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, beg, end);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
@@ -1190,6 +1197,7 @@
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	ioproc_update_range(vma, beg, end);
 	spin_unlock(&mm->page_table_lock);
 	return err;
 }
@@ -1218,8 +1226,10 @@
 
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
+	ioproc_invalidate_page(vma, address);
 	ptep_establish(vma, address, page_table, entry);
 	update_mmu_cache(vma, address, entry);
+	ioproc_update_page(vma, address);
 	lazy_mmu_prot_update(entry);
 }
 
@@ -1273,6 +1283,7 @@
 					      vma);
 			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
+			ioproc_update_page(vma, address);
 			lazy_mmu_prot_update(entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1736,6 +1747,7 @@
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, pte);
+	ioproc_update_page(vma, address);
 	lazy_mmu_prot_update(pte);
 	pte_unmap(page_table);
 	spin_unlock(&mm->page_table_lock);
@@ -1794,6 +1806,7 @@
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
+	ioproc_update_page(vma, addr);
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
 out:
@@ -1920,6 +1933,7 @@
 
 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
+	ioproc_update_page(vma, address);
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
 out:
diff -ruN linux-2.6.12-rc3.orig/mm/mmap.c linux-2.6.12-rc3.ioproc/mm/mmap.c
--- linux-2.6.12-rc3.orig/mm/mmap.c	2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mmap.c	2005-04-26 15:55:15.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/ioproc.h>
 #include <linux/personality.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
@@ -1627,6 +1628,7 @@
 
 	lru_add_drain();
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, start, end);
 	tlb = tlb_gather_mmu(mm, 0);
 	unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
@@ -1905,6 +1907,7 @@
 
 	spin_lock(&mm->page_table_lock);
 
+	ioproc_release(mm);
 	flush_cache_mm(mm);
 	tlb = tlb_gather_mmu(mm, 1);
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
diff -ruN linux-2.6.12-rc3.orig/mm/mprotect.c linux-2.6.12-rc3.ioproc/mm/mprotect.c
--- linux-2.6.12-rc3.orig/mm/mprotect.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mprotect.c	2005-04-26 15:55:15.000000000 +0100
@@ -10,6 +10,7 @@
 
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <linux/ioproc.h>
 #include <linux/slab.h>
 #include <linux/shm.h>
 #include <linux/mman.h>
@@ -89,6 +90,7 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_change_protection(vma, start, end, newprot);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
diff -ruN linux-2.6.12-rc3.orig/mm/mremap.c linux-2.6.12-rc3.ioproc/mm/mremap.c
--- linux-2.6.12-rc3.orig/mm/mremap.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mremap.c	2005-04-26 15:55:15.000000000 +0100
@@ -9,6 +9,7 @@
 
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <linux/ioproc.h>
 #include <linux/slab.h>
 #include <linux/shm.h>
 #include <linux/mman.h>
@@ -161,6 +162,8 @@
 {
 	unsigned long offset;
 
+	ioproc_invalidate_range(vma, old_addr, old_addr + len);
+	ioproc_invalidate_range(vma, new_addr, new_addr + len);
 	flush_cache_range(vma, old_addr, old_addr + len);
 
 	/*
diff -ruN linux-2.6.12-rc3.orig/mm/msync.c linux-2.6.12-rc3.ioproc/mm/msync.c
--- linux-2.6.12-rc3.orig/mm/msync.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/msync.c	2005-04-26 15:55:15.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/mman.h>
 #include <linux/hugetlb.h>
 #include <linux/syscalls.h>
+#include <linux/ioproc.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -95,6 +96,7 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_sync_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
diff -ruN linux-2.6.12-rc3.orig/mm/rmap.c linux-2.6.12-rc3.ioproc/mm/rmap.c
--- linux-2.6.12-rc3.orig/mm/rmap.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/rmap.c	2005-04-26 15:55:15.000000000 +0100
@@ -53,6 +53,7 @@
 #include <linux/init.h>
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
+#include <linux/ioproc.h>
 
 #include <asm/tlbflush.h>
 
@@ -573,6 +574,7 @@
 	}
 
 	/* Nuke the page table entry. */
+	ioproc_invalidate_page(vma, address);
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
 
@@ -690,6 +692,7 @@
 			continue;
 
 		/* Nuke the page table entry. */
+		ioproc_invalidate_page(vma, address);
 		flush_cache_page(vma, address, pfn);
 		pteval = ptep_clear_flush(vma, address, pte);
 
diff -ruN linux-2.6.12-rc3.orig/arch/i386/defconfig linux-2.6.12-rc3.ioproc/arch/i386/defconfig
--- linux-2.6.12-rc3.orig/arch/i386/defconfig	2005-04-26 08:59:33.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/i386/defconfig	2005-04-26 15:55:15.000000000 +0100
@@ -120,6 +120,7 @@
 CONFIG_IRQBALANCE=y
 CONFIG_HAVE_DEC_LOCK=y
 # CONFIG_REGPARM is not set
+CONFIG_IOPROC=y
 
 #
 # Power management options (ACPI, APM)
diff -ruN linux-2.6.12-rc3.orig/arch/i386/Kconfig linux-2.6.12-rc3.ioproc/arch/i386/Kconfig
--- linux-2.6.12-rc3.orig/arch/i386/Kconfig	2005-04-26 08:59:33.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/i386/Kconfig	2005-04-26 15:55:15.000000000 +0100
@@ -923,6 +923,8 @@
 
 	  If unsure, say Y. Only embedded should say N here.
 
+source "mm/Kconfig"
+
 endmenu
 
 
diff -ruN linux-2.6.12-rc3.orig/arch/ia64/defconfig linux-2.6.12-rc3.ioproc/arch/ia64/defconfig
--- linux-2.6.12-rc3.orig/arch/ia64/defconfig	2005-03-02 07:37:48.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/arch/ia64/defconfig	2005-04-26 15:55:15.000000000 +0100
@@ -92,6 +92,7 @@
 CONFIG_PERFMON=y
 CONFIG_IA64_PALINFO=y
 CONFIG_ACPI_DEALLOCATE_IRQ=y
+CONFIG_IOPROC=y
 
 #
 # Firmware Drivers
diff -ruN linux-2.6.12-rc3.orig/arch/ia64/Kconfig linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig
--- linux-2.6.12-rc3.orig/arch/ia64/Kconfig	2005-04-26 08:59:38.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig	2005-04-26 15:55:15.000000000 +0100
@@ -319,6 +319,8 @@
 	depends on IOSAPIC && EXPERIMENTAL
 	default y
 
+source "mm/Kconfig"
+
 source "drivers/firmware/Kconfig"
 
 source "fs/Kconfig.binfmt"
diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/defconfig linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig
--- linux-2.6.12-rc3.orig/arch/x86_64/defconfig	2005-04-26 09:00:10.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig	2005-04-26 15:55:15.000000000 +0100
@@ -100,6 +100,7 @@
 CONFIG_SECCOMP=y
 CONFIG_GENERIC_HARDIRQS=y
 CONFIG_GENERIC_IRQ_PROBE=y
+CONFIG_IOPROC=y
 
 #
 # Power management options
diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/Kconfig linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig
--- linux-2.6.12-rc3.orig/arch/x86_64/Kconfig	2005-04-26 09:00:10.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig	2005-04-26 15:55:15.000000000 +0100
@@ -458,6 +458,8 @@
 	depends on IA32_EMULATION
 	default y
 
+source "mm/Kconfig"
+
 endmenu
 
 source drivers/Kconfig
diff -ruN linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt
--- linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt	2005-04-26 15:55:15.000000000 +0100
@@ -0,0 +1,500 @@
+Linux IOPROC patch overview
+===========================
+
+The network interface for an HPC network differs significantly from
+network interfaces for traditional IP networks. HPC networks tend to
+be used directly from user processes and perform large RDMA transfers
+between the process address spaces. They also have a requirement
+for low latency communication, and typically achieve this by OS bypass
+techniques.  This then requires a different model to traditional
+interconnects, in that a process may need to expose a large amount of
+it's address space to the network RDMA.
+
+Locking down of memory has been a common mechanism for performing
+this, together with a pin-down cache implemented in user
+libraries. The disadvantage of this method is that large portions of
+the physical memory can be locked down for a single process, even if
+it's working set changes over the different phases of it's
+execution. This leads to inefficient memory utilisation - akin to the
+disadvantage of swapping compared to paging.
+
+This model also has problems where memory is being dynamically
+allocated and freed, since the pin down cache is unaware that memory
+may have been released by a call to munmap() and so it will still be
+locking down the now unused pages.
+
+Some modern HPC network interfaces implement their own MMU and are
+able to handle a translation fault during a network access. The
+Quadrics (http://www.quadrics.com) devices (Elan3 and Elan4) have done
+this for some time, and the Infiniband standard also allows for the
+case where memory has been deregistered when an RDMA occurs.
+These NICs are able to operate in an environment where paging occurs
+and do not require memory to be locked down. The advantage of this is
+that the user process can expose large portions of its address space
+without having to worry about physical memory constraints.
+
+However should the operating system decide to swap a page to disk,
+then the NIC must be made aware that it should no longer read/write
+from this memory, but should generate a translation fault instead.
+
+The ioproc patch has been developed to provide a mechanism whereby the
+device driver for a NIC can be made aware of when a user process's
+address translations change, either by paging or by explicitly mapping
+or unmapping of memory.
+
+The patch involves inserting callbacks where translations are being
+invalidated to notify the NIC that the memory behind those
+translations is no longer visible to the application (and so should
+not be visible to the NIC). This callback is then responsible for
+ensuring that the NIC will not access the physical memory that was
+being mapped.
+
+An ioproc invalidate callback in the kswapd code could be utilised to
+prevent memory from being paged out if the NIC is unable to support
+RDMA page faulting. This has not yet been implemented in this patch.
+
+For NICs which support RDMA page faulting, there is no requirement
+for a user level pin down cache, since they are able to page-in their
+translations on the first communication using a buffer. However this
+is likely to be inefficient, resulting in slow first use of the
+buffer. If the communication buffers were continually allocated and
+freed using mmap() based malloc() calls then this would lead to all
+communications being slower than desirable.
+
+To optimise these warm-up cases the ioproc patch adds calls to
+ioproc_update wherever the kernel is creating translations for a user
+process. These then allow the device driver to preload translations
+so that they are already present for the first network communication
+from a buffer.
+
+Linux 2.6 IOPROC implementation details
+=======================================
+
+The Linux IOPROC patch adds hooks to the Linux VM code whenever page
+table entries are being created and/or invalidated. IOPROC device
+drivers can register their interest in being informed of such changes
+by registering an ioproc_ops structure which is defined as follows;
+
+extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+typedef struct ioproc_ops {
+	struct ioproc_ops *next;
+	void *arg;
+
+	void (*release)(void *arg, struct mm_struct *mm);
+	void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+
+	void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);
+
+	void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+
+} ioproc_ops_t;
+
+ioproc_register_ops
+===================
+This function should be called by the IOPROC device driver to register
+its interest in PTE changes for the process associated with the passed
+in mm_struct.
+
+The ioproc registration is not inherited across fork() and should be
+called once for each process that IOPROC is interested in.
+
+This function must be called whilst holding the mm->page_table_lock.
+
+ioproc_unregister_ops
+=====================
+This function should be called by the IOPROC device driver when it no
+longer requires informing of PTE changes in the process associated
+with the supplied mm_struct.
+
+This function is not normally needed to be called as the ioproc_ops
+struct is unlinked from the associated mm_struct during the
+ioproc_release() call.
+
+This function must be called whilst holding the mm->page_table_lock.
+
+ioproc_ops struct
+=================
+A linked list ioproc_ops structures is hung off the user process
+mm_struct (linux/sched.h). At each hook point in the patched kernel
+the ioproc patch will call the associated ioproc_ops callback function
+pointer in turn for each registered structure.
+
+The intention of the callbacks is to allow the IOPROC device driver to
+inspect the new or modified PTE entry via the Linux kernel
+(e.g. find_pte_map()). These callbacks should not modify the Linux
+kernel VM state or PTE entries.
+
+The ioproc_ops callback function pointers are defined as follows;
+
+ioproc_release
+==============
+The release hook is called when a program exits and all its vma areas
+are torn down and unmapped. i.e. during exit_mmap(). Before each
+release hook is called the ioproc_ops structure is unlinked from the
+mm_struct.
+
+No locks are required as the process has the only reference to the mm
+at this point.
+
+ioproc_sync_[range|page]
+========================
+The sync hooks are called when a memory map is synchronised with its
+disk image i.e. when the msync() syscall is invoked. Any future read
+or write by the IOPROC device to the associated pages should cause the
+page to be marked as referenced or modified.
+
+Called holding the mm->page_table_lock
+
+ioproc_invalidate_[range|page]
+==============================
+The invalidate hooks are called whenever a valid PTE is unloaded
+e.g. when a page is unmapped by the user or paged out by the
+kernel. After this call the IOPROC must not access the physical memory
+again unless a new translation is loaded.
+
+Called holding the mm->page_table_lock
+
+ioproc_update_[range|page]
+==========================
+The update hooks are called whenever a valid PTE is loaded
+e.g. mmaping memory, moving the brk up, when breaking COW or faulting
+in an anonymous page of memory. These give the IOPROC device the
+opportunity to load translations speculatively, which can improve
+performance by avoiding device translation faults.
+
+Called holding the mm->page_table_lock
+
+ioproc_change_protection
+========================
+This hook is called when the protection on a region of memory is
+changed i.e. when the mprotect() syscall is invoked.
+
+The IOPROC must not be able to write to a read-only page, so if the
+permissions are downgraded then it must honour them. If they are
+upgraded it can treat this in the same way as the
+ioproc_update_[range|page]() calls
+
+Called holding the mm->page_table_lock
+
+
+Linux 2.6 IOPROC patch details
+==============================
+
+Here are the specific details of each ioproc hook added to the Linux
+2.6 VM system and the reasons for doing so;
+
+===============================================================================
+++++ FILE
+	mm/fremap.c
+
+==== FUNCTION
+	zap_pte
+
+CALLED FROM
+	install_page
+	install_file_pte
+
+PTE MODIFICATION
+	ptep_clear_flush
+
+ADDED HOOKS
+	ioproc_invalidate_page
+
+==== FUNCTION
+	install_page
+
+CALLED FROM
+	filemap_populate, shmem_populate
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+==== FUNCTION
+	install_file_pte
+
+CALLED FROM
+	filemap_populate, shmem_populate
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+===============================================================================
+++++ FILE
+	mm/memory.c
+
+==== FUNCTION
+	copy_page_range
+
+CALLED FROM
+       dup_mmap (fork.c)
+
+PTE MODIFICATION
+	set_pte_at (copy_one_pte)
+
+ADDED HOOKS
+	None necessary as its creating a new process
+
+==== FUNCTION
+	zap_page_range
+
+CALLED FROM
+	read_zero_pagealigned, madvise_dontneed, unmap_mapping_range,
+	unmap_mapping_range_list, do_mmap_pgoff
+
+PTE MODIFICATION
+	set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+
+
+==== FUNCTION
+	zeromap_page_range
+
+CALLED FROM
+	read_zero_pagealigned, mmap_zero
+
+PTE MODIFICATION
+	set_pte_at (zeromap_pte_range via zeromap_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+	ioproc_update_range
+
+
+==== FUNCTION
+	remap_pfn_range
+
+CALLED FROM
+	many device drivers
+
+PTE MODIFICATION
+	set_pte_at (remap_pte_range via remap_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+	ioproc_update_range
+
+
+==== FUNCTION
+	break_cow
+
+CALLED FROM
+	do_wp_page
+
+PTE MODIFICATION
+	ptep_establish
+
+ADDED HOOKS
+	ioproc_invalidate_page
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_wp_page
+
+CALLED FROM
+       do_swap_page, handle_pte_fault
+
+PTE MODIFICATION
+	ptep_set_access_flags, break_cow
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_swap_page
+
+CALLED FROM
+	handle_pte_fault
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_anonymous_page
+
+CALLED FROM
+	do_no_page
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_no_page
+
+CALLED FROM
+	do_file_page, handle_pte_fault
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	handle_pte_fault
+
+CALLED FROM
+	handle_mm_fault
+
+PTE MODIFICATION
+	ptep_set_access_flags, do_no_page, do_file_page, do_swap_page
+
+ADDED HOOKS
+	Handled in called functions and not necessary for minor fault
+
+
+===============================================================================
+++++ FILE
+	mm/mmap.c
+
+==== FUNCTION
+	unmap_region
+
+CALLED FROM
+	do_munmap
+
+PTE MODIFICATION
+	set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+
+
+==== FUNCTION
+	exit_mmap
+
+CALLED FROM
+	mmput
+
+PTE MODIFICATION
+	set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+	ioproc_release
+
+
+===============================================================================
+++++ FILE
+	mm/mprotect.c
+
+==== FUNCTION
+	change_protection
+
+CALLED FROM
+	mprotect_fixup
+
+PTE MODIFICATION
+	set_pte_at (change_pte_range via change_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+	ioproc_change_protection
+
+
+===============================================================================
+++++ FILE
+	mm/mremap.c
+
+==== FUNCTION
+	move_page_tables
+
+CALLED FROM
+	move_vma
+
+PTE MODIFICATION
+	ptep_clear_flush (move_one_page)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+	ioproc_invalidate_range
+
+
+===============================================================================
+++++ FILE
+	mm/rmap.c
+
+==== FUNCTION
+	try_to_unmap_one
+
+CALLED FROM
+	try_to_unmap_anon, try_to_unmap_file
+
+PTE MODIFICATION
+	ptep_clear_flush
+
+ADDED HOOKS
+	ioproc_invalidate_page
+
+
+==== FUNCTION
+	try_to_unmap_cluster
+
+CALLED FROM
+	try_to_unmap_file
+
+PTE MODIFICATION
+	ptep_clear_flush
+
+ADDED HOOKS
+	ioproc_invalidate_page
+
+
+===============================================================================
+++++ FILE
+	mm/msync.c
+
+==== FUNCTION
+	filemap_sync
+
+CALLED FROM
+	msync_interval
+
+PTE MODIFICATION
+	ptep_clear_flush_dirty (filemap_sync_pte)
+
+ADDED HOOKS
+	ioproc_sync_range
+
+
+===============================================================================
+++++ FILE
+	mm/hugetlb.c
+
+==== FUNCTION
+	zap_hugepage_range
+
+CALLED FROM
+	hugetlb_vmtruncate_list
+
+PTE MODIFICATION
+	ptep_get_and_clear (unmap_hugepage_range)
+
+ADDED HOOK
+	ioproc_invalidate_range
+
+
+-- Last update DavidAddison - 26 Apr 2005

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison
@ 2005-04-26 16:57 ` Jesper Juhl
  2005-04-26 17:13   ` Lee Revell
  2005-04-26 17:06 ` Brice Goglin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Jesper Juhl @ 2005-04-26 16:57 UTC (permalink / raw)
  To: David Addison
  Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison

On Tue, 26 Apr 2005, David Addison wrote:

> Hi,
> here is a patch we use to integrate the Quadrics NICs into the Linux kernel.
<snip>

A few small comments below.


> 
> +static inline void
> +ioproc_release(struct mm_struct *mm)
> +{

Return types on same line as function name makes grep'ing a lot 
easier/nicer.

Here's the example from Documentation/CodingStyle : 

        int function(int x)
        {
                body of function
        }

<snip>
> +/* ! CONFIG_IOPROC so make all hooks empty */
> +
> +#define ioproc_release(mm)			do { } while (0)
> +
> +#define ioproc_sync_range(vma, start, end)	do { } while (0)
> +
> +#define ioproc_invalidate_range(vma, start,end)	do { } while (0)
> +
> +#define ioproc_update_range(vma, start, end)	do { } while (0)
> +
> +#define ioproc_change_protection(vma, start, end, prot)	do { } while (0)
> +
> +#define ioproc_sync_page(vma, addr)		do { } while (0)
> +
> +#define ioproc_invalidate_page(vma, addr)	do { } while (0)
> +
> +#define ioproc_update_page(vma, addr)		do { } while (0)
> +
Why all these blank lines between each define? Seems like just a waste of 
screen space to me.


-- 
Jesper Juhl


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison
  2005-04-26 16:57 ` Jesper Juhl
@ 2005-04-26 17:06 ` Brice Goglin
  2005-04-27  9:41   ` David Addison
  2005-04-27 13:43   ` Andi Kleen
  2005-04-28  1:42 ` Troy Benjegerdes
  2005-04-28  7:21 ` Brice Goglin
  3 siblings, 2 replies; 20+ messages in thread
From: Brice Goglin @ 2005-04-26 17:06 UTC (permalink / raw)
  To: David Addison
  Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison

David Addison a écrit :
> Hi,
> 
> here is a patch we use to integrate the Quadrics NICs into the Linux 
> kernel.
> The patch adds hooks to the Linux VM subsystem so that registered 'IOPROC'
> devices can be informed of page table changes.
> This allows the Quadrics NICs to perform user RDMAs safely, without 
> requiring
> page pinning. Looking through some of the recent IB and Ammasso 
> discussions,
> it may also prove useful to those NICs too.

Hi,

I worked on a similar patch to help updating a registration cache on
Myrinet. I came to the problem of deciding between registering ioproc
to the entire address space (1) or only to some VMA (2).
You're doing (1), I tried (2).

(2) avoids calling ioproc hooks for all pages that are never involved
in any communication. This might be good if the amount of pages that
are involved is not too high and if the coproc_ops cost is a little bit
high.
Do you have any numbers about this in real applications on QsNet ?

I see two drawback in (2).
First, it requires to play with the list of ioproc_ops when VMA are
merged or split. Actually, it's not that bad since the list often
contains only 1 ioproc_ops.
Secondly, you have to add the ioproc to all involved VMA at some point.
It's easy when the API asks the application to register, you just add
the ioproc_ops to the target VMA during registration. But, I guess it's
not easy with Quadrics, right ?


I see in your patch that ioproc are not inherited during fork.
How do you support fork in your driver/lib then ?
What if a COW page is given to the son and the copy to the father
while some IO are being processed ? Do you require the application to
call a specific routine after forking ?
Don't you think it might be good to add a hook in the fork code
so that ioproc are inherited or duplicated pages are invalidated
in the card ?

Regards,
Brice

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 16:57 ` Jesper Juhl
@ 2005-04-26 17:13   ` Lee Revell
  2005-04-26 17:20     ` Jesper Juhl
                       ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Lee Revell @ 2005-04-26 17:13 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli,
	David Addison

On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote:
> > 
> > +static inline void
> > +ioproc_release(struct mm_struct *mm)
> > +{
> 
> Return types on same line as function name makes grep'ing a lot 
> easier/nicer.
> 
> Here's the example from Documentation/CodingStyle : 
> 
>         int function(int x)
>         {

How so?  I never understood the reasons.  This makes it easier to grep
for everything that returns int.  But you make the common case (what
file is function() defined in?) harder.

Lee



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:13   ` Lee Revell
@ 2005-04-26 17:20     ` Jesper Juhl
  2005-04-26 17:28       ` Lee Revell
  2005-04-26 20:09       ` Lars Marowsky-Bree
  2005-04-28 11:34     ` Jakob Oestergaard
  2005-04-29  8:22     ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 20+ messages in thread
From: Jesper Juhl @ 2005-04-26 17:20 UTC (permalink / raw)
  To: Lee Revell
  Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli,
	David Addison

On Tue, 26 Apr 2005, Lee Revell wrote:

> On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote:
> > > 
> > > +static inline void
> > > +ioproc_release(struct mm_struct *mm)
> > > +{
> > 
> > Return types on same line as function name makes grep'ing a lot 
> > easier/nicer.
> > 
> > Here's the example from Documentation/CodingStyle : 
> > 
> >         int function(int x)
> >         {
> 
> How so?  I never understood the reasons.  This makes it easier to grep
> for everything that returns int.  But you make the common case (what
> file is function() defined in?) harder.
> 
I don't know what you do, but when I'm grep'ing the tree for some function 
I'm often looking for its return type, having that on the same line as the 
function name lets me grep for the function name and the grep output will 
contain the return type and function name nicely on the same line.

-- 
Jesper


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:20     ` Jesper Juhl
@ 2005-04-26 17:28       ` Lee Revell
  2005-04-26 17:38         ` Jesper Juhl
  2005-04-26 20:14         ` John W. Linville
  2005-04-26 20:09       ` Lars Marowsky-Bree
  1 sibling, 2 replies; 20+ messages in thread
From: Lee Revell @ 2005-04-26 17:28 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli,
	David Addison

On Tue, 2005-04-26 at 19:20 +0200, Jesper Juhl wrote:
> I don't know what you do, but when I'm grep'ing the tree for some function 
> I'm often looking for its return type, having that on the same line as the 
> function name lets me grep for the function name and the grep output will 
> contain the return type and function name nicely on the same line.
> 

I do a lot of looking at large hunks of code I'm not familiar with and
trying to figure out how it works.  It's quite handy to grep for
foo_func to see all usages, then ^foo_func to see the function.  I guess
my preferred style favors people trying to grok code for the first time,
while the kernel style favors those who know it inside out.

Anyway, the coding style guidelines also state clearly that these points
are not up for debate on LKML so I'll stop now...

Lee


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:28       ` Lee Revell
@ 2005-04-26 17:38         ` Jesper Juhl
  2005-04-26 20:14         ` John W. Linville
  1 sibling, 0 replies; 20+ messages in thread
From: Jesper Juhl @ 2005-04-26 17:38 UTC (permalink / raw)
  To: Lee Revell
  Cc: David Addison, linux-kernel, Andrew Morton, Andrea Arcangeli,
	David Addison

On Tue, 26 Apr 2005, Lee Revell wrote:

> On Tue, 2005-04-26 at 19:20 +0200, Jesper Juhl wrote:
> > I don't know what you do, but when I'm grep'ing the tree for some function 
> > I'm often looking for its return type, having that on the same line as the 
> > function name lets me grep for the function name and the grep output will 
> > contain the return type and function name nicely on the same line.
> > 
> 
> I do a lot of looking at large hunks of code I'm not familiar with and
> trying to figure out how it works.  It's quite handy to grep for
> foo_func to see all usages, then ^foo_func to see the function.

Have you ever looked at what "make tags" gives you?
Run  make tags  in the kernel source dir, then open up a source file in 
vim, place the cursor over some struct name or function name and press 
CTRL+] and you'll be taken to the definition, you can drill down several 
levels like that, and if you want to go back up one level to where you 
were you simply press CTRL+t  very useful when navigating the source.

-- 
Jesper



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:20     ` Jesper Juhl
  2005-04-26 17:28       ` Lee Revell
@ 2005-04-26 20:09       ` Lars Marowsky-Bree
  1 sibling, 0 replies; 20+ messages in thread
From: Lars Marowsky-Bree @ 2005-04-26 20:09 UTC (permalink / raw)
  To: Jesper Juhl, Lee Revell; +Cc: linux-kernel

On 2005-04-26T19:20:13, Jesper Juhl <juhl-lkml@dif.dk> wrote:

> I don't know what you do, but when I'm grep'ing the tree for some function 
> I'm often looking for its return type, having that on the same line as the 
> function name lets me grep for the function name and the grep output will 
> contain the return type and function name nicely on the same line.

grep -rB1 '^function' drivers/




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:28       ` Lee Revell
  2005-04-26 17:38         ` Jesper Juhl
@ 2005-04-26 20:14         ` John W. Linville
  2005-04-26 20:17           ` Lee Revell
  1 sibling, 1 reply; 20+ messages in thread
From: John W. Linville @ 2005-04-26 20:14 UTC (permalink / raw)
  To: Lee Revell
  Cc: Jesper Juhl, David Addison, linux-kernel, Andrew Morton,
	Andrea Arcangeli, David Addison

On Tue, Apr 26, 2005 at 01:28:31PM -0400, Lee Revell wrote:

> I do a lot of looking at large hunks of code I'm not familiar with and
> trying to figure out how it works.  It's quite handy to grep for

I'd suggest cscope...

John
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 20:14         ` John W. Linville
@ 2005-04-26 20:17           ` Lee Revell
  0 siblings, 0 replies; 20+ messages in thread
From: Lee Revell @ 2005-04-26 20:17 UTC (permalink / raw)
  To: John W. Linville
  Cc: Jesper Juhl, David Addison, linux-kernel, Andrew Morton,
	Andrea Arcangeli, David Addison

On Tue, 2005-04-26 at 16:14 -0400, John W. Linville wrote:
> On Tue, Apr 26, 2005 at 01:28:31PM -0400, Lee Revell wrote:
> 
> > I do a lot of looking at large hunks of code I'm not familiar with and
> > trying to figure out how it works.  It's quite handy to grep for
> 
> I'd suggest cscope...

Thanks.  But now I feel bad hijacking the OP's thread.

Any comments on the patch?  ;-)

Lee


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:06 ` Brice Goglin
@ 2005-04-27  9:41   ` David Addison
  2005-04-28  8:38     ` Andy Isaacson
  2005-04-27 13:43   ` Andi Kleen
  1 sibling, 1 reply; 20+ messages in thread
From: David Addison @ 2005-04-27  9:41 UTC (permalink / raw)
  To: Brice Goglin; +Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison

[-- Attachment #1: Type: text/plain, Size: 3679 bytes --]

Brice Goglin wrote:
> David Addison a écrit :
> 
>> Hi,
>>
>> here is a patch we use to integrate the Quadrics NICs into the Linux 
>> kernel.
>> The patch adds hooks to the Linux VM subsystem so that registered 
>> 'IOPROC'
>> devices can be informed of page table changes.
>> This allows the Quadrics NICs to perform user RDMAs safely, without 
>> requiring
>> page pinning. Looking through some of the recent IB and Ammasso 
>> discussions,
>> it may also prove useful to those NICs too.
> 
> 
> Hi,
> 
> I worked on a similar patch to help updating a registration cache on
> Myrinet. I came to the problem of deciding between registering ioproc
> to the entire address space (1) or only to some VMA (2).
> You're doing (1), I tried (2).
> 
> (2) avoids calling ioproc hooks for all pages that are never involved
> in any communication. This might be good if the amount of pages that
> are involved is not too high and if the coproc_ops cost is a little bit
> high.
> Do you have any numbers about this in real applications on QsNet ?
> 
We have always taken approach (1) as it seems to be the simplest method
and offers the model where the whole user process space can be made available
for RDMA operations.
Admittedly, the update calls for pages which are not going to be used for
RDMA operations is an overhead, but the device driver can elect not to
present update functions and instead rely on pre-registration of comms
buffers.
For invalidates the device driver will have knowledge of the registered
regions and can quickly ignore irrelevant unloads.

With HPC applications in general we find the memory image is pretty static
over the life of the job and hence most of the costs are taken as the
pages are created during job startup and initialisation.

> I see two drawback in (2).
> First, it requires to play with the list of ioproc_ops when VMA are
> merged or split. Actually, it's not that bad since the list often
> contains only 1 ioproc_ops.
> Secondly, you have to add the ioproc to all involved VMA at some point.
> It's easy when the API asks the application to register, you just add
> the ioproc_ops to the target VMA during registration. But, I guess it's
> not easy with Quadrics, right ?
> 
In the past we have allowed dynamic page faulting of all application exposed
memory via RDMA operations.
However, our newer libraries do now implement a registration cache so we
can pre-load translations to our NIC MMU (or pin, if kernel invalidate hooks
are not available).
However, I still prefer model (1) as it allows both implementations and
appears to be much simpler in terms of the linux kernel changes required.

> 
> I see in your patch that ioproc are not inherited during fork.
> How do you support fork in your driver/lib then ?
> What if a COW page is given to the son and the copy to the father
> while some IO are being processed ? Do you require the application to
> call a specific routine after forking ?
> Don't you think it might be good to add a hook in the fork code
> so that ioproc are inherited or duplicated pages are invalidated
> in the card ?
> 
Yes, on fork() our programming model is for the child to attach to the
device again. The QsNet model has a NIC MMU context for each process
so it makes sense for each process to attach and have independent
IOPROC and NIC memory management.

But you're right, there should be IOPROC hooks to ensure that the device
cannot write to COW pages after the fork. I've added a new callback for this;
ioproc_wrprotect_page() called in copy_one_pte(), and a new revised patch
is attached (Jesper: with whitespace corrections too ;-)

> Regards,
> Brice

Thanks for your comments,

Cheers
David.

[-- Attachment #2: ioproc-2.6.12-rc3.patch --]
[-- Type: text/x-patch, Size: 39428 bytes --]

diff -ruN linux-2.6.12-rc3.orig/include/linux/ioproc.h linux-2.6.12-rc3.ioproc/include/linux/ioproc.h
--- linux-2.6.12-rc3.orig/include/linux/ioproc.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/include/linux/ioproc.h	2005-04-27 09:59:49.000000000 +0100
@@ -0,0 +1,273 @@
+/* -*- linux-c -*-
+ *
+ *    Copyright (C) 2002-2005 Quadrics Ltd.
+ *
+ *    This program is free software; you can redistribute it and/or modify
+ *    it under the terms of the GNU General Public License as published by
+ *    the Free Software Foundation; either version 2 of the License, or
+ *    (at your option) any later version.
+ *
+ *    This program is distributed in the hope that it will be useful,
+ *    but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *    GNU General Public License for more details.
+ *
+ *    You should have received a copy of the GNU General Public License
+ *    along with this program; if not, write to the Free Software
+ *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ */
+
+/*
+ * Callbacks for IO processor page table updates.
+ */
+
+#ifndef __LINUX_IOPROC_H__
+#define __LINUX_IOPROC_H__
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+typedef struct ioproc_ops {
+	struct ioproc_ops *next;
+	void *arg;
+
+	void (*release)(void *arg, struct mm_struct *mm);
+	void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+
+	void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);
+
+	void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*wrprotect_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+
+} ioproc_ops_t;
+
+/* IOPROC Registration
+ *
+ * Called by the IOPROC device driver to register its interest in page table
+ * changes for the process associated with the supplied mm_struct
+ *
+ * The caller should first allocate and fill out an ioproc_ops structure with
+ * the function pointers initialised to the device driver specific code for
+ * each callback. If the device driver doesn't have code for a particular
+ * callback then it should set the function pointer to be NULL.
+ * The ioproc_ops arg parameter will be passed unchanged as the first argument
+ * to each callback function invocation.
+ *
+ * The ioproc registration is not inherited across fork() and should be called
+ * once for each process that the IOPROC device driver is interested in.
+ *
+ * Must be called holding the mm->page_table_lock
+ */
+extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+/* IOPROC De-registration
+ *
+ * Called by the IOPROC device driver when it is no longer interested in page
+ * table changes for the process associated with the supplied mm_struct
+ *
+ * Normally this is not needed to be called as the ioproc_release() code will
+ * automatically unlink the ioproc_ops struct from the mm_struct as the
+ * process exits
+ *
+ * Must be called holding the mm->page_table_lock
+ */
+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+#ifdef CONFIG_IOPROC
+
+/* IOPROC Release
+ *
+ * Called during exit_mmap() as all vmas are torn down and unmapped.
+ *
+ * Also unlinks the ioproc_ops structure from the mm list as it goes.
+ *
+ * No need for locks as the mm can no longer be accessed at this point
+ *
+ */
+static inline void ioproc_release(struct mm_struct *mm)
+{
+	struct ioproc_ops *cp;
+
+	while ((cp = mm->ioproc_ops) != NULL) {
+		mm->ioproc_ops = cp->next;
+
+		if (cp->release)
+			cp->release(cp->arg, mm);
+	}
+}
+
+/* IOPROC SYNC RANGE
+ *
+ * Called when a memory map is synchronised with its disk image i.e. when the
+ * msync() syscall is invoked. Any future read or write to the associated
+ * pages by the IOPROC should cause the page to be marked as referenced or
+ * modified.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_sync_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->sync_range)
+			cp->sync_range(cp->arg, vma, start, end);
+}
+
+/* IOPROC INVALIDATE RANGE
+ *
+ * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the
+ * user or paged out by the kernel.
+ *
+ * After this call the IOPROC must not access the physical memory again unless
+ * a new translation is loaded.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_invalidate_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	struct ioproc_ops *cp;
+	
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->invalidate_range)
+			cp->invalidate_range(cp->arg, vma, start, end);
+}
+
+/* IOPROC UPDATE RANGE
+ *
+ * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk
+ * up, when breaking COW or faulting in an anonymous page of memory.
+ *
+ * These give the IOPROC device driver the opportunity to load translations
+ * speculatively, which can improve performance by avoiding device translation
+ * faults.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_update_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->update_range)
+			cp->update_range(cp->arg, vma, start, end);
+}
+
+
+/* IOPROC CHANGE PROTECTION
+ *
+ * Called when the protection on a region of memory is changed i.e. when the
+ * mprotect() syscall is invoked.
+ *
+ * The IOPROC must not be able to write to a read-only page, so if the
+ * permissions are downgraded then it must honour them. If they are upgraded
+ * it can treat this in the same way as the ioproc_update_[range|sync]() calls
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->change_protection)
+			cp->change_protection(cp->arg, vma, start, end, newprot);
+}
+
+/* IOPROC SYNC PAGE
+ *
+ * Called when a memory map is synchronised with its disk image i.e. when the
+ * msync() syscall is invoked. Any future read or write to the associated page
+ * by the IOPROC should cause the page to be marked as referenced or modified.
+ *
+ * Not currently called as msync() calls ioproc_sync_range() instead
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_sync_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->sync_page)
+			cp->sync_page(cp->arg, vma, addr);
+}
+
+/* IOPROC INVALIDATE PAGE
+ *
+ * Called whenever a valid PTE is unloaded e.g. when a page is unmapped by the
+ * user or paged out by the kernel.
+ *
+ * After this call the IOPROC must not access the physical memory again unless
+ * a new translation is loaded.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_invalidate_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->invalidate_page)
+			cp->invalidate_page(cp->arg, vma, addr);
+}
+
+/* IOPROC UPDATE PAGE
+ *
+ * Called whenever a valid PTE is loaded e.g. mmaping memory, moving the brk
+ * up, when breaking COW or faulting in an anonymous page of memory.
+ *
+ * These give the IOPROC device the opportunity to load translations
+ * speculatively, which can improve performance by avoiding device translation
+ * faults.
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_update_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->update_page)
+			cp->update_page(cp->arg, vma, addr);
+}
+
+/* IOPROC WRPROTECT PAGE
+ *
+ * Called when a page is downgraded for COW (during fork()). This should ensure that
+ * the page can no longer be written by the IOPROC
+ *
+ * Called holding the mm->page_table_lock
+ */
+static inline void ioproc_wrprotect_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct ioproc_ops *cp;
+
+	for (cp = vma->vm_mm->ioproc_ops; cp; cp = cp->next)
+		if (cp->wrprotect_page)
+			cp->wrprotect_page(cp->arg, vma, addr);
+}
+
+#else
+
+/* ! CONFIG_IOPROC so make all hooks empty */
+
+#define ioproc_release(mm)			do { } while (0)
+#define ioproc_sync_range(vma, start, end)	do { } while (0)
+#define ioproc_invalidate_range(vma, start,end)	do { } while (0)
+#define ioproc_update_range(vma, start, end)	do { } while (0)
+#define ioproc_change_protection(vma, start, end, prot)	do { } while (0)
+#define ioproc_sync_page(vma, addr)		do { } while (0)
+#define ioproc_invalidate_page(vma, addr)	do { } while (0)
+#define ioproc_update_page(vma, addr)		do { } while (0)
+#define ioproc_wrprotect_page(vma, addr)	do { } while (0)
+
+#endif /* CONFIG_IOPROC */
+
+#endif /* __LINUX_IOPROC_H__ */
diff -ruN linux-2.6.12-rc3.orig/include/linux/sched.h linux-2.6.12-rc3.ioproc/include/linux/sched.h
--- linux-2.6.12-rc3.orig/include/linux/sched.h	2005-04-26 09:02:29.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/include/linux/sched.h	2005-04-26 16:03:07.000000000 +0100
@@ -186,6 +186,9 @@
 asmlinkage void schedule(void);
 
 struct namespace;
+#ifdef CONFIG_IOPROC
+struct ioproc_ops;
+#endif
 
 /* Maximum number of active map areas.. This is a random (large) number */
 #define DEFAULT_MAX_MAP_COUNT	65536
@@ -267,6 +270,11 @@
 
 	unsigned long hiwater_rss;	/* High-water RSS usage */
 	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+
+#ifdef CONFIG_IOPROC
+	/* hooks for io devices with advanced RDMA capabilities */
+	struct ioproc_ops       *ioproc_ops;
+#endif
 };
 
 struct sighand_struct {
diff -ruN linux-2.6.12-rc3.orig/kernel/fork.c linux-2.6.12-rc3.ioproc/kernel/fork.c
--- linux-2.6.12-rc3.orig/kernel/fork.c	2005-04-26 09:02:36.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/kernel/fork.c	2005-04-26 16:03:07.000000000 +0100
@@ -320,6 +320,9 @@
 	spin_lock_init(&mm->page_table_lock);
 	rwlock_init(&mm->ioctx_list_lock);
 	mm->ioctx_list = NULL;
+#ifdef CONFIG_IOPROC
+	mm->ioproc_ops = NULL;
+#endif
 	mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm);
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 
diff -ruN linux-2.6.12-rc3.orig/mm/fremap.c linux-2.6.12-rc3.ioproc/mm/fremap.c
--- linux-2.6.12-rc3.orig/mm/fremap.c	2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/fremap.c	2005-04-26 16:03:07.000000000 +0100
@@ -12,6 +12,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
+#include <linux/ioproc.h>
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
@@ -30,6 +31,7 @@
 	if (pte_present(pte)) {
 		unsigned long pfn = pte_pfn(pte);
 
+		ioproc_invalidate_page(vma, addr);
 		flush_cache_page(vma, addr, pfn);
 		pte = ptep_clear_flush(vma, addr, ptep);
 		if (pfn_valid(pfn)) {
@@ -99,6 +101,7 @@
 	pte_val = *pte;
 	pte_unmap(pte);
 	update_mmu_cache(vma, addr, pte_val);
+	ioproc_update_page(vma, addr);
 
 	err = 0;
 err_unlock:
@@ -143,6 +146,7 @@
 	pte_val = *pte;
 	pte_unmap(pte);
 	update_mmu_cache(vma, addr, pte_val);
+	ioproc_update_page(vma, addr);
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 
diff -ruN linux-2.6.12-rc3.orig/mm/hugetlb.c linux-2.6.12-rc3.ioproc/mm/hugetlb.c
--- linux-2.6.12-rc3.orig/mm/hugetlb.c	2005-03-02 07:38:12.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/mm/hugetlb.c	2005-04-26 16:03:07.000000000 +0100
@@ -11,6 +11,7 @@
 #include <linux/sysctl.h>
 #include <linux/highmem.h>
 #include <linux/nodemask.h>
+#include <linux/ioproc.h>
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static unsigned long nr_huge_pages, free_huge_pages;
@@ -255,6 +256,7 @@
 	struct mm_struct *mm = vma->vm_mm;
 
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, start, start + length);
 	unmap_hugepage_range(vma, start, start + length);
 	spin_unlock(&mm->page_table_lock);
 }
diff -ruN linux-2.6.12-rc3.orig/mm/ioproc.c linux-2.6.12-rc3.ioproc/mm/ioproc.c
--- linux-2.6.12-rc3.orig/mm/ioproc.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/ioproc.c	2005-04-26 17:58:43.000000000 +0100
@@ -0,0 +1,56 @@
+/* -*- linux-c -*-
+ *
+ *    Copyright (C) 2002-2005 Quadrics Ltd.
+ *
+ *    This program is free software; you can redistribute it and/or modify
+ *    it under the terms of the GNU General Public License as published by
+ *    the Free Software Foundation; either version 2 of the License, or
+ *    (at your option) any later version.
+ *
+ *    This program is distributed in the hope that it will be useful,
+ *    but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *    GNU General Public License for more details.
+ *
+ *    You should have received a copy of the GNU General Public License
+ *    along with this program; if not, write to the Free Software
+ *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ */
+
+/*
+ * Registration for IO processor page table updates.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/mm.h>
+#include <linux/ioproc.h>
+
+int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
+{
+	ip->next = mm->ioproc_ops;
+	mm->ioproc_ops = ip;
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(ioproc_register_ops);
+
+int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip)
+{
+	struct ioproc_ops **tmp;
+
+	for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next)
+		;
+	if (*tmp) {
+		*tmp = ip->next;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+EXPORT_SYMBOL_GPL(ioproc_unregister_ops);
diff -ruN linux-2.6.12-rc3.orig/mm/Kconfig linux-2.6.12-rc3.ioproc/mm/Kconfig
--- linux-2.6.12-rc3.orig/mm/Kconfig	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/Kconfig	2005-04-26 16:03:07.000000000 +0100
@@ -0,0 +1,15 @@
+#
+# VM subsystem specific config
+#
+
+# Support for IO processors which have advanced RDMA capabilities
+#
+config IOPROC
+	bool "Enable IOPROC VM hooks"
+	depends on MMU
+	default y
+	help
+	This option enables hooks in the VM subsystem so that IO devices which
+	incorporate advanced RDMA capabilities can be kept in sync with CPU
+	page table changes.
+	See Documentation/vm/ioproc.txt for more details.
diff -ruN linux-2.6.12-rc3.orig/mm/Makefile linux-2.6.12-rc3.ioproc/mm/Makefile
--- linux-2.6.12-rc3.orig/mm/Makefile	2005-03-02 07:38:12.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/mm/Makefile	2005-04-26 16:03:07.000000000 +0100
@@ -17,4 +17,5 @@
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SHMEM) += shmem.o
 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
+obj-$(CONFIG_IOPROC)	+= ioproc.o
 
diff -ruN linux-2.6.12-rc3.orig/mm/memory.c linux-2.6.12-rc3.ioproc/mm/memory.c
--- linux-2.6.12-rc3.orig/mm/memory.c	2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/memory.c	2005-04-27 09:58:34.000000000 +0100
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/ioproc.h>
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
@@ -343,9 +344,10 @@
 
 static inline void
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, unsigned long vm_flags,
+		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr)
 {
+	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
 	unsigned long pfn;
@@ -385,6 +387,7 @@
 	 * in the parent and the child
 	 */
 	if ((vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE) {
+		ioproc_wrprotect_page(vma, addr);
 		ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = *src_pte;
 	}
@@ -409,7 +412,6 @@
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
-	unsigned long vm_flags = vma->vm_flags;
 	int progress;
 
 again:
@@ -433,7 +435,7 @@
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vm_flags, addr);
+		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 	spin_unlock(&src_mm->page_table_lock);
@@ -765,6 +767,7 @@
 
 	lru_add_drain();
 	spin_lock(&mm->page_table_lock);
+ 	ioproc_invalidate_range(vma, address, end);
 	tlb = tlb_gather_mmu(mm, 0);
 	end = unmap_vmas(&tlb, mm, vma, address, end, &nr_accounted, details);
 	tlb_finish_mmu(tlb, address, end);
@@ -1076,6 +1079,7 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
+	unsigned long beg = addr;
 	unsigned long end = addr + size;
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
@@ -1084,12 +1088,14 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, beg, end);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = zeromap_pud_range(mm, pgd, addr, next, prot);
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	ioproc_update_range(vma, beg, end);
 	spin_unlock(&mm->page_table_lock);
 	return err;
 }
@@ -1164,6 +1170,7 @@
 {
 	pgd_t *pgd;
 	unsigned long next;
+	unsigned long beg = addr;
 	unsigned long end = addr + size;
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
@@ -1183,6 +1190,7 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, beg, end);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
@@ -1190,6 +1198,7 @@
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	ioproc_update_range(vma, beg, end);
 	spin_unlock(&mm->page_table_lock);
 	return err;
 }
@@ -1218,8 +1227,10 @@
 
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
+	ioproc_invalidate_page(vma, address);
 	ptep_establish(vma, address, page_table, entry);
 	update_mmu_cache(vma, address, entry);
+	ioproc_update_page(vma, address);
 	lazy_mmu_prot_update(entry);
 }
 
@@ -1273,6 +1284,7 @@
 					      vma);
 			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
+			ioproc_update_page(vma, address);
 			lazy_mmu_prot_update(entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1736,6 +1748,7 @@
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, address, pte);
+	ioproc_update_page(vma, address);
 	lazy_mmu_prot_update(pte);
 	pte_unmap(page_table);
 	spin_unlock(&mm->page_table_lock);
@@ -1794,6 +1807,7 @@
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, addr, entry);
+	ioproc_update_page(vma, addr);
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
 out:
@@ -1920,6 +1934,7 @@
 
 	/* no need to invalidate: a not-present page shouldn't be cached */
 	update_mmu_cache(vma, address, entry);
+	ioproc_update_page(vma, address);
 	lazy_mmu_prot_update(entry);
 	spin_unlock(&mm->page_table_lock);
 out:
diff -ruN linux-2.6.12-rc3.orig/mm/mmap.c linux-2.6.12-rc3.ioproc/mm/mmap.c
--- linux-2.6.12-rc3.orig/mm/mmap.c	2005-04-26 09:02:39.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mmap.c	2005-04-26 16:03:07.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/ioproc.h>
 #include <linux/personality.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
@@ -1627,6 +1628,7 @@
 
 	lru_add_drain();
 	spin_lock(&mm->page_table_lock);
+	ioproc_invalidate_range(vma, start, end);
 	tlb = tlb_gather_mmu(mm, 0);
 	unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
@@ -1905,6 +1907,7 @@
 
 	spin_lock(&mm->page_table_lock);
 
+	ioproc_release(mm);
 	flush_cache_mm(mm);
 	tlb = tlb_gather_mmu(mm, 1);
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
diff -ruN linux-2.6.12-rc3.orig/mm/mprotect.c linux-2.6.12-rc3.ioproc/mm/mprotect.c
--- linux-2.6.12-rc3.orig/mm/mprotect.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mprotect.c	2005-04-26 16:03:07.000000000 +0100
@@ -10,6 +10,7 @@
 
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <linux/ioproc.h>
 #include <linux/slab.h>
 #include <linux/shm.h>
 #include <linux/mman.h>
@@ -89,6 +90,7 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_change_protection(vma, start, end, newprot);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
diff -ruN linux-2.6.12-rc3.orig/mm/mremap.c linux-2.6.12-rc3.ioproc/mm/mremap.c
--- linux-2.6.12-rc3.orig/mm/mremap.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/mremap.c	2005-04-26 16:03:07.000000000 +0100
@@ -9,6 +9,7 @@
 
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <linux/ioproc.h>
 #include <linux/slab.h>
 #include <linux/shm.h>
 #include <linux/mman.h>
@@ -161,6 +162,8 @@
 {
 	unsigned long offset;
 
+	ioproc_invalidate_range(vma, old_addr, old_addr + len);
+	ioproc_invalidate_range(vma, new_addr, new_addr + len);
 	flush_cache_range(vma, old_addr, old_addr + len);
 
 	/*
diff -ruN linux-2.6.12-rc3.orig/mm/msync.c linux-2.6.12-rc3.ioproc/mm/msync.c
--- linux-2.6.12-rc3.orig/mm/msync.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/msync.c	2005-04-26 16:03:07.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/mman.h>
 #include <linux/hugetlb.h>
 #include <linux/syscalls.h>
+#include <linux/ioproc.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -95,6 +96,7 @@
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	spin_lock(&mm->page_table_lock);
+	ioproc_sync_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
diff -ruN linux-2.6.12-rc3.orig/mm/rmap.c linux-2.6.12-rc3.ioproc/mm/rmap.c
--- linux-2.6.12-rc3.orig/mm/rmap.c	2005-04-26 09:02:40.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/mm/rmap.c	2005-04-26 16:03:07.000000000 +0100
@@ -53,6 +53,7 @@
 #include <linux/init.h>
 #include <linux/rmap.h>
 #include <linux/rcupdate.h>
+#include <linux/ioproc.h>
 
 #include <asm/tlbflush.h>
 
@@ -573,6 +574,7 @@
 	}
 
 	/* Nuke the page table entry. */
+	ioproc_invalidate_page(vma, address);
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
 
@@ -690,6 +692,7 @@
 			continue;
 
 		/* Nuke the page table entry. */
+		ioproc_invalidate_page(vma, address);
 		flush_cache_page(vma, address, pfn);
 		pteval = ptep_clear_flush(vma, address, pte);
 
diff -ruN linux-2.6.12-rc3.orig/arch/i386/defconfig linux-2.6.12-rc3.ioproc/arch/i386/defconfig
--- linux-2.6.12-rc3.orig/arch/i386/defconfig	2005-04-26 08:59:33.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/i386/defconfig	2005-04-26 16:03:07.000000000 +0100
@@ -120,6 +120,7 @@
 CONFIG_IRQBALANCE=y
 CONFIG_HAVE_DEC_LOCK=y
 # CONFIG_REGPARM is not set
+CONFIG_IOPROC=y
 
 #
 # Power management options (ACPI, APM)
diff -ruN linux-2.6.12-rc3.orig/arch/i386/Kconfig linux-2.6.12-rc3.ioproc/arch/i386/Kconfig
--- linux-2.6.12-rc3.orig/arch/i386/Kconfig	2005-04-26 08:59:33.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/i386/Kconfig	2005-04-26 16:03:08.000000000 +0100
@@ -923,6 +923,8 @@
 
 	  If unsure, say Y. Only embedded should say N here.
 
+source "mm/Kconfig"
+
 endmenu
 
 
diff -ruN linux-2.6.12-rc3.orig/arch/ia64/defconfig linux-2.6.12-rc3.ioproc/arch/ia64/defconfig
--- linux-2.6.12-rc3.orig/arch/ia64/defconfig	2005-03-02 07:37:48.000000000 +0000
+++ linux-2.6.12-rc3.ioproc/arch/ia64/defconfig	2005-04-26 16:03:08.000000000 +0100
@@ -92,6 +92,7 @@
 CONFIG_PERFMON=y
 CONFIG_IA64_PALINFO=y
 CONFIG_ACPI_DEALLOCATE_IRQ=y
+CONFIG_IOPROC=y
 
 #
 # Firmware Drivers
diff -ruN linux-2.6.12-rc3.orig/arch/ia64/Kconfig linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig
--- linux-2.6.12-rc3.orig/arch/ia64/Kconfig	2005-04-26 08:59:38.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/ia64/Kconfig	2005-04-26 16:03:08.000000000 +0100
@@ -319,6 +319,8 @@
 	depends on IOSAPIC && EXPERIMENTAL
 	default y
 
+source "mm/Kconfig"
+
 source "drivers/firmware/Kconfig"
 
 source "fs/Kconfig.binfmt"
diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/defconfig linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig
--- linux-2.6.12-rc3.orig/arch/x86_64/defconfig	2005-04-26 09:00:10.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/x86_64/defconfig	2005-04-26 16:03:08.000000000 +0100
@@ -100,6 +100,7 @@
 CONFIG_SECCOMP=y
 CONFIG_GENERIC_HARDIRQS=y
 CONFIG_GENERIC_IRQ_PROBE=y
+CONFIG_IOPROC=y
 
 #
 # Power management options
diff -ruN linux-2.6.12-rc3.orig/arch/x86_64/Kconfig linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig
--- linux-2.6.12-rc3.orig/arch/x86_64/Kconfig	2005-04-26 09:00:10.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/arch/x86_64/Kconfig	2005-04-26 16:03:08.000000000 +0100
@@ -458,6 +458,8 @@
 	depends on IA32_EMULATION
 	default y
 
+source "mm/Kconfig"
+
 endmenu
 
 source drivers/Kconfig
diff -ruN linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt
--- linux-2.6.12-rc3.orig/Documentation/vm/ioproc.txt	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3.ioproc/Documentation/vm/ioproc.txt	2005-04-27 10:05:31.000000000 +0100
@@ -0,0 +1,512 @@
+Linux IOPROC patch overview
+===========================
+
+The network interface for an HPC network differs significantly from
+network interfaces for traditional IP networks. HPC networks tend to
+be used directly from user processes and perform large RDMA transfers
+between the process address spaces. They also have a requirement
+for low latency communication, and typically achieve this by OS bypass
+techniques.  This then requires a different model to traditional
+interconnects, in that a process may need to expose a large amount of
+it's address space to the network RDMA.
+
+Locking down of memory has been a common mechanism for performing
+this, together with a pin-down cache implemented in user
+libraries. The disadvantage of this method is that large portions of
+the physical memory can be locked down for a single process, even if
+it's working set changes over the different phases of it's
+execution. This leads to inefficient memory utilisation - akin to the
+disadvantage of swapping compared to paging.
+
+This model also has problems where memory is being dynamically
+allocated and freed, since the pin down cache is unaware that memory
+may have been released by a call to munmap() and so it will still be
+locking down the now unused pages.
+
+Some modern HPC network interfaces implement their own MMU and are
+able to handle a translation fault during a network access. The
+Quadrics (http://www.quadrics.com) devices (Elan3 and Elan4) have done
+this for some time, and the Infiniband standard also allows for the
+case where memory has been deregistered when an RDMA occurs.
+These NICs are able to operate in an environment where paging occurs
+and do not require memory to be locked down. The advantage of this is
+that the user process can expose large portions of its address space
+without having to worry about physical memory constraints.
+
+However should the operating system decide to swap a page to disk,
+then the NIC must be made aware that it should no longer read/write
+from this memory, but should generate a translation fault instead.
+
+The ioproc patch has been developed to provide a mechanism whereby the
+device driver for a NIC can be made aware of when a user process's
+address translations change, either by paging or by explicitly mapping
+or unmapping of memory.
+
+The patch involves inserting callbacks where translations are being
+invalidated to notify the NIC that the memory behind those
+translations is no longer visible to the application (and so should
+not be visible to the NIC). This callback is then responsible for
+ensuring that the NIC will not access the physical memory that was
+being mapped.
+
+An ioproc invalidate callback in the kswapd code could be utilised to
+prevent memory from being paged out if the NIC is unable to support
+RDMA page faulting. This has not yet been implemented in this patch.
+
+For NICs which support RDMA page faulting, there is no requirement
+for a user level pin down cache, since they are able to page-in their
+translations on the first communication using a buffer. However this
+is likely to be inefficient, resulting in slow first use of the
+buffer. If the communication buffers were continually allocated and
+freed using mmap() based malloc() calls then this would lead to all
+communications being slower than desirable.
+
+To optimise these warm-up cases the ioproc patch adds calls to
+ioproc_update wherever the kernel is creating translations for a user
+process. These then allow the device driver to preload translations
+so that they are already present for the first network communication
+from a buffer.
+
+Linux 2.6 IOPROC implementation details
+=======================================
+
+The Linux IOPROC patch adds hooks to the Linux VM code whenever page
+table entries are being created and/or invalidated. IOPROC device
+drivers can register their interest in being informed of such changes
+by registering an ioproc_ops structure which is defined as follows;
+
+extern int ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+extern int ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip);
+
+typedef struct ioproc_ops {
+	struct ioproc_ops *next;
+	void *arg;
+
+	void (*release)(void *arg, struct mm_struct *mm);
+	void (*sync_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*invalidate_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+	void (*update_range)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end);
+
+	void (*change_protection)(void *arg, struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot);
+
+	void (*sync_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*invalidate_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+	void (*update_page)(void *arg, struct vm_area_struct *vma, unsigned long address);
+
+} ioproc_ops_t;
+
+ioproc_register_ops
+===================
+This function should be called by the IOPROC device driver to register
+its interest in PTE changes for the process associated with the passed
+in mm_struct.
+
+The ioproc registration is not inherited across fork() and should be
+called once for each process that IOPROC is interested in.
+
+This function must be called whilst holding the mm->page_table_lock.
+
+ioproc_unregister_ops
+=====================
+This function should be called by the IOPROC device driver when it no
+longer requires informing of PTE changes in the process associated
+with the supplied mm_struct.
+
+This function is not normally needed to be called as the ioproc_ops
+struct is unlinked from the associated mm_struct during the
+ioproc_release() call.
+
+This function must be called whilst holding the mm->page_table_lock.
+
+ioproc_ops struct
+=================
+A linked list ioproc_ops structures is hung off the user process
+mm_struct (linux/sched.h). At each hook point in the patched kernel
+the ioproc patch will call the associated ioproc_ops callback function
+pointer in turn for each registered structure.
+
+The intention of the callbacks is to allow the IOPROC device driver to
+inspect the new or modified PTE entry via the Linux kernel
+(e.g. find_pte_map()). These callbacks should not modify the Linux
+kernel VM state or PTE entries.
+
+The ioproc_ops callback function pointers are defined as follows;
+
+ioproc_release
+==============
+The release hook is called when a program exits and all its vma areas
+are torn down and unmapped. i.e. during exit_mmap(). Before each
+release hook is called the ioproc_ops structure is unlinked from the
+mm_struct.
+
+No locks are required as the process has the only reference to the mm
+at this point.
+
+ioproc_sync_[range|page]
+========================
+The sync hooks are called when a memory map is synchronised with its
+disk image i.e. when the msync() syscall is invoked. Any future read
+or write by the IOPROC device to the associated pages should cause the
+page to be marked as referenced or modified.
+
+Called holding the mm->page_table_lock
+
+ioproc_invalidate_[range|page]
+==============================
+The invalidate hooks are called whenever a valid PTE is unloaded
+e.g. when a page is unmapped by the user or paged out by the
+kernel. After this call the IOPROC must not access the physical memory
+again unless a new translation is loaded.
+
+Called holding the mm->page_table_lock
+
+ioproc_update_[range|page]
+==========================
+The update hooks are called whenever a valid PTE is loaded
+e.g. mmaping memory, moving the brk up, when breaking COW or faulting
+in an anonymous page of memory. These give the IOPROC device the
+opportunity to load translations speculatively, which can improve
+performance by avoiding device translation faults.
+
+Called holding the mm->page_table_lock
+
+ioproc_change_protection
+========================
+This hook is called when the protection on a region of memory is
+changed i.e. when the mprotect() syscall is invoked.
+
+The IOPROC must not be able to write to a read-only page, so if the
+permissions are downgraded then it must honour them. If they are
+upgraded it can treat this in the same way as the
+ioproc_update_[range|page]() calls
+
+Called holding the mm->page_table_lock
+
+
+Linux 2.6 IOPROC patch details
+==============================
+
+Here are the specific details of each ioproc hook added to the Linux
+2.6 VM system and the reasons for doing so;
+
+===============================================================================
+++++ FILE
+	mm/fremap.c
+
+==== FUNCTION
+	zap_pte
+
+CALLED FROM
+	install_page
+	install_file_pte
+
+PTE MODIFICATION
+	ptep_clear_flush
+
+ADDED HOOKS
+	ioproc_invalidate_page
+
+==== FUNCTION
+	install_page
+
+CALLED FROM
+	filemap_populate, shmem_populate
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+==== FUNCTION
+	install_file_pte
+
+CALLED FROM
+	filemap_populate, shmem_populate
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+===============================================================================
+++++ FILE
+	mm/memory.c
+
+==== FUNCTION
+	copy_one_pte
+
+CALLED FROM
+	copy_pte_range
+
+PTE MODIFICATION
+	ptep_set_wrprotect
+
+ADDED HOOKS
+	ioproc_wrprotect_page
+
+==== FUNCTION
+	copy_page_range
+
+CALLED FROM
+       dup_mmap (fork.c)
+
+PTE MODIFICATION
+	set_pte_at (copy_one_pte)
+
+ADDED HOOKS
+	None necessary as its creating a new process
+
+==== FUNCTION
+	zap_page_range
+
+CALLED FROM
+	read_zero_pagealigned, madvise_dontneed, unmap_mapping_range,
+	unmap_mapping_range_list, do_mmap_pgoff
+
+PTE MODIFICATION
+	set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+
+
+==== FUNCTION
+	zeromap_page_range
+
+CALLED FROM
+	read_zero_pagealigned, mmap_zero
+
+PTE MODIFICATION
+	set_pte_at (zeromap_pte_range via zeromap_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+	ioproc_update_range
+
+
+==== FUNCTION
+	remap_pfn_range
+
+CALLED FROM
+	many device drivers
+
+PTE MODIFICATION
+	set_pte_at (remap_pte_range via remap_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+	ioproc_update_range
+
+
+==== FUNCTION
+	break_cow
+
+CALLED FROM
+	do_wp_page
+
+PTE MODIFICATION
+	ptep_establish
+
+ADDED HOOKS
+	ioproc_invalidate_page
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_wp_page
+
+CALLED FROM
+       do_swap_page, handle_pte_fault
+
+PTE MODIFICATION
+	ptep_set_access_flags, break_cow
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_swap_page
+
+CALLED FROM
+	handle_pte_fault
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_anonymous_page
+
+CALLED FROM
+	do_no_page
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	do_no_page
+
+CALLED FROM
+	do_file_page, handle_pte_fault
+
+PTE MODIFICATION
+	set_pte_at
+
+ADDED HOOKS
+	ioproc_update_page
+
+
+==== FUNCTION
+	handle_pte_fault
+
+CALLED FROM
+	handle_mm_fault
+
+PTE MODIFICATION
+	ptep_set_access_flags, do_no_page, do_file_page, do_swap_page
+
+ADDED HOOKS
+	Handled in called functions and not necessary for minor fault
+
+
+===============================================================================
+++++ FILE
+	mm/mmap.c
+
+==== FUNCTION
+	unmap_region
+
+CALLED FROM
+	do_munmap
+
+PTE MODIFICATION
+	set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+
+
+==== FUNCTION
+	exit_mmap
+
+CALLED FROM
+	mmput
+
+PTE MODIFICATION
+	set_pte_at (unmap_vmas)
+
+ADDED HOOKS
+	ioproc_release
+
+
+===============================================================================
+++++ FILE
+	mm/mprotect.c
+
+==== FUNCTION
+	change_protection
+
+CALLED FROM
+	mprotect_fixup
+
+PTE MODIFICATION
+	set_pte_at (change_pte_range via change_[pud|pmd|pte]_range)
+
+ADDED HOOKS
+	ioproc_change_protection
+
+
+===============================================================================
+++++ FILE
+	mm/mremap.c
+
+==== FUNCTION
+	move_page_tables
+
+CALLED FROM
+	move_vma
+
+PTE MODIFICATION
+	ptep_clear_flush (move_one_page)
+
+ADDED HOOKS
+	ioproc_invalidate_range
+	ioproc_invalidate_range
+
+
+===============================================================================
+++++ FILE
+	mm/rmap.c
+
+==== FUNCTION
+	try_to_unmap_one
+
+CALLED FROM
+	try_to_unmap_anon, try_to_unmap_file
+
+PTE MODIFICATION
+	ptep_clear_flush
+
+ADDED HOOKS
+	ioproc_invalidate_page
+
+
+==== FUNCTION
+	try_to_unmap_cluster
+
+CALLED FROM
+	try_to_unmap_file
+
+PTE MODIFICATION
+	ptep_clear_flush
+
+ADDED HOOKS
+	ioproc_invalidate_page
+
+
+===============================================================================
+++++ FILE
+	mm/msync.c
+
+==== FUNCTION
+	filemap_sync
+
+CALLED FROM
+	msync_interval
+
+PTE MODIFICATION
+	ptep_clear_flush_dirty (filemap_sync_pte)
+
+ADDED HOOKS
+	ioproc_sync_range
+
+
+===============================================================================
+++++ FILE
+	mm/hugetlb.c
+
+==== FUNCTION
+	zap_hugepage_range
+
+CALLED FROM
+	hugetlb_vmtruncate_list
+
+PTE MODIFICATION
+	ptep_get_and_clear (unmap_hugepage_range)
+
+ADDED HOOK
+	ioproc_invalidate_range
+
+
+-- Last update DavidAddison - 26 Apr 2005

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:06 ` Brice Goglin
  2005-04-27  9:41   ` David Addison
@ 2005-04-27 13:43   ` Andi Kleen
  1 sibling, 0 replies; 20+ messages in thread
From: Andi Kleen @ 2005-04-27 13:43 UTC (permalink / raw)
  To: Brice Goglin; +Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison

Brice Goglin <Brice.Goglin@ens-lyon.org> writes:
> I see two drawback in (2).
> First, it requires to play with the list of ioproc_ops when VMA are
> merged or split. Actually, it's not that bad since the list often
> contains only 1 ioproc_ops.

I had a similar problem with the NUMA policies. With some minor hacks
you could probably reuse the policy support by making it a weird kind
of policy.  That would allow to keep the fast path impact very low,
which I think is the most important part of such hardware specific
narrow purpose, useless to 99.9999% of all users hacks 
(Golden rule number 1 such code: dont impact anything else)

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison
  2005-04-26 16:57 ` Jesper Juhl
  2005-04-26 17:06 ` Brice Goglin
@ 2005-04-28  1:42 ` Troy Benjegerdes
  2005-04-28  7:21 ` Brice Goglin
  3 siblings, 0 replies; 20+ messages in thread
From: Troy Benjegerdes @ 2005-04-28  1:42 UTC (permalink / raw)
  To: David Addison
  Cc: linux-kernel, Andrew Morton, Andrea Arcangeli, David Addison

On Tue, Apr 26, 2005 at 04:49:01PM +0100, David Addison wrote:
> Hi,
> 
> here is a patch we use to integrate the Quadrics NICs into the Linux kernel.
> The patch adds hooks to the Linux VM subsystem so that registered 'IOPROC'
> devices can be informed of page table changes.
> This allows the Quadrics NICs to perform user RDMAs safely, without 
> requiring
> page pinning. Looking through some of the recent IB and Ammasso discussions,
> it may also prove useful to those NICs too.
> 

I think the best thing to do is post this patch to openib-general
( http://openib.org/mailman/listinfo/openib-general )
and get a patch developed that works on amasso, IB, and Quadrics
hardware, and then come back to lkml.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison
                   ` (2 preceding siblings ...)
  2005-04-28  1:42 ` Troy Benjegerdes
@ 2005-04-28  7:21 ` Brice Goglin
  2005-04-28  9:21   ` David Addison
  2005-04-29  8:19   ` Benjamin Herrenschmidt
  3 siblings, 2 replies; 20+ messages in thread
From: Brice Goglin @ 2005-04-28  7:21 UTC (permalink / raw)
  To: David Addison; +Cc: Andrew Morton, Andrea Arcangeli, Linux Kernel

> @@ -267,6 +270,11 @@
>  
>  	unsigned long hiwater_rss;	/* High-water RSS usage */
>  	unsigned long hiwater_vm;	/* High-water virtual memory usage */
> +
> +#ifdef CONFIG_IOPROC
> +	/* hooks for io devices with advanced RDMA capabilities */
> +	struct ioproc_ops       *ioproc_ops;
> +#endif
>  };

> +int
> +ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
> +{
> +	ip->next = mm->ioproc_ops;
> +	mm->ioproc_ops = ip;
> +
> +	return 0;
> +}
> +
> +EXPORT_SYMBOL_GPL(ioproc_register_ops);
> +
> +int
> +ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip)
> +{
> +	struct ioproc_ops **tmp;
> +
> +	for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next)
> +		;
> +	if (*tmp) {
> +		*tmp = ip->next;
> +		return 0;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +EXPORT_SYMBOL_GPL(ioproc_unregister_ops);

You don't seem to use any synchronization mechanism to protect the
ioproc list from concurrent modifications, right ?
I understand that it might be useless as long as QsNet is the only user
of ioprocs and takes care of locking the address space somewhere in the
driver before adding/removing hooks.
But, if this patch is to be merged to the mainline, you probably need
to do something here. It's not clear how other in-kernel users
(IB, Myri, Ammasso, ...) might use ioprocs.
And actually, I think all ioproc list traversal need to be protected
as well.

A spinlock_t ioproc_lock is probably appropriate here.
I don't know whether any of the existing locks in the task_struct
might be used instead.

Regards,
Brice

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-27  9:41   ` David Addison
@ 2005-04-28  8:38     ` Andy Isaacson
  0 siblings, 0 replies; 20+ messages in thread
From: Andy Isaacson @ 2005-04-28  8:38 UTC (permalink / raw)
  To: David Addison
  Cc: Brice Goglin, linux-kernel, Andrew Morton, Andrea Arcangeli,
	David Addison

On Wed, Apr 27, 2005 at 10:41:07AM +0100, David Addison wrote:
> Brice Goglin wrote:
> >I worked on a similar patch to help updating a registration cache on
> >Myrinet. I came to the problem of deciding between registering ioproc
> >to the entire address space (1) or only to some VMA (2).
> >You're doing (1), I tried (2).
>
> We have always taken approach (1) as it seems to be the simplest
> method and offers the model where the whole user process space can be
> made available for RDMA operations.

I agree that this is a nice patch for exploring the design space (and
frankly, for maintaining outside the kernel tree).  I'd like to see
something like this merged.  As it stands, the patch is a decent
standalone implementation of (1).

I would personally strongly prefer that whatever is merged be low-impact
and so obviously good that it would not need to be a CONFIG_ option.
(Or rather, it should be a CONFIG_ option, but one which is forced to
yes if !CONFIG_EMBEDDED.)

And of course, it needs to be general-purpose enough to satisfy all the
significant constituencies:
1. Myrinet/Quadrics (proprietary interconnects for HPC/etc)
2. Infiniband (slightly more general-purpose interconnect standard for
               etc/HPC)
3. RDMA TCP
and I would add
4. people who want to add a commodity card to a general-purpose server
   and be able to take advantage of direct-to-userspace transfers
   without breaking the general-purposeness of their server.

I think that given a reliable framework for DMA-to-userspace, other
users will pop up.  OpenGL (DRI) is one obvious example; I think there
are others.

With those (fairly lofty) goals in mind, I think the verdict is not good
for ioproc-2.6.12-rc3.patch.

It's got some style-ish issues that would have to be worked out before
it could be merged.  (#ifdef in code, for one.)

It's adding a linked-list walk to a bunch of places in mm/, which is (or
at least, seems to me) pretty unacceptable (even if it's just one
cacheline miss) in the fast paths.

Did you understand Andi's suggestion about NUMA policies?  (I'm not
smart enough to follow it.)  Can we share code between this and the NUMA
stuff?

> static over the life of the job and hence most of the costs are taken
> as the pages are created during job startup and initialisation.

Yeah, I'm pretty skeptical about claims that "It's too much work to keep
track of all that" regarding per-proc versus per-vma, and also regarding
explicit-lock-from-commlib versus dynamic-pinning.  For the people who
care (HPC), pin/unpin events are very rare (zero during normal runtime),
so the overhead is unimportant.  It's more important to provide reliable
operation with minimal impact to standard mm semantics.

> However, I still prefer model (1) as it allows both implementations and
> appears to be much simpler in terms of the linux kernel changes required.

I agree that (1) looks easier to implement when you're doing it outside
the kernel (and tracking).  However, if you're aiming for integration
we should figure out what the right answer is.  It feels like that's
per-vma, but I freely admit I don't have any code to back that up.

> Thanks for your comments,

Thank you for stepping up to be our archery target. :)

> diff -ruN linux-2.6.12-rc3.orig/include/linux/ioproc.h linux-2.6.12-rc3.ioproc/include/linux/ioproc.h

Could you add -p to your diff invocation, please...

This patch is *exactly* what I'd want if I were looking for an obvious,
easy-to-maintain externally-maintained patch to add this capability.
(Would that I could say that for all the HPC kernel patches I've been
subjected to.)

But I think we can do better.

At least I would like to see Andi (or another NUMA mm god) and you (or
another RDMA expert) hash over the possiblity of sharing code.

-andy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-28  7:21 ` Brice Goglin
@ 2005-04-28  9:21   ` David Addison
  2005-04-29  8:19   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: David Addison @ 2005-04-28  9:21 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Andrew Morton, Andrea Arcangeli, Linux Kernel

Brice Goglin wrote:
>> @@ -267,6 +270,11 @@
>>  
>>      unsigned long hiwater_rss;    /* High-water RSS usage */
>>      unsigned long hiwater_vm;    /* High-water virtual memory usage */
>> +
>> +#ifdef CONFIG_IOPROC
>> +    /* hooks for io devices with advanced RDMA capabilities */
>> +    struct ioproc_ops       *ioproc_ops;
>> +#endif
>>  };
> 
>> +int
>> +ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
>> +{
>> +    ip->next = mm->ioproc_ops;
>> +    mm->ioproc_ops = ip;
>> +
>> +    return 0;
>> +}
>> +
>> +EXPORT_SYMBOL_GPL(ioproc_register_ops);
>> +
>> +int
>> +ioproc_unregister_ops(struct mm_struct *mm, struct ioproc_ops *ip)
>> +{
>> +    struct ioproc_ops **tmp;
>> +
>> +    for (tmp = &mm->ioproc_ops; *tmp && *tmp != ip; tmp= &(*tmp)->next)
>> +        ;
>> +    if (*tmp) {
>> +        *tmp = ip->next;
>> +        return 0;
>> +    }
>> +
>> +    return -EINVAL;
>> +}
>> +
>> +EXPORT_SYMBOL_GPL(ioproc_unregister_ops);
> 
> You don't seem to use any synchronization mechanism to protect the
> ioproc list from concurrent modifications, right ?
> I understand that it might be useless as long as QsNet is the only user
> of ioprocs and takes care of locking the address space somewhere in the
> driver before adding/removing hooks.
> But, if this patch is to be merged to the mainline, you probably need
> to do something here. It's not clear how other in-kernel users
> (IB, Myri, Ammasso, ...) might use ioprocs.
> And actually, I think all ioproc list traversal need to be protected
> as well.
> 
All ioproc list traversal is protected by the mm->page_table_lock which is
held at all points where the callbacks are invoked.
[Actually there is one case where this isn't true, which I'll fix
when we refresh this patch later today]

The registration/unregister functions also need to be called holding this
spinlock, our device driver does this, but perhaps we need to document
that requirement more clearly.

Cheers
David.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:13   ` Lee Revell
  2005-04-26 17:20     ` Jesper Juhl
@ 2005-04-28 11:34     ` Jakob Oestergaard
  2005-04-29  8:22     ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 20+ messages in thread
From: Jakob Oestergaard @ 2005-04-28 11:34 UTC (permalink / raw)
  To: Lee Revell
  Cc: Jesper Juhl, David Addison, linux-kernel, Andrew Morton,
	Andrea Arcangeli, David Addison

On Tue, Apr 26, 2005 at 01:13:04PM -0400, Lee Revell wrote:
> On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote:
> > > 
> > > +static inline void
> > > +ioproc_release(struct mm_struct *mm)
> > > +{
> > 
> > Return types on same line as function name makes grep'ing a lot 
> > easier/nicer.
> > 
> > Here's the example from Documentation/CodingStyle : 
> > 
> >         int function(int x)
> >         {
> 
> How so?  I never understood the reasons.  This makes it easier to grep
> for everything that returns int.  But you make the common case (what
> file is function() defined in?) harder.

etags/ctags end of story :)

-- 

 / jakob


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-28  7:21 ` Brice Goglin
  2005-04-28  9:21   ` David Addison
@ 2005-04-29  8:19   ` Benjamin Herrenschmidt
  2005-04-29  9:25     ` David Addison
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2005-04-29  8:19 UTC (permalink / raw)
  To: Brice Goglin
  Cc: David Addison, Andrew Morton, Andrea Arcangeli, Linux Kernel list


> > +ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
> > +{
> > +	ip->next = mm->ioproc_ops;
> > +	mm->ioproc_ops = ip;
> > +
> > +	return 0;
> > +}
> > +

Why not use a list_head along with linux standard list primitives ?

Ben.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-26 17:13   ` Lee Revell
  2005-04-26 17:20     ` Jesper Juhl
  2005-04-28 11:34     ` Jakob Oestergaard
@ 2005-04-29  8:22     ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2005-04-29  8:22 UTC (permalink / raw)
  To: Lee Revell
  Cc: Jesper Juhl, David Addison, Linux Kernel list, Andrew Morton,
	Andrea Arcangeli, David Addison

On Tue, 2005-04-26 at 13:13 -0400, Lee Revell wrote:
> On Tue, 2005-04-26 at 18:57 +0200, Jesper Juhl wrote:
> > > 
> > > +static inline void
> > > +ioproc_release(struct mm_struct *mm)
> > > +{
> > 
> > Return types on same line as function name makes grep'ing a lot 
> > easier/nicer.
> > 
> > Here's the example from Documentation/CodingStyle : 
> > 
> >         int function(int x)
> >         {
> 
> How so?  I never understood the reasons.  This makes it easier to grep
> for everything that returns int.  But you make the common case (what
> file is function() defined in?) harder.

Not exactly. I used the 2-lines style for a while, and changed overtime
and now can't stand anything but the one line style  :)

I recommend you read the mailing list archives for linus comments on
this issue btw.

Ben.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH][RFC] Linux VM hooks for advanced RDMA NICs
  2005-04-29  8:19   ` Benjamin Herrenschmidt
@ 2005-04-29  9:25     ` David Addison
  0 siblings, 0 replies; 20+ messages in thread
From: David Addison @ 2005-04-29  9:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Brice Goglin, Andrew Morton, Andrea Arcangeli, Linux Kernel list



Benjamin Herrenschmidt wrote:
>>>+ioproc_register_ops(struct mm_struct *mm, struct ioproc_ops *ip)
>>>+{
>>>+	ip->next = mm->ioproc_ops;
>>>+	mm->ioproc_ops = ip;
>>>+
>>>+	return 0;
>>>+}
>>>+
> 
> Why not use a list_head along with linux standard list primitives ?
> 
> Ben.
> 
> 
The reason we didn't use the standard list primitives was that we wanted the normal
case where no ioproc ops were registered to have minimal impact and this just comes
down to mm->ioproc_ops being checked against being zero, which is slightly lighter weight
than using the list primitives.

Also entries are rarely removed from the list using the ioproc_deregister function as
in the normal case they get removed in the call to ioproc_release.  Hence there is little
need for the doubly linked list.

Cheers,
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-04-29  9:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-26 15:49 [PATCH][RFC] Linux VM hooks for advanced RDMA NICs David Addison
2005-04-26 16:57 ` Jesper Juhl
2005-04-26 17:13   ` Lee Revell
2005-04-26 17:20     ` Jesper Juhl
2005-04-26 17:28       ` Lee Revell
2005-04-26 17:38         ` Jesper Juhl
2005-04-26 20:14         ` John W. Linville
2005-04-26 20:17           ` Lee Revell
2005-04-26 20:09       ` Lars Marowsky-Bree
2005-04-28 11:34     ` Jakob Oestergaard
2005-04-29  8:22     ` Benjamin Herrenschmidt
2005-04-26 17:06 ` Brice Goglin
2005-04-27  9:41   ` David Addison
2005-04-28  8:38     ` Andy Isaacson
2005-04-27 13:43   ` Andi Kleen
2005-04-28  1:42 ` Troy Benjegerdes
2005-04-28  7:21 ` Brice Goglin
2005-04-28  9:21   ` David Addison
2005-04-29  8:19   ` Benjamin Herrenschmidt
2005-04-29  9:25     ` David Addison

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).