All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC] ummunot: Userspace support for MMU notifications
@ 2009-07-22 17:47 Roland Dreier
  2009-07-22 18:15 ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2009-07-22 17:47 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton; +Cc: Jeff Squyres

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunot, that creates a /dev/ummunot
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunot_event in
    <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
    for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

Thanks to Jason Gunthorpe <jgunthorpe@obsidianresearch.com> for
suggestions on the interface design.  Also thanks to Jeff Squyres
<jsquyres@cisco.com> for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
Andrew -- sending this to you since I don't really know who should merge
this sort of thing.

Solving the underlying issue here is a pretty big deal to the universe
of people that work on MPI implementations (sort of middleware for
high-performance computing).  I think everyone would be open to better
ideas on how to handle this, although I'm pretty happy with how small
and self-contained the ummunot implementation ended up being.

Anyway, review, suggestions, etc. all gratefully received.

Thanks,
  Roland


 drivers/char/Kconfig    |   12 ++
 drivers/char/Makefile   |    1 +
 drivers/char/ummunot.c  |  489 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ummunot.h |  130 +++++++++++++
 4 files changed, 632 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunot.c
 create mode 100644 include/linux/ummunot.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6a06913..8d3dd9d 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1108,6 +1108,18 @@ config DEVPORT
 	depends on ISA || PCI
 	default y
 
+config UMMUNOT
+       tristate "Userspace MMU notifications"
+       select MMU_NOTIFIER
+       help
+         The ummunot (userspace MMU notification) driver creates a
+         character device that can be used by userspace libraries to
+         get notifications when an application's memory mapping
+         changed.  This is used, for example, by RDMA libraries to
+         improve the reliability of memory registration caching, since
+         the kernel's MMU notifications can be used to know precisely
+         when to shoot down a cached registration.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 66f779a..39ff0c3 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)		+= nsc_gpio.o
 obj-$(CONFIG_CS5535_GPIO)	+= cs5535_gpio.o
 obj-$(CONFIG_GPIO_TB0219)	+= tb0219.o
 obj-$(CONFIG_TELCLOCK)		+= tlclk.o
+obj-$(CONFIG_UMMUNOT)		+= ummunot.o
 
 obj-$(CONFIG_MWAVE)		+= mwave/
 obj-$(CONFIG_AGP)		+= agp/
diff --git a/drivers/char/ummunot.c b/drivers/char/ummunot.c
new file mode 100644
index 0000000..4a13449
--- /dev/null
+++ b/drivers/char/ummunot.c
@@ -0,0 +1,489 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunot.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("GPL v2");
+
+enum {
+	UMMUNOT_FLAG_DIRTY	= 1,
+	UMMUNOT_FLAG_HINT	= 2,
+};
+
+struct ummunot_reg {
+	u64			user_cookie;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		hint_start;
+	unsigned long		hint_end;
+	unsigned long		flags;
+	struct rb_node		node;
+	struct list_head	list;
+};
+
+struct ummunot_file {
+	struct mmu_notifier	mmu_notifier;
+	struct mm_struct       *mm;
+	struct rb_root		reg_tree;
+	struct list_head	dirty_list;
+	u64		       *counter;
+	spinlock_t		lock;
+	wait_queue_head_t	read_wait;
+	struct fasync_struct   *async_queue;
+	int			need_empty;
+	int			used;
+};
+
+static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn)
+{
+	return container_of(mn, struct ummunot_file, mmu_notifier);
+}
+
+static void ummunot_handle_not(struct mmu_notifier *mn,
+			       unsigned long start, unsigned long end)
+{
+	struct ummunot_file *priv = to_ummunot_file(mn);
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	unsigned long flags;
+	int hit = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->start >= end)
+			break;
+
+		/*
+		 * Ranges overlap if they're not disjoint; and they're
+		 * disjoint if the end of one is before the start of
+		 * the other one.
+		 */
+		if (!(reg->end <= start || end <= reg->start)) {
+			hit = 1;
+
+			if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+				list_add_tail(&reg->list, &priv->dirty_list);
+
+			if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+				clear_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+			} else {
+				set_bit(UMMUNOT_FLAG_HINT, &reg->flags);
+				reg->hint_start = start;
+				reg->hint_end   = end;
+			}
+		}
+	}
+
+	if (hit) {
+		++(*priv->counter);
+		flush_dcache_page(virt_to_page(priv->counter));
+		wake_up_interruptible(&priv->read_wait);
+		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunot_inval_page(struct mmu_notifier *mn,
+			       struct mm_struct *mm,
+			       unsigned long addr)
+{
+	ummunot_handle_not(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunot_inval_range_start(struct mmu_notifier *mn,
+				      struct mm_struct *mm,
+				      unsigned long start, unsigned long end)
+{
+	ummunot_handle_not(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = {
+	.invalidate_page	= ummunot_inval_page,
+	.invalidate_range_start	= ummunot_inval_range_start,
+};
+
+static int ummunot_open(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv;
+	int ret;
+
+	if (filp->f_mode & FMODE_WRITE)
+		return -EINVAL;
+
+	priv = kmalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!priv->counter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	priv->reg_tree = RB_ROOT;
+	INIT_LIST_HEAD(&priv->dirty_list);
+	spin_lock_init(&priv->lock);
+	init_waitqueue_head(&priv->read_wait);
+	priv->async_queue = NULL;
+	priv->need_empty  = 0;
+	priv->used	  = 0;
+
+	priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops;
+	/*
+	 * Register notifier last, since notifications can occur as
+	 * soon as we register....
+	 */
+	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+	if (ret)
+		goto err_page;
+
+	priv->mm = current->mm;
+	atomic_inc(&priv->mm->mm_count);
+
+	filp->private_data = priv;
+
+	return 0;
+
+err_page:
+	free_page((unsigned long) priv->counter);
+
+err:
+	kfree(priv);
+	return ret;
+}
+
+static int ummunot_close(struct inode *inode, struct file *filp)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+
+	mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+	mmdrop(priv->mm);
+	free_page((unsigned long) priv->counter);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+		kfree(reg);
+	}
+
+	kfree(priv);
+
+	return 0;
+}
+
+static bool ummunot_readable(struct ummunot_file *priv)
+{
+	return priv->need_empty || !list_empty(&priv->dirty_list);
+}
+
+static ssize_t ummunot_read(struct file *filp, char __user *buf,
+			    size_t count, loff_t *pos)
+{
+	struct ummunot_file *priv = filp->private_data;
+	struct ummunot_reg *reg;
+	ssize_t ret;
+	struct ummunot_event *events;
+	int max;
+	int n;
+
+	priv->used = 1;
+
+	events = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!events) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&priv->lock);
+
+	while (!ummunot_readable(priv)) {
+		spin_unlock_irq(&priv->lock);
+
+		if (filp->f_flags & O_NONBLOCK) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		if (wait_event_interruptible(priv->read_wait,
+					     ummunot_readable(priv))) {
+			ret = -ERESTARTSYS;
+			goto out;
+		}
+
+		spin_lock_irq(&priv->lock);
+	}
+
+	max = min(PAGE_SIZE, count) / sizeof *events;
+
+	for (n = 0; n < max; ++n) {
+		if (list_empty(&priv->dirty_list)) {
+			events[n].type		      = UMMUNOT_EVENT_TYPE_LAST;
+			events[n].user_cookie_counter = *priv->counter;
+			++n;
+			priv->need_empty = 0;
+			break;
+		}
+
+		reg = list_first_entry(&priv->dirty_list, struct ummunot_reg,
+				       list);
+
+		events[n].type = UMMUNOT_EVENT_TYPE_INVAL;
+		if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
+			events[n].flags		= UMMUNOT_EVENT_FLAG_HINT;
+			events[n].hint_start	= max(reg->start, reg->hint_start);
+			events[n].hint_end	= min(reg->end, reg->hint_end);
+		} else {
+			events[n].hint_start	= reg->start;
+			events[n].hint_end	= reg->end;
+		}
+		events[n].user_cookie_counter = reg->user_cookie;
+
+		list_del(&reg->list);
+		reg->flags = 0;
+		priv->need_empty = 1;
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	if (copy_to_user(buf, events, n * sizeof *events))
+		ret = -EFAULT;
+	else
+		ret = n * sizeof *events;
+
+out:
+	free_page((unsigned long) events);
+	return ret;
+}
+
+static unsigned int ummunot_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	poll_wait(filp, &priv->read_wait, wait);
+
+	return ummunot_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
+}
+
+static long ummunot_exchange_features(struct ummunot_file *priv,
+				      __u32 __user *arg)
+{
+	u32 feature_mask;
+
+	if (priv->used)
+		return -EINVAL;
+
+	priv->used = 1;
+
+	if (get_user(feature_mask, arg))
+		return -EFAULT;
+
+	/* No extensions defined at present. */
+	feature_mask = 0;
+
+	if (put_user(feature_mask, arg))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long ummunot_register_region(struct ummunot_file *priv,
+				    struct ummunot_register_ioctl __user *arg)
+{
+	struct ummunot_register_ioctl parm;
+	struct ummunot_reg *reg, *treg;
+	struct rb_node **n = &priv->reg_tree.rb_node;
+	struct rb_node *pn;
+	int ret = 0;
+
+	if (copy_from_user(&parm, arg, sizeof parm))
+		return -EFAULT;
+
+	priv->used = 1;
+
+	reg = kmalloc(sizeof *reg, GFP_KERNEL);
+	if (!reg)
+		return -ENOMEM;
+
+	reg->user_cookie	= parm.user_cookie;
+	reg->start		= parm.start;
+	reg->end		= parm.end;
+	reg->flags		= 0;
+
+	spin_lock_irq(&priv->lock);
+
+	for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
+		treg = rb_entry(pn, struct ummunot_reg, node);
+
+		if (treg->user_cookie == parm.user_cookie) {
+			kfree(reg);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	pn = NULL;
+	while (*n) {
+		pn = *n;
+		treg = rb_entry(pn, struct ummunot_reg, node);
+
+		if (reg->start <= treg->start)
+			n = &pn->rb_left;
+		else
+			n = &pn->rb_right;
+	}
+
+	rb_link_node(&reg->node, pn, n);
+	rb_insert_color(&reg->node, &priv->reg_tree);
+
+out:
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunot_unregister_region(struct ummunot_file *priv,
+				      __u64 __user *arg)
+{
+	u64 user_cookie;
+	struct rb_node *n;
+	struct ummunot_reg *reg;
+	int ret = -EINVAL;
+
+	if (get_user(user_cookie, arg))
+		return -EFAULT;
+
+	spin_lock_irq(&priv->lock);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunot_reg, node);
+
+		if (reg->user_cookie == user_cookie) {
+			rb_erase(n, &priv->reg_tree);
+			if (test_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
+				list_del(&reg->list);
+			kfree(reg);
+			ret = 0;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunot_ioctl(struct file *filp, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct ummunot_file *priv = filp->private_data;
+	void __user *argp = (void __user *) arg;
+
+	switch (cmd) {
+	case UMMUNOT_EXCHANGE_FEATURES:
+		return ummunot_exchange_features(priv, argp);
+	case UMMUNOT_REGISTER_REGION:
+		return ummunot_register_region(priv, argp);
+	case UMMUNOT_UNREGISTER_REGION:
+		return ummunot_unregister_region(priv, argp);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ummunot_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ummunot_file *priv = vma->vm_private_data;
+
+	if (vmf->pgoff != 0)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(priv->counter);
+	get_page(vmf->page);
+
+	return 0;
+
+}
+
+static struct vm_operations_struct ummunot_vm_ops = {
+	.fault		= ummunot_fault,
+};
+
+static int ummunot_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+	    vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	vma->vm_ops		= &ummunot_vm_ops;
+	vma->vm_private_data	= priv;
+
+	return 0;
+}
+
+static int ummunot_fasync(int fd, struct file *filp, int on)
+{
+	struct ummunot_file *priv = filp->private_data;
+
+	return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ummunot_open,
+	.release	= ummunot_close,
+	.read		= ummunot_read,
+	.poll		= ummunot_poll,
+	.unlocked_ioctl	= ummunot_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ummunot_ioctl,
+#endif
+	.mmap		= ummunot_mmap,
+	.fasync		= ummunot_fasync,
+};
+
+static struct miscdevice ummunot_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "ummunot",
+	.fops	= &ummunot_fops,
+};
+
+static int __init ummunot_init(void)
+{
+	return misc_register(&ummunot_misc);
+}
+
+static void __exit ummunot_cleanup(void)
+{
+	misc_deregister(&ummunot_misc);
+}
+
+module_init(ummunot_init);
+module_exit(ummunot_cleanup);
diff --git a/include/linux/ummunot.h b/include/linux/ummunot.h
new file mode 100644
index 0000000..4640e36
--- /dev/null
+++ b/include/linux/ummunot.h
@@ -0,0 +1,130 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenFabrics BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOT_H
+#define _LINUX_UMMUNOT_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Ummunot relays MMU notifier events to userspace.  A userspace
+ * process uses it by opening /dev/ummunot, which returns a file
+ * descriptor.  Interest in address ranges is registered using ioctl()
+ * and MMU notifier events are retrieved using read(), as described in
+ * more detail below.
+ */
+
+/*
+ * struct ummunot_register_ioctl describes an address range from start
+ * to end (including start but not including end) to be monitored.
+ * user_cookie is an opaque handle that userspace assigns, and which
+ * is used to unregister.  flags and reserved are currently unused and
+ * should be set to 0 for forward compatibility.
+ */
+struct ummunot_register_ioctl {
+	__u64	start;		/* in */
+	__u64	end;		/* in */
+	__u64	user_cookie;	/* in */
+	__u32	flags;		/* in */
+	__u32	reserved;	/* in */
+};
+
+#define UMMUNOT_MAGIC			'U'
+
+/*
+ * Forward compatibility: Userspace passes in a 32-bit feature mask
+ * with feature flags set indicating which extensions it wishes to
+ * use.  The kernel will return a feature mask with the bits of
+ * userspace's mask that the kernel implements; from that point on
+ * both userspace and the kernel should behave as described by the
+ * kernel's feature mask.
+ *
+ * If userspace does not perform a UMMUNOT_EXCHANGE_FEATURES ioctl,
+ * then the kernel will use a feature mask of 0.
+ *
+ * No feature flags are currently defined, so the kernel will always
+ * return a feature mask of 0 at present.
+ */
+#define UMMUNOT_EXCHANGE_FEATURES	_IOWR(UMMUNOT_MAGIC, 1, __u32)
+
+/*
+ * Register interest in an address range; userspace should pass in a
+ * struct ummunot_register_ioctl describing the region.
+ */
+#define UMMUNOT_REGISTER_REGION		_IOW(UMMUNOT_MAGIC, 2, \
+					      struct ummunot_register_ioctl)
+/*
+ * Unregister interest in an address range; userspace should pass in
+ * the user_cookie value that was used to register the address range.
+ * No events for the address range will be reported once it is
+ * unregistered.
+ */
+#define UMMUNOT_UNREGISTER_REGION	_IOW(UMMUNOT_MAGIC, 3, __u64)
+
+/*
+ * Invalidation events are returned whenever the kernel changes the
+ * mapping for a monitored address.  These events are retrieved by
+ * read() on the ummunot file descriptor, which will fill the read()
+ * buffer with struct ummunot_event.
+ *
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated.  (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * and set to the start/end of the whole registered range)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred.  The
+ * other fields are not filled in for this event.
+ */
+
+enum {
+	UMMUNOT_EVENT_TYPE_INVAL	= 0,
+	UMMUNOT_EVENT_TYPE_LAST		= 1,
+};
+
+enum {
+	UMMUNOT_EVENT_FLAG_HINT		= 1 << 0,
+};
+
+struct ummunot_event {
+	__u32	type;
+	__u32	flags;
+	__u64	hint_start;
+	__u64	hint_end;
+	__u64	user_cookie_counter;
+};
+
+#endif /* _LINUX_UMMUNOT_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-22 17:47 [PATCH/RFC] ummunot: Userspace support for MMU notifications Roland Dreier
@ 2009-07-22 18:15 ` Andrew Morton
  2009-07-22 19:27   ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2009-07-22 18:15 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-kernel, Jeff Squyres

On Wed, 22 Jul 2009 10:47:23 -0700 Roland Dreier <rdreier@cisco.com> wrote:

> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.

It isn't "precisely" - the tracker has a delayed view of the trackee,
of course.  Only really solvable by blocking the trackee each time it
does something.

The time-skew makes the facility considerably weaker.  Obviously this
is OK for your application but it's a fairly important design point
which it would be interesting to hear about.

>  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.
> 
> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunot, that creates a /dev/ummunot
> node.  A userspace process can open this node read-only and use the fd
> as follows:
> 
>  1. ioctl() to register/unregister an address range to watch in the
>     kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).
> 
>  2. read() to retrieve events generated when a mapping in a watched
>     address range is invalidated (cf struct ummunot_event in
>     <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
>     for this IO.
> 
>  3. mmap() one page at offset 0 to map a kernel page that contains a
>     generation counter that is incremented each time an event is
>     generated.  This allows userspace to have a fast path that checks
>     that no events have occurred without a system call.

If you stand back and squint, each of 1, 2 and 3 are things which the
kernel already provides for the delivery of ftrace events to userspace.

Did you look at reusing all that stuff?

> Thanks to Jason Gunthorpe <jgunthorpe@obsidianresearch.com> for
> suggestions on the interface design.  Also thanks to Jeff Squyres
> <jsquyres@cisco.com> for prototyping support for this in Open MPI, which
> helped find several bugs during development.
> 
> Signed-off-by: Roland Dreier <rolandd@cisco.com>
> ---
> Andrew -- sending this to you since I don't really know who should merge
> this sort of thing.

That works.

> Solving the underlying issue here is a pretty big deal to the universe
> of people that work on MPI implementations (sort of middleware for
> high-performance computing).  I think everyone would be open to better
> ideas on how to handle this, although I'm pretty happy with how small
> and self-contained the ummunot implementation ended up being.
> 
>
> ...
>
> +enum {
> +	UMMUNOT_FLAG_DIRTY	= 1,
> +	UMMUNOT_FLAG_HINT	= 2,
> +};
> +
> +struct ummunot_reg {
> +	u64			user_cookie;
> +	unsigned long		start;
> +	unsigned long		end;
> +	unsigned long		hint_start;
> +	unsigned long		hint_end;
> +	unsigned long		flags;
> +	struct rb_node		node;
> +	struct list_head	list;
> +};

Bah.

I looked at the code, asked "wtf is that rbtree being used for", went
to the defining data structure and ...  nothing.

And without knowing what the data structures do, and the relationship
between them it's hard to follow the code.  But you need to follow the
code to be able to reverse-engineer the data structures.

Ho hum.

What's the rbtree for?

What's a "hint"?

> +struct ummunot_file {
> +	struct mmu_notifier	mmu_notifier;
> +	struct mm_struct       *mm;
> +	struct rb_root		reg_tree;
> +	struct list_head	dirty_list;
> +	u64		       *counter;
> +	spinlock_t		lock;
> +	wait_queue_head_t	read_wait;
> +	struct fasync_struct   *async_queue;
> +	int			need_empty;
> +	int			used;
> +};

What's on dirty_list?

etc.

> +static struct ummunot_file *to_ummunot_file(struct mmu_notifier *mn)
> +{
> +	return container_of(mn, struct ummunot_file, mmu_notifier);
> +}
> +
> +static void ummunot_handle_not(struct mmu_notifier *mn,
> +			       unsigned long start, unsigned long end)

I really really want to do s/not/notify/g on this patch ;)

> +{
> +	struct ummunot_file *priv = to_ummunot_file(mn);
> +	struct rb_node *n;
> +	struct ummunot_reg *reg;
> +	unsigned long flags;
> +	int hit = 0;
> +
> +	spin_lock_irqsave(&priv->lock, flags);
> +
> +	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
> +		reg = rb_entry(n, struct ummunot_reg, node);
> +
> +		if (reg->start >= end)
> +			break;
> +
> +		/*
> +		 * Ranges overlap if they're not disjoint; and they're
> +		 * disjoint if the end of one is before the start of
> +		 * the other one.
> +		 */
> +		if (!(reg->end <= start || end <= reg->start)) {
> +			hit = 1;
> +
> +			if (!test_and_set_bit(UMMUNOT_FLAG_DIRTY, &reg->flags))
> +				list_add_tail(&reg->list, &priv->dirty_list);
> +
> +			if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
> +				clear_bit(UMMUNOT_FLAG_HINT, &reg->flags);
> +			} else {
> +				set_bit(UMMUNOT_FLAG_HINT, &reg->flags);
> +				reg->hint_start = start;
> +				reg->hint_end   = end;
> +			}
> +		}
> +	}
> +
> +	if (hit) {
> +		++(*priv->counter);
> +		flush_dcache_page(virt_to_page(priv->counter));
> +		wake_up_interruptible(&priv->read_wait);
> +		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
> +	}
> +
> +	spin_unlock_irqrestore(&priv->lock, flags);
> +}
> +
> +static void ummunot_inval_page(struct mmu_notifier *mn,

"invalidate"

> +			       struct mm_struct *mm,
> +			       unsigned long addr)
> +{
> +	ummunot_handle_not(mn, addr, addr + PAGE_SIZE);
> +}
> +
> +static void ummunot_inval_range_start(struct mmu_notifier *mn,
> +				      struct mm_struct *mm,
> +				      unsigned long start, unsigned long end)
> +{
> +	ummunot_handle_not(mn, start, end);
> +}
> +
> +static const struct mmu_notifier_ops ummunot_mmu_notifier_ops = {
> +	.invalidate_page	= ummunot_inval_page,
> +	.invalidate_range_start	= ummunot_inval_range_start,
> +};
> +
> +static int ummunot_open(struct inode *inode, struct file *filp)
> +{
> +	struct ummunot_file *priv;
> +	int ret;
> +
> +	if (filp->f_mode & FMODE_WRITE)
> +		return -EINVAL;

hm, how do you set the permissions on a miscdevice node?

> +	priv = kmalloc(sizeof *priv, GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
> +	if (!priv->counter) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	priv->reg_tree = RB_ROOT;
> +	INIT_LIST_HEAD(&priv->dirty_list);
> +	spin_lock_init(&priv->lock);
> +	init_waitqueue_head(&priv->read_wait);
> +	priv->async_queue = NULL;
> +	priv->need_empty  = 0;
> +	priv->used	  = 0;
> +
> +	priv->mmu_notifier.ops = &ummunot_mmu_notifier_ops;
> +	/*
> +	 * Register notifier last, since notifications can occur as
> +	 * soon as we register....
> +	 */
> +	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
> +	if (ret)
> +		goto err_page;
> +
> +	priv->mm = current->mm;
> +	atomic_inc(&priv->mm->mm_count);
> +
> +	filp->private_data = priv;
> +
> +	return 0;
> +
> +err_page:
> +	free_page((unsigned long) priv->counter);
> +
> +err:
> +	kfree(priv);
> +	return ret;
> +}
> +
>
> ...
>
> +static ssize_t ummunot_read(struct file *filp, char __user *buf,
> +			    size_t count, loff_t *pos)
> +{
> +	struct ummunot_file *priv = filp->private_data;
> +	struct ummunot_reg *reg;
> +	ssize_t ret;
> +	struct ummunot_event *events;
> +	int max;
> +	int n;
> +
> +	priv->used = 1;
> +
> +	events = (void *) get_zeroed_page(GFP_KERNEL);
> +	if (!events) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	spin_lock_irq(&priv->lock);
> +
> +	while (!ummunot_readable(priv)) {
> +		spin_unlock_irq(&priv->lock);
> +
> +		if (filp->f_flags & O_NONBLOCK) {
> +			ret = -EAGAIN;
> +			goto out;
> +		}
> +
> +		if (wait_event_interruptible(priv->read_wait,
> +					     ummunot_readable(priv))) {
> +			ret = -ERESTARTSYS;
> +			goto out;
> +		}
> +
> +		spin_lock_irq(&priv->lock);
> +	}
> +
> +	max = min(PAGE_SIZE, count) / sizeof *events;

It used to be the case that architectures disagreed over the type of
PAGE_SIZE - unsigned versus unsigned long, iirc.

If still true, this will warn.


> +	for (n = 0; n < max; ++n) {
> +		if (list_empty(&priv->dirty_list)) {
> +			events[n].type		      = UMMUNOT_EVENT_TYPE_LAST;
> +			events[n].user_cookie_counter = *priv->counter;
> +			++n;
> +			priv->need_empty = 0;
> +			break;
> +		}
> +
> +		reg = list_first_entry(&priv->dirty_list, struct ummunot_reg,
> +				       list);
> +
> +		events[n].type = UMMUNOT_EVENT_TYPE_INVAL;
> +		if (test_bit(UMMUNOT_FLAG_HINT, &reg->flags)) {
> +			events[n].flags		= UMMUNOT_EVENT_FLAG_HINT;
> +			events[n].hint_start	= max(reg->start, reg->hint_start);
> +			events[n].hint_end	= min(reg->end, reg->hint_end);
> +		} else {
> +			events[n].hint_start	= reg->start;
> +			events[n].hint_end	= reg->end;
> +		}
> +		events[n].user_cookie_counter = reg->user_cookie;
> +
> +		list_del(&reg->list);
> +		reg->flags = 0;
> +		priv->need_empty = 1;
> +	}
> +
> +	spin_unlock_irq(&priv->lock);
> +
> +	if (copy_to_user(buf, events, n * sizeof *events))
> +		ret = -EFAULT;
> +	else
> +		ret = n * sizeof *events;
> +
> +out:
> +	free_page((unsigned long) events);
> +	return ret;
> +}
> +
>
> ...
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-22 18:15 ` Andrew Morton
@ 2009-07-22 19:27   ` Roland Dreier
  2009-07-22 19:42     ` Andrew Morton
  2009-07-23  9:04     ` [PATCH/RFC] ummunot: " Li Zefan
  0 siblings, 2 replies; 17+ messages in thread
From: Roland Dreier @ 2009-07-22 19:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Jeff Squyres


 > > As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
 > > and follow-up messages, libraries using RDMA would like to track
 > > precisely when application code changes memory mapping via free(),
 > > munmap(), etc.
 > 
 > It isn't "precisely" - the tracker has a delayed view of the trackee,
 > of course.  Only really solvable by blocking the trackee each time it
 > does something.
 > 
 > The time-skew makes the facility considerably weaker.  Obviously this
 > is OK for your application but it's a fairly important design point
 > which it would be interesting to hear about.

Fair enough.  Of course as you say "precise" is impossible if we have
multiple threads running on multiple CPUs without synchronization
between them.  As you alluded to, for the specific use that inspired
this, it would be a bug in the userspace code for any changes to occur
at the moment that the library cares about things.

 > >  1. ioctl() to register/unregister an address range to watch in the
 > >     kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).
 > > 
 > >  2. read() to retrieve events generated when a mapping in a watched
 > >     address range is invalidated (cf struct ummunot_event in
 > >     <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
 > >     for this IO.
 > > 
 > >  3. mmap() one page at offset 0 to map a kernel page that contains a
 > >     generation counter that is incremented each time an event is
 > >     generated.  This allows userspace to have a fast path that checks
 > >     that no events have occurred without a system call.
 > 
 > If you stand back and squint, each of 1, 2 and 3 are things which the
 > kernel already provides for the delivery of ftrace events to userspace.
 > 
 > Did you look at reusing all that stuff?

No, not really... will investigate a bit further.  Any pointers to how
the ftrace stuff might work?  Specifically how #3 maps to ftrace is a
little obscure to me; and also as I understand it, ftrace is controlled
through debugfs, which means there's a bit of hassle to make this usable
on a default install.  And also I'm not sure how the ftrace control path
really maps to "here's a 100 address ranges I'd like events for".

So at a first glance after unsquinting a bit I'm not sure how good the
fit really is.

 > I looked at the code, asked "wtf is that rbtree being used for", went
 > to the defining data structure and ...  nothing.
 > 
 > And without knowing what the data structures do, and the relationship
 > between them it's hard to follow the code.  But you need to follow the
 > code to be able to reverse-engineer the data structures.

OK, will document data structures a bit more in the next version.  The
specific questions here:

 > What's the rbtree for?

To keep track of what ranges userspace is interested in so that it's
reasonably efficient to go through them in sort order and reasonably
efficient to add/remove ranges too.

 > What's a "hint"?

If userspace asks us to monitor a range X...Y and we get an event that
invalidates some subrange W...Q then we tell userspace that only the
subrange is invalid as a hint.  If multiple events hit a single range,
we give up and don't try to tell userspace what parts got invalidated.
So it's a "hint" in the sense of "here's some information that might be
helpful that we may give you, but don't count on getting it every time"

 > What's on dirty_list?

list of address ranges being watched that have been invalidated.

 > I really really want to do s/not/notify/g on this patch ;)

The name is of course very provisional... I'd be happy with ummunotify
or even perhaps something with _s in it to be more readable.  I'll go
with ummunotify next time.

 > > +static void ummunot_inval_page(struct mmu_notifier *mn,

 > "invalidate"

sure.

 > hm, how do you set the permissions on a miscdevice node?

On my system I have a udev rule

KERNEL=="ummunot", MODE="0666"

and that seems to work.  (Will need to change it to KERNEL=="ummunotify"
I guess ;)

 > > +	max = min(PAGE_SIZE, count) / sizeof *events;

 > It used to be the case that architectures disagreed over the type of
 > PAGE_SIZE - unsigned versus unsigned long, iirc.

 > If still true, this will warn.

Will check it out... anyway it's easy to switch to min_t() to be safe.

Thanks for the review,
  Roland


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-22 19:27   ` Roland Dreier
@ 2009-07-22 19:42     ` Andrew Morton
  2009-07-23  2:26       ` Steven Rostedt
  2009-07-24 22:56       ` [PATCH v2] ummunotify: " Roland Dreier
  2009-07-23  9:04     ` [PATCH/RFC] ummunot: " Li Zefan
  1 sibling, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2009-07-22 19:42 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-kernel, jsquyres, Steven Rostedt

On Wed, 22 Jul 2009 12:27:42 -0700
Roland Dreier <rdreier@cisco.com> wrote:

>  > >  1. ioctl() to register/unregister an address range to watch in the
>  > >     kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).
>  > > 
>  > >  2. read() to retrieve events generated when a mapping in a watched
>  > >     address range is invalidated (cf struct ummunot_event in
>  > >     <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
>  > >     for this IO.
>  > > 
>  > >  3. mmap() one page at offset 0 to map a kernel page that contains a
>  > >     generation counter that is incremented each time an event is
>  > >     generated.  This allows userspace to have a fast path that checks
>  > >     that no events have occurred without a system call.
>  > 
>  > If you stand back and squint, each of 1, 2 and 3 are things which the
>  > kernel already provides for the delivery of ftrace events to userspace.
>  > 
>  > Did you look at reusing all that stuff?
> 
> No, not really... will investigate a bit further.  Any pointers to how
> the ftrace stuff might work?

I know who to cc ;)

>  Specifically how #3 maps to ftrace is a
> little obscure to me; and also as I understand it, ftrace is controlled
> through debugfs, which means there's a bit of hassle to make this usable
> on a default install.  And also I'm not sure how the ftrace control path
> really maps to "here's a 100 address ranges I'd like events for".
> 
> So at a first glance after unsquinting a bit I'm not sure how good the
> fit really is.

Oh.  Here was I hoping that all that code was about to become useful.
<runs away>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-22 19:42     ` Andrew Morton
@ 2009-07-23  2:26       ` Steven Rostedt
  2009-07-23 20:21         ` Roland Dreier
  2009-07-24 22:56       ` [PATCH v2] ummunotify: " Roland Dreier
  1 sibling, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2009-07-23  2:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Roland Dreier, linux-kernel, jsquyres


On Wed, 22 Jul 2009, Andrew Morton wrote:

> On Wed, 22 Jul 2009 12:27:42 -0700
> Roland Dreier <rdreier@cisco.com> wrote:
> 
> >  > >  1. ioctl() to register/unregister an address range to watch in the
> >  > >     kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).
> >  > > 
> >  > >  2. read() to retrieve events generated when a mapping in a watched
> >  > >     address range is invalidated (cf struct ummunot_event in
> >  > >     <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
> >  > >     for this IO.
> >  > > 
> >  > >  3. mmap() one page at offset 0 to map a kernel page that contains a
> >  > >     generation counter that is incremented each time an event is
> >  > >     generated.  This allows userspace to have a fast path that checks
> >  > >     that no events have occurred without a system call.

Looks like a vsyscall to me.

> >  > 
> >  > If you stand back and squint, each of 1, 2 and 3 are things which the
> >  > kernel already provides for the delivery of ftrace events to userspace.
> >  > 
> >  > Did you look at reusing all that stuff?
> > 
> > No, not really... will investigate a bit further.  Any pointers to how
> > the ftrace stuff might work?
> 
> I know who to cc ;)

You would wouldn't you ;-)

> 
> >  Specifically how #3 maps to ftrace is a
> > little obscure to me; and also as I understand it, ftrace is controlled
> > through debugfs, which means there's a bit of hassle to make this usable
> > on a default install.  And also I'm not sure how the ftrace control path

On Fedora 11 (early ftrace kernel)

# mount -t debugfs nodev /sys/kernel/debug
# ls /sys/kernel/debug/tracing
available_filter_functions     set_ftrace_filter
available_tracers              set_ftrace_notrace
buffer_size_kb                 set_ftrace_pid
current_tracer                 stack_max_size
dyn_ftrace_total_info          stack_trace
failures                       sysprof_sample_period
latency_trace                  trace
process_follow_pid             trace_marker
process_trace_lifecycle        trace_options
process_trace_README           trace_pipe
process_trace_signals          tracing_cpumask
process_trace_syscalls         tracing_enabled
process_trace_taskcomm_filter  tracing_max_latency
process_trace_uid_filter       tracing_on
README                         tracing_thresh

Not too hard

> > really maps to "here's a 100 address ranges I'd like events for".

Now to boot into a more recent kernel:

# cd /sys/kernel/debug/tracing
# echo "ptr > 0xffffffff81100000 && ptr < 0xffffffff8113000" > events/kmem/kmalloc/filter 
# echo 1 > events/kmem/kmalloc/enable
# cat events/kmem/kmalloc/filter
 ptr > 0xffffffff81100000 && ptr < 0xffffffff81130000
# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
            bash-6652  [002] 80345.390536: kmalloc: call_site=ffffffff810d76e1 ptr=ffff88001e8e7c80 bytes_req=53 bytes_alloc=64 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390542: kmalloc: call_site=ffffffff810ba480 ptr=ffff88003c5a2700 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390543: kmalloc: call_site=ffffffff810d76e1 ptr=ffff88003c5a2540 bytes_req=4 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390544: kmalloc: call_site=ffffffff810ba210 ptr=ffff8800318c4d00 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390550: kmalloc: call_site=ffffffff810ba480 ptr=ffff8800318c43c0 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390551: kmalloc: call_site=ffffffff810d76e1 ptr=ffff8800318c4ca0 bytes_req=19 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390552: kmalloc: call_site=ffffffff810ba320 ptr=ffff8800318c4d00 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390552: kmalloc: call_site=ffffffff810ba210 ptr=ffff8800318c4e20 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390554: kmalloc: call_site=ffffffff810ba480 ptr=ffff8800318c4400 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390555: kmalloc: call_site=ffffffff810d76e1 ptr=ffff8800318c4380 bytes_req=4 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390555: kmalloc: call_site=ffffffff810ba210 ptr=ffff8800318c42e0 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390561: kmalloc: call_site=ffffffff810ba480 ptr=ffff8800318c4d20 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390562: kmalloc: call_site=ffffffff810d76e1 ptr=ffff8800318c4240 bytes_req=19 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390563: kmalloc: call_site=ffffffff810ba320 ptr=ffff8800318c42e0 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390563: kmalloc: call_site=ffffffff810ba320 ptr=ffff8800318c4e20 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390566: kmalloc: call_site=ffffffff810bb0ee ptr=ffff88003d4e2680 bytes_req=176 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO
            bash-6652  [002] 80345.390566: kmalloc: call_site=ffffffff810d76e1 ptr=ffff8800318c42c0 bytes_req=4 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390570: kmalloc: call_site=ffffffff810d76e1 ptr=ffff8800285405c0 bytes_req=4 bytes_alloc=32 gfp_flags=GFP_KERNEL
            bash-6652  [002] 80345.390574: kmalloc: call_site=ffffffff810bb320 ptr=ffff88003d4e2680 bytes_req=176 bytes_alloc=192 gfp_flags=GFP_KERNEL|GFP_ZERO

> >
> > So at a first glance after unsquinting a bit I'm not sure how good the
> > fit really is.

Well, if you need to add hooks, definitely at least use tracepoints. (see 
the TRACE_EVENT code in include/trace/events/*.h)

I'm not exactly sure what requirements you have, but it may be something 
we can work together on. Eliminate some duplicate code, or at least, 
ftrace can piggy back on it ;-)

> 
> Oh.  Here was I hoping that all that code was about to become useful.
> <runs away>

You better run!

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-22 19:27   ` Roland Dreier
  2009-07-22 19:42     ` Andrew Morton
@ 2009-07-23  9:04     ` Li Zefan
  2009-07-23 20:28       ` Roland Dreier
  1 sibling, 1 reply; 17+ messages in thread
From: Li Zefan @ 2009-07-23  9:04 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Andrew Morton, linux-kernel, Jeff Squyres

>  > >  1. ioctl() to register/unregister an address range to watch in the
>  > >     kernel (cf struct ummunot_register_ioctl in <linux/ummunot.h>).
>  > > 
>  > >  2. read() to retrieve events generated when a mapping in a watched
>  > >     address range is invalidated (cf struct ummunot_event in
>  > >     <linux/ummunot.h>).  select()/poll()/epoll() and SIGIO are handled
>  > >     for this IO.
>  > > 
>  > >  3. mmap() one page at offset 0 to map a kernel page that contains a
>  > >     generation counter that is incremented each time an event is
>  > >     generated.  This allows userspace to have a fast path that checks
>  > >     that no events have occurred without a system call.
>  > 
>  > If you stand back and squint, each of 1, 2 and 3 are things which the
>  > kernel already provides for the delivery of ftrace events to userspace.
>  > 
>  > Did you look at reusing all that stuff?
> 
> No, not really... will investigate a bit further.  Any pointers to how
> the ftrace stuff might work?

This can be implemented as a tracer or trace events, though I'm
not sure if it can fully meet your requirements or not.

To implement it as trace events, we add tracepoints in mmu_inlivadate_xxx(),
and define some TRACE_EVENT() macros.

And below shows how to use it:

  # mount -t debugfs xxx /mnt
  # cd /mnt/tracing/
  # echo 1 > events/mmu/enable
  # echo '(start >= 10000000 && end <= 10004096) || \
	(start >= 20000000 && end <= 20004096)' > events/mmu/filter
  # cat trace_pipe
        bash-2066  [001]   795.239077: mmu_invalidate_range_start: start=10000000 end=10000100
        bash-2066  [001]   795.239091: mmu_invalidate_range_start: start=10000000 end=10000100
        bash-2066  [001]   795.239098: mmu_invalidate_range_start: start=10000000 end=10000100
         cat-2189  [001]   795.239502: mmu_invalidate_page: start=20000000 end=20003000
         cat-2189  [001]   795.239578: mmu_invalidate_page: start=20000000 end=20003000
        bash-2066  [001]   795.239626: mmu_invalidate_page: start=20000000 end=20003000

The patch is extremely simple:

---
 include/linux/mmu_notifier.h |    4 +++
 include/trace/events/mmu.h   |   55 ++++++++++++++++++++++++++++++++++++++++++
 mm/mmu_notifier.c            |    3 ++
 3 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/mmu.h

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..49ff4cc 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -5,6 +5,8 @@
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
 
+#include <trace/events/mmu.h>
+
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
@@ -178,6 +180,7 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 					  unsigned long address)
 {
+	trace_mmu_invalidate_page(address);
 	if (mm_has_notifiers(mm))
 		__mmu_notifier_invalidate_page(mm, address);
 }
@@ -185,6 +188,7 @@ static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
+	trace_mmu_invalidate_range_start(start, end);
 	if (mm_has_notifiers(mm))
 		__mmu_notifier_invalidate_range_start(mm, start, end);
 }
diff --git a/include/trace/events/mmu.h b/include/trace/events/mmu.h
new file mode 100644
index 0000000..0d6bbcd
--- /dev/null
+++ b/include/trace/events/mmu.h
@@ -0,0 +1,55 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmu
+
+#if !defined(_TRACE_MMU_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMU_H
+
+#include <linux/tracepoint.h>
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+TRACE_EVENT(mmu_invalidate_page,
+
+	TP_PROTO(unsigned long addr),
+
+	TP_ARGS(addr),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	start		)
+		__field(	unsigned long,	end		)
+	),
+
+	TP_fast_assign(
+		__entry->start	= addr;
+		__entry->end	= addr + PAGE_SIZE;
+	),
+
+	TP_printk("start=%lu end=%lu", __entry->start, __entry->end)
+);
+
+TRACE_EVENT(mmu_invalidate_range_start,
+
+	TP_PROTO(unsigned long start, unsigned long end),
+
+	TP_ARGS(start, end),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	start		)
+		__field(	unsigned long,	end		)
+	),
+
+	TP_fast_assign(
+		__entry->start	= start;
+		__entry->end	= end;
+	),
+
+	TP_printk("start=%lu end=%lu", __entry->start, __entry->end)
+);
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _TRACE_MMU_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 5f4ef02..685ffdc 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -17,6 +17,9 @@
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/mmu.h>
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
-- 
1.6.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-23  2:26       ` Steven Rostedt
@ 2009-07-23 20:21         ` Roland Dreier
  2009-07-24  0:25           ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2009-07-23 20:21 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Andrew Morton, linux-kernel, jsquyres


 > > >  > >  3. mmap() one page at offset 0 to map a kernel page that contains a
 > > >  > >     generation counter that is incremented each time an event is
 > > >  > >     generated.  This allows userspace to have a fast path that checks
 > > >  > >     that no events have occurred without a system call.

 > Looks like a vsyscall to me.

Yes, in a way, although it is quite a bit simpler in the sense that it
doesn't require any arch-specific code (or indeed any code mapped from
the kernel) and is automatically available in a portable way.
Implementing this as a vsyscall seems as if it would add a lot of
complexity to the kernel side without much simplification on the
userspace side (in fact, hooking up the vsyscall is probably more code
than just doing mmap() + dereferencing a pointer).

 > # mount -t debugfs nodev /sys/kernel/debug
 > # ls /sys/kernel/debug/tracing

The use case I have in mind is for unprivileged user applications to use
this.  So requiring debugfs to be mounted hurts (since that isn't done
by default), and using the files in tracing really hurts, since they're
currently created with mode 0644 and so tracing can't be controlled by
unprivileged users.

[ASIDE: why is trace_marker created with the strange permission of 0220
when it is owned by root:root -- is there any reason for the group write
permission, or should it just be 0200 permission?]

In fact the whole model of ftrace seems to be a single privileged user
controlling a single context; the use case for ummunotify is that a lot
of processes running unprivileged (and possibly as multiple different
users) each want to get events for parts of their own address space.

So

 > # echo "ptr > 0xffffffff81100000 && ptr < 0xffffffff8113000" > events/kmem/kmalloc/filter 

is very cool; but what I would want is for a given process to be able to
say "please give me events for ptr in the following 100 ranges A..B,
C..D, ..." and "oh and add this range X..Y" and "oh don't give me events
for C..D anymore".  And that process should only get events about its
own address range; and 10 other (unprivileged) processes should be able
to do the same thing simultaneously.

Also is there a raw format for setting the filters that lets userspace
swap them atomically (ie change from filter A to filter B with a
guarantee that filter A is in effect right up to the time filter B is in
effect with no window where eg no filter is in effect).

 > Well, if you need to add hooks, definitely at least use tracepoints. (see 
 > the TRACE_EVENT code in include/trace/events/*.h)

I don't think I'm adding hooks -- the mmu notifier infrastructure
already suits me perfectly.  The only thing I'm doing is forwarding the
events delivered by mmu notifiers up to userspace, but not really in a
way that's very close to what ftrace does (I don't think).

It seems handling multiple unprivileged contexts accessing different
streams of trace events is going to require pretty huge ftrace changes.
And ummunotify is currently about 400 lines of code total (+ 300 lines
of comments :) so we're not going to simplify that code dramatically.
The hope I guess would be that a common interface would make things
conceptually simpler, but I don't see how to slot ftrace and ummunotify
together very cleanly.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-23  9:04     ` [PATCH/RFC] ummunot: " Li Zefan
@ 2009-07-23 20:28       ` Roland Dreier
  0 siblings, 0 replies; 17+ messages in thread
From: Roland Dreier @ 2009-07-23 20:28 UTC (permalink / raw)
  To: Li Zefan; +Cc: Andrew Morton, linux-kernel, Jeff Squyres


 > This can be implemented as a tracer or trace events, though I'm
 > not sure if it can fully meet your requirements or not.
 > 
 > To implement it as trace events, we add tracepoints in mmu_inlivadate_xxx(),
 > and define some TRACE_EVENT() macros.
 > 
 > And below shows how to use it:
 > 
 >   # mount -t debugfs xxx /mnt
 >   # cd /mnt/tracing/
 >   # echo 1 > events/mmu/enable
 >   # echo '(start >= 10000000 && end <= 10004096) || \
 > 	(start >= 20000000 && end <= 20004096)' > events/mmu/filter
 >   # cat trace_pipe
 >         bash-2066  [001]   795.239077: mmu_invalidate_range_start: start=10000000 end=10000100
 >         bash-2066  [001]   795.239091: mmu_invalidate_range_start: start=10000000 end=10000100
 >         bash-2066  [001]   795.239098: mmu_invalidate_range_start: start=10000000 end=10000100
 >          cat-2189  [001]   795.239502: mmu_invalidate_page: start=20000000 end=20003000
 >          cat-2189  [001]   795.239578: mmu_invalidate_page: start=20000000 end=20003000
 >         bash-2066  [001]   795.239626: mmu_invalidate_page: start=20000000 end=20003000
 > 
 > The patch is extremely simple:

Thanks... unfortunately I don't think this really helps for the use case
I'm trying to address.  What's desired is a way for an unprivileged
userspace process to monitor (parts of) its own address space; I think
getting a single stream of all such events is not going to be helpful,
because, first, the overhead of filtering out the subset of useful
events may be too high to make this useful and, second, exposing events
about unrelated processes is probably a security hole (eg "allocation
channel" attacks on crypto processes analogous to timing channel attacks)

If there were some way for each process to get a trace of its own events
then that would be very close to what I'm trying to implement.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC] ummunot: Userspace support for MMU notifications
  2009-07-23 20:21         ` Roland Dreier
@ 2009-07-24  0:25           ` Steven Rostedt
  0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2009-07-24  0:25 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Andrew Morton, linux-kernel, jsquyres


On Thu, 23 Jul 2009, Roland Dreier wrote:

> 
>  > > >  > >  3. mmap() one page at offset 0 to map a kernel page that contains a
>  > > >  > >     generation counter that is incremented each time an event is
>  > > >  > >     generated.  This allows userspace to have a fast path that checks
>  > > >  > >     that no events have occurred without a system call.
> 
>  > Looks like a vsyscall to me.
> 
> Yes, in a way, although it is quite a bit simpler in the sense that it
> doesn't require any arch-specific code (or indeed any code mapped from
> the kernel) and is automatically available in a portable way.
> Implementing this as a vsyscall seems as if it would add a lot of
> complexity to the kernel side without much simplification on the
> userspace side (in fact, hooking up the vsyscall is probably more code
> than just doing mmap() + dereferencing a pointer).

Just making an observation, not really suggesting that we convert this to 
a vsyscall.

> 
>  > # mount -t debugfs nodev /sys/kernel/debug
>  > # ls /sys/kernel/debug/tracing
> 
> The use case I have in mind is for unprivileged user applications to use
> this.  So requiring debugfs to be mounted hurts (since that isn't done
> by default), and using the files in tracing really hurts, since they're
> currently created with mode 0644 and so tracing can't be controlled by
> unprivileged users.

Ah, allowing unprivileged users is a big commitment. That is, everything
that you handle must be considered untrusted in all accounts. The thing 
about tracing inside the kernel, is that enabling it, may affect everyone. 
Thus we can not simply allow unprivileged users to go messing with the 
performance of others.

> 
> [ASIDE: why is trace_marker created with the strange permission of 0220
> when it is owned by root:root -- is there any reason for the group write
> permission, or should it just be 0200 permission?]

Good question. Probably just an oversite. debugfs is kind of funny with 
its permissions.

> 
> In fact the whole model of ftrace seems to be a single privileged user
> controlling a single context; the use case for ummunotify is that a lot
> of processes running unprivileged (and possibly as multiple different
> users) each want to get events for parts of their own address space.
> 
> So
> 
>  > # echo "ptr > 0xffffffff81100000 && ptr < 0xffffffff8113000" > events/kmem/kmalloc/filter 
> 
> is very cool; but what I would want is for a given process to be able to
> say "please give me events for ptr in the following 100 ranges A..B,
> C..D, ..." and "oh and add this range X..Y" and "oh don't give me events
> for C..D anymore".  And that process should only get events about its
> own address range; and 10 other (unprivileged) processes should be able
> to do the same thing simultaneously.
> 
> Also is there a raw format for setting the filters that lets userspace
> swap them atomically (ie change from filter A to filter B with a
> guarantee that filter A is in effect right up to the time filter B is in
> effect with no window where eg no filter is in effect).
> 
>  > Well, if you need to add hooks, definitely at least use tracepoints. (see 
>  > the TRACE_EVENT code in include/trace/events/*.h)
> 
> I don't think I'm adding hooks -- the mmu notifier infrastructure
> already suits me perfectly.  The only thing I'm doing is forwarding the
> events delivered by mmu notifiers up to userspace, but not really in a
> way that's very close to what ftrace does (I don't think).

OK, I didn't look at the code enough to know.

> 
> It seems handling multiple unprivileged contexts accessing different
> streams of trace events is going to require pretty huge ftrace changes.
> And ummunotify is currently about 400 lines of code total (+ 300 lines
> of comments :) so we're not going to simplify that code dramatically.
> The hope I guess would be that a common interface would make things
> conceptually simpler, but I don't see how to slot ftrace and ummunotify
> together very cleanly.

I agree, what you are doing is out of the scope of ftrace. But perhaps 
someday this may change. Currently because of security issues, we keep 
ftrace a privileged user function. But we've been discussing changing this 
someday. But any change would need to be scrutinized for security sake. 
Ftrace gives a user a peak into what is going on inside the kernel. And 
with that, it theoretically, gives a way to get around kernel security 
measures.

-- Steve


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] ummunotify: Userspace support for MMU notifications
  2009-07-22 19:42     ` Andrew Morton
  2009-07-23  2:26       ` Steven Rostedt
@ 2009-07-24 22:56       ` Roland Dreier
  2009-07-27 23:53         ` Andrew Morton
  1 sibling, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2009-07-24 22:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jsquyres, Steven Rostedt

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunotify_event in
    <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
    handled for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

Thanks to Jason Gunthorpe <jgunthorpe@obsidianresearch.com> for
suggestions on the interface design.  Also thanks to Jeff Squyres
<jsquyres@cisco.com> for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
Hi Andrew,

since last time I've
 - agreed (I think) with Steve that this doesn't really fit in with
   (current) ftrace code.
 - s/ummunot/ummunotify/
 - documented the main datastructures
 - clarified a few other names (inval -> invalidate, dirty -> invalid)
 - used min_t to handle taking min of size_t and PAGE_SIZE
which I think covers everything you mentioned last time.

What else do you think needs to happen to move this upstream?

Thanks,
  Roland


 drivers/char/Kconfig       |   12 +
 drivers/char/Makefile      |    1 +
 drivers/char/ummunotify.c  |  554 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/ummunotify.h |  121 ++++++++++
 4 files changed, 688 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 include/linux/ummunotify.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6a06913..b8368e2 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1108,6 +1108,18 @@ config DEVPORT
 	depends on ISA || PCI
 	default y
 
+config UMMUNOTIFY
+       tristate "Userspace MMU notifications"
+       select MMU_NOTIFIER
+       help
+         The ummunotify (userspace MMU notification) driver creates a
+         character device that can be used by userspace libraries to
+         get notifications when an application's memory mapping
+         changed.  This is used, for example, by RDMA libraries to
+         improve the reliability of memory registration caching, since
+         the kernel's MMU notifications can be used to know precisely
+         when to shoot down a cached registration.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 66f779a..3547020 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)		+= nsc_gpio.o
 obj-$(CONFIG_CS5535_GPIO)	+= cs5535_gpio.o
 obj-$(CONFIG_GPIO_TB0219)	+= tb0219.o
 obj-$(CONFIG_TELCLOCK)		+= tlclk.o
+obj-$(CONFIG_UMMUNOTIFY)	+= ummunotify.o
 
 obj-$(CONFIG_MWAVE)		+= mwave/
 obj-$(CONFIG_AGP)		+= agp/
diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c
new file mode 100644
index 0000000..1150623
--- /dev/null
+++ b/drivers/char/ummunotify.c
@@ -0,0 +1,554 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunotify.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("GPL v2");
+
+/*
+ * Information about an address range userspace has asked us to watch.
+ *
+ * user_cookie: Opaque cookie given to us when userspace registers the
+ *   address range.
+ *
+ * start, end: Address range; start is inclusive, end is exclusive.
+ *
+ * hint_start, hint_end: If a single MMU notification event
+ *   invalidates the address range, we hold the actual range of
+ *   addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT).
+ *   If another event hits this range before userspace reads the
+ *   event, we give up and don't try to keep track of which subsets
+ *   got invalidated.
+ *
+ * flags: Holds the INVALID flag for ranges that are on the invalid
+ *   list and/or the HINT flag for ranges where the hint range holds
+ *   good information.
+ *
+ * node: Used to put the range into an rbtree we use to be able to
+ *   scan address ranges in order.
+ *
+ * list: Used to put the range on the invalid list when an MMU
+ *   notification event hits the range.
+ */
+enum {
+	UMMUNOTIFY_FLAG_INVALID	= 1,
+	UMMUNOTIFY_FLAG_HINT	= 2,
+};
+
+struct ummunotify_reg {
+	u64			user_cookie;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		hint_start;
+	unsigned long		hint_end;
+	unsigned long		flags;
+	struct rb_node		node;
+	struct list_head	list;
+};
+
+/*
+ * Context attached to each file that userspace opens.
+ *
+ * mmu_notifier: MMU notifier registered for this context.
+ *
+ * mm: mm_struct for process that created the context; we use this to
+ *   hold a reference to the mm to make sure it doesn't go away until
+ *   we're done with it.
+ *
+ * reg_tree: RB tree of address ranges being watched, sorted by start
+ *   address.
+ *
+ * invalid_list: List of address ranges that have been invalidated by
+ *   MMU notification events; as userspace reads events, the address
+ *   range corresponding to the event is removed from the list.
+ *
+ * counter: Page that can be mapped read-only by userspace, which
+ *   holds a generation count that is incremented each time an event
+ *   occurs.
+ *
+ * lock: Spinlock used to protect all context.
+ *
+ * read_wait: Wait queue used to wait for data to become available in
+ *   blocking read()s.
+ *
+ * async_queue: Used to implement fasync().
+ *
+ * need_empty: Set when userspace reads an invalidation event, so that
+ *   read() knows it must generate an "empty" event when userspace
+ *   drains the invalid_list.
+ *
+ * used: Set after userspace does anything with the file, so that the
+ *   "exchange flags" ioctl() knows it's too late to change anything.
+ */
+struct ummunotify_file {
+	struct mmu_notifier	mmu_notifier;
+	struct mm_struct       *mm;
+	struct rb_root		reg_tree;
+	struct list_head	invalid_list;
+	u64		       *counter;
+	spinlock_t		lock;
+	wait_queue_head_t	read_wait;
+	struct fasync_struct   *async_queue;
+	int			need_empty;
+	int			used;
+};
+
+static void ummunotify_handle_notify(struct mmu_notifier *mn,
+				     unsigned long start, unsigned long end)
+{
+	struct ummunotify_file *priv =
+		container_of(mn, struct ummunotify_file, mmu_notifier);
+	struct rb_node *n;
+	struct ummunotify_reg *reg;
+	unsigned long flags;
+	int hit = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunotify_reg, node);
+
+		if (reg->start >= end)
+			break;
+
+		/*
+		 * Ranges overlap if they're not disjoint; and they're
+		 * disjoint if the end of one is before the start of
+		 * the other one.
+		 */
+		if (!(reg->end <= start || end <= reg->start)) {
+			hit = 1;
+
+			if (!test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
+				list_add_tail(&reg->list, &priv->invalid_list);
+
+			if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
+				clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
+			} else {
+				set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
+				reg->hint_start = start;
+				reg->hint_end   = end;
+			}
+		}
+	}
+
+	if (hit) {
+		++(*priv->counter);
+		flush_dcache_page(virt_to_page(priv->counter));
+		wake_up_interruptible(&priv->read_wait);
+		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunotify_invalidate_page(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long addr)
+{
+	ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunotify_invalidate_range_start(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end)
+{
+	ummunotify_handle_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = {
+	.invalidate_page	= ummunotify_invalidate_page,
+	.invalidate_range_start	= ummunotify_invalidate_range_start,
+};
+
+static int ummunotify_open(struct inode *inode, struct file *filp)
+{
+	struct ummunotify_file *priv;
+	int ret;
+
+	if (filp->f_mode & FMODE_WRITE)
+		return -EINVAL;
+
+	priv = kmalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!priv->counter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	priv->reg_tree = RB_ROOT;
+	INIT_LIST_HEAD(&priv->invalid_list);
+	spin_lock_init(&priv->lock);
+	init_waitqueue_head(&priv->read_wait);
+	priv->async_queue = NULL;
+	priv->need_empty  = 0;
+	priv->used	  = 0;
+
+	priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops;
+	/*
+	 * Register notifier last, since notifications can occur as
+	 * soon as we register....
+	 */
+	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+	if (ret)
+		goto err_page;
+
+	priv->mm = current->mm;
+	atomic_inc(&priv->mm->mm_count);
+
+	filp->private_data = priv;
+
+	return 0;
+
+err_page:
+	free_page((unsigned long) priv->counter);
+
+err:
+	kfree(priv);
+	return ret;
+}
+
+static int ummunotify_close(struct inode *inode, struct file *filp)
+{
+	struct ummunotify_file *priv = filp->private_data;
+	struct rb_node *n;
+	struct ummunotify_reg *reg;
+
+	mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+	mmdrop(priv->mm);
+	free_page((unsigned long) priv->counter);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunotify_reg, node);
+		kfree(reg);
+	}
+
+	kfree(priv);
+
+	return 0;
+}
+
+static bool ummunotify_readable(struct ummunotify_file *priv)
+{
+	return priv->need_empty || !list_empty(&priv->invalid_list);
+}
+
+static ssize_t ummunotify_read(struct file *filp, char __user *buf,
+			       size_t count, loff_t *pos)
+{
+	struct ummunotify_file *priv = filp->private_data;
+	struct ummunotify_reg *reg;
+	ssize_t ret;
+	struct ummunotify_event *events;
+	int max;
+	int n;
+
+	priv->used = 1;
+
+	events = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!events) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&priv->lock);
+
+	while (!ummunotify_readable(priv)) {
+		spin_unlock_irq(&priv->lock);
+
+		if (filp->f_flags & O_NONBLOCK) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		if (wait_event_interruptible(priv->read_wait,
+					     ummunotify_readable(priv))) {
+			ret = -ERESTARTSYS;
+			goto out;
+		}
+
+		spin_lock_irq(&priv->lock);
+	}
+
+	max = min_t(size_t, PAGE_SIZE, count) / sizeof *events;
+
+	for (n = 0; n < max; ++n) {
+		if (list_empty(&priv->invalid_list)) {
+			events[n].type		      = UMMUNOTIFY_EVENT_TYPE_LAST;
+			events[n].user_cookie_counter = *priv->counter;
+			++n;
+			priv->need_empty = 0;
+			break;
+		}
+
+		reg = list_first_entry(&priv->invalid_list, struct ummunotify_reg,
+				       list);
+
+		events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL;
+		if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
+			events[n].flags		= UMMUNOTIFY_EVENT_FLAG_HINT;
+			events[n].hint_start	= max(reg->start, reg->hint_start);
+			events[n].hint_end	= min(reg->end, reg->hint_end);
+		} else {
+			events[n].hint_start	= reg->start;
+			events[n].hint_end	= reg->end;
+		}
+		events[n].user_cookie_counter = reg->user_cookie;
+
+		list_del(&reg->list);
+		reg->flags = 0;
+		priv->need_empty = 1;
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	if (copy_to_user(buf, events, n * sizeof *events))
+		ret = -EFAULT;
+	else
+		ret = n * sizeof *events;
+
+out:
+	free_page((unsigned long) events);
+	return ret;
+}
+
+static unsigned int ummunotify_poll(struct file *filp, struct poll_table_struct *wait)
+{
+	struct ummunotify_file *priv = filp->private_data;
+
+	poll_wait(filp, &priv->read_wait, wait);
+
+	return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
+}
+
+static long ummunotify_exchange_features(struct ummunotify_file *priv,
+					 __u32 __user *arg)
+{
+	u32 feature_mask;
+
+	if (priv->used)
+		return -EINVAL;
+
+	priv->used = 1;
+
+	if (get_user(feature_mask, arg))
+		return -EFAULT;
+
+	/* No extensions defined at present. */
+	feature_mask = 0;
+
+	if (put_user(feature_mask, arg))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long ummunotify_register_region(struct ummunotify_file *priv,
+				       struct ummunotify_register_ioctl __user *arg)
+{
+	struct ummunotify_register_ioctl parm;
+	struct ummunotify_reg *reg, *treg;
+	struct rb_node **n = &priv->reg_tree.rb_node;
+	struct rb_node *pn;
+	int ret = 0;
+
+	if (copy_from_user(&parm, arg, sizeof parm))
+		return -EFAULT;
+
+	priv->used = 1;
+
+	reg = kmalloc(sizeof *reg, GFP_KERNEL);
+	if (!reg)
+		return -ENOMEM;
+
+	reg->user_cookie	= parm.user_cookie;
+	reg->start		= parm.start;
+	reg->end		= parm.end;
+	reg->flags		= 0;
+
+	spin_lock_irq(&priv->lock);
+
+	for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
+		treg = rb_entry(pn, struct ummunotify_reg, node);
+
+		if (treg->user_cookie == parm.user_cookie) {
+			kfree(reg);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	pn = NULL;
+	while (*n) {
+		pn = *n;
+		treg = rb_entry(pn, struct ummunotify_reg, node);
+
+		if (reg->start <= treg->start)
+			n = &pn->rb_left;
+		else
+			n = &pn->rb_right;
+	}
+
+	rb_link_node(&reg->node, pn, n);
+	rb_insert_color(&reg->node, &priv->reg_tree);
+
+out:
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunotify_unregister_region(struct ummunotify_file *priv,
+					 __u64 __user *arg)
+{
+	u64 user_cookie;
+	struct rb_node *n;
+	struct ummunotify_reg *reg;
+	int ret = -EINVAL;
+
+	if (get_user(user_cookie, arg))
+		return -EFAULT;
+
+	spin_lock_irq(&priv->lock);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunotify_reg, node);
+
+		if (reg->user_cookie == user_cookie) {
+			rb_erase(n, &priv->reg_tree);
+			if (test_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
+				list_del(&reg->list);
+			kfree(reg);
+			ret = 0;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunotify_ioctl(struct file *filp, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct ummunotify_file *priv = filp->private_data;
+	void __user *argp = (void __user *) arg;
+
+	switch (cmd) {
+	case UMMUNOTIFY_EXCHANGE_FEATURES:
+		return ummunotify_exchange_features(priv, argp);
+	case UMMUNOTIFY_REGISTER_REGION:
+		return ummunotify_register_region(priv, argp);
+	case UMMUNOTIFY_UNREGISTER_REGION:
+		return ummunotify_unregister_region(priv, argp);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ummunotify_file *priv = vma->vm_private_data;
+
+	if (vmf->pgoff != 0)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(priv->counter);
+	get_page(vmf->page);
+
+	return 0;
+
+}
+
+static struct vm_operations_struct ummunotify_vm_ops = {
+	.fault		= ummunotify_fault,
+};
+
+static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ummunotify_file *priv = filp->private_data;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE ||
+	    vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	vma->vm_ops		= &ummunotify_vm_ops;
+	vma->vm_private_data	= priv;
+
+	return 0;
+}
+
+static int ummunotify_fasync(int fd, struct file *filp, int on)
+{
+	struct ummunotify_file *priv = filp->private_data;
+
+	return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunotify_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ummunotify_open,
+	.release	= ummunotify_close,
+	.read		= ummunotify_read,
+	.poll		= ummunotify_poll,
+	.unlocked_ioctl	= ummunotify_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ummunotify_ioctl,
+#endif
+	.mmap		= ummunotify_mmap,
+	.fasync		= ummunotify_fasync,
+};
+
+static struct miscdevice ummunotify_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "ummunotify",
+	.fops	= &ummunotify_fops,
+};
+
+static int __init ummunotify_init(void)
+{
+	return misc_register(&ummunotify_misc);
+}
+
+static void __exit ummunotify_cleanup(void)
+{
+	misc_deregister(&ummunotify_misc);
+}
+
+module_init(ummunotify_init);
+module_exit(ummunotify_cleanup);
diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h
new file mode 100644
index 0000000..9e4620e
--- /dev/null
+++ b/include/linux/ummunotify.h
@@ -0,0 +1,121 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOTIFY_H
+#define _LINUX_UMMUNOTIFY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Ummunotify relays MMU notifier events to userspace.  A userspace
+ * process uses it by opening /dev/ummunotify, which returns a file
+ * descriptor.  Interest in address ranges is registered using ioctl()
+ * and MMU notifier events are retrieved using read(), as described in
+ * more detail below.
+ *
+ * Userspace can also mmap() a single read-only page at offset 0 on
+ * this file descriptor.  This page contains (at offest 0) a single
+ * 64-bit generation counter that the kernel increments each time an
+ * MMU notifier event occurs.  Userspace can use this to very quickly
+ * check if there are any events to retrieve without needing to do a
+ * system call.
+ */
+
+/*
+ * struct ummunotify_register_ioctl describes an address range from
+ * start to end (including start but not including end) to be
+ * monitored.  user_cookie is an opaque handle that userspace assigns,
+ * and which is used to unregister.  flags and reserved are currently
+ * unused and should be set to 0 for forward compatibility.
+ */
+struct ummunotify_register_ioctl {
+	__u64	start;		/* in */
+	__u64	end;		/* in */
+	__u64	user_cookie;	/* in */
+	__u32	flags;		/* in */
+	__u32	reserved;	/* in */
+};
+
+#define UMMUNOTIFY_MAGIC		'U'
+
+/*
+ * Forward compatibility: Userspace passes in a 32-bit feature mask
+ * with feature flags set indicating which extensions it wishes to
+ * use.  The kernel will return a feature mask with the bits of
+ * userspace's mask that the kernel implements; from that point on
+ * both userspace and the kernel should behave as described by the
+ * kernel's feature mask.
+ *
+ * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl,
+ * then the kernel will use a feature mask of 0.
+ *
+ * No feature flags are currently defined, so the kernel will always
+ * return a feature mask of 0 at present.
+ */
+#define UMMUNOTIFY_EXCHANGE_FEATURES	_IOWR(UMMUNOTIFY_MAGIC, 1, __u32)
+
+/*
+ * Register interest in an address range; userspace should pass in a
+ * struct ummunotify_register_ioctl describing the region.
+ */
+#define UMMUNOTIFY_REGISTER_REGION	_IOW(UMMUNOTIFY_MAGIC, 2, \
+					     struct ummunotify_register_ioctl)
+/*
+ * Unregister interest in an address range; userspace should pass in
+ * the user_cookie value that was used to register the address range.
+ * No events for the address range will be reported once it is
+ * unregistered.
+ */
+#define UMMUNOTIFY_UNREGISTER_REGION	_IOW(UMMUNOTIFY_MAGIC, 3, __u64)
+
+/*
+ * Invalidation events are returned whenever the kernel changes the
+ * mapping for a monitored address.  These events are retrieved by
+ * read() on the ummunotify file descriptor, which will fill the
+ * read() buffer with struct ummunotify_event.
+ *
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated.  (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * and set to the start/end of the whole registered range)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred.  The
+ * other fields are not filled in for this event.
+ */
+enum {
+	UMMUNOTIFY_EVENT_TYPE_INVAL	= 0,
+	UMMUNOTIFY_EVENT_TYPE_LAST	= 1,
+};
+
+enum {
+	UMMUNOTIFY_EVENT_FLAG_HINT	= 1 << 0,
+};
+
+struct ummunotify_event {
+	__u32	type;
+	__u32	flags;
+	__u64	hint_start;
+	__u64	hint_end;
+	__u64	user_cookie_counter;
+};
+
+#endif /* _LINUX_UMMUNOTIFY_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] ummunotify: Userspace support for MMU notifications
  2009-07-24 22:56       ` [PATCH v2] ummunotify: " Roland Dreier
@ 2009-07-27 23:53         ` Andrew Morton
  2009-07-28 16:14           ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2009-07-27 23:53 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-kernel, jsquyres, rostedt

On Fri, 24 Jul 2009 15:56:17 -0700
Roland Dreier <rdreier@cisco.com> wrote:

> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.
> 
> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunotify, that creates a /dev/ummunotify
> node.  A userspace process can open this node read-only and use the fd
> as follows:
> 
>  1. ioctl() to register/unregister an address range to watch in the
>     kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
> 
>  2. read() to retrieve events generated when a mapping in a watched
>     address range is invalidated (cf struct ummunotify_event in
>     <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
>     handled for this IO.
> 
>  3. mmap() one page at offset 0 to map a kernel page that contains a
>     generation counter that is incremented each time an event is
>     generated.  This allows userspace to have a fast path that checks
>     that no events have occurred without a system call.
> 
> Thanks to Jason Gunthorpe <jgunthorpe@obsidianresearch.com> for
> suggestions on the interface design.  Also thanks to Jeff Squyres
> <jsquyres@cisco.com> for prototyping support for this in Open MPI, which
> helped find several bugs during development.
> 
> ...
>
> +config UMMUNOTIFY
> +       tristate "Userspace MMU notifications"
> +       select MMU_NOTIFIER
> +       help
> +         The ummunotify (userspace MMU notification) driver creates a
> +         character device that can be used by userspace libraries to
> +         get notifications when an application's memory mapping
> +         changed.  This is used, for example, by RDMA libraries to
> +         improve the reliability of memory registration caching, since
> +         the kernel's MMU notifications can be used to know precisely
> +         when to shoot down a cached registration.

Does `select' dtrt here if UMMUNOTIFY=m?  I never trust it...

<searches in vain for ummunotify.txt>

Oh well :(

A little test app would be nice - I assume you have one.  We could toss
in in the tree as a how-to-use example, and people could perhaps turn
it into a regression test - perhaps the LTP people would take it.

>
> ...
>
> +			if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
> +				clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
> +			} else {
> +				set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);

It's a shame that change_bit() didn't return the old (or new) value.



The overall userspace interface seems a bit klunky, but I can't really
suggest anything better.  Netlink delivery?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] ummunotify: Userspace support for MMU notifications
  2009-07-27 23:53         ` Andrew Morton
@ 2009-07-28 16:14           ` Roland Dreier
  2009-07-31 18:54             ` [PATCH v3] " Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2009-07-28 16:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jsquyres, rostedt


 > > +config UMMUNOTIFY
 > > +       tristate "Userspace MMU notifications"
 > > +       select MMU_NOTIFIER

 > Does `select' dtrt here if UMMUNOTIFY=m?  I never trust it...

Yes, it appears to when I test... and all the other users (KVM, SGI_GRU)
of MMU_NOTIFIER are tristates that select it, so in practice it seems to
work as well.

 > <searches in vain for ummunotify.txt>
 > 
 > Oh well :(
 > 
 > A little test app would be nice - I assume you have one.  We could toss
 > in in the tree as a how-to-use example, and people could perhaps turn
 > it into a regression test - perhaps the LTP people would take it.

I'll write up a bit of doc and stick it in Documentation/ummunotify.txt,
and create samples/ummunotify/ as well for the next iteration.

 > > +			if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
 > > +				clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
 > > +			} else {
 > > +				set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);

 > It's a shame that change_bit() didn't return the old (or new) value.

Heh, I didn't know about change_bit().  <looks> ... seems like
test_and_change_bit() is exactly what would fit here.  Will update.

 > The overall userspace interface seems a bit klunky, but I can't really
 > suggest anything better.  Netlink delivery?

I took a quick look at netlink but discarded the idea when I saw that
netlink sockets don't support mmap() ... seems that plumbing in mmap
support (to handle the kernel exported generation count) is probably
more trouble than it's worth, and I'm not sure that all the netlink
encapsulation is that much of a win here anyway.  I'd prefer to proceed
with the "virtual character device" interface -- another option would be
sys_ummunotify_create and sys_ummunotify_ctl but I can't really imagine
this being worth another couple syscalls.

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3] ummunotify: Userspace support for MMU notifications
  2009-07-28 16:14           ` Roland Dreier
@ 2009-07-31 18:54             ` Roland Dreier
  2009-08-02 19:59               ` Brice Goglin
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2009-07-31 18:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jsquyres, rostedt

>From 8b067027a0fc5d3280fe3f006d936d6d775de297 Mon Sep 17 00:00:00 2001
From: Roland Dreier <rolandd@cisco.com>
Date: Fri, 31 Jul 2009 11:47:53 -0700
Subject: [PATCH] ummunotify: Userspace support for MMU notifications

As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
and follow-up messages, libraries using RDMA would like to track
precisely when application code changes memory mapping via free(),
munmap(), etc.  Current pure-userspace solutions using malloc hooks
and other tricks are not robust, and the feeling among experts is that
the issue is unfixable without kernel help.

We solve this not by implementing the full API proposed in the email
linked above but rather with a simpler and more generic interface,
which may be useful in other contexts.  Specifically, we implement a
new character device driver, ummunotify, that creates a /dev/ummunotify
node.  A userspace process can open this node read-only and use the fd
as follows:

 1. ioctl() to register/unregister an address range to watch in the
    kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).

 2. read() to retrieve events generated when a mapping in a watched
    address range is invalidated (cf struct ummunotify_event in
    <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
    handled for this IO.

 3. mmap() one page at offset 0 to map a kernel page that contains a
    generation counter that is incremented each time an event is
    generated.  This allows userspace to have a fast path that checks
    that no events have occurred without a system call.

Thanks to Jason Gunthorpe <jgunthorpe@obsidianresearch.com> for
suggestions on the interface design.  Also thanks to Jeff Squyres
<jsquyres@cisco.com> for prototyping support for this in Open MPI, which
helped find several bugs during development.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
---
Hi Andrew,
This should replace the ummunotify patch you're carrying in -mm (if
you'd prefer an incremental patch, let me know).

Changes since v2:
 - Added Documentation/ummunotify/ with a text file and a test program
   (hooked up to CONFIG_BUILD_DOCSRC, fixed things like hooking up
   ummunotify.h to headers_install)
 - Integrated Andrew's checkpatch fixes (no more > 80 char lines in
   kernel source; userspace test code has some long lines due to not
   wanting to split printf formats)
 - Clean up "if (test_bit) { clear_bit } else { set_bit }" -- code was
   actually buggy since we don't want to reset the bit after we cleared
   it (ie 3 events in a row)

 Documentation/Makefile                  |    3 +-
 Documentation/ummunotify/Makefile       |    7 +
 Documentation/ummunotify/ummunotify.txt |  150 ++++++++
 Documentation/ummunotify/umn-test.c     |  200 +++++++++++
 drivers/char/Kconfig                    |   12 +
 drivers/char/Makefile                   |    1 +
 drivers/char/ummunotify.c               |  566 +++++++++++++++++++++++++++++++
 include/linux/ummunotify.h              |  121 +++++++
 8 files changed, 1059 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/ummunotify/Makefile
 create mode 100644 Documentation/ummunotify/ummunotify.txt
 create mode 100644 Documentation/ummunotify/umn-test.c
 create mode 100644 drivers/char/ummunotify.c
 create mode 100644 include/linux/ummunotify.h

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 94b9457..21f05ea 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,4 @@
 obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
 	filesystems/configfs/ ia64/ networking/ \
-	pcmcia/ spi/ video4linux/ vm/ watchdog/src/
+	pcmcia/ spi/ video4linux/ vm/ ummunotify/ \
+	watchdog/src/
diff --git a/Documentation/ummunotify/Makefile b/Documentation/ummunotify/Makefile
new file mode 100644
index 0000000..89f31a0
--- /dev/null
+++ b/Documentation/ummunotify/Makefile
@@ -0,0 +1,7 @@
+# List of programs to build
+hostprogs-y := umn-test
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_umn-test.o += -I$(objtree)/usr/include
diff --git a/Documentation/ummunotify/ummunotify.txt b/Documentation/ummunotify/ummunotify.txt
new file mode 100644
index 0000000..78a79c2
--- /dev/null
+++ b/Documentation/ummunotify/ummunotify.txt
@@ -0,0 +1,150 @@
+UMMUNOTIFY
+
+  Ummunotify relays MMU notifier events to userspace.  This is useful
+  for libraries that need to track the memory mapping of applications;
+  for example, MPI implementations using RDMA want to cache memory
+  registrations for performance, but tracking all possible crazy cases
+  such as when, say, the FORTRAN runtime frees memory is impossible
+  without kernel help.
+
+Basic Model
+
+  A userspace process uses it by opening /dev/ummunotify, which
+  returns a file descriptor.  Interest in address ranges is registered
+  using ioctl() and MMU notifier events are retrieved using read(), as
+  described in more detail below.  Userspace can register multiple
+  address ranges to watch, and can unregister individual ranges.
+
+  Userspace can also mmap() a single read-only page at offset 0 on
+  this file descriptor.  This page contains (at offest 0) a single
+  64-bit generation counter that the kernel increments each time an
+  MMU notifier event occurs.  Userspace can use this to very quickly
+  check if there are any events to retrieve without needing to do a
+  system call.
+
+Control
+
+  To start using ummunotify, a process opens /dev/ummunotify in
+  read-only mode.  Control from userspace is done via ioctl(); the
+  defined ioctls are:
+
+    UMMUNOTIFY_EXCHANGE_FEATURES: This ioctl takes a single 32-bit
+      word of feature flags as input, and the kernel updates the
+      features flags word to contain only features requested by
+      userspace and also supported by the kernel.
+
+      This ioctl is only included for forward compatibility; no
+      feature flags are currently defined, and the kernel will simply
+      update any requested feature mask to 0.  The kernel will always
+      default to a feature mask of 0 if this ioctl is not used, so
+      current userspace does not need to perform this ioctl.
+
+    UMMUNOTIFY_REGISTER_REGION: Userspace uses this ioctl to tell the
+      kernel to start delivering events for an address range.  The
+      range is described using struct ummunotify_register_ioctl:
+
+	struct ummunotify_register_ioctl {
+		__u64	start;
+		__u64	end;
+		__u64	user_cookie;
+		__u32	flags;
+		__u32	reserved;
+	};
+
+      start and end give the range of userspace virtual addresses;
+      start is included in the range and end is not, so an example of
+      a 4 KB range would be start=0x1000, end=0x2000.
+
+      user_cookie is an opaque 64-bit quantity that is returned by the
+      kernel in events involving the range, and used by userspace to
+      stop watching the range.  Each registered address range must
+      have a distinct user_cookie.
+
+      It is fine with the kernel if userspace registers multiple
+      overlapping or even duplicate address ranges, as long as a
+      different cookie is used for each registration.
+
+      flags and reserved are included for forward compatibility;
+      userspace should simply set them to 0 for the current interface.
+
+    UMMUNOTIFY_UNREGISTER_REGION: Userspace passes in the 64-bit
+      user_cookie used to register a range to tell the kernel to stop
+      watching an address range.  Once this ioctl completes, the
+      kernel will not deliver any further events for the range that is
+      unregistered.
+
+Events
+
+  When an event occurs that invalidates some of a process's memory
+  mapping in an address range being watched, ummunotify queues an
+  event report for that address range.  If more than one event
+  invalidates parts of the same address range before userspace
+  retrieves the queued report, then further reports for the same range
+  will not be queued -- when userspace does read the queue, only a
+  single report for a given range will be returned.
+
+  If multiple ranges being watched are invalidated by a single event
+  (which is especially likely if userspace registers overlapping
+  ranges), then an event report structure will be queued for each
+  address range registration.
+
+  Userspace retrieves queued events via read() on the ummunotify file
+  descriptor; a buffer that is at least as big as struct
+  ummunotify_event should be used to retrieve event reports, and if a
+  larger buffer is passed to read(), multiple reports will be returned
+  (if available).
+
+  If the ummunotify file descriptor is in blocking mode, a read() call
+  will wait for an event report to be available.  Userspace may also
+  set the ummunotify file descriptor to non-blocking mode and use all
+  standard ways of waiting for data to be available on the ummunotify
+  file descriptor, including epoll/poll()/select() and SIGIO.
+
+  The format of event reports is:
+
+	struct ummunotify_event {
+		__u32	type;
+		__u32	flags;
+		__u64	hint_start;
+		__u64	hint_end;
+		__u64	user_cookie_counter;
+	};
+
+  where the type field is either UMMUNOTIFY_EVENT_TYPE_INVAL or
+  UMMUNOTIFY_EVENT_TYPE_LAST.  Events of type INVAL describe
+  invalidation events as follows: user_cookie_counter contains the
+  cookie passed in when userspace registered the range that the event
+  is for.  hint_start and hint_end contain the start address and end
+  address that were invalidated.
+
+  The flags word contains bit flags, with only UMMUNOTIFY_EVENT_FLAG_HINT
+  defined at the moment.  If HINT is set, then the invalidation event
+  invalidated less than the full address range and the kernel returns
+  the exact range invalidated; if HINT is not sent then hint_start and
+  hint_end are set to the original range registered by userspace.
+  (HINT will not be set if, for example, multiple events invalidated
+  disjoint parts of the range and so a single start/end pair cannot
+  represent the parts of the range that were invalidated)
+
+  If the event type is LAST, then the read operation has emptied the
+  list of invalidated regions, and the flags, hint_start and hint_end
+  fields are not used.  user_cookie_counter holds the value of the
+  kernel's generation counter (see below of more details) when the
+  empty list occurred.
+
+Generation Count
+
+  Userspace may mmap() a page on a ummunotify file descriptor via
+
+	mmap(NULL, sizeof (__u64), PROT_READ, MAP_SHARED, ummunotify_fd, 0);
+
+  to get a read-only mapping of the kernel's 64-bit generation
+  counter.  The kernel will increment this generation counter each
+  time an event report is queued.
+
+  Userspace can use the generation counter as a quick check to avoid
+  system calls; if the value read from the mapped kernel counter is
+  still equal to the value returned in user_cookie_counter for the
+  most recent LAST event retrieved, then no further events have been
+  queued and there is no need to try a read() on the ummunotify file
+  descriptor.
diff --git a/Documentation/ummunotify/umn-test.c b/Documentation/ummunotify/umn-test.c
new file mode 100644
index 0000000..143db2c
--- /dev/null
+++ b/Documentation/ummunotify/umn-test.c
@@ -0,0 +1,200 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <stdint.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include <linux/ummunotify.h>
+
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+
+#define UMN_TEST_COOKIE 123
+
+static int		umn_fd;
+static volatile __u64  *umn_counter;
+
+static int umn_init(void)
+{
+	__u32 flags;
+
+	umn_fd = open("/dev/ummunotify", O_RDONLY);
+	if (umn_fd < 0) {
+		perror("open");
+		return 1;
+	}
+
+	if (ioctl(umn_fd, UMMUNOTIFY_EXCHANGE_FEATURES, &flags)) {
+		perror("exchange ioctl");
+		return 1;
+	}
+
+	printf("kernel feature flags: 0x%08x\n", flags);
+
+	umn_counter = mmap(NULL, sizeof *umn_counter, PROT_READ,
+			   MAP_SHARED, umn_fd, 0);
+	if (umn_counter == MAP_FAILED) {
+		perror("mmap");
+		return 1;
+	}
+
+	return 0;
+}
+
+static int umn_register(void *buf, size_t size, __u64 cookie)
+{
+	struct ummunotify_register_ioctl r = {
+		.start		= (unsigned long) buf,
+		.end		= (unsigned long) buf + size,
+		.user_cookie	= cookie,
+	};
+
+	if (ioctl(umn_fd, UMMUNOTIFY_REGISTER_REGION, &r)) {
+		perror("register ioctl");
+		return 1;
+	}
+
+	return 0;
+}
+
+static int umn_unregister(__u64 cookie)
+{
+	if (ioctl(umn_fd, UMMUNOTIFY_UNREGISTER_REGION, &cookie)) {
+		perror("unregister ioctl");
+		return 1;
+	}
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int			page_size;
+	__u64			old_counter;
+	void		       *t;
+	int			got_it;
+
+	if (umn_init())
+		return 1;
+
+	printf("\n");
+
+	old_counter = *umn_counter;
+	if (old_counter != 0) {
+		fprintf(stderr, "counter = %lld (expected 0)\n", old_counter);
+		return 1;
+	}
+
+	page_size = sysconf(_SC_PAGESIZE);
+	t = mmap(NULL, 3 * page_size, PROT_READ,
+		 MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+
+	if (umn_register(t, 3 * page_size, UMN_TEST_COOKIE))
+		return 1;
+
+	munmap(t + page_size, page_size);
+
+	old_counter = *umn_counter;
+	if (old_counter != 1) {
+		fprintf(stderr, "counter = %lld (expected 1)\n", old_counter);
+		return 1;
+	}
+
+	got_it = 0;
+	while (1) {
+		struct ummunotify_event	ev;
+		int			len;
+
+		len = read(umn_fd, &ev, sizeof ev);
+		if (len < 0) {
+			perror("read event");
+			return 1;
+		}
+		if (len != sizeof ev) {
+			fprintf(stderr, "Read gave %d bytes (!= event size %zd)\n",
+				len, sizeof ev);
+			return 1;
+		}
+
+		switch (ev.type) {
+		case UMMUNOTIFY_EVENT_TYPE_INVAL:
+			if (got_it) {
+				fprintf(stderr, "Extra invalidate event\n");
+				return 1;
+			}
+			if (ev.user_cookie_counter != UMN_TEST_COOKIE) {
+				fprintf(stderr, "Invalidate event for cookie %lld (expected %d)\n",
+					ev.user_cookie_counter,
+					UMN_TEST_COOKIE);
+				return 1;
+			}
+
+			printf("Invalidate event:\tcookie %lld\n",
+			       ev.user_cookie_counter);
+
+			if (!(ev.flags & UMMUNOTIFY_EVENT_FLAG_HINT)) {
+				fprintf(stderr, "Hint flag not set\n");
+				return 1;
+			}
+
+			if (ev.hint_start != (uintptr_t) t + page_size ||
+			    ev.hint_end != (uintptr_t) t + page_size * 2) {
+				fprintf(stderr, "Got hint %llx..%llx, expected %p..%p\n",
+					ev.hint_start, ev.hint_end,
+					t + page_size, t + page_size * 2);
+				return 1;
+			}
+
+			printf("\t\t\thint %llx...%llx\n",
+			       ev.hint_start, ev.hint_end);
+
+			got_it = 1;
+			break;
+
+		case UMMUNOTIFY_EVENT_TYPE_LAST:
+			if (!got_it) {
+				fprintf(stderr, "Last event without invalidate event\n");
+				return 1;
+			}
+
+			printf("Empty event:\t\tcounter %lld\n",
+			       ev.user_cookie_counter);
+			goto done;
+
+		default:
+			fprintf(stderr, "unknown event type %d\n",
+				ev.type);
+			return 1;
+		}
+	}
+
+done:
+	umn_unregister(123);
+	munmap(t, page_size);
+
+	old_counter = *umn_counter;
+	if (old_counter != 1) {
+		fprintf(stderr, "counter = %lld (expected 1)\n", old_counter);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 6a06913..b8368e2 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1108,6 +1108,18 @@ config DEVPORT
 	depends on ISA || PCI
 	default y
 
+config UMMUNOTIFY
+       tristate "Userspace MMU notifications"
+       select MMU_NOTIFIER
+       help
+         The ummunotify (userspace MMU notification) driver creates a
+         character device that can be used by userspace libraries to
+         get notifications when an application's memory mapping
+         changed.  This is used, for example, by RDMA libraries to
+         improve the reliability of memory registration caching, since
+         the kernel's MMU notifications can be used to know precisely
+         when to shoot down a cached registration.
+
 source "drivers/s390/char/Kconfig"
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 66f779a..3547020 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)		+= nsc_gpio.o
 obj-$(CONFIG_CS5535_GPIO)	+= cs5535_gpio.o
 obj-$(CONFIG_GPIO_TB0219)	+= tb0219.o
 obj-$(CONFIG_TELCLOCK)		+= tlclk.o
+obj-$(CONFIG_UMMUNOTIFY)	+= ummunotify.o
 
 obj-$(CONFIG_MWAVE)		+= mwave/
 obj-$(CONFIG_AGP)		+= agp/
diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c
new file mode 100644
index 0000000..725fbb0
--- /dev/null
+++ b/drivers/char/ummunotify.c
@@ -0,0 +1,566 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/uaccess.h>
+#include <linux/ummunotify.h>
+
+#include <asm/cacheflush.h>
+
+MODULE_AUTHOR("Roland Dreier");
+MODULE_DESCRIPTION("Userspace MMU notifiers");
+MODULE_LICENSE("GPL v2");
+
+/*
+ * Information about an address range userspace has asked us to watch.
+ *
+ * user_cookie: Opaque cookie given to us when userspace registers the
+ *   address range.
+ *
+ * start, end: Address range; start is inclusive, end is exclusive.
+ *
+ * hint_start, hint_end: If a single MMU notification event
+ *   invalidates the address range, we hold the actual range of
+ *   addresses that were invalidated (and set UMMUNOTIFY_FLAG_HINT).
+ *   If another event hits this range before userspace reads the
+ *   event, we give up and don't try to keep track of which subsets
+ *   got invalidated.
+ *
+ * flags: Holds the INVALID flag for ranges that are on the invalid
+ *   list and/or the HINT flag for ranges where the hint range holds
+ *   good information.
+ *
+ * node: Used to put the range into an rbtree we use to be able to
+ *   scan address ranges in order.
+ *
+ * list: Used to put the range on the invalid list when an MMU
+ *   notification event hits the range.
+ */
+enum {
+	UMMUNOTIFY_FLAG_INVALID	= 1,
+	UMMUNOTIFY_FLAG_HINT	= 2,
+};
+
+struct ummunotify_reg {
+	u64			user_cookie;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		hint_start;
+	unsigned long		hint_end;
+	unsigned long		flags;
+	struct rb_node		node;
+	struct list_head	list;
+};
+
+/*
+ * Context attached to each file that userspace opens.
+ *
+ * mmu_notifier: MMU notifier registered for this context.
+ *
+ * mm: mm_struct for process that created the context; we use this to
+ *   hold a reference to the mm to make sure it doesn't go away until
+ *   we're done with it.
+ *
+ * reg_tree: RB tree of address ranges being watched, sorted by start
+ *   address.
+ *
+ * invalid_list: List of address ranges that have been invalidated by
+ *   MMU notification events; as userspace reads events, the address
+ *   range corresponding to the event is removed from the list.
+ *
+ * counter: Page that can be mapped read-only by userspace, which
+ *   holds a generation count that is incremented each time an event
+ *   occurs.
+ *
+ * lock: Spinlock used to protect all context.
+ *
+ * read_wait: Wait queue used to wait for data to become available in
+ *   blocking read()s.
+ *
+ * async_queue: Used to implement fasync().
+ *
+ * need_empty: Set when userspace reads an invalidation event, so that
+ *   read() knows it must generate an "empty" event when userspace
+ *   drains the invalid_list.
+ *
+ * used: Set after userspace does anything with the file, so that the
+ *   "exchange flags" ioctl() knows it's too late to change anything.
+ */
+struct ummunotify_file {
+	struct mmu_notifier	mmu_notifier;
+	struct mm_struct       *mm;
+	struct rb_root		reg_tree;
+	struct list_head	invalid_list;
+	u64		       *counter;
+	spinlock_t		lock;
+	wait_queue_head_t	read_wait;
+	struct fasync_struct   *async_queue;
+	int			need_empty;
+	int			used;
+};
+
+static void ummunotify_handle_notify(struct mmu_notifier *mn,
+				     unsigned long start, unsigned long end)
+{
+	struct ummunotify_file *priv =
+		container_of(mn, struct ummunotify_file, mmu_notifier);
+	struct rb_node *n;
+	struct ummunotify_reg *reg;
+	unsigned long flags;
+	int hit = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunotify_reg, node);
+
+		/*
+		 * Ranges overlap if they're not disjoint; and they're
+		 * disjoint if the end of one is before the start of
+		 * the other one.  So if both disjointness comparisons
+		 * fail then the ranges overlap.
+		 *
+		 * Since we keep the tree of regions we're watching
+		 * sorted by start address, we can end this loop as
+		 * soon as we hit a region that starts past the end of
+		 * the range for the event we're handling.
+		 */
+		if (reg->start >= end)
+			break;
+
+		/*
+		 * Just go to the next region if the start of the
+		 * range is after then end of the region -- there
+		 * might still be more overlapping ranges that have a
+		 * greater start.
+		 */
+		if (start >= reg->end)
+			continue;
+
+		hit = 1;
+
+		if (test_and_set_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags)) {
+			/* Already on invalid list */
+			clear_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
+		} else {
+			list_add_tail(&reg->list, &priv->invalid_list);
+			set_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags);
+			reg->hint_start = start;
+			reg->hint_end   = end;
+		}
+	}
+
+	if (hit) {
+		++(*priv->counter);
+		flush_dcache_page(virt_to_page(priv->counter));
+		wake_up_interruptible(&priv->read_wait);
+		kill_fasync(&priv->async_queue, SIGIO, POLL_IN);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void ummunotify_invalidate_page(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long addr)
+{
+	ummunotify_handle_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void ummunotify_invalidate_range_start(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end)
+{
+	ummunotify_handle_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops ummunotify_mmu_notifier_ops = {
+	.invalidate_page	= ummunotify_invalidate_page,
+	.invalidate_range_start	= ummunotify_invalidate_range_start,
+};
+
+static int ummunotify_open(struct inode *inode, struct file *filp)
+{
+	struct ummunotify_file *priv;
+	int ret;
+
+	if (filp->f_mode & FMODE_WRITE)
+		return -EINVAL;
+
+	priv = kmalloc(sizeof *priv, GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->counter = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!priv->counter) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	priv->reg_tree = RB_ROOT;
+	INIT_LIST_HEAD(&priv->invalid_list);
+	spin_lock_init(&priv->lock);
+	init_waitqueue_head(&priv->read_wait);
+	priv->async_queue = NULL;
+	priv->need_empty  = 0;
+	priv->used	  = 0;
+
+	priv->mmu_notifier.ops = &ummunotify_mmu_notifier_ops;
+	/*
+	 * Register notifier last, since notifications can occur as
+	 * soon as we register....
+	 */
+	ret = mmu_notifier_register(&priv->mmu_notifier, current->mm);
+	if (ret)
+		goto err_page;
+
+	priv->mm = current->mm;
+	atomic_inc(&priv->mm->mm_count);
+
+	filp->private_data = priv;
+
+	return 0;
+
+err_page:
+	free_page((unsigned long) priv->counter);
+
+err:
+	kfree(priv);
+	return ret;
+}
+
+static int ummunotify_close(struct inode *inode, struct file *filp)
+{
+	struct ummunotify_file *priv = filp->private_data;
+	struct rb_node *n;
+	struct ummunotify_reg *reg;
+
+	mmu_notifier_unregister(&priv->mmu_notifier, priv->mm);
+	mmdrop(priv->mm);
+	free_page((unsigned long) priv->counter);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunotify_reg, node);
+		kfree(reg);
+	}
+
+	kfree(priv);
+
+	return 0;
+}
+
+static bool ummunotify_readable(struct ummunotify_file *priv)
+{
+	return priv->need_empty || !list_empty(&priv->invalid_list);
+}
+
+static ssize_t ummunotify_read(struct file *filp, char __user *buf,
+			       size_t count, loff_t *pos)
+{
+	struct ummunotify_file *priv = filp->private_data;
+	struct ummunotify_reg *reg;
+	ssize_t ret;
+	struct ummunotify_event *events;
+	int max;
+	int n;
+
+	priv->used = 1;
+
+	events = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!events) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_irq(&priv->lock);
+
+	while (!ummunotify_readable(priv)) {
+		spin_unlock_irq(&priv->lock);
+
+		if (filp->f_flags & O_NONBLOCK) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
+		if (wait_event_interruptible(priv->read_wait,
+					     ummunotify_readable(priv))) {
+			ret = -ERESTARTSYS;
+			goto out;
+		}
+
+		spin_lock_irq(&priv->lock);
+	}
+
+	max = min_t(size_t, PAGE_SIZE, count) / sizeof *events;
+
+	for (n = 0; n < max; ++n) {
+		if (list_empty(&priv->invalid_list)) {
+			events[n].type = UMMUNOTIFY_EVENT_TYPE_LAST;
+			events[n].user_cookie_counter = *priv->counter;
+			++n;
+			priv->need_empty = 0;
+			break;
+		}
+
+		reg = list_first_entry(&priv->invalid_list,
+				       struct ummunotify_reg, list);
+
+		events[n].type = UMMUNOTIFY_EVENT_TYPE_INVAL;
+		if (test_bit(UMMUNOTIFY_FLAG_HINT, &reg->flags)) {
+			events[n].flags	     = UMMUNOTIFY_EVENT_FLAG_HINT;
+			events[n].hint_start = max(reg->start, reg->hint_start);
+			events[n].hint_end   = min(reg->end, reg->hint_end);
+		} else {
+			events[n].hint_start = reg->start;
+			events[n].hint_end   = reg->end;
+		}
+		events[n].user_cookie_counter = reg->user_cookie;
+
+		list_del(&reg->list);
+		reg->flags = 0;
+		priv->need_empty = 1;
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	if (copy_to_user(buf, events, n * sizeof *events))
+		ret = -EFAULT;
+	else
+		ret = n * sizeof *events;
+
+out:
+	free_page((unsigned long) events);
+	return ret;
+}
+
+static unsigned int ummunotify_poll(struct file *filp,
+				    struct poll_table_struct *wait)
+{
+	struct ummunotify_file *priv = filp->private_data;
+
+	poll_wait(filp, &priv->read_wait, wait);
+
+	return ummunotify_readable(priv) ? (POLLIN | POLLRDNORM) : 0;
+}
+
+static long ummunotify_exchange_features(struct ummunotify_file *priv,
+					 __u32 __user *arg)
+{
+	u32 feature_mask;
+
+	if (priv->used)
+		return -EINVAL;
+
+	priv->used = 1;
+
+	if (get_user(feature_mask, arg))
+		return -EFAULT;
+
+	/* No extensions defined at present. */
+	feature_mask = 0;
+
+	if (put_user(feature_mask, arg))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long ummunotify_register_region(struct ummunotify_file *priv,
+				       void __user *arg)
+{
+	struct ummunotify_register_ioctl parm;
+	struct ummunotify_reg *reg, *treg;
+	struct rb_node **n = &priv->reg_tree.rb_node;
+	struct rb_node *pn;
+	int ret = 0;
+
+	if (copy_from_user(&parm, arg, sizeof parm))
+		return -EFAULT;
+
+	priv->used = 1;
+
+	reg = kmalloc(sizeof *reg, GFP_KERNEL);
+	if (!reg)
+		return -ENOMEM;
+
+	reg->user_cookie	= parm.user_cookie;
+	reg->start		= parm.start;
+	reg->end		= parm.end;
+	reg->flags		= 0;
+
+	spin_lock_irq(&priv->lock);
+
+	for (pn = rb_first(&priv->reg_tree); pn; pn = rb_next(pn)) {
+		treg = rb_entry(pn, struct ummunotify_reg, node);
+
+		if (treg->user_cookie == parm.user_cookie) {
+			kfree(reg);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	pn = NULL;
+	while (*n) {
+		pn = *n;
+		treg = rb_entry(pn, struct ummunotify_reg, node);
+
+		if (reg->start <= treg->start)
+			n = &pn->rb_left;
+		else
+			n = &pn->rb_right;
+	}
+
+	rb_link_node(&reg->node, pn, n);
+	rb_insert_color(&reg->node, &priv->reg_tree);
+
+out:
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunotify_unregister_region(struct ummunotify_file *priv,
+					 __u64 __user *arg)
+{
+	u64 user_cookie;
+	struct rb_node *n;
+	struct ummunotify_reg *reg;
+	int ret = -EINVAL;
+
+	if (get_user(user_cookie, arg))
+		return -EFAULT;
+
+	spin_lock_irq(&priv->lock);
+
+	for (n = rb_first(&priv->reg_tree); n; n = rb_next(n)) {
+		reg = rb_entry(n, struct ummunotify_reg, node);
+
+		if (reg->user_cookie == user_cookie) {
+			rb_erase(n, &priv->reg_tree);
+			if (test_bit(UMMUNOTIFY_FLAG_INVALID, &reg->flags))
+				list_del(&reg->list);
+			kfree(reg);
+			ret = 0;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&priv->lock);
+
+	return ret;
+}
+
+static long ummunotify_ioctl(struct file *filp, unsigned int cmd,
+			     unsigned long arg)
+{
+	struct ummunotify_file *priv = filp->private_data;
+	void __user *argp = (void __user *) arg;
+
+	switch (cmd) {
+	case UMMUNOTIFY_EXCHANGE_FEATURES:
+		return ummunotify_exchange_features(priv, argp);
+	case UMMUNOTIFY_REGISTER_REGION:
+		return ummunotify_register_region(priv, argp);
+	case UMMUNOTIFY_UNREGISTER_REGION:
+		return ummunotify_unregister_region(priv, argp);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int ummunotify_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct ummunotify_file *priv = vma->vm_private_data;
+
+	if (vmf->pgoff != 0)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = virt_to_page(priv->counter);
+	get_page(vmf->page);
+
+	return 0;
+
+}
+
+static struct vm_operations_struct ummunotify_vm_ops = {
+	.fault		= ummunotify_fault,
+};
+
+static int ummunotify_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ummunotify_file *priv = filp->private_data;
+
+	if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff != 0)
+		return -EINVAL;
+
+	vma->vm_ops		= &ummunotify_vm_ops;
+	vma->vm_private_data	= priv;
+
+	return 0;
+}
+
+static int ummunotify_fasync(int fd, struct file *filp, int on)
+{
+	struct ummunotify_file *priv = filp->private_data;
+
+	return fasync_helper(fd, filp, on, &priv->async_queue);
+}
+
+static const struct file_operations ummunotify_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ummunotify_open,
+	.release	= ummunotify_close,
+	.read		= ummunotify_read,
+	.poll		= ummunotify_poll,
+	.unlocked_ioctl	= ummunotify_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ummunotify_ioctl,
+#endif
+	.mmap		= ummunotify_mmap,
+	.fasync		= ummunotify_fasync,
+};
+
+static struct miscdevice ummunotify_misc = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "ummunotify",
+	.fops	= &ummunotify_fops,
+};
+
+static int __init ummunotify_init(void)
+{
+	return misc_register(&ummunotify_misc);
+}
+
+static void __exit ummunotify_cleanup(void)
+{
+	misc_deregister(&ummunotify_misc);
+}
+
+module_init(ummunotify_init);
+module_exit(ummunotify_cleanup);
diff --git a/include/linux/ummunotify.h b/include/linux/ummunotify.h
new file mode 100644
index 0000000..21b0d03
--- /dev/null
+++ b/include/linux/ummunotify.h
@@ -0,0 +1,121 @@
+/*
+ * Copyright (c) 2009 Cisco Systems.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_UMMUNOTIFY_H
+#define _LINUX_UMMUNOTIFY_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Ummunotify relays MMU notifier events to userspace.  A userspace
+ * process uses it by opening /dev/ummunotify, which returns a file
+ * descriptor.  Interest in address ranges is registered using ioctl()
+ * and MMU notifier events are retrieved using read(), as described in
+ * more detail below.
+ *
+ * Userspace can also mmap() a single read-only page at offset 0 on
+ * this file descriptor.  This page contains (at offest 0) a single
+ * 64-bit generation counter that the kernel increments each time an
+ * MMU notifier event occurs.  Userspace can use this to very quickly
+ * check if there are any events to retrieve without needing to do a
+ * system call.
+ */
+
+/*
+ * struct ummunotify_register_ioctl describes an address range from
+ * start to end (including start but not including end) to be
+ * monitored.  user_cookie is an opaque handle that userspace assigns,
+ * and which is used to unregister.  flags and reserved are currently
+ * unused and should be set to 0 for forward compatibility.
+ */
+struct ummunotify_register_ioctl {
+	__u64	start;
+	__u64	end;
+	__u64	user_cookie;
+	__u32	flags;
+	__u32	reserved;
+};
+
+#define UMMUNOTIFY_MAGIC		'U'
+
+/*
+ * Forward compatibility: Userspace passes in a 32-bit feature mask
+ * with feature flags set indicating which extensions it wishes to
+ * use.  The kernel will return a feature mask with the bits of
+ * userspace's mask that the kernel implements; from that point on
+ * both userspace and the kernel should behave as described by the
+ * kernel's feature mask.
+ *
+ * If userspace does not perform a UMMUNOTIFY_EXCHANGE_FEATURES ioctl,
+ * then the kernel will use a feature mask of 0.
+ *
+ * No feature flags are currently defined, so the kernel will always
+ * return a feature mask of 0 at present.
+ */
+#define UMMUNOTIFY_EXCHANGE_FEATURES	_IOWR(UMMUNOTIFY_MAGIC, 1, __u32)
+
+/*
+ * Register interest in an address range; userspace should pass in a
+ * struct ummunotify_register_ioctl describing the region.
+ */
+#define UMMUNOTIFY_REGISTER_REGION	_IOW(UMMUNOTIFY_MAGIC, 2, \
+					     struct ummunotify_register_ioctl)
+/*
+ * Unregister interest in an address range; userspace should pass in
+ * the user_cookie value that was used to register the address range.
+ * No events for the address range will be reported once it is
+ * unregistered.
+ */
+#define UMMUNOTIFY_UNREGISTER_REGION	_IOW(UMMUNOTIFY_MAGIC, 3, __u64)
+
+/*
+ * Invalidation events are returned whenever the kernel changes the
+ * mapping for a monitored address.  These events are retrieved by
+ * read() on the ummunotify file descriptor, which will fill the
+ * read() buffer with struct ummunotify_event.
+ *
+ * If type field is INVAL, then user_cookie_counter holds the
+ * user_cookie for the region being reported; if the HINT flag is set
+ * then hint_start/hint_end hold the start and end of the mapping that
+ * was invalidated.  (If HINT is not set, then multiple events
+ * invalidated parts of the registered range and hint_start/hint_end
+ * and set to the start/end of the whole registered range)
+ *
+ * If type is LAST, then the read operation has emptied the list of
+ * invalidated regions, and user_cookie_counter holds the value of the
+ * kernel's generation counter when the empty list occurred.  The
+ * other fields are not filled in for this event.
+ */
+enum {
+	UMMUNOTIFY_EVENT_TYPE_INVAL	= 0,
+	UMMUNOTIFY_EVENT_TYPE_LAST	= 1,
+};
+
+enum {
+	UMMUNOTIFY_EVENT_FLAG_HINT	= 1 << 0,
+};
+
+struct ummunotify_event {
+	__u32	type;
+	__u32	flags;
+	__u64	hint_start;
+	__u64	hint_end;
+	__u64	user_cookie_counter;
+};
+
+#endif /* _LINUX_UMMUNOTIFY_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] ummunotify: Userspace support for MMU notifications
  2009-07-31 18:54             ` [PATCH v3] " Roland Dreier
@ 2009-08-02 19:59               ` Brice Goglin
  2009-08-03  4:55                 ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Brice Goglin @ 2009-08-02 19:59 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Andrew Morton, linux-kernel, jsquyres, rostedt

Roland Dreier wrote:
> As discussed in <http://article.gmane.org/gmane.linux.drivers.openib/61925>
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.
>
> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunotify, that creates a /dev/ummunotify
> node.  A userspace process can open this node read-only and use the fd
> as follows:
>
>  1. ioctl() to register/unregister an address range to watch in the
>     kernel (cf struct ummunotify_register_ioctl in <linux/ummunotify.h>).
>
>  2. read() to retrieve events generated when a mapping in a watched
>     address range is invalidated (cf struct ummunotify_event in
>     <linux/ummunotify.h>).  select()/poll()/epoll() and SIGIO are
>     handled for this IO.
>   

Hello Roland,

I like the interface but I have a couple questions:

1) Why does userspace have to register these address ranges? I would
have just reported all invalidation evens and let user-space check which
ones are interesting. My feeling is that the number of invalidation
events will usually be lower than the number registered ranges, so
you'll report more events through the file descriptor, but userspace
will do a lot less ioctls.

2) What happens in case of fork? If father+child keep reading from the
previously-open /dev/ummunotify, each event will be delivered only to
the first reader, right? Fork is always a mess in HPC, but maybe there's
something to do here.

3) What's userspace supposed to do if 2 libraries need such events in
the same process? Should each of them open /dev/ummunotify separately?
Doesn't matter much for performance, just wondering.

thanks,
Brice


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] ummunotify: Userspace support for MMU notifications
  2009-08-02 19:59               ` Brice Goglin
@ 2009-08-03  4:55                 ` Roland Dreier
  2009-08-03  6:57                   ` Brice Goglin
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2009-08-03  4:55 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Andrew Morton, linux-kernel, jsquyres, rostedt


 > I like the interface but I have a couple questions:

Thanks.

 > 1) Why does userspace have to register these address ranges? I would
 > have just reported all invalidation evens and let user-space check which
 > ones are interesting. My feeling is that the number of invalidation
 > events will usually be lower than the number registered ranges, so
 > you'll report more events through the file descriptor, but userspace
 > will do a lot less ioctls.

A couple of reasons.  First, MMU notifier events may be delivered (in
the kernel) in interrupt context so the amount of allocation we can do
in a notifier hook is limited (and any allocation will fail sometimes).
So if we just want to report all events to userspace then I don't see
any was around having to sometimes deliver an event like "uh, some
events got lost" and have userspace have to flush everything.

I suspect that MPI workloads will hit the overflow case in practice,
since they probably want to run as close to out-of-memory as possible,
and the application may not enter the MPI library often enough to keep
the queue of ummunotify events short -- I can imagine some codes that do
a lot of memory management, enter MPI infrequently, and end up
overflowing the queue and flushing all registrations over and over.
Having userspace register ranges means I can preallocate a landing area
for each event and make the MMU notifier hook pretty simple.

Second, it turns out that having the filter does cut down quite a bit on
the events.  From running some Open MPI tests that Jeff provided, I saw
that there were often several times as many MMU notifier events
delivered in the kernel than ended up being reported to userspace.

 > 2) What happens in case of fork? If father+child keep reading from the
 > previously-open /dev/ummunotify, each event will be delivered only to
 > the first reader, right? Fork is always a mess in HPC, but maybe there's
 > something to do here.

It works just like any other file where fork results in two file
descriptors in two processes... as you point out the two processes can
step on each other.  (And in the ummunotify case the file remains
associated with the original mm)  However this is the case for simpler
stuff like sockets etc too, and I think uniformity of interface and
least surprise say that ummunotify should follow the same model.

 > 3) What's userspace supposed to do if 2 libraries need such events in
 > the same process? Should each of them open /dev/ummunotify separately?
 > Doesn't matter much for performance, just wondering.

I guess the libraries could work out some way to share things, but that
would require one library to pass events to the other or something like
that.  It should work fine for 2 libraries to have independent
ummunotify files open though (I've not tested but "what could go wrong"?).

Thanks,
  Roland


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] ummunotify: Userspace support for MMU notifications
  2009-08-03  4:55                 ` Roland Dreier
@ 2009-08-03  6:57                   ` Brice Goglin
  2009-08-04 17:14                     ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Brice Goglin @ 2009-08-03  6:57 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Andrew Morton, linux-kernel, jsquyres, rostedt

Roland Dreier wrote:
> I suspect that MPI workloads will hit the overflow case in practice,
> since they probably want to run as close to out-of-memory as possible,
> and the application may not enter the MPI library often enough to keep
> the queue of ummunotify events short -- I can imagine some codes that do
> a lot of memory management, enter MPI infrequently, and end up
> overflowing the queue and flushing all registrations over and over.
> Having userspace register ranges means I can preallocate a landing area
> for each event and make the MMU notifier hook pretty simple.
>   

Thanks, I see.

> Second, it turns out that having the filter does cut down quite a bit on
> the events.  From running some Open MPI tests that Jeff provided, I saw
> that there were often several times as many MMU notifier events
> delivered in the kernel than ended up being reported to userspace.
>   

So maybe multiple invalidate_page are gathered into the same range
event? If so, maybe it'd make sense to cache the last used rb_node in
ummunotify_handle_notify()? (and if multiple ranges were invalidated at
once, just don't cache anything, it shouldn't happen often anyway)

>  > 2) What happens in case of fork? If father+child keep reading from the
>  > previously-open /dev/ummunotify, each event will be delivered only to
>  > the first reader, right? Fork is always a mess in HPC, but maybe there's
>  > something to do here.
>
> It works just like any other file where fork results in two file
> descriptors in two processes... as you point out the two processes can
> step on each other.  (And in the ummunotify case the file remains
> associated with the original mm)  However this is the case for simpler
> stuff like sockets etc too, and I think uniformity of interface and
> least surprise say that ummunotify should follow the same model.
>   

I was wondering if adding a special event such as "COWED" could help
user-space. But maybe fork already invalidates all COW'ed ranges in
copy_page_range() anyway?

Brice


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] ummunotify: Userspace support for MMU notifications
  2009-08-03  6:57                   ` Brice Goglin
@ 2009-08-04 17:14                     ` Roland Dreier
  0 siblings, 0 replies; 17+ messages in thread
From: Roland Dreier @ 2009-08-04 17:14 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Andrew Morton, linux-kernel, jsquyres, rostedt


 > > Second, it turns out that having the filter does cut down quite a bit on
 > > the events.  From running some Open MPI tests that Jeff provided, I saw
 > > that there were often several times as many MMU notifier events
 > > delivered in the kernel than ended up being reported to userspace.

 > So maybe multiple invalidate_page are gathered into the same range
 > event? If so, maybe it'd make sense to cache the last used rb_node in
 > ummunotify_handle_notify()? (and if multiple ranges were invalidated at
 > once, just don't cache anything, it shouldn't happen often anyway)

Well, I just meant that there were lots of events for parts of the
address space that Open MPI wasn't interested in... the fortran runtime
or whatever was freeing stuff that wasn't ever used for communication.

 > >  > 2) What happens in case of fork? If father+child keep reading from the
 > >  > previously-open /dev/ummunotify, each event will be delivered only to
 > >  > the first reader, right? Fork is always a mess in HPC, but maybe there's
 > >  > something to do here.

 > > It works just like any other file where fork results in two file
 > > descriptors in two processes... as you point out the two processes can
 > > step on each other.  (And in the ummunotify case the file remains
 > > associated with the original mm)  However this is the case for simpler
 > > stuff like sockets etc too, and I think uniformity of interface and
 > > least surprise say that ummunotify should follow the same model.

 > I was wondering if adding a special event such as "COWED" could help
 > user-space. But maybe fork already invalidates all COW'ed ranges in
 > copy_page_range() anyway?

The problem I guess is that there is only one file object (pointed to by
two file descriptors of course) after the fork.  And it is tracking
changes to the parent's mapping.  I guess in the parent when touching
pages and triggering COW, that might be interesting -- but I don't
really know how to distinguish it from any other type of invalidate
event (and I don't know how userspace would do anything different
anyway).

I haven't actually looked at what fork() does to the MMU notifiers --
but the MMU notifiers hook in at such a low level that it does seem hard
to know that what's going on is related to fork or even COW.

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-08-04 17:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-22 17:47 [PATCH/RFC] ummunot: Userspace support for MMU notifications Roland Dreier
2009-07-22 18:15 ` Andrew Morton
2009-07-22 19:27   ` Roland Dreier
2009-07-22 19:42     ` Andrew Morton
2009-07-23  2:26       ` Steven Rostedt
2009-07-23 20:21         ` Roland Dreier
2009-07-24  0:25           ` Steven Rostedt
2009-07-24 22:56       ` [PATCH v2] ummunotify: " Roland Dreier
2009-07-27 23:53         ` Andrew Morton
2009-07-28 16:14           ` Roland Dreier
2009-07-31 18:54             ` [PATCH v3] " Roland Dreier
2009-08-02 19:59               ` Brice Goglin
2009-08-03  4:55                 ` Roland Dreier
2009-08-03  6:57                   ` Brice Goglin
2009-08-04 17:14                     ` Roland Dreier
2009-07-23  9:04     ` [PATCH/RFC] ummunot: " Li Zefan
2009-07-23 20:28       ` Roland Dreier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.