linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
@ 2015-03-18 19:34 Pavel Emelyanov
  2015-03-18 19:34 ` [PATCH 1/3] uffd: Tossing bits around Pavel Emelyanov
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Pavel Emelyanov @ 2015-03-18 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap

Hi,

On the recent LSF Andrea presented his userfault-fd patches and
I had shown some issues that appear in usage scenarios when the
monitor task and mm task do not cooperate to each other on VM
changes (and fork()-s).

Here's the implementation of the extended uffd API that would help 
us to address those issues.

As proof of concept I've implemented only fix for fork() case, but
I also plan to add the mremap() and exit() notifications, both are
also required for such non-cooperative usage.

More details about the extension itself is in patch #2 and the fork()
notification description is in patch #3.

Comments and suggestion are warmly welcome :)


Andrea, what's the best way to go on with the patches -- would you
prefer to include them in your git tree or should I instead continue
with them on my own, re-sending them when required? Either way would
be OK for me.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] uffd: Tossing bits around
  2015-03-18 19:34 [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Pavel Emelyanov
@ 2015-03-18 19:34 ` Pavel Emelyanov
  2015-03-18 19:35 ` [PATCH 2/3] uffd: Introduce the v2 API Pavel Emelyanov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Pavel Emelyanov @ 2015-03-18 19:34 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap

Reformat the existing code a bit to make it easier for
further patching.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 fs/userfaultfd.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b4c7f25..6c9a2d6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -61,6 +61,23 @@ struct userfaultfd_wake_range {
 	unsigned long len;
 };
 
+static const struct file_operations userfaultfd_fops;
+
+static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
+{
+	struct userfaultfd_ctx *ctx;
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (ctx) {
+		atomic_set(&ctx->refcount, 1);
+		init_waitqueue_head(&ctx->fault_wqh);
+		init_waitqueue_head(&ctx->fd_wqh);
+		ctx->released = false;
+	}
+
+	return ctx;
+}
+
 static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 				     int wake_flags, void *key)
 {
@@ -307,15 +324,15 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+static inline unsigned int do_find_userfault(wait_queue_head_t *wqh,
 					  struct userfaultfd_wait_queue **uwq)
 {
 	wait_queue_t *wq;
 	struct userfaultfd_wait_queue *_uwq;
 	unsigned int ret = 0;
 
-	spin_lock(&ctx->fault_wqh.lock);
-	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+	spin_lock(&wqh->lock);
+	list_for_each_entry(wq, &wqh->task_list, task_list) {
 		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 		if (_uwq->pending) {
 			ret = POLLIN;
@@ -324,11 +341,17 @@ static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
 			break;
 		}
 	}
-	spin_unlock(&ctx->fault_wqh.lock);
+	spin_unlock(&wqh->lock);
 
 	return ret;
 }
 
+static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue **uwq)
+{
+	return do_find_userfault(&ctx->fault_wqh, uwq);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -1080,16 +1103,12 @@ static struct file *userfaultfd_file_create(int flags)
 		goto out;
 
 	file = ERR_PTR(-ENOMEM);
-	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	ctx = userfaultfd_ctx_alloc();
 	if (!ctx)
 		goto out;
 
-	atomic_set(&ctx->refcount, 1);
-	init_waitqueue_head(&ctx->fault_wqh);
-	init_waitqueue_head(&ctx->fd_wqh);
 	ctx->flags = flags;
 	ctx->state = UFFD_STATE_WAIT_API;
-	ctx->released = false;
 	ctx->mm = current->mm;
 	/* prevent the mm struct to be freed */
 	atomic_inc(&ctx->mm->mm_count);
-- 
1.8.4.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] uffd: Introduce the v2 API
  2015-03-18 19:34 [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Pavel Emelyanov
  2015-03-18 19:34 ` [PATCH 1/3] uffd: Tossing bits around Pavel Emelyanov
@ 2015-03-18 19:35 ` Pavel Emelyanov
  2015-04-21 12:18   ` Andrea Arcangeli
  2015-03-18 19:35 ` [PATCH 3/3] uffd: Introduce fork() notification Pavel Emelyanov
  2015-04-21 12:02 ` [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Andrea Arcangeli
  3 siblings, 1 reply; 10+ messages in thread
From: Pavel Emelyanov @ 2015-03-18 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap

The new API will report more than just the page-faults. The
reason for this is -- when the task whose mm we monitor with 
uffd and the monitor task itself cannot cooperate with each
other, the former one can screw things up. Like this.

If task fork()-s the child process is detached from uffd and
thus all not-yet-faulted-in memory gets mapped with zero-pages
on touch.

Another example is mremap(). When the victim remaps the uffd-ed
region and starts touching it the monitor would receive fault
messages with addresses that were not register-ed with uffd
ioctl before it. Thus monitor will have no idea how to handle
those faults.

To address both we can send more events to the monitor. In
particular, on fork() we can create another uffd context,
register the same set of regions in it and "send" the descriptor
to monitor.

For mremap() we can send the message describing what change
has been performed.

So this patch prepares to ground for the described above feature
by introducing the v2 API of uffd. With new API the kernel would
respond with a message containing the event type (pagefault,
fork or remap) and argument (fault address, new uffd descriptor
or region change respectively).

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 fs/userfaultfd.c                 | 56 ++++++++++++++++++++++++++++++----------
 include/uapi/linux/userfaultfd.h | 21 ++++++++++++++-
 2 files changed, 62 insertions(+), 15 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6c9a2d6..bd629b4 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -41,6 +41,8 @@ struct userfaultfd_ctx {
 	wait_queue_head_t fd_wqh;
 	/* userfaultfd syscall flags */
 	unsigned int flags;
+	/* features flags */
+	unsigned int features;
 	/* state machine */
 	enum userfaultfd_state state;
 	/* released */
@@ -49,6 +51,8 @@ struct userfaultfd_ctx {
 	struct mm_struct *mm;
 };
 
+#define UFFD_FEATURE_LONGMSG	0x1
+
 struct userfaultfd_wait_queue {
 	unsigned long address;
 	wait_queue_t wq;
@@ -369,7 +373,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-				    __u64 *addr)
+				    __u64 *mtype, __u64 *addr)
 {
 	ssize_t ret;
 	DECLARE_WAITQUEUE(wait, current);
@@ -383,6 +387,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 		if (find_userfault(ctx, &uwq)) {
 			uwq->pending = false;
 			/* careful to always initialize addr if ret == 0 */
+			*mtype = UFFD_PAGEFAULT;
 			*addr = uwq->address;
 			ret = 0;
 			break;
@@ -411,8 +416,6 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
 	ssize_t _ret, ret = 0;
-	/* careful to always initialize addr if ret == 0 */
-	__u64 uninitialized_var(addr);
 	int no_wait = file->f_flags & O_NONBLOCK;
 
 	if (ctx->state == UFFD_STATE_WAIT_API)
@@ -420,16 +423,34 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 	BUG_ON(ctx->state != UFFD_STATE_RUNNING);
 
 	for (;;) {
-		if (count < sizeof(addr))
-			return ret ? ret : -EINVAL;
-		_ret = userfaultfd_ctx_read(ctx, no_wait, &addr);
-		if (_ret < 0)
-			return ret ? ret : _ret;
-		if (put_user(addr, (__u64 __user *) buf))
-			return ret ? ret : -EFAULT;
-		ret += sizeof(addr);
-		buf += sizeof(addr);
-		count -= sizeof(addr);
+		if (!(ctx->features & UFFD_FEATURE_LONGMSG)) {
+			/* careful to always initialize addr if ret == 0 */
+			__u64 uninitialized_var(addr);
+			__u64 uninitialized_var(mtype);
+			if (count < sizeof(addr))
+				return ret ? ret : -EINVAL;
+			_ret = userfaultfd_ctx_read(ctx, no_wait, &mtype, &addr);
+			if (_ret < 0)
+				return ret ? ret : _ret;
+			BUG_ON(mtype != UFFD_PAGEFAULT);
+			if (put_user(addr, (__u64 __user *) buf))
+				return ret ? ret : -EFAULT;
+			_ret = sizeof(addr);
+		} else {
+			struct uffd_v2_msg msg;
+			if (count < sizeof(msg))
+				return ret ? ret : -EINVAL;
+			_ret = userfaultfd_ctx_read(ctx, no_wait, &msg.type, &msg.arg);
+			if (_ret < 0)
+				return ret ? ret : _ret;
+			if (copy_to_user(buf, &msg, sizeof(msg)))
+				return ret ? ret : -EINVAL;
+			_ret = sizeof(msg);
+		}
+
+		ret += _ret;
+		buf += _ret;
+		count -= _ret;
 		/*
 		 * Allow to read more than one fault at time but only
 		 * block if waiting for the very first one.
@@ -981,7 +1002,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	ret = -EFAULT;
 	if (copy_from_user(&uffdio_api, buf, sizeof(__u64)))
 		goto out;
-	if (uffdio_api.api != UFFD_API) {
+	if (uffdio_api.api != UFFD_API && uffdio_api.api != UFFD_API_V2) {
 		/* careful not to leak info, we only read the first 8 bytes */
 		memset(&uffdio_api, 0, sizeof(uffdio_api));
 		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
@@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	/* careful not to leak info, we only read the first 8 bytes */
 	uffdio_api.bits = UFFD_API_BITS;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
+
+	if (uffdio_api.api == UFFD_API_V2) {
+		ctx->features |= UFFD_FEATURE_LONGMSG;
+		uffdio_api.bits |= UFFD_API_V2_BITS;
+	}
+
 	ret = -EFAULT;
 	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 		goto out;
@@ -1109,6 +1136,7 @@ static struct file *userfaultfd_file_create(int flags)
 
 	ctx->flags = flags;
 	ctx->state = UFFD_STATE_WAIT_API;
+	ctx->features = 0;
 	ctx->mm = current->mm;
 	/* prevent the mm struct to be freed */
 	atomic_inc(&ctx->mm->mm_count);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index db6e99a..4e169b8 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -9,7 +9,9 @@
 #ifndef _LINUX_USERFAULTFD_H
 #define _LINUX_USERFAULTFD_H
 
-#define UFFD_API ((__u64)0xAA)
+#define UFFD_API 	((__u64)0xAA)
+#define UFFD_API_V2	((__u64)0xAB)
+
 /* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
 #define UFFD_API_BITS (UFFD_BIT_WRITE)
 #define UFFD_API_IOCTLS				\
@@ -147,4 +149,21 @@ struct uffdio_remap {
 	__s64 wake;
 };
 
+struct uffd_v2_msg {
+	__u64	type;
+	__u64	arg;
+};
+
+#define UFFD_PAGEFAULT	0x1
+
+#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
+#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
+
+/*
+ * Lower PAGE_SHIFT bits are used to report those supported
+ * by the pagefault message itself. Other bits are used to
+ * report the message types v2 API supports
+ */
+#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
+
 #endif /* _LINUX_USERFAULTFD_H */
-- 
1.8.4.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] uffd: Introduce fork() notification
  2015-03-18 19:34 [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Pavel Emelyanov
  2015-03-18 19:34 ` [PATCH 1/3] uffd: Tossing bits around Pavel Emelyanov
  2015-03-18 19:35 ` [PATCH 2/3] uffd: Introduce the v2 API Pavel Emelyanov
@ 2015-03-18 19:35 ` Pavel Emelyanov
  2015-04-21 12:02 ` [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Andrea Arcangeli
  3 siblings, 0 replies; 10+ messages in thread
From: Pavel Emelyanov @ 2015-03-18 19:35 UTC (permalink / raw)
  To: Andrea Arcangeli, Linux Kernel Mailing List, Linux MM, Linux API
  Cc: Sanidhya Kashyap

As described in previous e-mail, we need to get informed when
the task, whose mm is monitored with uffd, calls fork().

The fork notification is the new uffd with the same regions
and flags as those on parent. When read()-ing from uffd the
monitor task would receive the new uffd's descriptor number
and will be able to start reading events from the new task.

The fork() of mm with uffd attached doesn't finish until the 
monitor "acks" the message by reading the new uffd descriptor.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 fs/userfaultfd.c                 | 175 ++++++++++++++++++++++++++++++++++++++-
 include/linux/userfaultfd_k.h    |  12 +++
 include/uapi/linux/userfaultfd.h |   4 +-
 kernel/fork.c                    |   9 +-
 4 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bd629b4..265f031 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include <linux/list.h>
 #include <linux/hashtable.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
 	atomic_t refcount;
 	/* waitqueue head for the userfaultfd page faults */
 	wait_queue_head_t fault_wqh;
+	/* waitqueue head for fork-s */
+	wait_queue_head_t fork_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
 	/* userfaultfd syscall flags */
@@ -51,10 +54,21 @@ struct userfaultfd_ctx {
 	struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+	struct userfaultfd_ctx *orig;
+	struct userfaultfd_ctx *new;
+	struct list_head list;
+};
+
 #define UFFD_FEATURE_LONGMSG	0x1
+#define UFFD_FEATURE_FORK	0x2
 
 struct userfaultfd_wait_queue {
-	unsigned long address;
+	union {
+		unsigned long address;
+		struct userfaultfd_ctx *nctx;
+		int fd;
+	};
 	wait_queue_t wq;
 	bool pending;
 	struct userfaultfd_ctx *ctx;
@@ -75,6 +89,7 @@ static struct userfaultfd_ctx *userfaultfd_ctx_alloc(void)
 	if (ctx) {
 		atomic_set(&ctx->refcount, 1);
 		init_waitqueue_head(&ctx->fault_wqh);
+		init_waitqueue_head(&ctx->fork_wqh);
 		init_waitqueue_head(&ctx->fd_wqh);
 		ctx->released = false;
 	}
@@ -270,6 +285,111 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	return VM_FAULT_RETRY;
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+	struct userfaultfd_ctx *ctx = NULL, *octx;
+	struct userfaultfd_fork_ctx *fctx;
+
+	octx = vma->vm_userfaultfd_ctx.ctx;
+	if (!octx || !(octx->features & UFFD_FEATURE_FORK)) {
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+		return 0;
+	}
+
+	list_for_each_entry(fctx, fcs, list)
+		if (fctx->orig == octx) {
+			ctx = fctx->new;
+			break;
+		}
+
+	if (!ctx) {
+		fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+		if (!fctx)
+			return -ENOMEM;
+
+		ctx = userfaultfd_ctx_alloc();
+		if (!ctx) {
+			kfree(fctx);
+			return -ENOMEM;
+		}
+
+		ctx->flags = octx->flags;
+		ctx->state = UFFD_STATE_RUNNING;
+		ctx->features = UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG;
+		ctx->mm = vma->vm_mm;
+		atomic_inc(&ctx->mm->mm_count);
+
+		userfaultfd_ctx_get(octx);
+		fctx->orig = octx;
+		fctx->new = ctx;
+		list_add_tail(&fctx->list, fcs);
+	}
+
+	vma->vm_userfaultfd_ctx.ctx = ctx;
+	return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+	int ret = 0;
+	struct userfaultfd_ctx *ctx = fctx->orig;
+	struct userfaultfd_wait_queue uwq;
+
+	init_waitqueue_entry(&uwq.wq, current);
+	uwq.pending = true;
+	uwq.ctx = ctx;
+	uwq.nctx = fctx->new;
+
+	spin_lock(&ctx->fork_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fork_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_KILLABLE);
+		if (!uwq.pending)
+			break;
+		if (ACCESS_ONCE(ctx->released) ||
+				fatal_signal_pending(current)) {
+			ret = -1;
+			break;
+		}
+
+		spin_unlock(&ctx->fork_wqh.lock);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		spin_lock(&ctx->fork_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fork_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fork_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * already released.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return ret;
+}
+
+void dup_userfaultfd_complete(struct list_head *fcs)
+{
+	int ret = 0;
+	struct userfaultfd_fork_ctx *fctx, *n;
+
+	list_for_each_entry_safe(fctx, n, fcs, list) {
+		if (!ret)
+			ret = dup_fctx(fctx);
+		list_del(&fctx->list);
+		kfree(fctx);
+	}
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -356,6 +476,12 @@ static inline unsigned int find_userfault(struct userfaultfd_ctx *ctx,
 	return do_find_userfault(&ctx->fault_wqh, uwq);
 }
 
+static inline unsigned int find_userfault_fork(struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue **uwq)
+{
+	return do_find_userfault(&ctx->fork_wqh, uwq);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -366,12 +492,40 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 	case UFFD_STATE_WAIT_API:
 		return POLLERR;
 	case UFFD_STATE_RUNNING:
-		return find_userfault(ctx, NULL);
+		return find_userfault(ctx, NULL) || find_userfault_fork(ctx, NULL);
 	default:
 		BUG();
 	}
 }
 
+static ssize_t resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue **uwq)
+{
+	struct userfaultfd_ctx *new;
+	int fd;
+	struct file *file;
+
+	if (!find_userfault_fork(ctx, uwq))
+		return 0;
+
+	new = (*uwq)->nctx;
+	fd = get_unused_fd_flags(new->flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (fd < 0)
+		return fd;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, new,
+				  O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		return PTR_ERR(file);
+	}
+
+	fd_install(fd, file);
+	(*uwq)->fd = fd;
+
+	return 1;
+}
+
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 				    __u64 *mtype, __u64 *addr)
 {
@@ -392,6 +546,21 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			ret = 0;
 			break;
 		}
+
+		ret = resolve_userfault_fork(ctx, &uwq);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			*mtype = UFFD_FORK;
+			*addr = uwq->fd;
+
+			uwq->pending = false;
+			wake_up(&ctx->fork_wqh);
+
+			ret = 0;
+			break;
+		}
+
 		if (signal_pending(current)) {
 			ret = -ERESTARTSYS;
 			break;
@@ -1015,7 +1184,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 
 	if (uffdio_api.api == UFFD_API_V2) {
-		ctx->features |= UFFD_FEATURE_LONGMSG;
+		ctx->features |= UFFD_FEATURE_FORK | UFFD_FEATURE_LONGMSG;
 		uffdio_api.bits |= UFFD_API_V2_BITS;
 	}
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 81f0d11..44827f7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -75,6 +75,9 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
 }
 
+extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
+extern void dup_userfaultfd_complete(struct list_head *);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -107,6 +110,15 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline int dup_userfaultfd(struct vm_area_struct *, struct list_head *)
+{
+	return 0;
+}
+
+static inline void dup_userfaultfd_complete(struct list_head *)
+{
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 4e169b8..f6cfea3 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -155,9 +155,11 @@ struct uffd_v2_msg {
 };
 
 #define UFFD_PAGEFAULT	0x1
+#define UFFD_FORK	0x2
 
 #define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
-#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
+#define UFFD_FORK_BIT		(1 << (UFFD_FORK - 1))
+#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT | UFFD_FORK_BIT)
 
 /*
  * Lower PAGE_SHIFT bits are used to report those supported
diff --git a/kernel/fork.c b/kernel/fork.c
index cfab6e9..532882d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -55,6 +55,7 @@
 #include <linux/rmap.h>
 #include <linux/ksm.h>
 #include <linux/acct.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
@@ -370,6 +371,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
+	LIST_HEAD(uf);
 
 	uprobe_start_dup_mmap();
 	down_write(&oldmm->mmap_sem);
@@ -421,11 +423,13 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		if (retval)
 			goto fail_nomem_policy;
 		tmp->vm_mm = mm;
+		retval = dup_userfaultfd(tmp, &uf);
+		if (retval)
+			goto fail_nomem_anon_vma_fork;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
+		tmp->vm_flags &= ~(VM_LOCKED);
 		tmp->vm_next = tmp->vm_prev = NULL;
-		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);
@@ -481,6 +485,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	dup_userfaultfd_complete(&uf);
 	uprobe_end_dup_mmap();
 	return retval;
 fail_nomem_anon_vma_fork:
-- 
1.8.4.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
  2015-03-18 19:34 [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2015-03-18 19:35 ` [PATCH 3/3] uffd: Introduce fork() notification Pavel Emelyanov
@ 2015-04-21 12:02 ` Andrea Arcangeli
  2015-04-23  6:34   ` Pavel Emelyanov
  3 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2015-04-21 12:02 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Kernel Mailing List, Linux MM, Linux API, Sanidhya Kashyap,
	Dr. David Alan Gilbert, Dave Hansen

Hi Pavel,

On Wed, Mar 18, 2015 at 10:34:26PM +0300, Pavel Emelyanov wrote:
> Hi,
> 
> On the recent LSF Andrea presented his userfault-fd patches and
> I had shown some issues that appear in usage scenarios when the
> monitor task and mm task do not cooperate to each other on VM
> changes (and fork()-s).
> 
> Here's the implementation of the extended uffd API that would help 
> us to address those issues.
> 
> As proof of concept I've implemented only fix for fork() case, but
> I also plan to add the mremap() and exit() notifications, both are
> also required for such non-cooperative usage.
> 
> More details about the extension itself is in patch #2 and the fork()
> notification description is in patch #3.
> 
> Comments and suggestion are warmly welcome :)

This looks feasible.

> Andrea, what's the best way to go on with the patches -- would you
> prefer to include them in your git tree or should I instead continue
> with them on my own, re-sending them when required? Either way would
> be OK for me.

Ok so various improvements happened in userfaultfd patchset over the
last month so your incremental patchset likely requires a rebase
sorry. When you posted it I was in the middle of the updates. Now
things are working stable and I have no pending updates, so it would
be a good time for a rebase.

I can merge it if you like, it's your call if you prefer to maintain
it incrementally or if I should merge it, but I wouldn't push it to
Andrew for upstream integration in the first batch, as this
complicates things further and it's not fully complete yet (at least
the version posted only handled fork). I think it can be merged
incrementally in a second stage.

The major updates of the userfaultfd patchset over the last month were:

1) Various mixed fixes thanks to the feedback from Dave Hansen and
   David Gilbert.

   The most notable one is the use of mm_users instead of mm_count to
   pin the mm to avoid crashes that assumed the vma still existed (in
   the userfaultfd_release method and in the various ioctl). exit_mmap
   doesn't even set mm->mmap to NULL, so unless I introduce a
   userfaultfd_exit to call in mmput, I have to pin the mm_users to be
   safe. This is mainly an issue for the non-cooperative usage you're
   implementing. Can you catch the exit somehow so you can close the
   fd? The memory won't be released until you do. Is this ok with you?
   I suppose you had to close the fd somehow anyway.

2) userfaults are waken immediately even if they're not been "read"
   yet, this can lead to POLLIN false positives (so I only allow poll
   if the fd is open in nonblocking mode to be sure it won't hang). Is
   it too paranoid to return POLLERR if the fd is not open in
   nonblocking mode?

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f222d9de0a5302dc8ac62d6fab53a84251098751

3) optimize read to return entries in O(1) and poll which was already
   O(1) becomes lockless. This required to split the waitqueue in two,
   one for pending faults and one for non pending faults, and the
   faults are refiled across the two waitqueues when they're
   read. Both waitqueues are protected by a single lock to be simpler
   and faster at runtime (the fault_pending_wqh one).

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=9aa033ed43a1134c2223dac8c5d9e02e0100fca1

4) Allocate the ctx with kmem_cache_alloc. This is going to collide a
   bit with your cleanup patch sorry.

	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6

5) Originally qemu had two bitflags for each page and kept 3 states
   (of the 4 possible with two bits) for each page in order to deal
   with the races that can happen if one thread is reading the
   userfaults and another thread is calling the UFFDIO_COPY ioctl in
   the background. This patch solves all races in the kernel so the
   two bits per page can be dropped from qemu codebase. I started
   documenting the races that can materialize by using 2 threads
   (instead of running the workload single threaded with a single poll
   event loop), and how userland had to solve them until I decided it
   was simpler to fix the race in the kernel by running an ad-hoc
   pagetable walk inside the wait_event()-kind-of-section. This
   simplified qemu significantly and it doesn't make the kernel much
   more complicated.

   I tried this before in much older versions but I use gup_fast for
   it and it didn't work well with gup_fast for various reasons.

   http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=41efeae4e93f0296436f2a9fc6b28b6b0158512a

   After this patch the only reason to call UFFDIO_WAKE is to handle
   the userfaults in batches in combination with the DONT_WAKE flag of
   UFFDIO_COPY.

6) I removed the read recursion from mcopy_atomic. This avoids to
   depend on the write-starvation behavior of rwsem to be safe. After
   this change the rwsem is free to stop any further down_read if
   there's a down_write waiting on the lock.

   	  http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b1e3a08acc9e3f6c2614e89fc3b8e338daa58e18

About other troubles for the non cooperative usage: MADV_DONTNEED
likely needs to be trapped too or how do you know that you shall map a
zero page instead of the old data at the faulting address?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] uffd: Introduce the v2 API
  2015-03-18 19:35 ` [PATCH 2/3] uffd: Introduce the v2 API Pavel Emelyanov
@ 2015-04-21 12:18   ` Andrea Arcangeli
  2015-04-23  6:29     ` Pavel Emelyanov
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2015-04-21 12:18 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Kernel Mailing List, Linux MM, Linux API, Sanidhya Kashyap

On Wed, Mar 18, 2015 at 10:35:17PM +0300, Pavel Emelyanov wrote:
> +		if (!(ctx->features & UFFD_FEATURE_LONGMSG)) {

If we are to use different protocols, it'd be nicer to have two
different methods to assign to userfaultfd_fops.read that calls an
__always_inline function, so that the above check can be optimized
away at build time when the inline is expanded. So the branch is
converted to calling a different pointer to function which is zero
additional cost.

> +			/* careful to always initialize addr if ret == 0 */
> +			__u64 uninitialized_var(addr);
> +			__u64 uninitialized_var(mtype);
> +			if (count < sizeof(addr))
> +				return ret ? ret : -EINVAL;
> +			_ret = userfaultfd_ctx_read(ctx, no_wait, &mtype, &addr);
> +			if (_ret < 0)
> +				return ret ? ret : _ret;
> +			BUG_ON(mtype != UFFD_PAGEFAULT);
> +			if (put_user(addr, (__u64 __user *) buf))
> +				return ret ? ret : -EFAULT;
> +			_ret = sizeof(addr);
> +		} else {
> +			struct uffd_v2_msg msg;
> +			if (count < sizeof(msg))
> +				return ret ? ret : -EINVAL;
> +			_ret = userfaultfd_ctx_read(ctx, no_wait, &msg.type, &msg.arg);
> +			if (_ret < 0)
> +				return ret ? ret : _ret;
> +			if (copy_to_user(buf, &msg, sizeof(msg)))
> +				return ret ? ret : -EINVAL;
> +			_ret = sizeof(msg);

Reading 16bytes instead of 8bytes for each fault, probably wouldn't
move the needle much in terms of userfaultfd_read performance. Perhaps
we could consider using the uffd_v2_msg unconditionally and then have
a single protocol differentiated by the feature bits.

The only reason to have two different protocols would be to be able to
read 8 bytes per userfault, in the cooperative usage (i.e. qemu
postcopy). But if we do that we want to use the __always_inline trick
to avoid branches and additional runtime costs (otherwise we may as
well forget all microoptimizations and read 16bytes always).

> @@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
>  	/* careful not to leak info, we only read the first 8 bytes */
>  	uffdio_api.bits = UFFD_API_BITS;
>  	uffdio_api.ioctls = UFFD_API_IOCTLS;
> +
> +	if (uffdio_api.api == UFFD_API_V2) {
> +		ctx->features |= UFFD_FEATURE_LONGMSG;
> +		uffdio_api.bits |= UFFD_API_V2_BITS;
> +	}
> +
>  	ret = -EFAULT;
>  	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
>  		goto out;

The original meaning of the bits is:

If UFFD_BIT_WRITE was set in api.bits, it means the
!!(address&UFFD_BIT_WRITE) tells if it was a write fault (missing or
WP).

If UFFD_BIT_WP was set in api.bits, it means the
!!(address&UFFD_BIT_WP) tells if it was a WP fault (if not set it
means it was a missing fault).

Currently api.bits sets only UFFD_BIT_WRITE, and UFFD_BIT_WP will be
set later, after the WP tracking mode will be implemented.

I'm uncertain how bits translated to features and if they should be
unified or only have features.

> +struct uffd_v2_msg {
> +	__u64	type;
> +	__u64	arg;
> +};
> +
> +#define UFFD_PAGEFAULT	0x1
> +
> +#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
> +#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
> +
> +/*
> + * Lower PAGE_SHIFT bits are used to report those supported
> + * by the pagefault message itself. Other bits are used to
> + * report the message types v2 API supports
> + */
> +#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
> +

And why exactly is this 12 hardcoded? And which field should be masked
with the bits? In the V1 protocol it was the "arg" (userfault address)
not the "type". So this is a bit confusing and probably requires
simplification.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] uffd: Introduce the v2 API
  2015-04-21 12:18   ` Andrea Arcangeli
@ 2015-04-23  6:29     ` Pavel Emelyanov
  2015-04-27 21:12       ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Emelyanov @ 2015-04-23  6:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux Kernel Mailing List, Linux MM, Linux API, Sanidhya Kashyap

On 04/21/2015 03:18 PM, Andrea Arcangeli wrote:
> On Wed, Mar 18, 2015 at 10:35:17PM +0300, Pavel Emelyanov wrote:
>> +		if (!(ctx->features & UFFD_FEATURE_LONGMSG)) {
> 
> If we are to use different protocols, it'd be nicer to have two
> different methods to assign to userfaultfd_fops.read that calls an
> __always_inline function, so that the above check can be optimized
> away at build time when the inline is expanded. So the branch is
> converted to calling a different pointer to function which is zero
> additional cost.

OK :)

>> +			/* careful to always initialize addr if ret == 0 */
>> +			__u64 uninitialized_var(addr);
>> +			__u64 uninitialized_var(mtype);
>> +			if (count < sizeof(addr))
>> +				return ret ? ret : -EINVAL;
>> +			_ret = userfaultfd_ctx_read(ctx, no_wait, &mtype, &addr);
>> +			if (_ret < 0)
>> +				return ret ? ret : _ret;
>> +			BUG_ON(mtype != UFFD_PAGEFAULT);
>> +			if (put_user(addr, (__u64 __user *) buf))
>> +				return ret ? ret : -EFAULT;
>> +			_ret = sizeof(addr);
>> +		} else {
>> +			struct uffd_v2_msg msg;
>> +			if (count < sizeof(msg))
>> +				return ret ? ret : -EINVAL;
>> +			_ret = userfaultfd_ctx_read(ctx, no_wait, &msg.type, &msg.arg);
>> +			if (_ret < 0)
>> +				return ret ? ret : _ret;
>> +			if (copy_to_user(buf, &msg, sizeof(msg)))
>> +				return ret ? ret : -EINVAL;
>> +			_ret = sizeof(msg);
> 
> Reading 16bytes instead of 8bytes for each fault, probably wouldn't
> move the needle much in terms of userfaultfd_read performance. Perhaps
> we could consider using the uffd_v2_msg unconditionally and then have
> a single protocol differentiated by the feature bits.

So your proposal is to always report 16 bytes per PF from read() and
let userspace decide itself how to handle the result?

> The only reason to have two different protocols would be to be able to
> read 8 bytes per userfault, in the cooperative usage (i.e. qemu
> postcopy). But if we do that we want to use the __always_inline trick
> to avoid branches and additional runtime costs (otherwise we may as
> well forget all microoptimizations and read 16bytes always).
> 
>> @@ -992,6 +1013,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
>>  	/* careful not to leak info, we only read the first 8 bytes */
>>  	uffdio_api.bits = UFFD_API_BITS;
>>  	uffdio_api.ioctls = UFFD_API_IOCTLS;
>> +
>> +	if (uffdio_api.api == UFFD_API_V2) {
>> +		ctx->features |= UFFD_FEATURE_LONGMSG;
>> +		uffdio_api.bits |= UFFD_API_V2_BITS;
>> +	}
>> +
>>  	ret = -EFAULT;
>>  	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
>>  		goto out;
> 
> The original meaning of the bits is:
> 
> If UFFD_BIT_WRITE was set in api.bits, it means the
> !!(address&UFFD_BIT_WRITE) tells if it was a write fault (missing or
> WP).
> 
> If UFFD_BIT_WP was set in api.bits, it means the
> !!(address&UFFD_BIT_WP) tells if it was a WP fault (if not set it
> means it was a missing fault).
> 
> Currently api.bits sets only UFFD_BIT_WRITE, and UFFD_BIT_WP will be
> set later, after the WP tracking mode will be implemented.
> 
> I'm uncertain how bits translated to features and if they should be
> unified or only have features.
> 
>> +struct uffd_v2_msg {
>> +	__u64	type;
>> +	__u64	arg;
>> +};
>> +
>> +#define UFFD_PAGEFAULT	0x1
>> +
>> +#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
>> +#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
>> +
>> +/*
>> + * Lower PAGE_SHIFT bits are used to report those supported
>> + * by the pagefault message itself. Other bits are used to
>> + * report the message types v2 API supports
>> + */
>> +#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
>> +
> 
> And why exactly is this 12 hardcoded?

Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
would be OK to have different shifts in different arches.

But taking into account your comment that bits field id bad for these
values, if we introduce the new .features one for api message, then this
12 will just go away.

> And which field should be masked
> with the bits? In the V1 protocol it was the "arg" (userfault address)
> not the "type". So this is a bit confusing and probably requires
> simplification.

I see. Actually I decided that since bits higher than 12th (for x86) is
always 0 in api message (no bits allowed there, since pfn sits in this
place), it would be OK to put non-PF bits there.

Should I better introduce another .features field in uffd API message?

-- Pavel


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage
  2015-04-21 12:02 ` [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Andrea Arcangeli
@ 2015-04-23  6:34   ` Pavel Emelyanov
  0 siblings, 0 replies; 10+ messages in thread
From: Pavel Emelyanov @ 2015-04-23  6:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux Kernel Mailing List, Linux MM, Linux API, Sanidhya Kashyap,
	Dr. David Alan Gilbert, Dave Hansen

On 04/21/2015 03:02 PM, Andrea Arcangeli wrote:
> Hi Pavel,
> 
> On Wed, Mar 18, 2015 at 10:34:26PM +0300, Pavel Emelyanov wrote:
>> Hi,
>>
>> On the recent LSF Andrea presented his userfault-fd patches and
>> I had shown some issues that appear in usage scenarios when the
>> monitor task and mm task do not cooperate to each other on VM
>> changes (and fork()-s).
>>
>> Here's the implementation of the extended uffd API that would help 
>> us to address those issues.
>>
>> As proof of concept I've implemented only fix for fork() case, but
>> I also plan to add the mremap() and exit() notifications, both are
>> also required for such non-cooperative usage.
>>
>> More details about the extension itself is in patch #2 and the fork()
>> notification description is in patch #3.
>>
>> Comments and suggestion are warmly welcome :)
> 
> This looks feasible.
> 
>> Andrea, what's the best way to go on with the patches -- would you
>> prefer to include them in your git tree or should I instead continue
>> with them on my own, re-sending them when required? Either way would
>> be OK for me.
> 
> Ok so various improvements happened in userfaultfd patchset over the
> last month so your incremental patchset likely requires a rebase
> sorry. When you posted it I was in the middle of the updates. Now
> things are working stable and I have no pending updates, so it would
> be a good time for a rebase.

OK, thanks for the heads up! I will rebase my patches.

> I can merge it if you like, it's your call if you prefer to maintain
> it incrementally or if I should merge it, but I wouldn't push it to
> Andrew for upstream integration in the first batch, as this
> complicates things further and it's not fully complete yet (at least
> the version posted only handled fork). I think it can be merged
> incrementally in a second stage.

Sure!

> The major updates of the userfaultfd patchset over the last month were:
> 
> 1) Various mixed fixes thanks to the feedback from Dave Hansen and
>    David Gilbert.
> 
>    The most notable one is the use of mm_users instead of mm_count to
>    pin the mm to avoid crashes that assumed the vma still existed (in
>    the userfaultfd_release method and in the various ioctl). exit_mmap
>    doesn't even set mm->mmap to NULL, so unless I introduce a
>    userfaultfd_exit to call in mmput, I have to pin the mm_users to be
>    safe. This is mainly an issue for the non-cooperative usage you're
>    implementing. Can you catch the exit somehow so you can close the
>    fd? The memory won't be released until you do. Is this ok with you?
>    I suppose you had to close the fd somehow anyway.
> 
> 2) userfaults are waken immediately even if they're not been "read"
>    yet, this can lead to POLLIN false positives (so I only allow poll
>    if the fd is open in nonblocking mode to be sure it won't hang). Is
>    it too paranoid to return POLLERR if the fd is not open in
>    nonblocking mode?
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f222d9de0a5302dc8ac62d6fab53a84251098751
> 
> 3) optimize read to return entries in O(1) and poll which was already
>    O(1) becomes lockless. This required to split the waitqueue in two,
>    one for pending faults and one for non pending faults, and the
>    faults are refiled across the two waitqueues when they're
>    read. Both waitqueues are protected by a single lock to be simpler
>    and faster at runtime (the fault_pending_wqh one).
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=9aa033ed43a1134c2223dac8c5d9e02e0100fca1
> 
> 4) Allocate the ctx with kmem_cache_alloc. This is going to collide a
>    bit with your cleanup patch sorry.
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=f5a8db16d2876eed8906a4d36f1d0e06ca5490f6
> 
> 5) Originally qemu had two bitflags for each page and kept 3 states
>    (of the 4 possible with two bits) for each page in order to deal
>    with the races that can happen if one thread is reading the
>    userfaults and another thread is calling the UFFDIO_COPY ioctl in
>    the background. This patch solves all races in the kernel so the
>    two bits per page can be dropped from qemu codebase. I started
>    documenting the races that can materialize by using 2 threads
>    (instead of running the workload single threaded with a single poll
>    event loop), and how userland had to solve them until I decided it
>    was simpler to fix the race in the kernel by running an ad-hoc
>    pagetable walk inside the wait_event()-kind-of-section. This
>    simplified qemu significantly and it doesn't make the kernel much
>    more complicated.
> 
>    I tried this before in much older versions but I use gup_fast for
>    it and it didn't work well with gup_fast for various reasons.
> 
>    http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=41efeae4e93f0296436f2a9fc6b28b6b0158512a
> 
>    After this patch the only reason to call UFFDIO_WAKE is to handle
>    the userfaults in batches in combination with the DONT_WAKE flag of
>    UFFDIO_COPY.
> 
> 6) I removed the read recursion from mcopy_atomic. This avoids to
>    depend on the write-starvation behavior of rwsem to be safe. After
>    this change the rwsem is free to stop any further down_read if
>    there's a down_write waiting on the lock.
> 
>    	  http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b1e3a08acc9e3f6c2614e89fc3b8e338daa58e18
> 
> About other troubles for the non cooperative usage: MADV_DONTNEED
> likely needs to be trapped too or how do you know that you shall map a
> zero page instead of the old data at the faulting address?
> 
> Thanks,
> Andrea
> .
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] uffd: Introduce the v2 API
  2015-04-23  6:29     ` Pavel Emelyanov
@ 2015-04-27 21:12       ` Andrea Arcangeli
  2015-04-30  9:50         ` Pavel Emelyanov
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2015-04-27 21:12 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Linux Kernel Mailing List, Linux MM, Linux API, Sanidhya Kashyap

Hello,

On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote:
> So your proposal is to always report 16 bytes per PF from read() and
> let userspace decide itself how to handle the result?

Reading 16bytes for each userfault (instead of 8) and sharing the same
read(2) protocol (UFFD_API) for both the cooperative and
non-cooperative usages, is something I just suggested for
consideration after reading your patchset.

The pros of using a single protocol for both is that it would reduce
amount of code and there would be just one file operation for the
.read method. The cons is that it will waste 8 bytes per userfault in
terms of memory footprint. The other major cons is that it would force
us to define the format of the non cooperative protocol now despite it's
not fully finished yet.

I'm also ok with two protocols if nobody else objects, but if we use
two protocols, we should at least use different file operation methods
and use __always_inline with constants passed as parameter to optimize
away the branches at build time. This way we get the reduced memory
footprint in the read syscall without other runtime overhead
associated with it.

> >> +struct uffd_v2_msg {
> >> +	__u64	type;
> >> +	__u64	arg;
> >> +};
> >> +
> >> +#define UFFD_PAGEFAULT	0x1
> >> +
> >> +#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
> >> +#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
> >> +
> >> +/*
> >> + * Lower PAGE_SHIFT bits are used to report those supported
> >> + * by the pagefault message itself. Other bits are used to
> >> + * report the message types v2 API supports
> >> + */
> >> +#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
> >> +
> > 
> > And why exactly is this 12 hardcoded?
> 
> Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
> would be OK to have different shifts in different arches.
> 
> But taking into account your comment that bits field id bad for these
> values, if we introduce the new .features one for api message, then this
> 12 will just go away.

Ok.

> > And which field should be masked
> > with the bits? In the V1 protocol it was the "arg" (userfault address)
> > not the "type". So this is a bit confusing and probably requires
> > simplification.
> 
> I see. Actually I decided that since bits higher than 12th (for x86) is
> always 0 in api message (no bits allowed there, since pfn sits in this
> place), it would be OK to put non-PF bits there.

That was ok yes.

> Should I better introduce another .features field in uffd API message?

What about renaming "uffdio_api.bits" to "uffdio_api.features"?

And then we set uffdio_api.features to
UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed.

UFFD_FEATURE_WRITE would always be enabled, it's there only in case we
want to disable it later (mostly if some arch has trouble with it,
which is unlikely, but qemu doesn't need that bit of information at
all for example so qemu would be fine if UFFD_FEATURE_WRITE
disappears).

UFFD_FEATURE_WP would signal also that the wrprotection feature (not
implemented yet) is available (then later the register ioctl would
also show the new wrprotection ioctl numbers available to mangle the
wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage
(qemu live snapshotting) can use the UFFD_API first protocol too.

UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in
uffdio.api, and it would be part of the incremental non-cooperative
patchset.

We could also not define "UFFD_FEATURE_FORK" at all and imply that
fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api
ioctl succeeds... That's only doable if we keep two different read
protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE)
are really strictly needed only if we share the same read(2) protocol
for both the cooperative and non-cooperative usages.

The idea is that there's not huge benefit of only having the "fork"
feature supported but missing "mremap" and "madv_dontneed".

In fact if a new syscall that works like mremap is added later (call
it mremap2), we would need to fail the UFFDIO_API_V2 and require a
UFFDIO_API_V3 for such kernel that can return a new mremap2 type of
event. Userland couldn't just assume it is ok to use postcopy live
migration for containers, because
UFFD_FEATURE_FORK|MREMAP|MADV_DONTNEED are present in the
uffdio.features when it asked for API_V2. There shall be something
that tells userland "hey there's a new mremap2 that the software
inside the container can run on top of this kernel, so you are going
to get a new mremap2 type of userfault event too".

In any case, regardless of how we solve the above,
"uffdio_api.features" sounds better than ".bits".

If we retain two different UFFD_API, we'll be able to freeze the
current one and decide later if
UFFD_FEATURE_FORK|UFFD_FEATURE_MREMAP|UFFD_FEATURE_MADV_DONTNEED shall
be added to the .features, or if to rely on UFFD_API_V2 succeeding to
let userland know that the non-cooperative usage is fully supported by
the kernel.

Not having to freeze these details now is the main benefit of having
two different UFFD_API after all...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] uffd: Introduce the v2 API
  2015-04-27 21:12       ` Andrea Arcangeli
@ 2015-04-30  9:50         ` Pavel Emelyanov
  0 siblings, 0 replies; 10+ messages in thread
From: Pavel Emelyanov @ 2015-04-30  9:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux Kernel Mailing List, Linux MM, Linux API, Sanidhya Kashyap

On 04/28/2015 12:12 AM, Andrea Arcangeli wrote:
> Hello,
> 
> On Thu, Apr 23, 2015 at 09:29:07AM +0300, Pavel Emelyanov wrote:
>> So your proposal is to always report 16 bytes per PF from read() and
>> let userspace decide itself how to handle the result?
> 
> Reading 16bytes for each userfault (instead of 8) and sharing the same
> read(2) protocol (UFFD_API) for both the cooperative and
> non-cooperative usages, is something I just suggested for
> consideration after reading your patchset.
> 
> The pros of using a single protocol for both is that it would reduce
> amount of code and there would be just one file operation for the
> .read method. The cons is that it will waste 8 bytes per userfault in
> terms of memory footprint. The other major cons is that it would force
> us to define the format of the non cooperative protocol now despite it's
> not fully finished yet.
> 
> I'm also ok with two protocols if nobody else objects, but if we use
> two protocols, we should at least use different file operation methods
> and use __always_inline with constants passed as parameter to optimize
> away the branches at build time. This way we get the reduced memory
> footprint in the read syscall without other runtime overhead
> associated with it.

OK. I would go with two protocols then and will reshuffle the code to
use two ops.

>>>> +struct uffd_v2_msg {
>>>> +	__u64	type;
>>>> +	__u64	arg;
>>>> +};
>>>> +
>>>> +#define UFFD_PAGEFAULT	0x1
>>>> +
>>>> +#define UFFD_PAGEFAULT_BIT	(1 << (UFFD_PAGEFAULT - 1))
>>>> +#define __UFFD_API_V2_BITS	(UFFD_PAGEFAULT_BIT)
>>>> +
>>>> +/*
>>>> + * Lower PAGE_SHIFT bits are used to report those supported
>>>> + * by the pagefault message itself. Other bits are used to
>>>> + * report the message types v2 API supports
>>>> + */
>>>> +#define UFFD_API_V2_BITS	(__UFFD_API_V2_BITS << 12)
>>>> +
>>>
>>> And why exactly is this 12 hardcoded?
>>
>> Ah, it should have been the PAGE_SHIFT one, but I was unsure whether it
>> would be OK to have different shifts in different arches.
>>
>> But taking into account your comment that bits field id bad for these
>> values, if we introduce the new .features one for api message, then this
>> 12 will just go away.
> 
> Ok.
> 
>>> And which field should be masked
>>> with the bits? In the V1 protocol it was the "arg" (userfault address)
>>> not the "type". So this is a bit confusing and probably requires
>>> simplification.
>>
>> I see. Actually I decided that since bits higher than 12th (for x86) is
>> always 0 in api message (no bits allowed there, since pfn sits in this
>> place), it would be OK to put non-PF bits there.
> 
> That was ok yes.
> 
>> Should I better introduce another .features field in uffd API message?
> 
> What about renaming "uffdio_api.bits" to "uffdio_api.features"?

Yup, agreed, will do.

> And then we set uffdio_api.features to
> UFFD_FEATURE_WRITE|UFFD_FEATURE_WP|UFFD_FEATURE_FORK as needed.
> 
> UFFD_FEATURE_WRITE would always be enabled, it's there only in case we
> want to disable it later (mostly if some arch has trouble with it,
> which is unlikely, but qemu doesn't need that bit of information at
> all for example so qemu would be fine if UFFD_FEATURE_WRITE
> disappears).
> 
> UFFD_FEATURE_WP would signal also that the wrprotection feature (not
> implemented yet) is available (then later the register ioctl would
> also show the new wrprotection ioctl numbers available to mangle the
> wrprotection). The UFFD_FEATURE_WP feature in the cooperative usage
> (qemu live snapshotting) can use the UFFD_API first protocol too.
> 
> UFFD_FEATURE_FORK would be returned if the UFFD_API_V2 was set in
> uffdio.api, and it would be part of the incremental non-cooperative
> patchset.
> 
> We could also not define "UFFD_FEATURE_FORK" at all and imply that
> fork/mremap/MADV_DONTNEED are all available if UFFD_API_V2 uffdio_api
> ioctl succeeds... That's only doable if we keep two different read
> protocols though. UFFD_FEATURE_FORK (or UFFD_FEATURE_NON_COOPERATIVE)
> are really strictly needed only if we share the same read(2) protocol
> for both the cooperative and non-cooperative usages.
> 
> The idea is that there's not huge benefit of only having the "fork"
> feature supported but missing "mremap" and "madv_dontneed".
> 
> In fact if a new syscall that works like mremap is added later (call
> it mremap2), we would need to fail the UFFDIO_API_V2 and require a
> UFFDIO_API_V3 for such kernel that can return a new mremap2 type of
> event. Userland couldn't just assume it is ok to use postcopy live
> migration for containers, because
> UFFD_FEATURE_FORK|MREMAP|MADV_DONTNEED are present in the
> uffdio.features when it asked for API_V2. There shall be something
> that tells userland "hey there's a new mremap2 that the software
> inside the container can run on top of this kernel, so you are going
> to get a new mremap2 type of userfault event too".

But that's why I assumed to use per-sycall bits -- UFFD_FEATURE_FORK,
_MREMAP, _MWHATEVER so that userspace can read those bits and make sure
it contains only bits it understands with other bits set to zero.

If we had only one UFFD_API_NON_COOPERATIVE userspace would have no idea
what kind of messages it may receive.

> In any case, regardless of how we solve the above,
> "uffdio_api.features" sounds better than ".bits".
> 
> If we retain two different UFFD_API, we'll be able to freeze the
> current one and decide later if
> UFFD_FEATURE_FORK|UFFD_FEATURE_MREMAP|UFFD_FEATURE_MADV_DONTNEED shall
> be added to the .features, or if to rely on UFFD_API_V2 succeeding to
> let userland know that the non-cooperative usage is fully supported by
> the kernel.
> 
> Not having to freeze these details now is the main benefit of having
> two different UFFD_API after all...
> .
> 

-- Pavel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-04-30  9:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-18 19:34 [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Pavel Emelyanov
2015-03-18 19:34 ` [PATCH 1/3] uffd: Tossing bits around Pavel Emelyanov
2015-03-18 19:35 ` [PATCH 2/3] uffd: Introduce the v2 API Pavel Emelyanov
2015-04-21 12:18   ` Andrea Arcangeli
2015-04-23  6:29     ` Pavel Emelyanov
2015-04-27 21:12       ` Andrea Arcangeli
2015-04-30  9:50         ` Pavel Emelyanov
2015-03-18 19:35 ` [PATCH 3/3] uffd: Introduce fork() notification Pavel Emelyanov
2015-04-21 12:02 ` [PATCH 0/3] UserfaultFD: Extension for non cooperative uffd usage Andrea Arcangeli
2015-04-23  6:34   ` Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).