linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] userfaultfd: non-cooperative: syncronous events
@ 2018-02-27  8:19 Mike Rapoport
  2018-02-27  8:19 ` [PATCH 1/3] userfaultfd: introduce userfaultfd_init_waitqueue helper Mike Rapoport
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Mike Rapoport @ 2018-02-27  8:19 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Pavel Emelyanov, linux-mm, linux-api, lkml, crml, Mike Rapoport

Hi,

These patches add ability to generate userfaultfd events so that their
processing will be synchronized with the non-cooperative thread that caused
the event.

In the non-cooperative case userfaultfd resumes execution of the thread
that caused an event when the notification is read() by the uffd monitor.
In some cases, like, for example, madvise(MADV_REMOVE), it might be
desirable to keep the thread that caused the event suspended until the
uffd monitor had the event handled to avoid races between the thread that
caused the and userfaultfd ioctls.

Theses patches extend the userfaultfd API with an implementation of
UFFD_EVENT_REMOVE_SYNC that allows to keep the thread that triggered
UFFD_EVENT_REMOVE until the uffd monitor would not wake it explicitly.

Mike Rapoport (3):
  userfaultfd: introduce userfaultfd_init_waitqueue helper
  userfaultfd: non-cooperative: generalize wake key structure
  userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE

 fs/userfaultfd.c                 | 191 +++++++++++++++++++++++++++++----------
 include/uapi/linux/userfaultfd.h |  14 +++
 2 files changed, 158 insertions(+), 47 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] userfaultfd: introduce userfaultfd_init_waitqueue helper
  2018-02-27  8:19 [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Mike Rapoport
@ 2018-02-27  8:19 ` Mike Rapoport
  2018-02-27  8:19 ` [PATCH 2/3] userfaultfd: non-cooperative: generalize wake key structure Mike Rapoport
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2018-02-27  8:19 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Pavel Emelyanov, linux-mm, linux-api, lkml, crml, Mike Rapoport

The helper can be used for initialization of wait queue entries for both
page-fault and non-cooperative events

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 fs/userfaultfd.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cec550c8468f..b32c7aaeca6b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -131,6 +131,15 @@ static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
 	return ret;
 }
 
+static inline void userfaultfd_init_waitqueue(struct userfaultfd_ctx *ctx,
+					      struct userfaultfd_wait_queue *uwq)
+{
+	init_waitqueue_func_entry(&uwq->wq, userfaultfd_wake_function);
+	uwq->wq.private = current;
+	uwq->ctx = ctx;
+	uwq->waken = false;
+}
+
 /**
  * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
  * context.
@@ -444,12 +453,9 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	/* take the reference before dropping the mmap_sem */
 	userfaultfd_ctx_get(ctx);
 
-	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
-	uwq.wq.private = current;
+	userfaultfd_init_waitqueue(ctx, &uwq);
 	uwq.msg = userfault_msg(vmf->address, vmf->flags, reason,
 			ctx->features);
-	uwq.ctx = ctx;
-	uwq.waken = false;
 
 	return_to_userland =
 		(vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] userfaultfd: non-cooperative: generalize wake key structure
  2018-02-27  8:19 [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Mike Rapoport
  2018-02-27  8:19 ` [PATCH 1/3] userfaultfd: introduce userfaultfd_init_waitqueue helper Mike Rapoport
@ 2018-02-27  8:19 ` Mike Rapoport
  2018-02-27  8:19 ` [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE Mike Rapoport
  2018-03-02 23:38 ` [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Andrew Morton
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2018-02-27  8:19 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Pavel Emelyanov, linux-mm, linux-api, lkml, crml, Mike Rapoport

Upcoming support for synchronous non-page-fault events will require
userfaultfd_wake_function to be able to differentiate between the event
types. Depending on the event type, different parameters will define if the
wait queue element should be awaken. This requires more general structure
than userfaultfd_wake_range to be used as the "key" parameter for
userfaultfd_wake_function.
This patch introduces userfaultfd_wake_key that is used for waking up
threads waiting on page-fault and non-cooperative events.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 fs/userfaultfd.c | 114 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 72 insertions(+), 42 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b32c7aaeca6b..d9f74b389706 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -91,21 +91,44 @@ struct userfaultfd_wake_range {
 	unsigned long len;
 };
 
+struct userfaultfd_wake_key {
+	u8 event;
+	union {
+		struct userfaultfd_wake_range range;
+	} arg;
+};
+
+static bool userfaultfd_should_wake(struct userfaultfd_wait_queue *uwq,
+				    struct userfaultfd_wake_key *key)
+{
+	/* key->event == 0 means wake all */
+	if (key->event && key->event != uwq->msg.event)
+		return false;
+
+	if (key->event == UFFD_EVENT_PAGEFAULT) {
+		unsigned long start, len, address;
+
+		/* len == 0 means wake all threads waiting on page fault */
+		address = uwq->msg.arg.pagefault.address;
+		start = key->arg.range.start;
+		len = key->arg.range.len;
+		if (len && (start > address || start + len <= address))
+			return false;
+	}
+
+	return true;
+}
+
 static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
-				     int wake_flags, void *key)
+				     int wake_flags, void *_key)
 {
-	struct userfaultfd_wake_range *range = key;
+	struct userfaultfd_wake_key *key = _key;
 	int ret;
 	struct userfaultfd_wait_queue *uwq;
-	unsigned long start, len;
 
 	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 	ret = 0;
-	/* len == 0 means wake all */
-	start = range->start;
-	len = range->len;
-	if (len && (start > uwq->msg.arg.pagefault.address ||
-		    start + len <= uwq->msg.arg.pagefault.address))
+	if (!userfaultfd_should_wake(uwq, key))
 		goto out;
 	WRITE_ONCE(uwq->waken, true);
 	/*
@@ -585,7 +608,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	ewq->ctx = ctx;
-	init_waitqueue_entry(&ewq->wq, current);
+	userfaultfd_init_waitqueue(ctx, ewq);
 	release_new_ctx = NULL;
 
 	spin_lock(&ctx->event_wqh.lock);
@@ -596,7 +619,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 	__add_wait_queue(&ctx->event_wqh, &ewq->wq);
 	for (;;) {
 		set_current_state(TASK_KILLABLE);
-		if (ewq->msg.event == 0)
+		if (READ_ONCE(ewq->waken))
 			break;
 		if (READ_ONCE(ctx->released) ||
 		    fatal_signal_pending(current)) {
@@ -653,9 +676,10 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
 				       struct userfaultfd_wait_queue *ewq)
 {
-	ewq->msg.event = 0;
-	wake_up_locked(&ctx->event_wqh);
-	__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
+	struct userfaultfd_wake_key key = { 0 };
+
+	key.event = ewq->msg.event;
+	 __wake_up_locked_key(&ctx->event_wqh, TASK_NORMAL, &key);
 }
 
 int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
@@ -854,8 +878,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	struct userfaultfd_ctx *ctx = file->private_data;
 	struct mm_struct *mm = ctx->mm;
 	struct vm_area_struct *vma, *prev;
-	/* len == 0 means wake all */
-	struct userfaultfd_wake_range range = { .len = 0, };
+	/* event == 0 means wake all */
+	struct userfaultfd_wake_key key = {
+		.event = 0,
+	};
 	unsigned long new_flags;
 
 	WRITE_ONCE(ctx->released, true);
@@ -903,12 +929,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	 * the fault_*wqh.
 	 */
 	spin_lock(&ctx->fault_pending_wqh.lock);
-	__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &range);
-	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, &range);
+	__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &key);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, &key);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	/* Flush pending events that may still wait on event_wqh */
-	wake_up_all(&ctx->event_wqh);
+	__wake_up(&ctx->event_wqh, TASK_NORMAL, 0, &key);
 
 	wake_up_poll(&ctx->fd_wqh, EPOLLHUP);
 	userfaultfd_ctx_put(ctx);
@@ -1201,20 +1227,20 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 }
 
 static void __wake_userfault(struct userfaultfd_ctx *ctx,
-			     struct userfaultfd_wake_range *range)
+			     struct userfaultfd_wake_key *key)
 {
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/* wake all in the range and autoremove */
 	if (waitqueue_active(&ctx->fault_pending_wqh))
 		__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
-				     range);
+				     key);
 	if (waitqueue_active(&ctx->fault_wqh))
-		__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, range);
+		__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, key);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 }
 
 static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
-					   struct userfaultfd_wake_range *range)
+					   struct userfaultfd_wake_key *key)
 {
 	unsigned seq;
 	bool need_wakeup;
@@ -1241,7 +1267,7 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 		cond_resched();
 	} while (read_seqcount_retry(&ctx->refile_seq, seq));
 	if (need_wakeup)
-		__wake_userfault(ctx, range);
+		__wake_userfault(ctx, key);
 }
 
 static __always_inline int validate_range(struct mm_struct *mm,
@@ -1567,10 +1593,11 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 			 * permanently and it avoids userland to call
 			 * UFFDIO_WAKE explicitly.
 			 */
-			struct userfaultfd_wake_range range;
-			range.start = start;
-			range.len = vma_end - start;
-			wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
+			struct userfaultfd_wake_key key;
+			key.event = UFFD_EVENT_PAGEFAULT;
+			key.arg.range.start = start;
+			key.arg.range.len = vma_end - start;
+			wake_userfault(vma->vm_userfaultfd_ctx.ctx, &key);
 		}
 
 		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
@@ -1622,7 +1649,7 @@ static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
 {
 	int ret;
 	struct uffdio_range uffdio_wake;
-	struct userfaultfd_wake_range range;
+	struct userfaultfd_wake_key key;
 	const void __user *buf = (void __user *)arg;
 
 	ret = -EFAULT;
@@ -1633,16 +1660,17 @@ static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
 	if (ret)
 		goto out;
 
-	range.start = uffdio_wake.start;
-	range.len = uffdio_wake.len;
+	key.event = UFFD_EVENT_PAGEFAULT;
+	key.arg.range.start = uffdio_wake.start;
+	key.arg.range.len = uffdio_wake.len;
 
 	/*
 	 * len == 0 means wake all and we don't want to wake all here,
 	 * so check it again to be sure.
 	 */
-	VM_BUG_ON(!range.len);
+	VM_BUG_ON(!key.arg.range.len);
 
-	wake_userfault(ctx, &range);
+	wake_userfault(ctx, &key);
 	ret = 0;
 
 out:
@@ -1655,7 +1683,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	__s64 ret;
 	struct uffdio_copy uffdio_copy;
 	struct uffdio_copy __user *user_uffdio_copy;
-	struct userfaultfd_wake_range range;
+	struct userfaultfd_wake_key key;
 
 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
 
@@ -1691,12 +1719,13 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 		goto out;
 	BUG_ON(!ret);
 	/* len == 0 would wake all */
-	range.len = ret;
+	key.event = UFFD_EVENT_PAGEFAULT;
+	key.arg.range.len = ret;
 	if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
-		range.start = uffdio_copy.dst;
-		wake_userfault(ctx, &range);
+		key.arg.range.start = uffdio_copy.dst;
+		wake_userfault(ctx, &key);
 	}
-	ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
+	ret = key.arg.range.len == uffdio_copy.len ? 0 : -EAGAIN;
 out:
 	return ret;
 }
@@ -1707,7 +1736,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	__s64 ret;
 	struct uffdio_zeropage uffdio_zeropage;
 	struct uffdio_zeropage __user *user_uffdio_zeropage;
-	struct userfaultfd_wake_range range;
+	struct userfaultfd_wake_key key;
 
 	user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
 
@@ -1738,12 +1767,13 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 		goto out;
 	/* len == 0 would wake all */
 	BUG_ON(!ret);
-	range.len = ret;
+	key.event = UFFD_EVENT_PAGEFAULT;
+	key.arg.range.len = ret;
 	if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
-		range.start = uffdio_zeropage.range.start;
-		wake_userfault(ctx, &range);
+		key.arg.range.start = uffdio_zeropage.range.start;
+		wake_userfault(ctx, &key);
 	}
-	ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
+	ret = key.arg.range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
 out:
 	return ret;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE
  2018-02-27  8:19 [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Mike Rapoport
  2018-02-27  8:19 ` [PATCH 1/3] userfaultfd: introduce userfaultfd_init_waitqueue helper Mike Rapoport
  2018-02-27  8:19 ` [PATCH 2/3] userfaultfd: non-cooperative: generalize wake key structure Mike Rapoport
@ 2018-02-27  8:19 ` Mike Rapoport
  2018-02-28  8:21   ` Pavel Emelyanov
  2018-03-02 23:38 ` [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Andrew Morton
  3 siblings, 1 reply; 8+ messages in thread
From: Mike Rapoport @ 2018-02-27  8:19 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Pavel Emelyanov, linux-mm, linux-api, lkml, crml, Mike Rapoport

In non-cooperative case, userfaultfd monitor may encounter a race between
UFFDIO_COPY or UFFDIO_UNREGISTER and the processing of UFFD_EVENT_REMOVE.

Unlike the page faults that suspend the faulting thread until the page
fault is resolved, other events resume execution of the thread that caused
the event immediately after delivering the notification to the userfaultfd
monitor. The monitor may run UFFDIO_COPY in parallel with the event
processing and this may result in memory corruption.

Another race condition is caused if the faulting thread consequently calls
a system call causing UFFD_EVENT_REMOVE and munmap(). In this case, uffd
monitor will try to unregister the removed range as the response for
UFFD_EVENT_REMOVE, but the VMA linked to the uffd context might already be
gone because of munmap().

With UFFD_EVENT_REMOVE_SYNC introduced by this patch, it would be possible
to block the non-cooperative thread until the userfaultfd monitor will
explicitly wake it and thus allow uffd monitor proper processing of
UFFD_EVENT_REMOVE.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 fs/userfaultfd.c                 | 65 ++++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/userfaultfd.h | 14 +++++++++
 2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d9f74b389706..af813b3a3397 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -50,6 +50,8 @@ struct userfaultfd_ctx {
 	wait_queue_head_t fd_wqh;
 	/* waitqueue head for events */
 	wait_queue_head_t event_wqh;
+	/* waitqueue head for sync events */
+	wait_queue_head_t event_sync_wqh;
 	/* a refile sequence protected by fault_pending_wqh lock */
 	struct seqcount refile_seq;
 	/* pseudo fd refcounting */
@@ -116,6 +118,17 @@ static bool userfaultfd_should_wake(struct userfaultfd_wait_queue *uwq,
 			return false;
 	}
 
+	if (key->event == UFFD_EVENT_REMOVE_SYNC) {
+		unsigned long start, end;
+
+		start = key->arg.range.start;
+		end = start + key->arg.range.len;
+
+		if (start != uwq->msg.arg.remove.start ||
+		    end != uwq->msg.arg.remove.end)
+			return false;
+	}
+
 	return true;
 }
 
@@ -191,6 +204,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 		VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
 		VM_BUG_ON(spin_is_locked(&ctx->event_wqh.lock));
 		VM_BUG_ON(waitqueue_active(&ctx->event_wqh));
+		VM_BUG_ON(spin_is_locked(&ctx->event_sync_wqh.lock));
+		VM_BUG_ON(waitqueue_active(&ctx->event_sync_wqh));
 		VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
 		VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
 		mmdrop(ctx->mm);
@@ -676,7 +691,19 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
 				       struct userfaultfd_wait_queue *ewq)
 {
-	struct userfaultfd_wake_key key = { 0 };
+	struct userfaultfd_wake_key key;
+
+	/*
+	 * For synchronous events we don't wake up the thread that
+	 * caused the event, but rather refile it onto
+	 * event_sync_wqh. The userfault monitor has to explicitly
+	 * wake it with ioctl(UFFDIO_WAKE_SYNC_EVENT)
+	 */
+	if (ewq->msg.event & UFFD_EVENT_FLAG_SYNC) {
+		list_del(&ewq->wq.entry);
+		__add_wait_queue(&ctx->event_sync_wqh, &ewq->wq);
+		return;
+	}
 
 	key.event = ewq->msg.event;
 	 __wake_up_locked_key(&ctx->event_wqh, TASK_NORMAL, &key);
@@ -798,7 +825,8 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
 	struct userfaultfd_wait_queue ewq;
 
 	ctx = vma->vm_userfaultfd_ctx.ctx;
-	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_REMOVE))
+	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_REMOVE ||
+		      ctx->features & UFFD_FEATURE_EVENT_REMOVE_SYNC))
 		return true;
 
 	userfaultfd_ctx_get(ctx);
@@ -807,6 +835,9 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
 	msg_init(&ewq.msg);
 
 	ewq.msg.event = UFFD_EVENT_REMOVE;
+	if (ctx->features & UFFD_FEATURE_EVENT_REMOVE_SYNC)
+		ewq.msg.event |= UFFD_EVENT_FLAG_SYNC;
+
 	ewq.msg.arg.remove.start = start;
 	ewq.msg.arg.remove.end = end;
 
@@ -935,6 +966,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 
 	/* Flush pending events that may still wait on event_wqh */
 	__wake_up(&ctx->event_wqh, TASK_NORMAL, 0, &key);
+	__wake_up(&ctx->event_sync_wqh, TASK_NORMAL, 0, &key);
 
 	wake_up_poll(&ctx->fd_wqh, EPOLLHUP);
 	userfaultfd_ctx_put(ctx);
@@ -1677,6 +1709,31 @@ static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static int userfaultfd_wake_sync_event(struct userfaultfd_ctx *ctx,
+				       unsigned long arg)
+{
+	struct uffd_msg uffd_msg;
+	struct userfaultfd_wake_key key;
+	const void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&uffd_msg, buf, sizeof(uffd_msg)))
+		return -EFAULT;
+
+	if (uffd_msg.event != UFFD_EVENT_REMOVE_SYNC)
+		return -EINVAL;
+
+	key.event = uffd_msg.event;
+	key.arg.range.start = uffd_msg.arg.remove.start;
+	key.arg.range.len = uffd_msg.arg.remove.end - uffd_msg.arg.remove.start;
+
+	spin_lock(&ctx->event_wqh.lock);
+	if (waitqueue_active(&ctx->event_sync_wqh))
+		__wake_up_locked_key(&ctx->event_sync_wqh, TASK_NORMAL, &key);
+	spin_unlock(&ctx->event_wqh.lock);
+
+	return 0;
+}
+
 static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 			    unsigned long arg)
 {
@@ -1849,6 +1906,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_WAKE:
 		ret = userfaultfd_wake(ctx, arg);
 		break;
+	case UFFDIO_WAKE_SYNC_EVENT:
+		ret = userfaultfd_wake_sync_event(ctx, arg);
+		break;
 	case UFFDIO_COPY:
 		ret = userfaultfd_copy(ctx, arg);
 		break;
@@ -1909,6 +1969,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 	init_waitqueue_head(&ctx->fault_pending_wqh);
 	init_waitqueue_head(&ctx->fault_wqh);
 	init_waitqueue_head(&ctx->event_wqh);
+	init_waitqueue_head(&ctx->event_sync_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
 	seqcount_init(&ctx->refile_seq);
 }
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..81e3e2e2eded 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -22,6 +22,7 @@
 #define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |		\
 			   UFFD_FEATURE_EVENT_REMAP |		\
 			   UFFD_FEATURE_EVENT_REMOVE |	\
+			   UFFD_FEATURE_EVENT_REMOVE_SYNC |	\
 			   UFFD_FEATURE_EVENT_UNMAP |		\
 			   UFFD_FEATURE_MISSING_HUGETLBFS |	\
 			   UFFD_FEATURE_MISSING_SHMEM |		\
@@ -52,6 +53,7 @@
 #define _UFFDIO_WAKE			(0x02)
 #define _UFFDIO_COPY			(0x03)
 #define _UFFDIO_ZEROPAGE		(0x04)
+#define _UFFDIO_WAKE_SYNC_EVENT		(0x05)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -68,6 +70,8 @@
 				      struct uffdio_copy)
 #define UFFDIO_ZEROPAGE		_IOWR(UFFDIO, _UFFDIO_ZEROPAGE,	\
 				      struct uffdio_zeropage)
+#define UFFDIO_WAKE_SYNC_EVENT	_IOR(UFFDIO, _UFFDIO_WAKE_SYNC_EVENT, \
+				     struct uffd_msg)
 
 /* read() structure */
 struct uffd_msg {
@@ -119,6 +123,15 @@ struct uffd_msg {
 #define UFFD_EVENT_REMOVE	0x15
 #define UFFD_EVENT_UNMAP	0x16
 
+/*
+ * Events that are delivered synchronously. The causing thread is
+ * blocked until the event is handled by the userfault monitor. The
+ * monitor is responsible to explictly wake up the thread after
+ * processing the event.
+ */
+#define UFFD_EVENT_FLAG_SYNC	0x80
+#define UFFD_EVENT_REMOVE_SYNC	(UFFD_EVENT_REMOVE | UFFD_EVENT_FLAG_SYNC)
+
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
 #define UFFD_PAGEFAULT_FLAG_WP		(1<<1)	/* If reason is VM_UFFD_WP */
@@ -176,6 +189,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_EVENT_UNMAP		(1<<6)
 #define UFFD_FEATURE_SIGBUS			(1<<7)
 #define UFFD_FEATURE_THREAD_ID			(1<<8)
+#define UFFD_FEATURE_EVENT_REMOVE_SYNC		(1<<9)
 	__u64 features;
 
 	__u64 ioctls;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE
  2018-02-27  8:19 ` [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE Mike Rapoport
@ 2018-02-28  8:21   ` Pavel Emelyanov
  2018-02-28  8:26     ` Mike Rapoport
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Emelyanov @ 2018-02-28  8:21 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton, Andrea Arcangeli
  Cc: linux-mm, linux-api, lkml, crml


> @@ -52,6 +53,7 @@
>  #define _UFFDIO_WAKE			(0x02)
>  #define _UFFDIO_COPY			(0x03)
>  #define _UFFDIO_ZEROPAGE		(0x04)
> +#define _UFFDIO_WAKE_SYNC_EVENT		(0x05)

Excuse my ignorance, but what's the difference between UFFDIO_WAKE and UFFDIO_WAKE_SYNC_EVENT?

-- Pavel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE
  2018-02-28  8:21   ` Pavel Emelyanov
@ 2018-02-28  8:26     ` Mike Rapoport
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2018-02-28  8:26 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrew Morton, Andrea Arcangeli, linux-mm, linux-api, lkml, crml

On Wed, Feb 28, 2018 at 11:21:02AM +0300, Pavel Emelyanov wrote:
> 
> > @@ -52,6 +53,7 @@
> >  #define _UFFDIO_WAKE			(0x02)
> >  #define _UFFDIO_COPY			(0x03)
> >  #define _UFFDIO_ZEROPAGE		(0x04)
> > +#define _UFFDIO_WAKE_SYNC_EVENT		(0x05)
> 
> Excuse my ignorance, but what's the difference between UFFDIO_WAKE and UFFDIO_WAKE_SYNC_EVENT?

UFFDIO_WAKE is used when UFFDIO_COPY/UFFDIO_ZERO page are used with
UFFDIO_*_MODE_DONTWAKE flag set and it presumes 'struct uffdio_range'
argument to the ioctl(). Since waking up a non page fault event requires
different parameters I've add new ioctl to keep backwards compatibility.
 
> -- Pavel
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] userfaultfd: non-cooperative: syncronous events
  2018-02-27  8:19 [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Mike Rapoport
                   ` (2 preceding siblings ...)
  2018-02-27  8:19 ` [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE Mike Rapoport
@ 2018-03-02 23:38 ` Andrew Morton
  2018-03-03  9:09   ` Mike Rapoport
  3 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2018-03-02 23:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrea Arcangeli, Pavel Emelyanov, linux-mm, linux-api, lkml, crml

On Tue, 27 Feb 2018 10:19:49 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:

> Hi,
> 
> These patches add ability to generate userfaultfd events so that their
> processing will be synchronized with the non-cooperative thread that caused
> the event.
> 
> In the non-cooperative case userfaultfd resumes execution of the thread
> that caused an event when the notification is read() by the uffd monitor.
> In some cases, like, for example, madvise(MADV_REMOVE), it might be
> desirable to keep the thread that caused the event suspended until the
> uffd monitor had the event handled to avoid races between the thread that
> caused the and userfaultfd ioctls.
> 
> Theses patches extend the userfaultfd API with an implementation of
> UFFD_EVENT_REMOVE_SYNC that allows to keep the thread that triggered
> UFFD_EVENT_REMOVE until the uffd monitor would not wake it explicitly.

"might be desirable" is a bit weak.  It might not be desirable, too ;)

_Is_ it desirable?  What are the use-cases and what is the end-user
benefit?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] userfaultfd: non-cooperative: syncronous events
  2018-03-02 23:38 ` [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Andrew Morton
@ 2018-03-03  9:09   ` Mike Rapoport
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2018-03-03  9:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Pavel Emelyanov, linux-mm, linux-api, lkml, crml

On Fri, Mar 02, 2018 at 03:38:49PM -0800, Andrew Morton wrote:
> On Tue, 27 Feb 2018 10:19:49 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > These patches add ability to generate userfaultfd events so that their
> > processing will be synchronized with the non-cooperative thread that caused
> > the event.
> > 
> > In the non-cooperative case userfaultfd resumes execution of the thread
> > that caused an event when the notification is read() by the uffd monitor.
> > In some cases, like, for example, madvise(MADV_REMOVE), it might be
> > desirable to keep the thread that caused the event suspended until the
> > uffd monitor had the event handled to avoid races between the thread that
> > caused the and userfaultfd ioctls.
> > 
> > Theses patches extend the userfaultfd API with an implementation of
> > UFFD_EVENT_REMOVE_SYNC that allows to keep the thread that triggered
> > UFFD_EVENT_REMOVE until the uffd monitor would not wake it explicitly.
> 
> "might be desirable" is a bit weak.  It might not be desirable, too ;)
> 
> _Is_ it desirable?  What are the use-cases and what is the end-user
> benefit?

It _is_ desirable :)
With asynchronous UFFD_EVENT_REMOVE, the faulting thread continues before
the uffd monitor had chance to process the event and the memory accesses or
layout modifications of faulting thread race with the monitor processing of
the UFFD_EVENT_REMOVE.  Moreover, for multithreaded uffd monitor there
could be also uffdio_{copy,zeropage} in flight that will also race with
those memory accesses.

I have elaborate description of the patch 3 in this series.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-03  9:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  8:19 [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Mike Rapoport
2018-02-27  8:19 ` [PATCH 1/3] userfaultfd: introduce userfaultfd_init_waitqueue helper Mike Rapoport
2018-02-27  8:19 ` [PATCH 2/3] userfaultfd: non-cooperative: generalize wake key structure Mike Rapoport
2018-02-27  8:19 ` [PATCH 3/3] userfaultfd: non-cooperative: allow synchronous EVENT_REMOVE Mike Rapoport
2018-02-28  8:21   ` Pavel Emelyanov
2018-02-28  8:26     ` Mike Rapoport
2018-03-02 23:38 ` [PATCH 0/3] userfaultfd: non-cooperative: syncronous events Andrew Morton
2018-03-03  9:09   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).