From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=P6yO=FD=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 10B43C63777
	for <linux-mm@archiver.kernel.org>; Sun, 29 Nov 2020 00:50:12 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 857D320731
	for <linux-mm@archiver.kernel.org>; Sun, 29 Nov 2020 00:50:11 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="chVcc+Rp"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 857D320731
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4FD086B0073; Sat, 28 Nov 2020 19:50:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 410CD6B0074; Sat, 28 Nov 2020 19:50:08 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2623F6B0075; Sat, 28 Nov 2020 19:50:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154])
	by kanga.kvack.org (Postfix) with ESMTP id 069126B0073
	for <linux-mm@kvack.org>; Sat, 28 Nov 2020 19:50:08 -0500 (EST)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id BE0708249980
	for <linux-mm@kvack.org>; Sun, 29 Nov 2020 00:50:07 +0000 (UTC)
X-FDA: 77535624054.15.pies92_40077b427395
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin15.hostedemail.com (Postfix) with ESMTP id 9C0A61814B0C1
	for <linux-mm@kvack.org>; Sun, 29 Nov 2020 00:50:07 +0000 (UTC)
X-HE-Tag: pies92_40077b427395
X-Filterd-Recvd-Size: 18115
Received: from mail-pf1-f195.google.com (mail-pf1-f195.google.com [209.85.210.195])
	by imf36.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 29 Nov 2020 00:50:07 +0000 (UTC)
Received: by mail-pf1-f195.google.com with SMTP id y7so7721923pfq.11
        for <linux-mm@kvack.org>; Sat, 28 Nov 2020 16:50:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=59KWJ3/vJ6HvjVKPPNnRXw2pLzvj6jPJhqf+PHsK5Lk=;
        b=chVcc+RpWELrdHQJX2hhDn+8PahEbwgiDiSXx20aZXuf2vwweeyvQsVqF3iGR8DfL1
         flZOboyW8fvAK96YKsrOIp/mQbHi6+IapaGz213Q765F25rS71WgumelVtRDsOZsZepS
         DciB0vTY3vQDe9CkU1T/CrRFGbnKK4C7FBn8qExI5wuat6Wh/wfDurQrI4m5EXnyjLjv
         3ETHzqUR4ywd33HwE7d2SLOw5jjwpmOUI+1/jJuJKO3woWIR+/HcCMt8dKXjLGtX9CWs
         vC0kKUd3PEe7tvDkDQD9iDp7/l792lBgAlVajXVNcnUug0xhWpieS5R8FZIUauJVWSVX
         OwJQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=59KWJ3/vJ6HvjVKPPNnRXw2pLzvj6jPJhqf+PHsK5Lk=;
        b=X6B6I7B036TpknkLzVvHuyzjb+FzWoyqrn4K1f85r3Yh3zSqtM7tdaERJ7WMEu2/EB
         uWrsH5R9ZkdtcLxuvzFK9nxz3d5+cvqn+2hkrJAfKC7SygesjhAhV5velhLGoEqNhNdS
         aFeNlU4gcvMdHOn7Od/04wgJN4fof7HXUK+mK6AH6bALLwvFvn2vWzgbbZjFwW75t3KQ
         41mfcPFHCC+IvfK1cnuftO4UPqwUfq3NZRPp7xSVaFuUCDwj9Pj+rPZDgC5uDADAJdiv
         EhrFT28zvhSBfXp+Pc/Pmfc8uSoFJYF+RO9M9EOskWYmAl6lHK47C4oIavI8ZlANUD1c
         sGhA==
X-Gm-Message-State: AOAM532lcvIKBsLcDs5yAPk+uEXyuJyrsc6EEhMRoyHPWW4kaC7KPaOx
	pEYyN42p+6E1apkokCSvbc2WBvm5haOnbg==
X-Google-Smtp-Source: ABdhPJyWiKvysa3Yocv9eCzmz/ID0fXaV6ox++urR807pZiDVfoy48giSjx/iyE9N0Gfs7uI/Rzs5g==
X-Received: by 2002:a17:90b:4595:: with SMTP id hd21mr18568326pjb.127.1606611006164;
        Sat, 28 Nov 2020 16:50:06 -0800 (PST)
Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1])
        by smtp.gmail.com with ESMTPSA id gg19sm16444871pjb.21.2020.11.28.16.50.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sat, 28 Nov 2020 16:50:05 -0800 (PST)
From: Nadav Amit <nadav.amit@gmail.com>
X-Google-Original-From: Nadav Amit
To: linux-fsdevel@vger.kernel.org
Cc: Nadav Amit <namit@vmware.com>,
	Jens Axboe <axboe@kernel.dk>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	io-uring@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [RFC PATCH 08/13] fs/userfaultfd: complete reads asynchronously
Date: Sat, 28 Nov 2020 16:45:43 -0800
Message-Id: <20201129004548.1619714-9-namit@vmware.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20201129004548.1619714-1-namit@vmware.com>
References: <20201129004548.1619714-1-namit@vmware.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Nadav Amit <namit@vmware.com>

Complete reads asynchronously to allow io_uring to complete reads
asynchronously.

Reads, which report page-faults and events, can only be performed
asynchronously if the read is performed into a kernel buffer, and
therefore guarantee that no page-fault would occur during the completion
of the read. Otherwise, we would have needed to handle nested
page-faults or do expensive pinning/unpinning of the pages into which
the read is performed.

Userfaultfd holds in its context the kiocb and iov_iter that would be
used for the next asynchronous read (can be extended later into a list
to hold more than a single enqueued read).  If such a buffer is
available and a fault occurs, the fault is reported to the user and the
fault is added to the fault workqueue instead of the pending-fault
workqueue.

There is a need to prevent a race between synchronous and asynchronous
reads, so reads will first use buffers that were previous enqueued and
only later pending-faults and events. For this matter a new
"notification" lock is introduced that is held while enqueuing new
events and pending faults and during event reads. It may be possible to
use the fd_wqh.lock instead, but having a separate lock for the matter
seems cleaner.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: io-uring@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c | 265 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 235 insertions(+), 30 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6333b4632742..db1a963f6ae2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -44,9 +44,10 @@ enum userfaultfd_state {
  *
  * Locking order:
  *	fd_wqh.lock
- *		fault_pending_wqh.lock
- *			fault_wqh.lock
- *		event_wqh.lock
+ *		notification_lock
+ *			fault_pending_wqh.lock
+ *				fault_wqh.lock
+ *			event_wqh.lock
  *
  * To avoid deadlocks, IRQs must be disabled when taking any of the abov=
e locks,
  * since fd_wqh.lock is taken by aio_poll() while it's holding a lock th=
at's
@@ -79,6 +80,16 @@ struct userfaultfd_ctx {
 	struct mm_struct *mm;
 	/* controlling process files as they might be different than current */
 	struct files_struct *files;
+	/*
+	 * lock for sync and async userfaultfd reads, which must be held when
+	 * enqueueing into fault_pending_wqh or event_wqh, upon userfaultfd
+	 * reads and on accesses of iocb_callback and to.
+	 */
+	spinlock_t notification_lock;
+	/* kiocb struct that is used for the next asynchronous read */
+	struct kiocb *iocb_callback;
+	/* the iterator that is used for the next asynchronous read */
+	struct iov_iter to;
 };
=20
 struct userfaultfd_fork_ctx {
@@ -356,6 +367,53 @@ static inline long userfaultfd_get_blocking_state(un=
signed int flags)
 	return TASK_UNINTERRUPTIBLE;
 }
=20
+static bool userfaultfd_get_async_complete_locked(struct userfaultfd_ctx=
 *ctx,
+				struct kiocb **iocb, struct iov_iter *iter)
+{
+	if (!ctx->released)
+		lockdep_assert_held(&ctx->notification_lock);
+
+	if (ctx->iocb_callback =3D=3D NULL)
+		return false;
+
+	*iocb =3D ctx->iocb_callback;
+	*iter =3D ctx->to;
+
+	ctx->iocb_callback =3D NULL;
+	ctx->to.kvec =3D NULL;
+	return true;
+}
+
+static bool userfaultfd_get_async_complete(struct userfaultfd_ctx *ctx,
+				struct kiocb **iocb, struct iov_iter *iter)
+{
+	bool r;
+
+	spin_lock_irq(&ctx->notification_lock);
+	r =3D userfaultfd_get_async_complete_locked(ctx, iocb, iter);
+	spin_unlock_irq(&ctx->notification_lock);
+	return r;
+}
+
+static void userfaultfd_copy_async_msg(struct kiocb *iocb,
+				       struct iov_iter *iter,
+				       struct uffd_msg *msg,
+				       int ret)
+{
+
+	const struct kvec *kvec =3D iter->kvec;
+
+	if (ret =3D=3D 0)
+		ret =3D copy_to_iter(msg, sizeof(*msg), iter);
+
+	/* Should never fail as we guarantee that we use a kernel buffer */
+	WARN_ON_ONCE(ret !=3D sizeof(*msg));
+	iocb->ki_complete(iocb, ret, 0);
+
+	kfree(kvec);
+	iter->kvec =3D NULL;
+}
+
 /*
  * The locking rules involved in returning VM_FAULT_RETRY depending on
  * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
@@ -380,6 +438,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, un=
signed long reason)
 	bool must_wait;
 	long blocking_state;
 	bool poll;
+	bool async =3D false;
+	struct kiocb *iocb;
+	struct iov_iter iter;
+	wait_queue_head_t *wqh;
=20
 	/*
 	 * We don't do userfault handling for the final child pid update.
@@ -489,12 +551,29 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, u=
nsigned long reason)
=20
 	blocking_state =3D userfaultfd_get_blocking_state(vmf->flags);
=20
-	spin_lock_irq(&ctx->fault_pending_wqh.lock);
+	/*
+	 * Abuse fd_wqh.lock to protect against concurrent reads to avoid a
+	 * scenario in which a fault/event is queued, and read returns
+	 * -EIOCBQUEUED.
+	 */
+	spin_lock_irq(&ctx->notification_lock);
+	async =3D userfaultfd_get_async_complete_locked(ctx, &iocb, &iter);
+	wqh =3D &ctx->fault_pending_wqh;
+
+	if (async)
+		wqh =3D &ctx->fault_wqh;
+
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
 	 * through poll/read().
 	 */
-	__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
+	spin_lock(&wqh->lock);
+
+	__add_wait_queue(wqh, &uwq.wq);
+
+	/* Ensure it is queued before userspace is informed. */
+	smp_wmb();
+
 	/*
 	 * The smp_mb() after __set_current_state prevents the reads
 	 * following the spin_unlock to happen before the list_add in
@@ -504,7 +583,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, un=
signed long reason)
 	if (!poll)
 		set_current_state(blocking_state);
=20
-	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+	spin_unlock(&wqh->lock);
+	spin_unlock_irq(&ctx->notification_lock);
+
+	/*
+	 * Do the copy after the lock is relinquished to avoid circular lock
+	 * dependencies.
+	 */
+	if (async)
+		userfaultfd_copy_async_msg(iocb, &iter, &uwq.msg, 0);
=20
 	if (!is_vm_hugetlb_page(vmf->vma))
 		must_wait =3D userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
@@ -516,7 +603,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, uns=
igned long reason)
 	mmap_read_unlock(mm);
=20
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
-		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
+		if (!async)
+			wake_up_poll(&ctx->fd_wqh, EPOLLIN);
+
 		if (poll) {
 			while (!READ_ONCE(uwq.waken) && !READ_ONCE(ctx->released) &&
 			       !signal_pending(current)) {
@@ -544,13 +633,21 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, u=
nsigned long reason)
 	 * kernel stack can be released after the list_del_init.
 	 */
 	if (!list_empty_careful(&uwq.wq.entry)) {
-		spin_lock_irq(&ctx->fault_pending_wqh.lock);
+		local_irq_disable();
+		if (!async)
+			spin_lock(&ctx->fault_pending_wqh.lock);
+		spin_lock(&ctx->fault_wqh.lock);
+
 		/*
 		 * No need of list_del_init(), the uwq on the stack
 		 * will be freed shortly anyway.
 		 */
 		list_del(&uwq.wq.entry);
-		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+
+		spin_unlock(&ctx->fault_wqh.lock);
+		if (!async)
+			spin_unlock(&ctx->fault_pending_wqh.lock);
+		local_irq_enable();
 	}
=20
 	/*
@@ -563,10 +660,17 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, u=
nsigned long reason)
 	return ret;
 }
=20
+
+static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+				  struct userfaultfd_ctx *new,
+				  struct uffd_msg *msg);
+
 static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ct=
x,
 					      struct userfaultfd_wait_queue *ewq)
 {
 	struct userfaultfd_ctx *release_new_ctx;
+	struct iov_iter iter;
+	struct kiocb *iocb;
=20
 	if (WARN_ON_ONCE(current->flags & PF_EXITING))
 		goto out;
@@ -575,12 +679,42 @@ static void userfaultfd_event_wait_completion(struc=
t userfaultfd_ctx *ctx,
 	init_waitqueue_entry(&ewq->wq, current);
 	release_new_ctx =3D NULL;
=20
-	spin_lock_irq(&ctx->event_wqh.lock);
+retry:
+	spin_lock_irq(&ctx->notification_lock);
+
 	/*
-	 * After the __add_wait_queue the uwq is visible to userland
-	 * through poll/read().
+	 * Submit asynchronously when needed, and release the notification lock
+	 * as soon as the event is either queued on the work queue or an entry
+	 * is taken.
+	 */
+	if (userfaultfd_get_async_complete_locked(ctx, &iocb, &iter)) {
+		int r =3D 0;
+
+		spin_unlock_irq(&ctx->notification_lock);
+		if (ewq->msg.event =3D=3D UFFD_EVENT_FORK) {
+			struct userfaultfd_ctx *new =3D
+				(struct userfaultfd_ctx *)(unsigned long)
+					ewq->msg.arg.reserved.reserved1;
+
+			r =3D resolve_userfault_fork(ctx, new, &ewq->msg);
+		}
+		userfaultfd_copy_async_msg(iocb, &iter, &ewq->msg, r);
+
+		if (r !=3D 0)
+			goto retry;
+
+		goto out;
+	}
+
+	spin_lock(&ctx->event_wqh.lock);
+	/*
+	 * After the __add_wait_queue or the call to ki_complete the uwq is
+	 * visible to userland through poll/read().
 	 */
 	__add_wait_queue(&ctx->event_wqh, &ewq->wq);
+
+	spin_unlock(&ctx->notification_lock);
+
 	for (;;) {
 		set_current_state(TASK_KILLABLE);
 		if (ewq->msg.event =3D=3D 0)
@@ -683,6 +817,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struc=
t list_head *fcs)
 		ctx->features =3D octx->features;
 		ctx->released =3D false;
 		ctx->mmap_changing =3D false;
+		ctx->iocb_callback =3D NULL;
 		ctx->mm =3D vma->vm_mm;
 		mmgrab(ctx->mm);
=20
@@ -854,6 +989,15 @@ void userfaultfd_unmap_complete(struct mm_struct *mm=
, struct list_head *uf)
 	}
 }
=20
+static void userfaultfd_cancel_async_reads(struct userfaultfd_ctx *ctx)
+{
+	struct iov_iter iter;
+	struct kiocb *iocb;
+
+	while (userfaultfd_get_async_complete(ctx, &iocb, &iter))
+		userfaultfd_copy_async_msg(iocb, &iter, NULL, -EBADF);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx =3D file->private_data;
@@ -912,6 +1056,8 @@ static int userfaultfd_release(struct inode *inode, =
struct file *file)
 	__wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
=20
+	userfaultfd_cancel_async_reads(ctx);
+
 	/* Flush pending events that may still wait on event_wqh */
 	wake_up_all(&ctx->event_wqh);
=20
@@ -1032,8 +1178,39 @@ static int resolve_userfault_fork(struct userfault=
fd_ctx *ctx,
 	return 0;
 }
=20
-static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_=
wait,
-				    struct uffd_msg *msg)
+static ssize_t userfaultfd_enqueue(struct kiocb *iocb,
+				   struct userfaultfd_ctx *ctx,
+				   struct iov_iter *to)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (!to)
+		return -EAGAIN;
+
+	if (is_sync_kiocb(iocb) ||
+	    (!iov_iter_is_bvec(to) && !iov_iter_is_kvec(to)))
+		return -EAGAIN;
+
+	/* Check again if there are pending events */
+	if (waitqueue_active(&ctx->fault_pending_wqh) ||
+	    waitqueue_active(&ctx->event_wqh))
+		return -EAGAIN;
+
+	/*
+	 * Check that there is no other callback already registered, as
+	 * we only support one at the moment.
+	 */
+	if (ctx->iocb_callback)
+		return -EAGAIN;
+
+	ctx->iocb_callback =3D iocb;
+	ctx->to =3D *to;
+	return -EIOCBQUEUED;
+}
+
+static ssize_t userfaultfd_ctx_read(struct kiocb *iocb,
+				    struct userfaultfd_ctx *ctx, int no_wait,
+				    struct uffd_msg *msg, struct iov_iter *to)
 {
 	ssize_t ret;
 	DECLARE_WAITQUEUE(wait, current);
@@ -1051,6 +1228,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaul=
tfd_ctx *ctx, int no_wait,
 	/* always take the fd_wqh lock before the fault_pending_wqh lock */
 	spin_lock_irq(&ctx->fd_wqh.lock);
 	__add_wait_queue(&ctx->fd_wqh, &wait);
+	spin_lock(&ctx->notification_lock);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		spin_lock(&ctx->fault_pending_wqh.lock);
@@ -1122,21 +1300,23 @@ static ssize_t userfaultfd_ctx_read(struct userfa=
ultfd_ctx *ctx, int no_wait,
 			ret =3D 0;
 		}
 		spin_unlock(&ctx->event_wqh.lock);
-		if (!ret)
-			break;
=20
-		if (signal_pending(current)) {
+		if (ret =3D=3D -EAGAIN && signal_pending(current))
 			ret =3D -ERESTARTSYS;
+
+		if (ret =3D=3D -EAGAIN && no_wait)
+			ret =3D userfaultfd_enqueue(iocb, ctx, to);
+
+		if (no_wait || ret !=3D -EAGAIN)
 			break;
-		}
-		if (no_wait) {
-			ret =3D -EAGAIN;
-			break;
-		}
+
+		spin_unlock(&ctx->notification_lock);
 		spin_unlock_irq(&ctx->fd_wqh.lock);
 		schedule();
 		spin_lock_irq(&ctx->fd_wqh.lock);
+		spin_lock(&ctx->notification_lock);
 	}
+	spin_unlock(&ctx->notification_lock);
 	__remove_wait_queue(&ctx->fd_wqh, &wait);
 	__set_current_state(TASK_RUNNING);
 	spin_unlock_irq(&ctx->fd_wqh.lock);
@@ -1202,20 +1382,38 @@ static ssize_t userfaultfd_read_iter(struct kiocb=
 *iocb, struct iov_iter *to)
 	ssize_t _ret, ret =3D 0;
 	struct uffd_msg msg;
 	int no_wait =3D file->f_flags & O_NONBLOCK;
+	struct iov_iter _to, *async_to =3D NULL;
=20
-	if (ctx->state =3D=3D UFFD_STATE_WAIT_API)
+	if (ctx->state =3D=3D UFFD_STATE_WAIT_API || READ_ONCE(ctx->released))
 		return -EINVAL;
=20
+	/* Duplicate before taking the lock */
+	if (no_wait && !is_sync_kiocb(iocb) &&
+	    (iov_iter_is_bvec(to) || iov_iter_is_kvec(to))) {
+		async_to =3D &_to;
+		dup_iter(async_to, to, GFP_KERNEL);
+	}
+
 	for (;;) {
-		if (iov_iter_count(to) < sizeof(msg))
-			return ret ? ret : -EINVAL;
-		_ret =3D userfaultfd_ctx_read(ctx, no_wait, &msg);
-		if (_ret < 0)
-			return ret ? ret : _ret;
+		if (iov_iter_count(to) < sizeof(msg)) {
+			if (!ret)
+				ret =3D -EINVAL;
+			break;
+		}
+		_ret =3D userfaultfd_ctx_read(iocb, ctx, no_wait, &msg, async_to);
+		if (_ret < 0) {
+			if (ret =3D=3D 0)
+				ret =3D _ret;
+			break;
+		}
+		async_to =3D NULL;
=20
 		_ret =3D copy_to_iter(&msg, sizeof(msg), to);
-		if (_ret !=3D sizeof(msg))
-			return ret ? ret : -EINVAL;
+		if (_ret !=3D sizeof(msg)) {
+			if (ret =3D=3D 0)
+				ret =3D -EINVAL;
+			break;
+		}
=20
 		ret +=3D sizeof(msg);
=20
@@ -1225,6 +1423,11 @@ static ssize_t userfaultfd_read_iter(struct kiocb =
*iocb, struct iov_iter *to)
 		 */
 		no_wait =3D O_NONBLOCK;
 	}
+
+	if (ret !=3D -EIOCBQUEUED && async_to !=3D NULL)
+		kfree(async_to->kvec);
+
+	return ret;
 }
=20
 static void __wake_userfault(struct userfaultfd_ctx *ctx,
@@ -1997,6 +2200,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 	init_waitqueue_head(&ctx->event_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
 	seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
+	spin_lock_init(&ctx->notification_lock);
 }
=20
 SYSCALL_DEFINE1(userfaultfd, int, flags)
@@ -2027,6 +2231,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 	ctx->released =3D false;
 	ctx->mmap_changing =3D false;
 	ctx->mm =3D current->mm;
+	ctx->iocb_callback =3D NULL;
 	/* prevent the mm struct to be freed */
 	mmgrab(ctx->mm);
=20
--=20
2.25.1