From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 51B45C433F5
	for <linux-mm@archiver.kernel.org>; Thu, 17 Feb 2022 21:15:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D0D446B0073; Thu, 17 Feb 2022 16:15:57 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CBC576B0074; Thu, 17 Feb 2022 16:15:57 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B5CC96B0075; Thu, 17 Feb 2022 16:15:57 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0168.hostedemail.com [216.40.44.168])
	by kanga.kvack.org (Postfix) with ESMTP id A739C6B0073
	for <linux-mm@kvack.org>; Thu, 17 Feb 2022 16:15:57 -0500 (EST)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 6D5A5972EE
	for <linux-mm@kvack.org>; Thu, 17 Feb 2022 21:15:57 +0000 (UTC)
X-FDA: 79153529154.13.A773D4D
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf27.hostedemail.com (Postfix) with ESMTP id B853E40018
	for <linux-mm@kvack.org>; Thu, 17 Feb 2022 21:15:56 +0000 (UTC)
Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dfw.source.kernel.org (Postfix) with ESMTPS id E9DEF61239;
	Thu, 17 Feb 2022 21:15:55 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id D0336C340E8;
	Thu, 17 Feb 2022 21:15:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1645132555;
	bh=mVy6IrUf/wVvt2SFYwrOcQEv+CkcEkFs8Aii+xes8BA=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=d4iD/aa+F3BjEqX5+PpwHErwcspcshbhpvMBT0QCMI4E4rWVUjdDWdrUEnX+0fEfA
	 g6gdcs7HGRnXYzk+AV2MSSZW5ruY0g4vLHjGNtSXILb3lmg0iTzZ+5HhK4ScGq/HRW
	 VVLyzHzBEwJmceMKWrQTa6GGARUivLOIDBaUN8fT7lX29ojvfMJ3AuL+zHmawaea7C
	 cJGKOCITI4Iwb4GHjrW8QDvLV1kC8mhtTyVV5fh9KrlzH0HC9Yfl/02fzt4tFDYXhU
	 BfsmyBxh9M++0ZINeyCnGD95f/BcPnhDOqdXGGdx+kTF/hGlDzAqUlR5WKWxzoJRHd
	 XLDwVZdC0Wk5A==
Date: Thu, 17 Feb 2022 23:15:46 +0200
From: Mike Rapoport <rppt@kernel.org>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>, David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering
Message-ID: <Yg67AoSBMNM4JVvP@kernel.org>
References: <YffsxLDZk2osB7US@kernel.org>
 <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com>
 <Yffx/PJP1TuJCnhc@kernel.org>
 <a7660987-23d1-d550-5315-7f24c1b27076@redhat.com>
 <YfgutA6FYwu7RyJP@kernel.org>
 <B2B2DFF0-7967-4F80-8AAC-3DB0B3911CED@gmail.com>
 <YgTDTjrhoiyH4ZTr@xz-m1.local>
 <BC9D9187-1777-4336-AFA4-CD34208DF31E@gmail.com>
 <YgnUVqKfkYTjz3Gx@xz-m1.local>
 <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>
X-Rspamd-Queue-Id: B853E40018
X-Rspam-User: 
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b="d4iD/aa+";
	spf=pass (imf27.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
X-Stat-Signature: do44ora6ug4nho6q3nfoth4qxp6bdzan
X-Rspamd-Server: rspam11
X-HE-Tag: 1645132556-540240
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
>=20
>=20
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> >=20
> > Thanks for explaining.
> >=20
> > I also digged out the discussion threads between you and Mike and tha=
t's a good
> > one too summarizing the problems:
> >=20
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmai=
l.com/
> >=20
> > Scenario 4 is kind of special imho along all those, because that's th=
e only one
> > that can be workarounded by user application by only copying pages on=
e by one.
> > I know you were even leveraging iouring in your local tree, so that's=
 probably
> > not a solution at all for you. But I'm just trying to start thinking =
without
> > that scenario for now.
> >=20
> > Per my understanding, a major issue regarding the rest of the scenari=
os is
> > ordering of uffd messages may not match with how things are happening=
.  This
> > actually contains two problems.
> >=20
> > First of all, mmap_sem is mostly held read for all page faults and mo=
st of the
> > mm changes except e.g. fork, then we can never serialize them.  Not t=
o mention
> > uffd events releases mmap_sem within prep and completion.  Let's call=
 it
> > problem 1.
> >=20
> > The other problem 2 is we can never serialize faults against events.
> >=20
> > For problem 1, I do sense something that mmap_sem is just not suitabl=
e for uffd
> > scenario. Say, we grant concurrent with most of the events like dontn=
eed and
> > mremap, but when uffd ordering is a concern we may not want to grant =
that
> > concurrency.  I'm wondering whether it means uffd may need its own se=
maphore to
> > achieve this.  So for all events that uffd cares we take write lock o=
n a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem aft=
er prep of
> > events, not until completion (the message is read).  It'll slow down =
uffd
> > tracked systems but guarantees ordering.
>=20
> Peter,
>=20
> Thanks for finding the time and looking into the issues that I encounte=
red.
>=20
> Your approach sounds possible, but it sounds to me unsafe to acquire uf=
fd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process use=
s events
> to manage its own memory).
>=20
> >=20
> > At the meantime, I'm wildly thinking whether we can tackle with the o=
ther
> > problem by merging the page fault queue with the event queue, aka, ev=
ent_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages=
 when
> > read() and conditionally move then into fault_wqh only if they come f=
rom page
> > faults, but that seems doable?
>=20
> This, I guess is necessary in addition to your aforementioned proposal =
to have
> some semaphore protecting, can do the trick.
>=20
> While I got your attention, let me share some other challenges I encoun=
tered
> using userfaultfd. They might be unrelated, but perhaps you can keep th=
em in
> the back of your mind. Nobody should suffer as I did ;-)
>=20
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd hard=
er than
> it should be, especially when using io-uring as I wish to do.
>=20
> I think it is not too hard to address by changing the API. For instance=
, if
> uffd-ctx had a uffd-generation that would increase on each event, the u=
ser
> could have provided an ioctl-generation as part of copy/zero/etc ioctls=
, and
> the kernel would only fail the operation if ioctl copy/zero/etc operati=
on
> only succeeds if the uffd-generation is lower/equal than the one provid=
ed by
> the user.=20

Do you mean that if there were page faults with generations 1 and 3 and,
say, MADV_DONTNEED with generation 2, then even if the uffd copy that res=
olves
page fault 1 races with MADV_DONTNEED it will go through and the copy for
page fault 3 will fail?

But how would you order zapping the pages and copying into them internall=
y?
Or may understanding of your idea was completely off?

As for technicality of adding a generation to uffd_msg and to
uffdio_{copy,zero,etc}, we can use __u32 reserved in the first one and 32
bits from mode in the second with a bit of care for wraparound.
=20
> 2. userfaultfd is separated from other tracing/instrumentation mechanis=
ms in
> the kernel. I, for instance, also wanted to track mmap events (let=E2=80=
=99s put
> aside for a second why). Tracking these events can be done with ptrace =
or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfault=
fd
> notifications were provided through ptrace/tracepoints mechanisms as we=
ll.

This sounds like opening Pandora box ;-)

I think it's possible to trace userfaultfd events to some extent with a
probe at userfaultfd_event_wait_completion() entry and handle_userfault()=
.
The "interesting" information is passed to these functions as parameters
and I believe all the data can be extracted with tools like bpftrace.
=20
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfa=
ultfd
> concurrently. This seems as a general problem that I believe ptrace suf=
fers
> from too. I know it might seem far-fetched to have 2 monitors at the mo=
ment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be desig=
ned
> from the beginning with such support as adding it in a later stage can =
be
> tricky.

It's not too far fetched to have nested userfaultfd contexts even now. If
CRIU would need to post-copy restore a process that uses userfaultfd it
will need to deal with nested uffds.
=20
> Thanks again,
> Nadav

--=20
Sincerely yours,
Mike.