From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51B45C433F5 for ; Thu, 17 Feb 2022 21:15:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D0D446B0073; Thu, 17 Feb 2022 16:15:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CBC576B0074; Thu, 17 Feb 2022 16:15:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B5CC96B0075; Thu, 17 Feb 2022 16:15:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0168.hostedemail.com [216.40.44.168]) by kanga.kvack.org (Postfix) with ESMTP id A739C6B0073 for ; Thu, 17 Feb 2022 16:15:57 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6D5A5972EE for ; Thu, 17 Feb 2022 21:15:57 +0000 (UTC) X-FDA: 79153529154.13.A773D4D Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf27.hostedemail.com (Postfix) with ESMTP id B853E40018 for ; Thu, 17 Feb 2022 21:15:56 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E9DEF61239; Thu, 17 Feb 2022 21:15:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D0336C340E8; Thu, 17 Feb 2022 21:15:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1645132555; bh=mVy6IrUf/wVvt2SFYwrOcQEv+CkcEkFs8Aii+xes8BA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=d4iD/aa+F3BjEqX5+PpwHErwcspcshbhpvMBT0QCMI4E4rWVUjdDWdrUEnX+0fEfA g6gdcs7HGRnXYzk+AV2MSSZW5ruY0g4vLHjGNtSXILb3lmg0iTzZ+5HhK4ScGq/HRW VVLyzHzBEwJmceMKWrQTa6GGARUivLOIDBaUN8fT7lX29ojvfMJ3AuL+zHmawaea7C cJGKOCITI4Iwb4GHjrW8QDvLV1kC8mhtTyVV5fh9KrlzH0HC9Yfl/02fzt4tFDYXhU BfsmyBxh9M++0ZINeyCnGD95f/BcPnhDOqdXGGdx+kTF/hGlDzAqUlR5WKWxzoJRHd XLDwVZdC0Wk5A== Date: Thu, 17 Feb 2022 23:15:46 +0200 From: Mike Rapoport To: Nadav Amit Cc: Peter Xu , David Hildenbrand , Mike Rapoport , Andrea Arcangeli , Linux-MM Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering Message-ID: References: <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: B853E40018 X-Rspam-User: Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="d4iD/aa+"; spf=pass (imf27.hostedemail.com: domain of rppt@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=rppt@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Stat-Signature: do44ora6ug4nho6q3nfoth4qxp6bdzan X-Rspamd-Server: rspam11 X-HE-Tag: 1645132556-540240 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote: >=20 >=20 > > On Feb 13, 2022, at 8:02 PM, Peter Xu wrote: > >=20 > > Thanks for explaining. > >=20 > > I also digged out the discussion threads between you and Mike and tha= t's a good > > one too summarizing the problems: > >=20 > > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmai= l.com/ > >=20 > > Scenario 4 is kind of special imho along all those, because that's th= e only one > > that can be workarounded by user application by only copying pages on= e by one. > > I know you were even leveraging iouring in your local tree, so that's= probably > > not a solution at all for you. But I'm just trying to start thinking = without > > that scenario for now. > >=20 > > Per my understanding, a major issue regarding the rest of the scenari= os is > > ordering of uffd messages may not match with how things are happening= . This > > actually contains two problems. > >=20 > > First of all, mmap_sem is mostly held read for all page faults and mo= st of the > > mm changes except e.g. fork, then we can never serialize them. Not t= o mention > > uffd events releases mmap_sem within prep and completion. Let's call= it > > problem 1. > >=20 > > The other problem 2 is we can never serialize faults against events. > >=20 > > For problem 1, I do sense something that mmap_sem is just not suitabl= e for uffd > > scenario. Say, we grant concurrent with most of the events like dontn= eed and > > mremap, but when uffd ordering is a concern we may not want to grant = that > > concurrency. I'm wondering whether it means uffd may need its own se= maphore to > > achieve this. So for all events that uffd cares we take write lock o= n a new > > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem aft= er prep of > > events, not until completion (the message is read). It'll slow down = uffd > > tracked systems but guarantees ordering. >=20 > Peter, >=20 > Thanks for finding the time and looking into the issues that I encounte= red. >=20 > Your approach sounds possible, but it sounds to me unsafe to acquire uf= fd_sem > after mmap_lock, since it might cause deadlocks (e.g., if a process use= s events > to manage its own memory). >=20 > >=20 > > At the meantime, I'm wildly thinking whether we can tackle with the o= ther > > problem by merging the page fault queue with the event queue, aka, ev= ent_wqh > > and fault_pending_wqh. Obviously we'll need to identify the messages= when > > read() and conditionally move then into fault_wqh only if they come f= rom page > > faults, but that seems doable? >=20 > This, I guess is necessary in addition to your aforementioned proposal = to have > some semaphore protecting, can do the trick. >=20 > While I got your attention, let me share some other challenges I encoun= tered > using userfaultfd. They might be unrelated, but perhaps you can keep th= em in > the back of your mind. Nobody should suffer as I did ;-) >=20 > 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd hard= er than > it should be, especially when using io-uring as I wish to do. >=20 > I think it is not too hard to address by changing the API. For instance= , if > uffd-ctx had a uffd-generation that would increase on each event, the u= ser > could have provided an ioctl-generation as part of copy/zero/etc ioctls= , and > the kernel would only fail the operation if ioctl copy/zero/etc operati= on > only succeeds if the uffd-generation is lower/equal than the one provid= ed by > the user.=20 Do you mean that if there were page faults with generations 1 and 3 and, say, MADV_DONTNEED with generation 2, then even if the uffd copy that res= olves page fault 1 races with MADV_DONTNEED it will go through and the copy for page fault 3 will fail? But how would you order zapping the pages and copying into them internall= y? Or may understanding of your idea was completely off? As for technicality of adding a generation to uffd_msg and to uffdio_{copy,zero,etc}, we can use __u32 reserved in the first one and 32 bits from mode in the second with a bit of care for wraparound. =20 > 2. userfaultfd is separated from other tracing/instrumentation mechanis= ms in > the kernel. I, for instance, also wanted to track mmap events (let=E2=80= =99s put > aside for a second why). Tracking these events can be done with ptrace = or > perf_event_open() but then it is hard to correlate these events with > userfaultfd. It would have been easier for users, I think, if userfault= fd > notifications were provided through ptrace/tracepoints mechanisms as we= ll. This sounds like opening Pandora box ;-) I think it's possible to trace userfaultfd events to some extent with a probe at userfaultfd_event_wait_completion() entry and handle_userfault()= . The "interesting" information is passed to these functions as parameters and I believe all the data can be extracted with tools like bpftrace. =20 > 3. Nesting/chaining. It is not easy to allow two monitors to use userfa= ultfd > concurrently. This seems as a general problem that I believe ptrace suf= fers > from too. I know it might seem far-fetched to have 2 monitors at the mo= ment, > but I think that any tracking/instrumentation mechanism (e.g., ptrace, > software-dirty, not to mention hardware virtualization) should be desig= ned > from the beginning with such support as adding it in a later stage can = be > tricky. It's not too far fetched to have nested userfaultfd contexts even now. If CRIU would need to post-copy restore a process that uses userfaultfd it will need to deal with nested uffds. =20 > Thanks again, > Nadav --=20 Sincerely yours, Mike.