From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 444CCC433EF for ; Tue, 15 Feb 2022 22:35:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 903296B0078; Tue, 15 Feb 2022 17:35:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 88C276B007B; Tue, 15 Feb 2022 17:35:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E0276B007D; Tue, 15 Feb 2022 17:35:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0148.hostedemail.com [216.40.44.148]) by kanga.kvack.org (Postfix) with ESMTP id 5C1386B0078 for ; Tue, 15 Feb 2022 17:35:13 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 209F118142F4C for ; Tue, 15 Feb 2022 22:35:13 +0000 (UTC) X-FDA: 79146471306.08.035BDCD Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) by imf02.hostedemail.com (Postfix) with ESMTP id A1EB08000E for ; Tue, 15 Feb 2022 22:35:12 +0000 (UTC) Received: by mail-pl1-f174.google.com with SMTP id u12so385301plf.13 for ; Tue, 15 Feb 2022 14:35:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=OhCsWhuZ3PmlkExymQ0i+OPvU7vocn1l8dPatdj8Tk4=; b=Crk4LCGimBU5i+s/7xiv2Lj+VGFQHq6KpG/Or3SyCon5MukpTBFhhXQ9zP16W4Par8 U7fjCqXokbTw+w6pyGhiFTn2QTILC3XBDpnMSDGltdVtwsjh/n9KTu4qZdLdEB+/ZdkD BZbk94PrzW4M8wS5R7MQT1IlW08SGxf5qo5ysw+HKDZZ30Z1fSI9RSSpliO/+Lc0mJA9 E4Vjv+RZRpgY3QbGFHVbqFJFEbJu19ICXm/J/GLzZhRU1OjF5g+J17uw3IXODeWf2Z/e DBSQQPxqIipHQnrfBxMP/8vtuhxEDt6CzRd7R5jKefj2Oi9p+c1ogeYAqRnkiqh/fDwS JV0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=OhCsWhuZ3PmlkExymQ0i+OPvU7vocn1l8dPatdj8Tk4=; b=dgg2jH5/Z9kYKi2K8gsJ2IX3IuXTVpGT5KPus5Wc20fcx5Bk89lkYeOf/yfMrKM9Jy Lf86lJaNorJGo5tlm/PiZm9JAdQlAZFXc3uzciegqWSFAKS8h0dP/HTmOC2uuYd/w2iW XrgFyU7s7J2oZuJ7WDyMOVGhCaYvuT9Wz1tLprvVqpWOy8O5n3AhZrik0AHpZlFKZ5fh DO9aY1LNdzBVogZ+TPMRo1bhqo9o1VOPAUANlPgfgT4FF7MahokAzMZ0HQTzj6FloSZ2 qQ97QVIUl6JultnJgztQqsvsZ2MrG0xJyMwPKKl9l1huEB9KTs7XZQwBP4ri9QIKnBFZ IhKA== X-Gm-Message-State: AOAM533QGgMkhpdRKI/HQzZ8qsFmIGdIYsZgmr2iaCEFPXgqF9ALofRS VuG31K194uQIqC4FT4M7pWBKVsRXA4s= X-Google-Smtp-Source: ABdhPJwVklfT/80mR3XHJ2SpBNLZJkTmphH9boTR+ZZouAAUQOoM9l6jh2+lQ46kg1pmNn3ndE7j8A== X-Received: by 2002:a17:902:e54a:: with SMTP id n10mr1181594plf.160.1644964511234; Tue, 15 Feb 2022 14:35:11 -0800 (PST) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id b15sm16868719pfm.154.2022.02.15.14.35.09 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Feb 2022 14:35:10 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 15.0 \(3693.40.0.1.81\)) Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering From: Nadav Amit In-Reply-To: Date: Tue, 15 Feb 2022 14:35:09 -0800 Cc: Mike Rapoport , David Hildenbrand , Mike Rapoport , Andrea Arcangeli , Linux-MM Content-Transfer-Encoding: quoted-printable Message-Id: References: <11831b20-0b46-92df-885a-1220430f9257@redhat.com> <63a8a665-4431-a13c-c320-1b46e5f62005@redhat.com> To: Peter Xu X-Mailer: Apple Mail (2.3693.40.0.1.81) X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: A1EB08000E X-Stat-Signature: 7gaikfn7przhtiadd5f31uzhzruoed1e X-Rspam-User: Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Crk4LCGi; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com X-HE-Tag: 1644964512-268496 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Feb 13, 2022, at 8:02 PM, Peter Xu wrote: >=20 > Thanks for explaining. >=20 > I also digged out the discussion threads between you and Mike and = that's a good > one too summarizing the problems: >=20 > = https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com= / >=20 > Scenario 4 is kind of special imho along all those, because that's the = only one > that can be workarounded by user application by only copying pages one = by one. > I know you were even leveraging iouring in your local tree, so that's = probably > not a solution at all for you. But I'm just trying to start thinking = without > that scenario for now. >=20 > Per my understanding, a major issue regarding the rest of the = scenarios is > ordering of uffd messages may not match with how things are happening. = This > actually contains two problems. >=20 > First of all, mmap_sem is mostly held read for all page faults and = most of the > mm changes except e.g. fork, then we can never serialize them. Not to = mention > uffd events releases mmap_sem within prep and completion. Let's call = it > problem 1. >=20 > The other problem 2 is we can never serialize faults against events. >=20 > For problem 1, I do sense something that mmap_sem is just not suitable = for uffd > scenario. Say, we grant concurrent with most of the events like = dontneed and > mremap, but when uffd ordering is a concern we may not want to grant = that > concurrency. I'm wondering whether it means uffd may need its own = semaphore to > achieve this. So for all events that uffd cares we take write lock on = a new > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem = after prep of > events, not until completion (the message is read). It'll slow down = uffd > tracked systems but guarantees ordering. Peter, Thanks for finding the time and looking into the issues that I = encountered. Your approach sounds possible, but it sounds to me unsafe to acquire = uffd_sem after mmap_lock, since it might cause deadlocks (e.g., if a process uses = events to manage its own memory). >=20 > At the meantime, I'm wildly thinking whether we can tackle with the = other > problem by merging the page fault queue with the event queue, aka, = event_wqh > and fault_pending_wqh. Obviously we'll need to identify the messages = when > read() and conditionally move then into fault_wqh only if they come = from page > faults, but that seems doable? This, I guess is necessary in addition to your aforementioned proposal = to have some semaphore protecting, can do the trick. While I got your attention, let me share some other challenges I = encountered using userfaultfd. They might be unrelated, but perhaps you can keep = them in the back of your mind. Nobody should suffer as I did ;-) 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd = harder than it should be, especially when using io-uring as I wish to do. I think it is not too hard to address by changing the API. For instance, = if uffd-ctx had a uffd-generation that would increase on each event, the = user could have provided an ioctl-generation as part of copy/zero/etc ioctls, = and the kernel would only fail the operation if ioctl copy/zero/etc = operation only succeeds if the uffd-generation is lower/equal than the one = provided by the user.=20 2. userfaultfd is separated from other tracing/instrumentation = mechanisms in the kernel. I, for instance, also wanted to track mmap events (let=E2=80=99= s put aside for a second why). Tracking these events can be done with ptrace = or perf_event_open() but then it is hard to correlate these events with userfaultfd. It would have been easier for users, I think, if = userfaultfd notifications were provided through ptrace/tracepoints mechanisms as = well. 3. Nesting/chaining. It is not easy to allow two monitors to use = userfaultfd concurrently. This seems as a general problem that I believe ptrace = suffers from too. I know it might seem far-fetched to have 2 monitors at the = moment, but I think that any tracking/instrumentation mechanism (e.g., ptrace, software-dirty, not to mention hardware virtualization) should be = designed from the beginning with such support as adding it in a later stage can = be tricky. 4. Missing state. It would be useful to provide the TID of the faulting thread. I will send a patch for this one once I get the necessary internal approvals. Thanks again, Nadav