From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 244B5C433FE for ; Tue, 8 Dec 2020 08:57:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 93BC623A79 for ; Tue, 8 Dec 2020 08:57:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 93BC623A79 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0A28C6B005D; Tue, 8 Dec 2020 03:57:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 02B7D6B0068; Tue, 8 Dec 2020 03:57:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E83016B006C; Tue, 8 Dec 2020 03:57:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0172.hostedemail.com [216.40.44.172]) by kanga.kvack.org (Postfix) with ESMTP id CD8C36B005D for ; Tue, 8 Dec 2020 03:57:34 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 9B3728249980 for ; Tue, 8 Dec 2020 08:57:34 +0000 (UTC) X-FDA: 77569511628.19.jeans44_3312cb0273e5 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id 7AFDE1AD1B9 for ; Tue, 8 Dec 2020 08:57:34 +0000 (UTC) X-HE-Tag: jeans44_3312cb0273e5 X-Filterd-Recvd-Size: 7530 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) by imf29.hostedemail.com (Postfix) with ESMTP for ; Tue, 8 Dec 2020 08:57:33 +0000 (UTC) Received: by mail-pf1-f194.google.com with SMTP id s21so13287740pfu.13 for ; Tue, 08 Dec 2020 00:57:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=gtLb/BFxaZEtUCQGZNUXZW2j+UMC0CoivbHLowPJfAI=; b=hJ60Glvn+CmUXHYPQBC/b47Mub8Qtjb9XxPeVkVLwgtgjKxRJdpu1CymYyZQWCaB3x 97vp7zu7D5FDHJOm/ooRrVgXNMDyh0KhGHj6gnGpMn89i9W060X6hAvGlOqXkbvm0fEY 6sEx1UzxHxruJCQUXVi48Fos9Lur4v6RRPVhmzZWYDdCskNEUv+sC/sC+YUkbxD4Sqfb cOF1+mDLg8BoaTGe9886CiMpwCfMXu4I9GTbxb/jGpZR/lF/r7DULDCetztcD5Tu0YAS 4O8UM89EDWVCZM2sjGheqH78mvJZ7wIX7KY7nrH3wHOrq5g2aDclDjsLk4gLULW0/i4E ehqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=gtLb/BFxaZEtUCQGZNUXZW2j+UMC0CoivbHLowPJfAI=; b=jIYYHobDQFZukf9lNOeVH5z+5pSUIH+t92emN8jxk//fWwY9/ssMm/wj3TdC2C5gK+ 7h7IWnhiZ44muerqVQdl9Kd30lWDtQfTx/YeJFvifzHyM/Wlsx1tr3/TkohGVju4515u wWNDBJp8WS1EbdxYcpklqeyxUxTy8QMT7VJ0YyKh3tJRceTtCIwKL0LSzAseL6obymRz ZqYe6oAfCkXOPoykywsPqERDW0NuthYn0xpYvdU0fN1k7K+kATxGyqE4ixFpeD7dimMo TVTBZCqJUS1OId7LEss8cnN0rEJxBuFiSoABsSepe9H8gVIHkLcecn283+JqWa6ZAuP0 /aew== X-Gm-Message-State: AOAM532qHEZmDUFgxPxk4cZU5YH/6HJCKaszkXRmoWj3R0mhjcqA8IpQ SB1kGkuqxj5Zrrb2p5HVRf4= X-Google-Smtp-Source: ABdhPJyzh7Uemlt+vdwUtjLJLeCSsJuNHKl9rVRmHdzZGos30W8A7PLEXDV4YLoaQymFDiOYMQKl+g== X-Received: by 2002:a62:1716:0:b029:19d:b78b:ef02 with SMTP id 22-20020a6217160000b029019db78bef02mr19798332pfx.11.1607417852838; Tue, 08 Dec 2020 00:57:32 -0800 (PST) Received: from ?IPv6:2601:647:4700:9b2:5c98:e5b3:1ddc:54ce? ([2601:647:4700:9b2:5c98:e5b3:1ddc:54ce]) by smtp.gmail.com with ESMTPSA id k16sm7410225pfi.131.2020.12.08.00.57.31 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 08 Dec 2020 00:57:32 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) Subject: Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races From: Nadav Amit In-Reply-To: <20201208083434.GA1164013@linux.ibm.com> Date: Tue, 8 Dec 2020 00:57:30 -0800 Cc: Mike Rapoport , Andrew Morton , linux-mm , lkml , Andrea Arcangeli , Mike Kravetz , Pavel Emelyanov , Andrei Vagin Content-Transfer-Encoding: quoted-printable Message-Id: <83A6D439-F732-4112-BD1F-00195EBFCE4C@gmail.com> References: <1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com> <31DA12CC-E9CC-497D-A2EE-B83549D95CE8@gmail.com> <20201206093703.GY123287@linux.ibm.com> <5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com> <20201208083434.GA1164013@linux.ibm.com> To: Mike Rapoport X-Mailer: Apple Mail (2.3608.120.23.2.4) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Dec 8, 2020, at 12:34 AM, Mike Rapoport wrote: >=20 > On Sun, Dec 06, 2020 at 08:31:39PM -0800, Nadav Amit wrote: >> Whenever I run into a non-standard and non-trivial synchronization = algorithm >> in the kernel (and elsewhere), I become very confused and concerned. = I >> raised my question since I wanted to modify the code and could not = figure >> out how to properly do so. Based on your input that the monitor is = expected >> to know the child mappings according to userfaultfd events, I now = think that >> the kernel does not provide this ability and the locking scheme is = broken. >>=20 >> Here are some scenarios that I think are broken - please correct me = if I am >> wrong: >>=20 >> * Scenario 1: MADV_DONTNEED racing with userfaultfd page-faults >>=20 >> userfaultfd_remove() only holds the mmap_lock for read, so these = events >> cannot be ordered with userfaultfd page-faults. >>=20 >> * Scenario 2: MADV_DONTNEED racing with fork() >>=20 >> As userfaultfd_remove() releases mmap_lock after the user = notification and >> before the actual unmapping, concurrent fork() might happen before or = after >> the actual unmapping in MADV_DONTNEED and the user therefore has no = way of >> knowing whether the actual unmapping took place before or after the = fork(). >>=20 >> * Scenario 3: Concurrent MADV_DONTNEED can cause userfaultfd_remove() = to >> clear mmap_changing cleared before all the notifications are = completed. >>=20 >> As mmap_lock is only taken for read, the first thread the completed >> userfaultfd_remove() would clear the indication that was set by the = other >> one. >>=20 >> * Scenario 4: Fork starts and ends between copying of two pages. >>=20 >> As mmap_lock might be released during ioctl_copy() (inside >> __mcopy_atomic()), some pages might be mapped in the child and others = not: >>=20 >>=20 >> CPU0 CPU1 >> ---- ---- >> ioctl_copy(): >> __mcopy_atomic() >> mmap_read_lock() >> !mmap_changing [ok] >> mfill_atomic_pte() =3D=3D 0 [page0 copied] >> mfill_atomic_pte() =3D=3D -ENOENT [page1 will be retried] >> mmap_read_unlock() >> goto retry >>=20 >> fork(): >> dup_userfaultfd() >> -> mmap_changing=3Dtrue >> userfaultfd_event_wait_completion() >> -> mmap_changing=3Dfalse >>=20 >> mmap_read_lock() >> !mmap_changing [ok] >> mfill_atomic_pte() =3D=3D 0 [page1 copied] >> mmap_read_unlock() >>=20 >> return: 2 pages were mapped, while the first is present in the child = and >> the second one is non-present. >>=20 >> Bottom-line: it seems to me that mmap_changing should be a counter = (not >> boolean) that is protected by mmap_lock. This counter should be kept >> elevated throughout the entire operation (in regard to = MADV_DONTNEED). >> Perhaps mmap_lock does not have to be taken to decrease the counter, = but >> then an smp_wmb() would be needed before the counter is decreased. >>=20 >> Let me know whether I am completely off or missing something. >=20 > I tried to remember what's going on there and wrap my head around your > examples. I'm not sure if userspace cannot workaround some of those, = but > I can't say I can propose it right now. >=20 > There is for sure userspace is helpless in Scenario 4, but I think it = is > very unlikely that fork() will be fast enough to grab and release > mmap_lock while uffd_copy() waits for CPU to retry. >=20 > I agree that a making mmap_changing a counter would be more robust > anyway. Thanks for confirming my suspicion. On a second thought, I think that a sequence lock would be required. I = will work on a patch to resolve it in the next RFC of the related patch = series I am working on. As for the race window size, as there are lock optimizations to prevent writers' starvation, I do not think the last scenario is completely far-fetched. Thanks again, Nadav=