From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEAFBC433DB for ; Tue, 22 Dec 2020 21:18:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6123B20855 for ; Tue, 22 Dec 2020 21:18:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6123B20855 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C8D188D000F; Tue, 22 Dec 2020 16:18:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C3D058D0003; Tue, 22 Dec 2020 16:18:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2CDF8D000F; Tue, 22 Dec 2020 16:18:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0045.hostedemail.com [216.40.44.45]) by kanga.kvack.org (Postfix) with ESMTP id 9BE5A8D0003 for ; Tue, 22 Dec 2020 16:18:05 -0500 (EST) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 5C3A5180AD837 for ; Tue, 22 Dec 2020 21:18:05 +0000 (UTC) X-FDA: 77622180930.03.woman01_631254b27463 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin03.hostedemail.com (Postfix) with ESMTP id 3C66728A26A for ; Tue, 22 Dec 2020 21:18:05 +0000 (UTC) X-HE-Tag: woman01_631254b27463 X-Filterd-Recvd-Size: 6183 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Tue, 22 Dec 2020 21:18:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608671884; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=AkoOcriox08aICF4e6yOPafNyaUMgYdVf0PsmAE3Q6M=; b=EwAp5cDOU7AMAmgzokH6l3fF/vSPzcNrvK8UaYDNoleEAnkIBkLWc8EHZnp9JS2rsZqDoE 4buVCDukP/XHNL29io/TiR/AyXGhDid/mqq1ApljiTR75xXcApr1olsksUJp5NuW4dfm/c 1MrQCqgxwjMir0Cxvq8TpHpZ3nhiuA4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-213-JTGTxLJxM1aM-UHzYI6zKg-1; Tue, 22 Dec 2020 16:18:00 -0500 X-MC-Unique: JTGTxLJxM1aM-UHzYI6zKg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id F14028042A2; Tue, 22 Dec 2020 21:17:57 +0000 (UTC) Received: from mail (ovpn-112-5.rdu2.redhat.com [10.10.112.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C4B5960BF1; Tue, 22 Dec 2020 21:17:53 +0000 (UTC) Date: Tue, 22 Dec 2020 16:17:53 -0500 From: Andrea Arcangeli To: Nadav Amit Cc: Peter Xu , Yu Zhao , Linus Torvalds , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Andy Lutomirski , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote: > Perhaps any change to PTE in a page-table should increase a page-table > generation that we would save in the page-table page-struct (next to the The current rule is that by the time in the page fault we find a pte/hugepmd in certain state under the "PT lock", the TLB cannot be left stale with a more permissive state at that point. So if under "PT lock" the page fault sees !pte_write, it means no other CPU can possibly modify the page anymore. That must be enforced at all times, no sequence number is required if you enforce that and the whole point of the fix is to enforce that. This is how things always worked in the page fault and it's perfectly fine. For the record I never suggested to change anything in the page fault code. So the only way we can leave stale TLB after dropping PT lock with the mmap_read_lock and concurrent page faults is with a marker: 1) take PT lock 2) leave in the pte/hugepmd a unique identifiable marker 3) change the pte to downgrade permissions 4) drop PT lock and leave stale TLB entries with the upgraded permissions In point 3) the pte is built in a way that is guaranteed to trigger a deterministic path in the page fault. And in the way of that deterministic path you put the marker check for 2) to send the fault to a dead-end where the stale TLB is actually irrelevant and harmless. No change to the page fault is needed if you only enforce the above. Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you will still have to deal with the above technique because of change_prot_numa(). Would you suggest to add a sequence counter and to change all pte_same in order to make change_prot_numa safer or is it safe enough already using the above technique that checks the marker in the page fault? To be safe NUMA balancing is using the same mechanism above of sending the page fault into a dead end, in order to call the very same function (change_permissions) with the very same lock (mmap_read_lock) as userfaultfd_writeprotect. What happens then in do_numa_page then is different than what happens in handle_userfault, but both are a dead ends as far as the page fault code is concerned and they will never come back to the page fault code. That's how they avoid breaking the page fault. The final rule for the above logic to work safe for uffd, is that the marker cannot be cleared until after the deferred TLB flush is executed (that's where the window for the race existed, and the fix closed such window). do_numa_page differs in clearing the marker while the TLB flush still pending, but it can do that because it puts the same exact pte value (with the upgraded permissions) that was there before it put the marker in the pte. In turn do_numa_page makes the pending TLB flush becomes a noop and it doesn't need to wait for it before removing the marker. Does that last difference in how the marker is cleared, warrant to consider what NUMA balancing radically different so to forbid userfaultfd_writeprotect to use the same logic in the very same function with the very same lock in order to decrease the VM complexity? In my view no. Thanks, Andrea