From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0CA7C4361B for ; Sat, 19 Dec 2020 04:34:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3895923BE2 for ; Sat, 19 Dec 2020 04:34:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3895923BE2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6E2FD6B005C; Fri, 18 Dec 2020 23:34:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 692DB6B005D; Fri, 18 Dec 2020 23:34:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 581C96B0068; Fri, 18 Dec 2020 23:34:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0125.hostedemail.com [216.40.44.125]) by kanga.kvack.org (Postfix) with ESMTP id 4005D6B005C for ; Fri, 18 Dec 2020 23:34:06 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id E3E868249980 for ; Sat, 19 Dec 2020 04:34:05 +0000 (UTC) X-FDA: 77608764450.22.angle40_3d1390027443 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id B92C6180C6234 for ; Sat, 19 Dec 2020 04:34:05 +0000 (UTC) X-HE-Tag: angle40_3d1390027443 X-Filterd-Recvd-Size: 7240 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Sat, 19 Dec 2020 04:34:05 +0000 (UTC) Received: by mail-pj1-f51.google.com with SMTP id iq13so2425741pjb.3 for ; Fri, 18 Dec 2020 20:34:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=yoyAynmz1wcAyKdwj7JPb1oUlBRjnh/nNP8b1Gw1a2k=; b=sewz8SG1Hhr94jDzCDmMJeN7/3LbxnO32ongCAGDSWXlgEfHRoPdR6ocDAZN0oRe+h Pp3W7MMTndo1oCIyfSBC9GuBqgRHPCUHVUA9VLTSydMsUBn0U7HBAa8Sg2fP+X8JIKWc 6dm6gkAmUkhFhRPGuxvkEH6aGwpnhITUMcvIALGDCBtnAjrf/f9HuLABUzvssBAkSTDZ DCTfqdFFdyfKYsH1e1cMbKFsOiQL581rIWaAMWk1szhPHRyj0mfeVBhIEuD6JW6tXxNo nIeQquXrTvY+v3VyGd9dZF+6c3nUr+jvCNINxQn3Z98yOtCDs9xfXWORMeGtZP2DD6Y0 gfPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=yoyAynmz1wcAyKdwj7JPb1oUlBRjnh/nNP8b1Gw1a2k=; b=JYHIN0jPDoUqNUpsnbHGk4nM1e6iyoxr/UMbn33K82qOsDdf8XVehPHCtkk1h0QpYs hVPciaXBpeW8lMvlPW2rDCkfJkL26GZWbSsbLgJoHBCM5Ei7UUuutOQUiv5bUZaDYaLa QTjYkvDlvEHDbXrnLlPxLSyNuk3AtQ4Pf9ArwacJMsvMNNcgiDUJCpD1DDARwwuXIc9q mkOF/rBWkGMrzcQLdsZNtCwCciqjtOnk+IpH4/w5/QtCvkWPL964JlY02U6rOTKEOEuf NvOa+QFqT/TYwqgKXogj1+7CRSmWTHOf2O3Lf2xUXHU3YRLpRWbek7ZW8W9BBnfNauuY YkWA== X-Gm-Message-State: AOAM533xDc/WiiVXBAvBZva3cOuiT90fh7TF6f/KWHqoKC65C2c+hZOB vKJgOELd6vwwDGCqKiZQr6b19BJil8X5Pg== X-Google-Smtp-Source: ABdhPJxoD5X9iSGSIKKaOutdsLJyUD+JyZV1Irl6iwgjqe8wTKCRGQCBDFPE/PuEI9PvUMhcvUxz4g== X-Received: by 2002:a17:90a:df0d:: with SMTP id gp13mr7397314pjb.151.1608352443766; Fri, 18 Dec 2020 20:34:03 -0800 (PST) Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id f64sm10737029pfb.146.2020.12.18.20.34.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 18 Dec 2020 20:34:03 -0800 (PST) From: Nadav Amit X-Google-Original-From: Nadav Amit To: linux-mm@kvack.org, Peter Xu Cc: linux-kernel@vger.kernel.org, Nadav Amit , Andrea Arcangeli , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable@vger.kernel.org Subject: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Date: Fri, 18 Dec 2020 20:30:06 -0800 Message-Id: <20201219043006.2206347-1-namit@vmware.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Nadav Amit Userfaultfd self-tests fail occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range(). This might cause the TLBs to be in an inconsistent state due to the deferred batched TLB flushes. Consider the following scenario with 3 CPUs (cpu2 is not shown): cpu0 cpu1 ---- ---- userfaultfd_writeprotect() [ write-protecting ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ defer TLB flushes] userfaultfd_writeprotect() mmap_read_lock() change_protection() [ write-unprotect ] ... [ unprotect PTE logically ] ... [ page-fault] ... wp_page_copy() [ set new writable page in PTE] At this point no TLB flush took place. cpu2 (not shown) might have a stale writable PTE, which was cached in the TLB before cpu0 called userfaultfd_writeprotect(), and this PTE points to a different page. Therefore, write-protecting of memory, even using userfaultfd, which does not change the vma, requires to prevent any concurrent reader (#PF handler) from reading PTEs from the page-tables if any CPU might still hold in it TLB a PTE with higher permissions for the same address. To do so mmap_lock needs to be taken for write. Surprisingly, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current code might actually cause a demotion of the permission, and therefore requires a TLB flush. During unprotection of userfaultfd managed memory region, the PTE is not really made writable, but instead marked "logically" as writable, and left for the #PF handler to be handled later. While this is ok, the code currently also *removes* the write permission, and therefore makes it necessary to flush the TLBs. To resolve these problems, acquire mmap_lock for write when write-protecting userfaultfd region using ioctl's. Keep taking mmap_lock for read when unprotecting memory, but keep the write-bit set when resolving userfaultfd write-protection. Cc: Peter Xu Cc: Andrea Arcangeli Cc: Pavel Emelyanov Cc: Mike Kravetz Cc: Mike Rapoport Cc: Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit --- mm/mprotect.c | 3 ++- mm/userfaultfd.c | 15 +++++++++++++-- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..c08c4055b051 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_st= ruct *vma, pmd_t *pmd, oldpte =3D *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write =3D prot_numa && pte_write(oldpte); + bool preserve_write =3D (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); =20 /* * Avoid trapping faults against the zero or KSM diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9a3d451402d7..7423808640ef 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -652,7 +652,15 @@ int mwriteprotect_range(struct mm_struct *dst_mm, un= signed long start, /* Does the address range wrap, or is the span zero-sized? */ BUG_ON(start + len <=3D start); =20 - mmap_read_lock(dst_mm); + /* + * Although we do not change the VMA, we have to ensure deferred TLB + * flushes are performed before page-faults can be handled. Otherwise + * we can get inconsistent TLB state. + */ + if (enable_wp) + mmap_write_lock(dst_mm); + else + mmap_read_lock(dst_mm); =20 /* * If memory mappings are changing because of non-cooperative @@ -686,6 +694,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, uns= igned long start, =20 err =3D 0; out_unlock: - mmap_read_unlock(dst_mm); + if (enable_wp) + mmap_write_unlock(dst_mm); + else + mmap_read_unlock(dst_mm); return err; } --=20 2.25.1