From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3kWv=HC=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6024EC433E0
	for <linux-mm@archiver.kernel.org>; Sun, 31 Jan 2021 00:16:50 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 0629064E15
	for <linux-mm@archiver.kernel.org>; Sun, 31 Jan 2021 00:16:50 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0629064E15
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 95B0E6B0088; Sat, 30 Jan 2021 19:16:35 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8E84B6B0089; Sat, 30 Jan 2021 19:16:35 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 713396B008A; Sat, 30 Jan 2021 19:16:35 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241])
	by kanga.kvack.org (Postfix) with ESMTP id 54ABC6B0088
	for <linux-mm@kvack.org>; Sat, 30 Jan 2021 19:16:35 -0500 (EST)
Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 2141E1EE6
	for <linux-mm@kvack.org>; Sun, 31 Jan 2021 00:16:35 +0000 (UTC)
X-FDA: 77764153950.07.deer53_4808843275b5
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin07.hostedemail.com (Postfix) with ESMTP id 047B81803F90F
	for <linux-mm@kvack.org>; Sun, 31 Jan 2021 00:16:35 +0000 (UTC)
X-HE-Tag: deer53_4808843275b5
X-Filterd-Recvd-Size: 12183
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177])
	by imf01.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 31 Jan 2021 00:16:34 +0000 (UTC)
Received: by mail-pl1-f177.google.com with SMTP id u11so7895703plg.13
        for <linux-mm@kvack.org>; Sat, 30 Jan 2021 16:16:34 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=vTdsuJQth2J6WGuWWZ2jdlnmzFKq4d9KvvhYC+YtYPM=;
        b=kQA8w6RBjWwklh54JZyl96kDKOlZPvVrUhbxGdhR4VwqYSyrGHXNPzm/7T3cNlhk+9
         Jt/AvSljO4Qqx7b7YU7hxsVKNvzENWz4PlbqvlqI+E9TprAVSCOcZyHWlqBIA/+uqfSQ
         H/S8DpVZPw3N8xREtnXd5OXhbe3kkWh8ll+f58bkGpfUo1Zbv5s274x2Gi4ID+RKAp4R
         dsjseZNxDFRCwZJpWykiRQBVCNOwCJE7jXQtt0juVEt2J0aHZazhW3DRmn9E+N7OgeSe
         eB/l7QMfwLa04lCx1txffJBhV+2JqFsbkWOClh9AaUL623Seyvv9XMyzl2+mj2y0DFjj
         saXg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=vTdsuJQth2J6WGuWWZ2jdlnmzFKq4d9KvvhYC+YtYPM=;
        b=WFS5vlda70r5UvYhBDpG+E+gaRzgJo8ChxLj1AAIJPE227hVMTNpqH9km1eUx8xHs5
         /ZxaH8rM/uNcAAIud6EuLBQ6v6daqQQnGMlZRa8TASKEWhi+Z0YNiF6uYx2f5/2WoGAN
         fnq4k2eUgmyI807stYnhZHd5YspnFVX5wtipsnCG5cPXm2U+e6TVvWMaQDt8mqVTaVgB
         pXcYC3dvk2w/CCXtINbDiwT7RuDV9SELd0DKyQXEzJHO4127vhZuyk+1Dp/yJajRK1eJ
         yOJ6Ti2hBHJHdpmkandmP9HVbcImWprsN9MAWKF7nKIg3fUYc6nEDgC/o6hk56+47rWu
         y3Ww==
X-Gm-Message-State: AOAM532csc6jcd5vjnrrA6qVe/kUsXRxfbLDmKiopmU1ZLrSfkXC00uy
	Gj1FBpQ17S6DwSZ3HSdkA0ao2MMQajc=
X-Google-Smtp-Source: ABdhPJxc1EgCFTX8mBi+r3yG2hwnghqgny3SVhg+g/XVGDEfxMAToxY0ux9sP9kqY7dHWCPL7E5z2w==
X-Received: by 2002:a17:90a:8c87:: with SMTP id b7mr11168725pjo.158.1612052193016;
        Sat, 30 Jan 2021 16:16:33 -0800 (PST)
Received: from sc2-haas01-esx0118.eng.vmware.com ([66.170.99.1])
        by smtp.gmail.com with ESMTPSA id e12sm13127365pga.13.2021.01.30.16.16.31
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sat, 30 Jan 2021 16:16:32 -0800 (PST)
From: Nadav Amit <nadav.amit@gmail.com>
X-Google-Original-From: Nadav Amit
To: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Nadav Amit <namit@vmware.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Will Deacon <will@kernel.org>,
	Yu Zhao <yuzhao@google.com>,
	x86@kernel.org
Subject: [RFC 20/20] mm/rmap: avoid potential races
Date: Sat, 30 Jan 2021 16:11:32 -0800
Message-Id: <20210131001132.3368247-21-namit@vmware.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20210131001132.3368247-1-namit@vmware.com>
References: <20210131001132.3368247-1-namit@vmware.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Nadav Amit <namit@vmware.com>

flush_tlb_batched_pending() appears to have a theoretical race:
tlb_flush_batched is being cleared after the TLB flush, and if in
between another core calls set_tlb_ubc_flush_pending() and sets the
pending TLB flush indication, this indication might be lost. Holding the
page-table lock when SPLIT_LOCK is set cannot eliminate this race.

The current batched TLB invalidation scheme therefore does not seem
viable or easily repairable.

Introduce a new scheme, in which a cpumask is maintained for pending
batched TLB flushes. When a full TLB flush is performed clear the
corresponding bit on the CPU the performs the TLB flush.

This scheme is only suitable for architectures that use IPIs for TLB
shootdowns. As x86 is the only architecture that currently uses batched
TLB flushes, this is not an issue.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/tlbbatch.h | 15 ------------
 arch/x86/include/asm/tlbflush.h |  2 +-
 arch/x86/mm/tlb.c               | 18 ++++++++++-----
 include/linux/mm.h              |  7 ++++++
 include/linux/mm_types_task.h   | 13 -----------
 mm/rmap.c                       | 41 ++++++++++++++++-----------------
 6 files changed, 40 insertions(+), 56 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

diff --git a/arch/x86/include/asm/tlbbatch.h b/arch/x86/include/asm/tlbba=
tch.h
deleted file mode 100644
index 1ad56eb3e8a8..000000000000
--- a/arch/x86/include/asm/tlbbatch.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ARCH_X86_TLBBATCH_H
-#define _ARCH_X86_TLBBATCH_H
-
-#include <linux/cpumask.h>
-
-struct arch_tlbflush_unmap_batch {
-	/*
-	 * Each bit set is a CPU that potentially has a TLB entry for one of
-	 * the PFNs being flushed..
-	 */
-	struct cpumask cpumask;
-};
-
-#endif /* _ARCH_X86_TLBBATCH_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbfl=
ush.h
index a4e7c90d11a8..0e681a565b78 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,7 +240,7 @@ static inline void flush_tlb_page(struct vm_area_stru=
ct *vma, unsigned long a)
 	flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
=20
-extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)=
;
+extern void arch_tlbbatch_flush(void);
=20
 static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ba85d6bb4988..f7304d45e6b9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -760,8 +760,15 @@ static void flush_tlb_func_common(const struct flush=
_tlb_info *f,
 			count_vm_tlb_events(NR_TLB_LOCAL_FLUSH_ONE, nr_invalidate);
 		trace_tlb_flush(reason, nr_invalidate);
 	} else {
+		int cpu =3D smp_processor_id();
+
 		/* Full flush. */
 		flush_tlb_local();
+
+		/* If there are batched TLB flushes, mark they are done */
+		if (cpumask_test_cpu(cpu, &tlb_flush_batched_cpumask))
+			cpumask_clear_cpu(cpu, &tlb_flush_batched_cpumask);
+
 		if (local)
 			count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
 		trace_tlb_flush(reason, TLB_FLUSH_ALL);
@@ -1143,21 +1150,20 @@ static const struct flush_tlb_info full_flush_tlb=
_info =3D {
 	.end =3D TLB_FLUSH_ALL,
 };
=20
-void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+void arch_tlbbatch_flush(void)
 {
 	int cpu =3D get_cpu();
=20
-	if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+	if (cpumask_test_cpu(cpu, &tlb_flush_batched_cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
 		flush_tlb_func_local(&full_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
 		local_irq_enable();
 	}
=20
-	if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
-		flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
-
-	cpumask_clear(&batch->cpumask);
+	if (cpumask_any_but(&tlb_flush_batched_cpumask, cpu) < nr_cpu_ids)
+		flush_tlb_others(&tlb_flush_batched_cpumask,
+				 &full_flush_tlb_info);
=20
 	/*
 	 * We cannot call mark_mm_tlb_gen_done() since we do not know which
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a8a5bf82bd03..e4eeee985cf6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3197,5 +3197,12 @@ unsigned long wp_shared_mapping_range(struct addre=
ss_space *mapping,
=20
 extern int sysctl_nr_trim_pages;
=20
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+extern volatile cpumask_t tlb_flush_batched_cpumask;
+void tlb_batch_init(void);
+#else
+static inline void tlb_batch_init(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.=
h
index c1bc6731125c..742c542aaf3f 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -15,10 +15,6 @@
=20
 #include <asm/page.h>
=20
-#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-#include <asm/tlbbatch.h>
-#endif
-
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >=3D CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
@@ -75,15 +71,6 @@ struct page_frag {
 /* Track pages that require TLB flushes */
 struct tlbflush_unmap_batch {
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/*
-	 * The arch code makes the following promise: generic code can modify a
-	 * PTE, then call arch_tlbbatch_add_mm() (which internally provides all
-	 * needed barriers), then call arch_tlbbatch_flush(), and the entries
-	 * will be flushed on all CPUs by the time that arch_tlbbatch_flush()
-	 * returns.
-	 */
-	struct arch_tlbflush_unmap_batch arch;
-
 	/* True if a flush is needed. */
 	bool flush_required;
=20
diff --git a/mm/rmap.c b/mm/rmap.c
index 9655e1fc328a..0d2ac5a72d19 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,6 +586,18 @@ void page_unlock_anon_vma_read(struct anon_vma *anon=
_vma)
 }
=20
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+/*
+ * TLB batching requires arch code to make the following promise: upon a=
 full
+ * TLB flushes, the CPU that performs tlb_flush_batched_cpumask will cle=
ar
+ * tlb_flush_batched_cpumask atomically (i.e., during an IRQ or while in=
terrupts
+ * are disabled). arch_tlbbatch_flush() is required to flush all the CPU=
s that
+ * are set in tlb_flush_batched_cpumask.
+ *
+ * This scheme is therefore only suitable for IPI-based TLB shootdowns.
+ */
+volatile cpumask_t tlb_flush_batched_cpumask =3D { 0 };
+
 /*
  * Flush TLB entries for recently unmapped pages from remote CPUs. It is
  * important if a PTE was dirty when it was unmapped that it's flushed
@@ -599,7 +611,7 @@ void try_to_unmap_flush(void)
 	if (!tlb_ubc->flush_required)
 		return;
=20
-	arch_tlbbatch_flush(&tlb_ubc->arch);
+	arch_tlbbatch_flush();
 	tlb_ubc->flush_required =3D false;
 	tlb_ubc->writable =3D false;
 }
@@ -613,27 +625,20 @@ void try_to_unmap_flush_dirty(void)
 		try_to_unmap_flush();
 }
=20
-static inline void tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *bat=
ch,
-				   struct mm_struct *mm)
+static inline void tlbbatch_add_mm(struct mm_struct *mm)
 {
+	cpumask_atomic_or(&tlb_flush_batched_cpumask, mm_cpumask(mm));
+
 	inc_mm_tlb_gen(mm);
-	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
 }
=20
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writabl=
e)
 {
 	struct tlbflush_unmap_batch *tlb_ubc =3D &current->tlb_ubc;
=20
-	tlbbatch_add_mm(&tlb_ubc->arch, mm);
+	tlbbatch_add_mm(mm);
 	tlb_ubc->flush_required =3D true;
=20
-	/*
-	 * Ensure compiler does not re-order the setting of tlb_flush_batched
-	 * before the PTE is cleared.
-	 */
-	barrier();
-	mm->tlb_flush_batched =3D true;
-
 	/*
 	 * If the PTE was dirty then it's best to assume it's writable. The
 	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
@@ -679,16 +684,10 @@ static bool should_defer_flush(struct mm_struct *mm=
, enum ttu_flags flags)
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
-	if (data_race(mm->tlb_flush_batched)) {
-		flush_tlb_mm(mm);
+	if (!cpumask_intersects(mm_cpumask(mm), &tlb_flush_batched_cpumask))
+		return;
=20
-		/*
-		 * Do not allow the compiler to re-order the clearing of
-		 * tlb_flush_batched before the tlb is flushed.
-		 */
-		barrier();
-		mm->tlb_flush_batched =3D false;
-	}
+	flush_tlb_mm(mm);
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writabl=
e)
--=20
2.25.1