Re: [PATCH v2 2/5] mm: avoid unnecessary flush on change_huge_pmd()

From: Nadav Amit <namit@vmware.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Xu <peterx@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>,
	Nick Piggin <npiggin@gmail.com>,
	"x86@kernel.org" <x86@kernel.org>
Subject: Re: [PATCH v2 2/5] mm: avoid unnecessary flush on change_huge_pmd()
Date: Tue, 26 Oct 2021 16:47:56 +0000	[thread overview]
Message-ID: <E38AEB97-DE1B-4C91-A959-132EC24812AE@vmware.com> (raw)
In-Reply-To: <c415820a-aebb-265c-7f47-e048ee429102@intel.com>



> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> The first TLB flush is only necessary to prevent the dirty bit (and with
>> a lesser importance the access bit) from changing while the PTE is
>> modified. However, this is not necessary as the x86 CPUs set the
>> dirty-bit atomically with an additional check that the PTE is (still)
>> present. One caveat is Intel's Knights Landing that has a bug and does
>> not do so.
> 
> First, did I miss the check in this patch for X86_BUG_PTE_LEAK?  I don't
> see it anywhere.

No, it is me who missed it. It should have been in pmdp_invalidate_ad():

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..f14f64cc17b5 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
 	return 0;
 }
 
+/*
+ * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
+ * updated by the CPU.
+ *
+ * Returns the original PTE.
+ *
+ * During an access to a page, x86 CPUs set the dirty and access bit atomically
+ * with an additional check of the present-bit. Therefore, it is possible to
+ * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
+ *
+ * We do not make this optimization on certain CPUs that has a bug that violates
+ * this behavior (specifically Knights Landing).
+ */
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pmd_t *pmdp)
+{
+	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+
+	if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
+		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
+}

> 
>> -	 * pmdp_invalidate() is required to make sure we don't miss
>> -	 * dirty/young flags set by hardware.
> 
> This got me thinking...  In here:
> 
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&amp;data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&amp;reserved=0
> 
> I wrote:
> 
>> These bits are truly "stray".  In the case of the Dirty bit, the
>> thread associated with the stray set was *not* allowed to write to
>> the page.  This means that we do not have to launder the bit(s); we
>> can simply ignore them.
> 
> Is the goal of your proposed patch here to ensure that the dirty bit is
> not set at *all*?  Or, is it to ensure that a dirty bit which we need to
> *launder* is never set?

At *all*.

Err… I remembered from our previous discussions that the dirty bit cannot
be set once the R/W bit is cleared atomically. But going back to the SDM,
I see the (relatively new?) note:

"If software on one logical processor writes to a page while software on
 another logical processor concurrently clears the R/W flag in the
 paging-structure entry that maps the page, execution on some processors may
 result in the entry’s dirty flag being set (due to the write on the first
 logical processor) and the entry’s R/W flag being clear (due to the update
 to the entry on the second logical processor). This will never occur on a
 processor that supports control-flow enforcement technology (CET)”

So I guess that this optimization can only be enabled when CET is enabled.

:(