Re: Potential race in TLB flush batching?

From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Fri, 14 Jul 2017 16:16:44 -0700	[thread overview]
Message-ID: <A9CB595E-7C6D-438F-9835-A9EB8DA90892@gmail.com> (raw)
In-Reply-To: <20170713060706.o2cuko5y6irxwnww@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Jul 12, 2017 at 04:27:23PM -0700, Nadav Amit wrote:
>>> If reclaim is first, it'll take the PTL, set batched while a racing
>>> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
>>> immediately calls flush_tlb_batched_pending() before proceeding as normal,
>>> finding pte_none with the TLB flushed.
>> 
>> This is the scenario I regarded in my example. Notice that when the first
>> flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
>> locks - allowing them to run concurrently. As a result
>> flush_tlb_batched_pending is executed before the PTE was cleared and
>> mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
>> mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.
> 
> If they hold different PTL locks, it means that reclaim and and the parallel
> munmap/mprotect/madvise/mremap operation are operating on different regions
> of an mm or separate mm's and the race should not apply or at the very
> least is equivalent to not batching the flushes. For multiple parallel
> operations, munmap/mprotect/mremap are serialised by mmap_sem so there
> is only one risky operation at a time. For multiple madvise, there is a
> small window when a page is accessible after madvise returns but it is an
> advisory call so it's primarily a data integrity concern and the TLB is
> flushed before the page is either freed or IO starts on the reclaim side.

I think there is some miscommunication. Perhaps one detail was missing:

CPU0				CPU1
---- 				----
should_defer_flush
=> mm->tlb_flush_batched=true		
				flush_tlb_batched_pending (another PT)
				=> flush TLB
				=> mm->tlb_flush_batched=false

				Access PTE (and cache in TLB)
ptep_get_and_clear(PTE)
...

				flush_tlb_batched_pending (batched PT)
				[ no flush since tlb_flush_batched=false ]
				use the stale PTE
...
try_to_unmap_flush

There are only 2 CPUs and both regard the same address-space. CPU0 reclaim a
page from this address-space. Just between setting tlb_flush_batch and the
actual clearing of the PTE, the process on CPU1 runs munmap and calls
flush_tlb_batched_pending. This can happen if CPU1 regards a different
page-table.

So CPU1 flushes the TLB and clears the tlb_flush_batched indication. Note,
however, that CPU0 still did not clear the PTE so CPU1 can access this PTE
and cache it. Then, after CPU0 clears the PTE, the process on CPU1 can try
to munmap the region that includes the cleared PTE. However, now it does not
flush the TLB.

> +/*
> + * Ensure that any arch_tlbbatch_add_mm calls on this mm are up to date when
> + * this returns. Using the current mm tlb_gen means the TLB will be up to date
> + * with respect to the tlb_gen set at arch_tlbbatch_add_mm. If a flush has
> + * happened since then the IPIs will still be sent but the actual flush is
> + * avoided. Unfortunately the IPIs are necessary as the per-cpu context
> + * tlb_gens cannot be safely accessed.
> + */
> +void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
> +{
> +	int cpu;
> +	struct flush_tlb_info info = {
> +		.mm = mm,
> +		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
> +		.start = 0,
> +		.end = TLB_FLUSH_ALL,
> +	};
> +
> +	cpu = get_cpu();
> +
> +	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> +		VM_WARN_ON(irqs_disabled());
> +		local_irq_disable();
> +		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> +		local_irq_enable();
> +	}
> +
> +	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> +		flush_tlb_others(mm_cpumask(mm), &info);
> +
> +	put_cpu();
> +}
> +

It is a shame that after Andy collapsed all the different flushing flows,
you create a new one. How about squashing this untested one to yours?

-- >8 --

Subject: x86/mm: refactor flush_tlb_mm_range and arch_tlbbatch_flush_one_mm

flush_tlb_mm_range() and arch_tlbbatch_flush_one_mm() share a lot of mutual
code. After the recent work on combining the x86 TLB userspace entries
flushes, it is a shame to break them into different code-paths again.

Refactor the mutual code into perform_tlb_flush().

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/tlb.c | 48 +++++++++++++++++++-----------------------------
 1 file changed, 19 insertions(+), 29 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 248063dc5be8..56e00443a6cf 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -404,17 +404,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
  */
 static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
+static void perform_tlb_flush(struct mm_struct *mm, struct flush_tlb_info *info)
+{
+	int cpu = get_cpu();
+
+	if (info->mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+		VM_WARN_ON(irqs_disabled());
+		local_irq_disable();
+		flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+		local_irq_enable();
+	}
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		flush_tlb_others(mm_cpumask(mm), info);
+
+	put_cpu();
+}
+
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned long vmflag)
 {
-	int cpu;
-
 	struct flush_tlb_info info = {
 		.mm = mm,
 	};
 
-	cpu = get_cpu();
-
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	info.new_tlb_gen = inc_mm_tlb_gen(mm);
 
@@ -429,17 +442,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		VM_WARN_ON(irqs_disabled());
-		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
-
-	put_cpu();
+	perform_tlb_flush(mm, &info);
 }
 
 
@@ -515,7 +518,6 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
  */
 void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
 {
-	int cpu;
 	struct flush_tlb_info info = {
 		.mm = mm,
 		.new_tlb_gen = atomic64_read(&mm->context.tlb_gen),
@@ -523,19 +525,7 @@ void arch_tlbbatch_flush_one_mm(struct mm_struct *mm)
 		.end = TLB_FLUSH_ALL,
 	};
 
-	cpu = get_cpu();
-
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
-		VM_WARN_ON(irqs_disabled());
-		local_irq_disable();
-		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
-		local_irq_enable();
-	}
-
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		flush_tlb_others(mm_cpumask(mm), &info);
-
-	put_cpu();
+	perform_tlb_flush(mm, &info);
 }
 
 static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>