Re: TLB batching breaks MADV_DONTNEED

From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: "open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: TLB batching breaks MADV_DONTNEED
Date: Wed, 19 Jul 2017 11:14:17 -0700	[thread overview]
Message-ID: <196BA5A2-A4EA-43A0-8961-B5CF262CA745@gmail.com> (raw)
In-Reply-To: <20170719082316.ceuzf3wt34e6jy3s@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 18, 2017 at 10:05:23PM -0700, Nadav Amit wrote:
>> Something seems to be really wrong with all these TLB flush batching
>> mechanisms that are all around kernel. Here is another example, which was
>> not addressed by the recently submitted patches.
>> 
>> Consider what happens when two MADV_DONTNEED run concurrently. According to
>> the man page "After a successful MADV_DONTNEED operation ??? subsequent
>> accesses of pages in the range will succeed, but will result in ???
>> zero-fill-on-demand pages for anonymous private mappings.???
>> 
>> However, the test below, which does MADV_DONTNEED in two threads, reads ???8???
>> and not ???0??? when reading the memory following MADV_DONTNEED. It happens
>> since one of the threads clears the PTE, but defers the TLB flush for some
>> time (until it finishes changing 16k PTEs). The main thread sees the PTE
>> already non-present and does not flush the TLB.
>> 
>> I think there is a need for a batching scheme that considers whether
>> mmap_sem is taken for write/read/nothing and the change to the PTE.
>> Unfortunately, I do not have the time to do it right now.
>> 
>> Am I missing something? Thoughts?
> 
> You're right that in this case, there will be a short window when the old
> anonymous data is still available. Non-anonymous doesn't matter in this case
> as the if the data is unmapped but available from a stale TLB entry, all it
> means is that there is a delay in refetching the data from backing storage.
> 
> Technically, DONTNEED is not required to zero-fill the data but in the
> case of Linux, it actually does matter because the stale entry is
> pointing to page that will be freed shortly. If a caller returns and
> uses a stale TLB entry to "reinitialise" the region then the writes may
> be lost.

And although I didn’t check, it may have some implications on userfaultfd
which is often used with MADV_DONTNEED.

> This is independent of the reclaim batching of flushes and specific to
> how madvise uses zap_page_range.
> 
> The most straight-forward but overkill solution would be to take mmap_sem
> for write for madvise. That would have wide-ranging consequences and likely
> to be rejected.
> 
> A more reasonable solution would be to always flush the TLB range being
> madvised when the VMA is a private anonymous mapping to guarantee that
> a zero-fill-on-demand region exists. Other mappings do not need special
> protection as a parallel access will either use a stale TLB (no permission
> change so no problem) or refault the data. Special casing based on
> mmap_sem does not make much sense but is also unnecessary.
> 
> Something like this completely untested patch that would point in the
> general direction if a case can be found where this should be fixed. It
> could be optimised to only flush the local TLB but it's probably not worth
> the complexity.
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9976852f1e1c..78bbe09e549e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -497,6 +497,18 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> 					unsigned long start, unsigned long end)
> {
> 	zap_page_range(vma, start, end - start);
> +
> +	/*
> +	 * A parallel madvise operation could have unmapped PTEs and deferred
> +	 * a flush before this madvise returns. Guarantee the TLB is flushed
> +	 * so that an immediate read after madvise will return zero's for
> +	 * private anonymous mappings. File-backed shared mappings do not
> +	 * matter as they will either use a stale TLB entry or refault the
> +	 * data in the event of a race.
> +	 */
> +	if (vma_is_anonymous(vma))
> +		flush_tlb_range(vma, start, end);
> +	
> 	return 0;
> }

It will work but would in this case but would very often result in a
redundant TLB flush. I also think that we still don’t understand the extent
of the problem, based on the issues that keep coming out. In this case it
may be better to be defensive and not to try to avoid flushes too
aggressively (e.g., on non-anonymous VMAs).

Here is what I have in mind (not tested). Based on whether mmap_sem is
acquired for write, exclusiveness is determined. If exclusiveness is not
maintained, a TLB flush is required. If I could use the owner field of rwsem
(when available), this can simplify the check whether the code is run
exclusively.

Having said that, I still think that the whole batching scheme need to be
unified and rethought of.

-- >8 --
Subject: [PATCH] mm: Add missing TLB flushes when zapping

From: Nadav Amit <namit@vmware.com>
---
 arch/s390/mm/gmap.c       |  2 +-
 arch/x86/mm/mpx.c         |  2 +-
 drivers/android/binder.c  |  2 +-
 include/asm-generic/tlb.h |  2 +-
 include/linux/mm.h        |  2 +-
 mm/madvise.c              |  2 +-
 mm/memory.c               | 25 ++++++++++++++++++++++++-
 7 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 4fb3d3cdb370..f11d78d74f64 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -690,7 +690,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
 		/* Find vma in the parent mm */
 		vma = find_vma(gmap->mm, vmaddr);
 		size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
-		zap_page_range(vma, vmaddr, size);
+		zap_page_range(vma, vmaddr, size, false);
 	}
 	up_read(&gmap->mm->mmap_sem);
 }
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 1c34b767c84c..e0b0caa09199 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -789,7 +789,7 @@ static noinline int zap_bt_entries_mapping(struct mm_struct *mm,
 			return -EINVAL;
 
 		len = min(vma->vm_end, end) - addr;
-		zap_page_range(vma, addr, len);
+		zap_page_range(vma, addr, len, true);
 		trace_mpx_unmap_zap(addr, addr+len);
 
 		vma = vma->vm_next;
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index aae4d8d4be36..7e3fdd0e6e21 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -658,7 +658,7 @@ static int binder_update_page_range(struct binder_proc *proc, int allocate,
 		page = &proc->pages[(page_addr - proc->buffer) / PAGE_SIZE];
 		if (vma)
 			zap_page_range(vma, (uintptr_t)page_addr +
-				proc->user_buffer_offset, PAGE_SIZE);
+				proc->user_buffer_offset, PAGE_SIZE, true);
 err_vm_insert_page_failed:
 		unmap_kernel_range((unsigned long)page_addr, PAGE_SIZE);
 err_map_kernel_failed:
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 9cd77edd642d..83ca2764f61a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -121,7 +121,7 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
 
 static inline void __tlb_adjust_range(struct mmu_gather *tlb,
 				      unsigned long address,
-				      unsigned int range_size)
+				      unsigned long range_size)
 {
 	tlb->start = min(tlb->start, address);
 	tlb->end = max(tlb->end, address + range_size);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..bcd6b4138bec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1203,7 +1203,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size);
 void zap_page_range(struct vm_area_struct *vma, unsigned long address,
-		unsigned long size);
+		unsigned long size, bool exclusive);
 void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long start, unsigned long end);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 9976852f1e1c..956f273fac2f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -496,7 +496,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	zap_page_range(vma, start, end - start, false);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..becb8752a422 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1469,11 +1469,12 @@ void unmap_vmas(struct mmu_gather *tlb,
  * @vma: vm_area_struct holding the applicable pages
  * @start: starting address of pages to zap
  * @size: number of bytes to zap
+ * @exclusive: whether mmap_sem is acquired for write
  *
  * Caller must protect the VMA list
  */
 void zap_page_range(struct vm_area_struct *vma, unsigned long start,
-		unsigned long size)
+		unsigned long size, bool exclusive)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
@@ -1486,6 +1487,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, NULL);
 	mmu_notifier_invalidate_range_end(mm, start, end);
+
+	/*
+	 * If mmap_sem is not acquired for write, a TLB flush of the entire
+	 * range is needed, since zap_page_range may run while another thread
+	 * unmaps PTEs and defers TLB flushes to batch them. As a result, this
+	 * thread may not see present PTEs in zap_pte_range and would not ask to
+	 * flush them, although they were not flushed before.  This scenario can
+	 * happen, for example, when two threads use MADV_DONTNEED of
+	 * overlapping memory ranges concurrently. To avoid problems, if
+	 * mmap_sem is not held exclusively, we should flush the entire range.
+	 */
+	if (!exclusive)
+		__tlb_adjust_range(&tlb, start, size);
+
 	tlb_finish_mmu(&tlb, start, end);
 }
 
@@ -1511,6 +1526,14 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	mmu_notifier_invalidate_range_start(mm, address, end);
 	unmap_single_vma(&tlb, vma, address, end, details);
 	mmu_notifier_invalidate_range_end(mm, address, end);
+
+	/*
+	 * This code may race with other PTE unmapping (see comment in
+	 * zap_page_range). Always flush the TLB since we do not know if
+	 * mmap_sem is acquired for write.
+	 */
+	__tlb_adjust_range(&tlb, address, size);
+
 	tlb_finish_mmu(&tlb, address, end);
 }
 
-- 
2.11.0

 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>