Stable Archive on lore.kernel.org
 help / color / Atom feed
* [v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
@ 2019-05-20  3:17 Yang Shi
  2019-05-21 23:18 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Yang Shi @ 2019-05-20  3:17 UTC (permalink / raw)
  To: jstancek, peterz, will.deacon, npiggin, aneesh.kumar, namit,
	minchan, mgorman, akpm
  Cc: yang.shi, stable, linux-mm, linux-kernel

A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e. read mmap_sem).  Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
PTEs in parallel don't remove page tables.  But, the forementioned
commit may do munmap() under read mmap_sem and free page tables.  This
may result in program hang on aarch64 reported by Jan Stancek.  The
problem could be reproduced by his test program with slightly modified
below.

---8<---

static int map_size = 4096;
static int num_iter = 500;
static long threads_total;

static void *distant_area;

void *map_write_unmap(void *ptr)
{
	int *fd = ptr;
	unsigned char *map_address;
	int i, j = 0;

	for (i = 0; i < num_iter; i++) {
		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
		if (map_address == MAP_FAILED) {
			perror("mmap");
			exit(1);
		}

		for (j = 0; j < map_size; j++)
			map_address[j] = 'b';

		if (munmap(map_address, map_size) == -1) {
			perror("munmap");
			exit(1);
		}
	}

	return NULL;
}

void *dummy(void *ptr)
{
	return NULL;
}

int main(void)
{
	pthread_t thid[2];

	/* hint for mmap in map_write_unmap() */
	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
	distant_area += DISTANT_MMAP_SIZE / 2;

	while (1) {
		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
		pthread_create(&thid[1], NULL, dummy, NULL);

		pthread_join(thid[0], NULL);
		pthread_join(thid[1], NULL);
	}
}
---8<---

The program may bring in parallel execution like below:

        t1                                        t2
munmap(map_address)
  downgrade_write(&mm->mmap_sem);
  unmap_region()
  tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

                                        pthread_exit()
                                        madvise(thread_stack, 8M, MADV_DONTNEED)
                                          zap_page_range()
                                            tlb_gather_mmu()
                                              inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
      __tlb_reset_range()

__tlb_reset_range() would reset freed_tables and cleared_* bits, but
this may cause inconsistency for munmap() which do free page tables.
Then it may result in some architectures, e.g. aarch64, may not flush
TLB completely as expected to have stale TLB entries remained.

Use fullmm flush since it yields much better performance on aarch64 and
non-fullmm doesn't yields significant difference on x86.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.

Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: stable@vger.kernel.org  4.20+
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
---
v3: Adopted fullmm flush suggestion from Will
v2: Reworked the commit log per Peter and Will
    Adopted the suggestion from Peter

 mm/mmu_gather.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99740e1..289f8cf 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
 {
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
-	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-	 * flush by batching, a thread has stable TLB entry can fail to flush
-	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-	 * forcefully if we detect parallel PTE batching threads.
+	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+	 * flush by batching, one thread may end up seeing inconsistent PTEs
+	 * and result in having stale TLB entries.  So flush TLB forcefully
+	 * if we detect parallel PTE batching threads.
+	 *
+	 * However, some syscalls, e.g. munmap(), may free page tables, this
+	 * needs force flush everything in the given range. Otherwise this
+	 * may result in having stale TLB entries for some architectures,
+	 * e.g. aarch64, that could specify flush what level TLB.
 	 */
 	if (mm_tlb_flush_nested(tlb->mm)) {
+		/*
+		 * The aarch64 yields better performance with fullmm by
+		 * avoiding multiple CPUs spamming TLBI messages at the
+		 * same time.
+		 *
+		 * On x86 non-fullmm doesn't yield significant difference
+		 * against fullmm.
+		 */ 
+		tlb->fullmm = 1;
 		__tlb_reset_range(tlb);
-		__tlb_adjust_range(tlb, start, end - start);
+		tlb->freed_tables = 1;
 	}
 
 	tlb_flush_mmu(tlb);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-20  3:17 [v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Yang Shi
@ 2019-05-21 23:18 ` Andrew Morton
  2019-05-22  1:00   ` Yang Shi
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2019-05-21 23:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: jstancek, peterz, will.deacon, npiggin, aneesh.kumar, namit,
	minchan, mgorman, stable, linux-mm, linux-kernel

On Mon, 20 May 2019 11:17:32 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> A few new fields were added to mmu_gather to make TLB flush smarter for
> huge page by telling what level of page table is changed.
> 
> __tlb_reset_range() is used to reset all these page table state to
> unchanged, which is called by TLB flush for parallel mapping changes for
> the same range under non-exclusive lock (i.e. read mmap_sem).  Before
> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
> munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
> PTEs in parallel don't remove page tables.  But, the forementioned
> commit may do munmap() under read mmap_sem and free page tables.  This
> may result in program hang on aarch64 reported by Jan Stancek.  The
> problem could be reproduced by his test program with slightly modified
> below.
> 
> ...
> 
> Use fullmm flush since it yields much better performance on aarch64 and
> non-fullmm doesn't yields significant difference on x86.
> 
> The original proposed fix came from Jan Stancek who mainly debugged this
> issue, I just wrapped up everything together.

Thanks.  I'll add

Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

to this.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
  2019-05-21 23:18 ` Andrew Morton
@ 2019-05-22  1:00   ` Yang Shi
  0 siblings, 0 replies; 3+ messages in thread
From: Yang Shi @ 2019-05-22  1:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jstancek, peterz, will.deacon, npiggin, aneesh.kumar, namit,
	minchan, mgorman, stable, linux-mm, linux-kernel



On 5/22/19 7:18 AM, Andrew Morton wrote:
> On Mon, 20 May 2019 11:17:32 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> A few new fields were added to mmu_gather to make TLB flush smarter for
>> huge page by telling what level of page table is changed.
>>
>> __tlb_reset_range() is used to reset all these page table state to
>> unchanged, which is called by TLB flush for parallel mapping changes for
>> the same range under non-exclusive lock (i.e. read mmap_sem).  Before
>> commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
>> munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
>> PTEs in parallel don't remove page tables.  But, the forementioned
>> commit may do munmap() under read mmap_sem and free page tables.  This
>> may result in program hang on aarch64 reported by Jan Stancek.  The
>> problem could be reproduced by his test program with slightly modified
>> below.
>>
>> ...
>>
>> Use fullmm flush since it yields much better performance on aarch64 and
>> non-fullmm doesn't yields significant difference on x86.
>>
>> The original proposed fix came from Jan Stancek who mainly debugged this
>> issue, I just wrapped up everything together.
> Thanks.  I'll add
>
> Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>
> to this.

Thanks, Andrew.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, back to index

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20  3:17 [v3 PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Yang Shi
2019-05-21 23:18 ` Andrew Morton
2019-05-22  1:00   ` Yang Shi

Stable Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/stable/0 stable/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 stable stable/ https://lore.kernel.org/stable \
		stable@vger.kernel.org stable@archiver.kernel.org
	public-inbox-index stable


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.stable


AGPL code for this site: git clone https://public-inbox.org/ public-inbox