stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
@ 2019-05-21 23:18 akpm
  2019-05-27 11:01 ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: akpm @ 2019-05-21 23:18 UTC (permalink / raw)
  To: mm-commits, will.deacon, stable, peterz, npiggin, namit, minchan,
	mgorman, jstancek, aneesh.kumar, yang.shi


The patch titled
     Subject: mm: mmu_gather: remove __tlb_reset_range() for force flush
has been added to the -mm tree.  Its filename is
     mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm: mmu_gather: remove __tlb_reset_range() for force flush

A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e.  read mmap_sem).  Before
commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"),
the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update PTEs in
parallel don't remove page tables.  But, the forementioned commit may do
munmap() under read mmap_sem and free page tables.  This may result in
program hang on aarch64 reported by Jan Stancek.  The problem could be
reproduced by his test program with slightly modified below.

---8<---

static int map_size = 4096;
static int num_iter = 500;
static long threads_total;

static void *distant_area;

void *map_write_unmap(void *ptr)
{
	int *fd = ptr;
	unsigned char *map_address;
	int i, j = 0;

	for (i = 0; i < num_iter; i++) {
		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
		if (map_address == MAP_FAILED) {
			perror("mmap");
			exit(1);
		}

		for (j = 0; j < map_size; j++)
			map_address[j] = 'b';

		if (munmap(map_address, map_size) == -1) {
			perror("munmap");
			exit(1);
		}
	}

	return NULL;
}

void *dummy(void *ptr)
{
	return NULL;
}

int main(void)
{
	pthread_t thid[2];

	/* hint for mmap in map_write_unmap() */
	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
	distant_area += DISTANT_MMAP_SIZE / 2;

	while (1) {
		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
		pthread_create(&thid[1], NULL, dummy, NULL);

		pthread_join(thid[0], NULL);
		pthread_join(thid[1], NULL);
	}
}
---8<---

The program may bring in parallel execution like below:

        t1                                        t2
munmap(map_address)
  downgrade_write(&mm->mmap_sem);
  unmap_region()
  tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

                                        pthread_exit()
                                        madvise(thread_stack, 8M, MADV_DONTNEED)
                                          zap_page_range()
                                            tlb_gather_mmu()
                                              inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
      __tlb_reset_range()

__tlb_reset_range() would reset freed_tables and cleared_* bits, but this
may cause inconsistency for munmap() which do free page tables.  Then it
may result in some architectures, e.g.  aarch64, may not flush TLB
completely as expected to have stale TLB entries remained.

Use fullmm flush since it yields much better performance on aarch64 and
non-fullmm doesn't yields significant difference on x86.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.

Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>	[4.20+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmu_gather.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

--- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
+++ a/mm/mmu_gather.c
@@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
 {
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
-	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
-	 * flush by batching, a thread has stable TLB entry can fail to flush
-	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
-	 * forcefully if we detect parallel PTE batching threads.
+	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
+	 * flush by batching, one thread may end up seeing inconsistent PTEs
+	 * and result in having stale TLB entries.  So flush TLB forcefully
+	 * if we detect parallel PTE batching threads.
+	 *
+	 * However, some syscalls, e.g. munmap(), may free page tables, this
+	 * needs force flush everything in the given range. Otherwise this
+	 * may result in having stale TLB entries for some architectures,
+	 * e.g. aarch64, that could specify flush what level TLB.
 	 */
 	if (mm_tlb_flush_nested(tlb->mm)) {
+		/*
+		 * The aarch64 yields better performance with fullmm by
+		 * avoiding multiple CPUs spamming TLBI messages at the
+		 * same time.
+		 *
+		 * On x86 non-fullmm doesn't yield significant difference
+		 * against fullmm.
+		 */ 
+		tlb->fullmm = 1;
 		__tlb_reset_range(tlb);
-		__tlb_adjust_range(tlb, start, end - start);
+		tlb->freed_tables = 1;
 	}
 
 	tlb_flush_mmu(tlb);
_

Patches currently in -mm which might be from yang.shi@linux.alibaba.com are

mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-21 23:18 + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree akpm
@ 2019-05-27 11:01 ` Peter Zijlstra
  2019-05-27 13:29   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-27 11:01 UTC (permalink / raw)
  To: akpm
  Cc: mm-commits, will.deacon, stable, npiggin, namit, minchan,
	mgorman, jstancek, aneesh.kumar, yang.shi

On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
> --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
> +++ a/mm/mmu_gather.c
> @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
>  {
>  	/*
>  	 * If there are parallel threads are doing PTE changes on same range
> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> -	 * forcefully if we detect parallel PTE batching threads.
> +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> +	 * and result in having stale TLB entries.  So flush TLB forcefully
> +	 * if we detect parallel PTE batching threads.
> +	 *
> +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> +	 * needs force flush everything in the given range. Otherwise this
> +	 * may result in having stale TLB entries for some architectures,
> +	 * e.g. aarch64, that could specify flush what level TLB.
>  	 */
>  	if (mm_tlb_flush_nested(tlb->mm)) {
> +		/*
> +		 * The aarch64 yields better performance with fullmm by
> +		 * avoiding multiple CPUs spamming TLBI messages at the
> +		 * same time.
> +		 *
> +		 * On x86 non-fullmm doesn't yield significant difference
> +		 * against fullmm.
> +		 */ 
> +		tlb->fullmm = 1;
>  		__tlb_reset_range(tlb);
> -		__tlb_adjust_range(tlb, start, end - start);
> +		tlb->freed_tables = 1;
>  	}
>  
>  	tlb_flush_mmu(tlb);

Nick, Aneesh, can we now do this?

---

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 4d841369399f..8d28b83914cb 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	 */
 	if (tlb->fullmm) {
 		__flush_all_mm(mm, true);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
-	} else if (mm_tlb_flush_nested(mm)) {
-		/*
-		 * If there is a concurrent invalidation that is clearing ptes,
-		 * then it's possible this invalidation will miss one of those
-		 * cleared ptes and miss flushing the TLB. If this invalidate
-		 * returns before the other one flushes TLBs, that can result
-		 * in it returning while there are still valid TLBs inside the
-		 * range to be invalidated.
-		 *
-		 * See mm/memory.c:tlb_finish_mmu() for more details.
-		 *
-		 * The solution to this is ensure the entire range is always
-		 * flushed here. The problem for powerpc is that the flushes
-		 * are page size specific, so this "forced flush" would not
-		 * do the right thing if there are a mix of page sizes in
-		 * the range to be invalidated. So use __flush_tlb_range
-		 * which invalidates all possible page sizes in the range.
-		 *
-		 * PWC flush probably is not be required because the core code
-		 * shouldn't free page tables in this path, but accounting
-		 * for the possibility makes us a bit more robust.
-		 *
-		 * need_flush_all is an uncommon case because page table
-		 * teardown should be done with exclusive locks held (but
-		 * after locks are dropped another invalidate could come
-		 * in), it could be optimized further if necessary.
-		 */
-		if (!tlb->need_flush_all)
-			__radix__flush_tlb_range(mm, start, end, true);
-		else
-			radix__flush_all_mm(mm);
-#endif
 	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
 		if (!tlb->need_flush_all)
 			radix__flush_tlb_mm(mm);


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-27 11:01 ` Peter Zijlstra
@ 2019-05-27 13:29   ` Aneesh Kumar K.V
  2019-05-27 14:25     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2019-05-27 13:29 UTC (permalink / raw)
  To: Peter Zijlstra, akpm
  Cc: mm-commits, will.deacon, stable, npiggin, namit, minchan,
	mgorman, jstancek, yang.shi

On 5/27/19 4:31 PM, Peter Zijlstra wrote:
> On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
>> --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
>> +++ a/mm/mmu_gather.c
>> @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
>>   {
>>   	/*
>>   	 * If there are parallel threads are doing PTE changes on same range
>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> -	 * forcefully if we detect parallel PTE batching threads.
>> +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>> +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>> +	 * and result in having stale TLB entries.  So flush TLB forcefully
>> +	 * if we detect parallel PTE batching threads.
>> +	 *
>> +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>> +	 * needs force flush everything in the given range. Otherwise this
>> +	 * may result in having stale TLB entries for some architectures,
>> +	 * e.g. aarch64, that could specify flush what level TLB.
>>   	 */
>>   	if (mm_tlb_flush_nested(tlb->mm)) {
>> +		/*
>> +		 * The aarch64 yields better performance with fullmm by
>> +		 * avoiding multiple CPUs spamming TLBI messages at the
>> +		 * same time.
>> +		 *
>> +		 * On x86 non-fullmm doesn't yield significant difference
>> +		 * against fullmm.
>> +		 */
>> +		tlb->fullmm = 1;
>>   		__tlb_reset_range(tlb);
>> -		__tlb_adjust_range(tlb, start, end - start);
>> +		tlb->freed_tables = 1;
>>   	}
>>   
>>   	tlb_flush_mmu(tlb);
> 
> Nick, Aneesh, can we now do this?
> 
> ---
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 4d841369399f..8d28b83914cb 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb)
>   	 */
>   	if (tlb->fullmm) {
>   		__flush_all_mm(mm, true);
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
> -	} else if (mm_tlb_flush_nested(mm)) {
> -		/*
> -		 * If there is a concurrent invalidation that is clearing ptes,
> -		 * then it's possible this invalidation will miss one of those
> -		 * cleared ptes and miss flushing the TLB. If this invalidate
> -		 * returns before the other one flushes TLBs, that can result
> -		 * in it returning while there are still valid TLBs inside the
> -		 * range to be invalidated.
> -		 *
> -		 * See mm/memory.c:tlb_finish_mmu() for more details.
> -		 *
> -		 * The solution to this is ensure the entire range is always
> -		 * flushed here. The problem for powerpc is that the flushes
> -		 * are page size specific, so this "forced flush" would not
> -		 * do the right thing if there are a mix of page sizes in
> -		 * the range to be invalidated. So use __flush_tlb_range
> -		 * which invalidates all possible page sizes in the range.
> -		 *
> -		 * PWC flush probably is not be required because the core code
> -		 * shouldn't free page tables in this path, but accounting
> -		 * for the possibility makes us a bit more robust.
> -		 *
> -		 * need_flush_all is an uncommon case because page table
> -		 * teardown should be done with exclusive locks held (but
> -		 * after locks are dropped another invalidate could come
> -		 * in), it could be optimized further if necessary.
> -		 */
> -		if (!tlb->need_flush_all)
> -			__radix__flush_tlb_range(mm, start, end, true);
> -		else
> -			radix__flush_all_mm(mm);
> -#endif
>   	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
>   		if (!tlb->need_flush_all)
>   			radix__flush_tlb_mm(mm);
> 


I guess we can revert most of the commit
02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we 
flush multiple page sizes? . But should we evaluate the performance 
impact of that fullmm flush on ppc64?


-aneesh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-27 13:29   ` Aneesh Kumar K.V
@ 2019-05-27 14:25     ` Peter Zijlstra
  2019-05-30 21:55       ` Jan Stancek
  2019-05-31  2:46       ` Nicholas Piggin
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-27 14:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, mm-commits, will.deacon, stable, npiggin, namit, minchan,
	mgorman, jstancek, yang.shi

On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
> On 5/27/19 4:31 PM, Peter Zijlstra wrote:
> > On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
> > > --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
> > > +++ a/mm/mmu_gather.c
> > > @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
> > >   {
> > >   	/*
> > >   	 * If there are parallel threads are doing PTE changes on same range
> > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
> > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > > -	 * forcefully if we detect parallel PTE batching threads.
> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> > > +	 * if we detect parallel PTE batching threads.
> > > +	 *
> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> > > +	 * needs force flush everything in the given range. Otherwise this
> > > +	 * may result in having stale TLB entries for some architectures,
> > > +	 * e.g. aarch64, that could specify flush what level TLB.
> > >   	 */
> > >   	if (mm_tlb_flush_nested(tlb->mm)) {
> > > +		/*
> > > +		 * The aarch64 yields better performance with fullmm by
> > > +		 * avoiding multiple CPUs spamming TLBI messages at the
> > > +		 * same time.
> > > +		 *
> > > +		 * On x86 non-fullmm doesn't yield significant difference
> > > +		 * against fullmm.
> > > +		 */
> > > +		tlb->fullmm = 1;
> > >   		__tlb_reset_range(tlb);
> > > -		__tlb_adjust_range(tlb, start, end - start);
> > > +		tlb->freed_tables = 1;
> > >   	}
> > >   	tlb_flush_mmu(tlb);
> > 
> > Nick, Aneesh, can we now do this?
> > 
> > ---
> > 
> > diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> > index 4d841369399f..8d28b83914cb 100644
> > --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> > +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> > @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb)
> >   	 */
> >   	if (tlb->fullmm) {
> >   		__flush_all_mm(mm, true);
> > -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
> > -	} else if (mm_tlb_flush_nested(mm)) {
> > -		/*
> > -		 * If there is a concurrent invalidation that is clearing ptes,
> > -		 * then it's possible this invalidation will miss one of those
> > -		 * cleared ptes and miss flushing the TLB. If this invalidate
> > -		 * returns before the other one flushes TLBs, that can result
> > -		 * in it returning while there are still valid TLBs inside the
> > -		 * range to be invalidated.
> > -		 *
> > -		 * See mm/memory.c:tlb_finish_mmu() for more details.
> > -		 *
> > -		 * The solution to this is ensure the entire range is always
> > -		 * flushed here. The problem for powerpc is that the flushes
> > -		 * are page size specific, so this "forced flush" would not
> > -		 * do the right thing if there are a mix of page sizes in
> > -		 * the range to be invalidated. So use __flush_tlb_range
> > -		 * which invalidates all possible page sizes in the range.
> > -		 *
> > -		 * PWC flush probably is not be required because the core code
> > -		 * shouldn't free page tables in this path, but accounting
> > -		 * for the possibility makes us a bit more robust.
> > -		 *
> > -		 * need_flush_all is an uncommon case because page table
> > -		 * teardown should be done with exclusive locks held (but
> > -		 * after locks are dropped another invalidate could come
> > -		 * in), it could be optimized further if necessary.
> > -		 */
> > -		if (!tlb->need_flush_all)
> > -			__radix__flush_tlb_range(mm, start, end, true);
> > -		else
> > -			radix__flush_all_mm(mm);
> > -#endif
> >   	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
> >   		if (!tlb->need_flush_all)
> >   			radix__flush_tlb_mm(mm);
> > 
> 
> 
> I guess we can revert most of the commit
> 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush
> multiple page sizes? . But should we evaluate the performance impact of that
> fullmm flush on ppc64?

Maybe, but given the patch that went into -mm, PPC will never hit that
branch I killed anymore -- and that really shouldn't be in architecture
code anyway.

Also; as I noted last time: __radix___flush_tlb_range() and
__radix__flush_tlb_range_psize() look similar enough that they might
want to be a single function (and instead of @flush_all_sizes, have it
take @gflush, @hflush, @flush and @pwc).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-27 14:25     ` Peter Zijlstra
@ 2019-05-30 21:55       ` Jan Stancek
  2019-05-31  2:46       ` Nicholas Piggin
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Stancek @ 2019-05-30 21:55 UTC (permalink / raw)
  To: Peter Zijlstra, Aneesh Kumar K.V, akpm
  Cc: mm-commits, will deacon, stable, npiggin, namit, minchan,
	mgorman, yang shi, Jan Stancek



----- Original Message -----
> On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
> > On 5/27/19 4:31 PM, Peter Zijlstra wrote:
> > > On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org
> > > wrote:
> > > > ---
> > > > a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
> > > > +++ a/mm/mmu_gather.c
> > > > @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
> > > >   {
> > > >   	/*
> > > >   	 * If there are parallel threads are doing PTE changes on same range
> > > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
> > > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > > > -	 * forcefully if we detect parallel PTE batching threads.
> > > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> > > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> > > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> > > > +	 * if we detect parallel PTE batching threads.
> > > > +	 *
> > > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> > > > +	 * needs force flush everything in the given range. Otherwise this
> > > > +	 * may result in having stale TLB entries for some architectures,
> > > > +	 * e.g. aarch64, that could specify flush what level TLB.
> > > >   	 */
> > > >   	if (mm_tlb_flush_nested(tlb->mm)) {
> > > > +		/*
> > > > +		 * The aarch64 yields better performance with fullmm by
> > > > +		 * avoiding multiple CPUs spamming TLBI messages at the
> > > > +		 * same time.
> > > > +		 *
> > > > +		 * On x86 non-fullmm doesn't yield significant difference
> > > > +		 * against fullmm.
> > > > +		 */
> > > > +		tlb->fullmm = 1;
> > > >   		__tlb_reset_range(tlb);
> > > > -		__tlb_adjust_range(tlb, start, end - start);
> > > > +		tlb->freed_tables = 1;
> > > >   	}
> > > >   	tlb_flush_mmu(tlb);
> > > 
> > > Nick, Aneesh, can we now do this?
> > > 
> > > ---
> > > 
> > > diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c
> > > b/arch/powerpc/mm/book3s64/radix_tlb.c
> > > index 4d841369399f..8d28b83914cb 100644
> > > --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> > > +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> > > @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb)
> > >   	 */
> > >   	if (tlb->fullmm) {
> > >   		__flush_all_mm(mm, true);
> > > -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
> > > -	} else if (mm_tlb_flush_nested(mm)) {
> > > -		/*
> > > -		 * If there is a concurrent invalidation that is clearing ptes,
> > > -		 * then it's possible this invalidation will miss one of those
> > > -		 * cleared ptes and miss flushing the TLB. If this invalidate
> > > -		 * returns before the other one flushes TLBs, that can result
> > > -		 * in it returning while there are still valid TLBs inside the
> > > -		 * range to be invalidated.
> > > -		 *
> > > -		 * See mm/memory.c:tlb_finish_mmu() for more details.
> > > -		 *
> > > -		 * The solution to this is ensure the entire range is always
> > > -		 * flushed here. The problem for powerpc is that the flushes
> > > -		 * are page size specific, so this "forced flush" would not
> > > -		 * do the right thing if there are a mix of page sizes in
> > > -		 * the range to be invalidated. So use __flush_tlb_range
> > > -		 * which invalidates all possible page sizes in the range.
> > > -		 *
> > > -		 * PWC flush probably is not be required because the core code
> > > -		 * shouldn't free page tables in this path, but accounting
> > > -		 * for the possibility makes us a bit more robust.
> > > -		 *
> > > -		 * need_flush_all is an uncommon case because page table
> > > -		 * teardown should be done with exclusive locks held (but
> > > -		 * after locks are dropped another invalidate could come
> > > -		 * in), it could be optimized further if necessary.
> > > -		 */
> > > -		if (!tlb->need_flush_all)
> > > -			__radix__flush_tlb_range(mm, start, end, true);
> > > -		else
> > > -			radix__flush_all_mm(mm);
> > > -#endif
> > >   	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
> > >   		if (!tlb->need_flush_all)
> > >   			radix__flush_tlb_mm(mm);
> > > 
> > 
> > 
> > I guess we can revert most of the commit
> > 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush
> > multiple page sizes? . But should we evaluate the performance impact of
> > that
> > fullmm flush on ppc64?

Hi,

I ran a test on ppc64le, using [1] as benchmark (it's same one that produced
aarch64 numbers in [2]):

16 threads, each looping 100k times over:
  mmap(16M)
  touch 1 page
  madvise(DONTNEED)
  munmap(16M)

10 runs averaged with and without v3 patch:

v5.2-rc2-24-gbec7550cca10
--------------------------
        mean     stddev
real    37.382   2.780
user     1.420   0.078
sys     54.658   1.855

v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
-----------------------------------------------------------------------------------------
        mean     stddev
real    37.119   2.105
user     1.548   0.087
sys     55.698   1.357

Hardware
---------
Power9 Model 8335-GTH (ibm,witherspoon,128CPUs,256GB RAM,6 nodes).

cpu             : POWER9, altivec supported
clock           : 3300.000000MHz
revision        : 2.2 (pvr 004e 1202)

# numactl -H
available: 6 nodes (0,8,252-255)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 130722 MB
node 0 free: 127327 MB
node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 8 size: 130784 MB
node 8 free: 129864 MB
node 252 cpus:
node 252 size: 0 MB
node 252 free: 0 MB
node 253 cpus:
node 253 size: 0 MB
node 253 free: 0 MB
node 254 cpus:
node 254 size: 0 MB
node 254 free: 0 MB
node 255 cpus:
node 255 size: 0 MB
node 255 free: 0 MB
node distances:
node   0   8  252  253  254  255
  0:  10  40  80  80  80  80
  8:  40  10  80  80  80  80
 252:  80  80  10  80  80  80
 253:  80  80  80  10  80  80
 254:  80  80  80  80  10  80
 255:  80  80  80  80  80  10

Regards,
Jan

[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_fault_stall/mmap8.c
[2] https://lore.kernel.org/linux-mm/1158926942.23199905.1558020575293.JavaMail.zimbra@redhat.com/

> 
> Maybe, but given the patch that went into -mm, PPC will never hit that
> branch I killed anymore -- and that really shouldn't be in architecture
> code anyway.
> 
> Also; as I noted last time: __radix___flush_tlb_range() and
> __radix__flush_tlb_range_psize() look similar enough that they might
> want to be a single function (and instead of @flush_all_sizes, have it
> take @gflush, @hflush, @flush and @pwc).
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-27 14:25     ` Peter Zijlstra
  2019-05-30 21:55       ` Jan Stancek
@ 2019-05-31  2:46       ` Nicholas Piggin
  2019-05-31  9:49         ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Nicholas Piggin @ 2019-05-31  2:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Peter Zijlstra
  Cc: akpm, jstancek, mgorman, minchan, mm-commits, namit, stable,
	will.deacon, yang.shi

Peter Zijlstra's on May 28, 2019 12:25 am:
> On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
>> On 5/27/19 4:31 PM, Peter Zijlstra wrote:
>> > On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
>> > > --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
>> > > +++ a/mm/mmu_gather.c
>> > > @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
>> > >   {
>> > >   	/*
>> > >   	 * If there are parallel threads are doing PTE changes on same range
>> > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
>> > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> > > -	 * forcefully if we detect parallel PTE batching threads.
>> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
>> > > +	 * if we detect parallel PTE batching threads.
>> > > +	 *
>> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>> > > +	 * needs force flush everything in the given range. Otherwise this
>> > > +	 * may result in having stale TLB entries for some architectures,
>> > > +	 * e.g. aarch64, that could specify flush what level TLB.
>> > >   	 */
>> > >   	if (mm_tlb_flush_nested(tlb->mm)) {
>> > > +		/*
>> > > +		 * The aarch64 yields better performance with fullmm by
>> > > +		 * avoiding multiple CPUs spamming TLBI messages at the
>> > > +		 * same time.
>> > > +		 *
>> > > +		 * On x86 non-fullmm doesn't yield significant difference
>> > > +		 * against fullmm.
>> > > +		 */
>> > > +		tlb->fullmm = 1;
>> > >   		__tlb_reset_range(tlb);
>> > > -		__tlb_adjust_range(tlb, start, end - start);
>> > > +		tlb->freed_tables = 1;
>> > >   	}
>> > >   	tlb_flush_mmu(tlb);
>> > 
>> > Nick, Aneesh, can we now do this?

Sorry I meant to get to that but forgot.

>> > 
>> > ---
>> > 
>> > diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> > index 4d841369399f..8d28b83914cb 100644
>> > --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> > +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> > @@ -881,39 +881,6 @@ void radix__tlb_flush(struct mmu_gather *tlb)
>> >   	 */
>> >   	if (tlb->fullmm) {
>> >   		__flush_all_mm(mm, true);
>> > -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE)
>> > -	} else if (mm_tlb_flush_nested(mm)) {
>> > -		/*
>> > -		 * If there is a concurrent invalidation that is clearing ptes,
>> > -		 * then it's possible this invalidation will miss one of those
>> > -		 * cleared ptes and miss flushing the TLB. If this invalidate
>> > -		 * returns before the other one flushes TLBs, that can result
>> > -		 * in it returning while there are still valid TLBs inside the
>> > -		 * range to be invalidated.
>> > -		 *
>> > -		 * See mm/memory.c:tlb_finish_mmu() for more details.
>> > -		 *
>> > -		 * The solution to this is ensure the entire range is always
>> > -		 * flushed here. The problem for powerpc is that the flushes
>> > -		 * are page size specific, so this "forced flush" would not
>> > -		 * do the right thing if there are a mix of page sizes in
>> > -		 * the range to be invalidated. So use __flush_tlb_range
>> > -		 * which invalidates all possible page sizes in the range.
>> > -		 *
>> > -		 * PWC flush probably is not be required because the core code
>> > -		 * shouldn't free page tables in this path, but accounting
>> > -		 * for the possibility makes us a bit more robust.
>> > -		 *
>> > -		 * need_flush_all is an uncommon case because page table
>> > -		 * teardown should be done with exclusive locks held (but
>> > -		 * after locks are dropped another invalidate could come
>> > -		 * in), it could be optimized further if necessary.
>> > -		 */
>> > -		if (!tlb->need_flush_all)
>> > -			__radix__flush_tlb_range(mm, start, end, true);
>> > -		else
>> > -			radix__flush_all_mm(mm);
>> > -#endif
>> >   	} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
>> >   		if (!tlb->need_flush_all)
>> >   			radix__flush_tlb_mm(mm);
>> > 
>> 
>> 
>> I guess we can revert most of the commit
>> 02390f66bd2362df114a0a0770d80ec33061f6d1. That is the only place we flush
>> multiple page sizes? . But should we evaluate the performance impact of that
>> fullmm flush on ppc64?
> 
> Maybe, but given the patch that went into -mm, PPC will never hit that
> branch I killed anymore -- and that really shouldn't be in architecture
> code anyway.

Yeah well if mm/ does this then sure it's dead and can go.

I don't think it's very nice to set fullmm and freed_tables for this 
case though. Is this concurrent zapping an important fast path? It
must have been, in order to justify all this complexity to the mm, so
we don't want to tie this boat anchor to it AFAIKS?

Is the problem just that the freed page tables flags get cleared by
__tlb_reset_range()? Why not just remove that then, so the bits are
set properly for the munmap?

> Also; as I noted last time: __radix___flush_tlb_range() and
> __radix__flush_tlb_range_psize() look similar enough that they might
> want to be a single function (and instead of @flush_all_sizes, have it
> take @gflush, @hflush, @flush and @pwc).

Yeah, it could possibly use a cleanup pass. I have a few patches
sitting around to make a few more optimisations around the place,
was going to look at some refactoring after that.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-31  2:46       ` Nicholas Piggin
@ 2019-05-31  9:49         ` Peter Zijlstra
  2019-06-03  2:11           ` Nicholas Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-31  9:49 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Aneesh Kumar K.V, akpm, jstancek, mgorman, minchan, mm-commits,
	namit, stable, will.deacon, yang.shi

On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> Peter Zijlstra's on May 28, 2019 12:25 am:
> > On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
> >> On 5/27/19 4:31 PM, Peter Zijlstra wrote:
> >> > On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
> >> > > --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
> >> > > +++ a/mm/mmu_gather.c
> >> > > @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
> >> > >   {
> >> > >   	/*
> >> > >   	 * If there are parallel threads are doing PTE changes on same range
> >> > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >> > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
> >> > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >> > > -	 * forcefully if we detect parallel PTE batching threads.
> >> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> >> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
> >> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
> >> > > +	 * if we detect parallel PTE batching threads.
> >> > > +	 *
> >> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
> >> > > +	 * needs force flush everything in the given range. Otherwise this
> >> > > +	 * may result in having stale TLB entries for some architectures,
> >> > > +	 * e.g. aarch64, that could specify flush what level TLB.
> >> > >   	 */
> >> > >   	if (mm_tlb_flush_nested(tlb->mm)) {
> >> > > +		/*
> >> > > +		 * The aarch64 yields better performance with fullmm by
> >> > > +		 * avoiding multiple CPUs spamming TLBI messages at the
> >> > > +		 * same time.
> >> > > +		 *
> >> > > +		 * On x86 non-fullmm doesn't yield significant difference
> >> > > +		 * against fullmm.
> >> > > +		 */
> >> > > +		tlb->fullmm = 1;
> >> > >   		__tlb_reset_range(tlb);
> >> > > -		__tlb_adjust_range(tlb, start, end - start);
> >> > > +		tlb->freed_tables = 1;
> >> > >   	}
> >> > >   	tlb_flush_mmu(tlb);

> > Maybe, but given the patch that went into -mm, PPC will never hit that
> > branch I killed anymore -- and that really shouldn't be in architecture
> > code anyway.
> 
> Yeah well if mm/ does this then sure it's dead and can go.
> 
> I don't think it's very nice to set fullmm and freed_tables for this 
> case though. Is this concurrent zapping an important fast path? It
> must have been, in order to justify all this complexity to the mm, so
> we don't want to tie this boat anchor to it AFAIKS?

I'm not convinced its an important fast path, afaict it is an
unfortunate correctness issue caused by allowing concurrenct frees.

> Is the problem just that the freed page tables flags get cleared by
> __tlb_reset_range()? Why not just remove that then, so the bits are
> set properly for the munmap?

That's insufficient; as argued in my initial suggestion:

  https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net

Since we don't know what was flushed by the concorrent flushes, we must
flush all state (page sizes, tables etc..).

But it looks like benchmarks (for the one test-case we have) seem to
favour flushing the world over flushing a smaller range.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-05-31  9:49         ` Peter Zijlstra
@ 2019-06-03  2:11           ` Nicholas Piggin
  2019-06-03 10:30             ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Nicholas Piggin @ 2019-06-03  2:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Aneesh Kumar K.V, jstancek, mgorman, minchan, mm-commits,
	namit, stable, will.deacon, yang.shi

Peter Zijlstra's on May 31, 2019 7:49 pm:
> On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
>> Peter Zijlstra's on May 28, 2019 12:25 am:
>> > On Mon, May 27, 2019 at 06:59:08PM +0530, Aneesh Kumar K.V wrote:
>> >> On 5/27/19 4:31 PM, Peter Zijlstra wrote:
>> >> > On Tue, May 21, 2019 at 04:18:33PM -0700, akpm@linux-foundation.org wrote:
>> >> > > --- a/mm/mmu_gather.c~mm-mmu_gather-remove-__tlb_reset_range-for-force-flush
>> >> > > +++ a/mm/mmu_gather.c
>> >> > > @@ -245,14 +245,28 @@ void tlb_finish_mmu(struct mmu_gather *t
>> >> > >   {
>> >> > >   	/*
>> >> > >   	 * If there are parallel threads are doing PTE changes on same range
>> >> > > -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
>> >> > > -	 * flush by batching, a thread has stable TLB entry can fail to flush
>> >> > > -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
>> >> > > -	 * forcefully if we detect parallel PTE batching threads.
>> >> > > +	 * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
>> >> > > +	 * flush by batching, one thread may end up seeing inconsistent PTEs
>> >> > > +	 * and result in having stale TLB entries.  So flush TLB forcefully
>> >> > > +	 * if we detect parallel PTE batching threads.
>> >> > > +	 *
>> >> > > +	 * However, some syscalls, e.g. munmap(), may free page tables, this
>> >> > > +	 * needs force flush everything in the given range. Otherwise this
>> >> > > +	 * may result in having stale TLB entries for some architectures,
>> >> > > +	 * e.g. aarch64, that could specify flush what level TLB.
>> >> > >   	 */
>> >> > >   	if (mm_tlb_flush_nested(tlb->mm)) {
>> >> > > +		/*
>> >> > > +		 * The aarch64 yields better performance with fullmm by
>> >> > > +		 * avoiding multiple CPUs spamming TLBI messages at the
>> >> > > +		 * same time.
>> >> > > +		 *
>> >> > > +		 * On x86 non-fullmm doesn't yield significant difference
>> >> > > +		 * against fullmm.
>> >> > > +		 */
>> >> > > +		tlb->fullmm = 1;
>> >> > >   		__tlb_reset_range(tlb);
>> >> > > -		__tlb_adjust_range(tlb, start, end - start);
>> >> > > +		tlb->freed_tables = 1;
>> >> > >   	}
>> >> > >   	tlb_flush_mmu(tlb);
> 
>> > Maybe, but given the patch that went into -mm, PPC will never hit that
>> > branch I killed anymore -- and that really shouldn't be in architecture
>> > code anyway.
>> 
>> Yeah well if mm/ does this then sure it's dead and can go.
>> 
>> I don't think it's very nice to set fullmm and freed_tables for this 
>> case though. Is this concurrent zapping an important fast path? It
>> must have been, in order to justify all this complexity to the mm, so
>> we don't want to tie this boat anchor to it AFAIKS?
> 
> I'm not convinced its an important fast path, afaict it is an
> unfortunate correctness issue caused by allowing concurrenct frees.

I mean -- concurrent freeing was an important fastpath, right?
And concurrent freeing means that you hit this case. So this
case itself should be important too.

> 
>> Is the problem just that the freed page tables flags get cleared by
>> __tlb_reset_range()? Why not just remove that then, so the bits are
>> set properly for the munmap?
> 
> That's insufficient; as argued in my initial suggestion:
> 
>   https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
> 
> Since we don't know what was flushed by the concorrent flushes, we must
> flush all state (page sizes, tables etc..).

Page tables should not be concurrently freed I think. Just don't clear
those page table free flags and it should be okay. Page sizes yes,
but we accommodated for that in the arch code. I could see reason to
add a flag to the gather struct like "concurrent_free" and set that
from the generic code, which the arch has to take care of.

> But it looks like benchmarks (for the one test-case we have) seem to
> favour flushing the world over flushing a smaller range.

Testing on 16MB unmap is possibly not a good benchmark, I didn't run
it exactly but it looks likely to go beyond the range flush threshold
and flush the entire PID anyway.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-06-03  2:11           ` Nicholas Piggin
@ 2019-06-03 10:30             ` Will Deacon
  2019-06-03 14:10               ` Nicholas Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2019-06-03 10:30 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, akpm, Aneesh Kumar K.V, jstancek, mgorman,
	minchan, mm-commits, namit, stable, yang.shi

Hi Nick,

On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
> Peter Zijlstra's on May 31, 2019 7:49 pm:
> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> >> I don't think it's very nice to set fullmm and freed_tables for this 
> >> case though. Is this concurrent zapping an important fast path? It
> >> must have been, in order to justify all this complexity to the mm, so
> >> we don't want to tie this boat anchor to it AFAIKS?
> > 
> > I'm not convinced its an important fast path, afaict it is an
> > unfortunate correctness issue caused by allowing concurrenct frees.
> 
> I mean -- concurrent freeing was an important fastpath, right?
> And concurrent freeing means that you hit this case. So this
> case itself should be important too.

I honestly don't think we (Peter and I) know. Our first involvement with
this was because TLBs were found to contain stale user entries:

https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/

so the initial work to support the concurrent freeing was done separately
and, I assume, motivated by some real workloads. I would also very much
like to know more about that, since nothing remotely realistic has surfaced
in this discussion, other than some historical glibc thing which has long
since been fixed.

> >> Is the problem just that the freed page tables flags get cleared by
> >> __tlb_reset_range()? Why not just remove that then, so the bits are
> >> set properly for the munmap?
> > 
> > That's insufficient; as argued in my initial suggestion:
> > 
> >   https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
> > 
> > Since we don't know what was flushed by the concorrent flushes, we must
> > flush all state (page sizes, tables etc..).
> 
> Page tables should not be concurrently freed I think. Just don't clear
> those page table free flags and it should be okay. Page sizes yes,
> but we accommodated for that in the arch code. I could see reason to
> add a flag to the gather struct like "concurrent_free" and set that
> from the generic code, which the arch has to take care of.

I think you're correct that two CPUs cannot free the page tables
concurrently (I misunderstood this initially), although I also think
there may be some subtle issues if tlb->freed_tables is not set,
depending on the architecture. Roughly speaking, if one CPU is clearing
a PMD as part of munmap() and another CPU in madvise() does only last-level
TLB invalidation, then I think there's the potential for the invalidation
to be ineffective if observing a cleared PMD doesn't imply that the last
level has been unmapped from the perspective of the page-table walker.

> > But it looks like benchmarks (for the one test-case we have) seem to
> > favour flushing the world over flushing a smaller range.
> 
> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
> it exactly but it looks likely to go beyond the range flush threshold
> and flush the entire PID anyway.

If we can get a better idea of what a "good benchmark" might look like (i.e.
something that is representative of the cases in which real workloads are
likely to trigger this path) then we can definitely try to optimise around
that.

In the meantime, I would really like to see this patch land in mainline
since it fixes a regression.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-06-03 10:30             ` Will Deacon
@ 2019-06-03 14:10               ` Nicholas Piggin
  2019-06-03 17:57                 ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Nicholas Piggin @ 2019-06-03 14:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: akpm, Aneesh Kumar K.V, jstancek, mgorman, minchan, mm-commits,
	namit, Peter Zijlstra, stable, yang.shi

Will Deacon's on June 3, 2019 8:30 pm:
> Hi Nick,
> 
> On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
>> Peter Zijlstra's on May 31, 2019 7:49 pm:
>> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
>> >> I don't think it's very nice to set fullmm and freed_tables for this 
>> >> case though. Is this concurrent zapping an important fast path? It
>> >> must have been, in order to justify all this complexity to the mm, so
>> >> we don't want to tie this boat anchor to it AFAIKS?
>> > 
>> > I'm not convinced its an important fast path, afaict it is an
>> > unfortunate correctness issue caused by allowing concurrenct frees.
>> 
>> I mean -- concurrent freeing was an important fastpath, right?
>> And concurrent freeing means that you hit this case. So this
>> case itself should be important too.
> 
> I honestly don't think we (Peter and I) know. Our first involvement with
> this was because TLBs were found to contain stale user entries:
> 
> https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/
> 
> so the initial work to support the concurrent freeing was done separately
> and, I assume, motivated by some real workloads. I would also very much
> like to know more about that, since nothing remotely realistic has surfaced
> in this discussion, other than some historical glibc thing which has long
> since been fixed.

Well, it seems like it is important. While the complexity is carried
in the mm, we should not skimp on this last small piece.

>> >> Is the problem just that the freed page tables flags get cleared by
>> >> __tlb_reset_range()? Why not just remove that then, so the bits are
>> >> set properly for the munmap?
>> > 
>> > That's insufficient; as argued in my initial suggestion:
>> > 
>> >   https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
>> > 
>> > Since we don't know what was flushed by the concorrent flushes, we must
>> > flush all state (page sizes, tables etc..).
>> 
>> Page tables should not be concurrently freed I think. Just don't clear
>> those page table free flags and it should be okay. Page sizes yes,
>> but we accommodated for that in the arch code. I could see reason to
>> add a flag to the gather struct like "concurrent_free" and set that
>> from the generic code, which the arch has to take care of.
> 
> I think you're correct that two CPUs cannot free the page tables
> concurrently (I misunderstood this initially), although I also think
> there may be some subtle issues if tlb->freed_tables is not set,
> depending on the architecture. Roughly speaking, if one CPU is clearing
> a PMD as part of munmap() and another CPU in madvise() does only last-level
> TLB invalidation, then I think there's the potential for the invalidation
> to be ineffective if observing a cleared PMD doesn't imply that the last
> level has been unmapped from the perspective of the page-table walker.

That should not be the case because the last level table should have
had all entries cleared before the pointer to it has been cleared.

So the page table walker could begin from the now-freed page table,
but it would never instantiate a valid TLB entry from there. So a
TLB invalidation would behave properly despite not flushing page
tables.

Powerpc at least would want to avoid over flushing here, AFAIKS.

>> > But it looks like benchmarks (for the one test-case we have) seem to
>> > favour flushing the world over flushing a smaller range.
>> 
>> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
>> it exactly but it looks likely to go beyond the range flush threshold
>> and flush the entire PID anyway.
> 
> If we can get a better idea of what a "good benchmark" might look like (i.e.
> something that is representative of the cases in which real workloads are
> likely to trigger this path) then we can definitely try to optimise around
> that.

Hard to say unfortunately. A smaller unmap range to start with, but
even then when you have a TLB over-flushing case, then an unmap micro
benchmark is not a great test because you'd like to see more impact of
other useful entries being flushed (e.g., you need an actual working
set).

> In the meantime, I would really like to see this patch land in mainline
> since it fixes a regression.

Sorry I didn't provide input earlier. I would like to improve the fix or 
at least make an option for archs to provide an optimised way to flush 
this case, so it would be nice not to fix archs this way and then have 
to change the fix significantly right away.

But the bug does need to be fixed of course, if there needs to be more
thought about it maybe it's best to take this fix for next release.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-06-03 14:10               ` Nicholas Piggin
@ 2019-06-03 17:57                 ` Will Deacon
  2019-06-04  8:18                   ` Nicholas Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2019-06-03 17:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: akpm, Aneesh Kumar K.V, jstancek, mgorman, minchan, mm-commits,
	namit, Peter Zijlstra, stable, yang.shi

Hi Nick,

On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
> Will Deacon's on June 3, 2019 8:30 pm:
> > On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
> >> Peter Zijlstra's on May 31, 2019 7:49 pm:
> >> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> >> >> I don't think it's very nice to set fullmm and freed_tables for this 
> >> >> case though. Is this concurrent zapping an important fast path? It
> >> >> must have been, in order to justify all this complexity to the mm, so
> >> >> we don't want to tie this boat anchor to it AFAIKS?
> >> > 
> >> > I'm not convinced its an important fast path, afaict it is an
> >> > unfortunate correctness issue caused by allowing concurrenct frees.
> >> 
> >> I mean -- concurrent freeing was an important fastpath, right?
> >> And concurrent freeing means that you hit this case. So this
> >> case itself should be important too.
> > 
> > I honestly don't think we (Peter and I) know. Our first involvement with
> > this was because TLBs were found to contain stale user entries:
> > 
> > https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/
> > 
> > so the initial work to support the concurrent freeing was done separately
> > and, I assume, motivated by some real workloads. I would also very much
> > like to know more about that, since nothing remotely realistic has surfaced
> > in this discussion, other than some historical glibc thing which has long
> > since been fixed.
> 
> Well, it seems like it is important. While the complexity is carried
> in the mm, we should not skimp on this last small piece.

As I say, I really don't know. But yes, if we can do something better we
should.

> >> >> Is the problem just that the freed page tables flags get cleared by
> >> >> __tlb_reset_range()? Why not just remove that then, so the bits are
> >> >> set properly for the munmap?
> >> > 
> >> > That's insufficient; as argued in my initial suggestion:
> >> > 
> >> >   https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
> >> > 
> >> > Since we don't know what was flushed by the concorrent flushes, we must
> >> > flush all state (page sizes, tables etc..).
> >> 
> >> Page tables should not be concurrently freed I think. Just don't clear
> >> those page table free flags and it should be okay. Page sizes yes,
> >> but we accommodated for that in the arch code. I could see reason to
> >> add a flag to the gather struct like "concurrent_free" and set that
> >> from the generic code, which the arch has to take care of.
> > 
> > I think you're correct that two CPUs cannot free the page tables
> > concurrently (I misunderstood this initially), although I also think
> > there may be some subtle issues if tlb->freed_tables is not set,
> > depending on the architecture. Roughly speaking, if one CPU is clearing
> > a PMD as part of munmap() and another CPU in madvise() does only last-level
> > TLB invalidation, then I think there's the potential for the invalidation
> > to be ineffective if observing a cleared PMD doesn't imply that the last
> > level has been unmapped from the perspective of the page-table walker.
> 
> That should not be the case because the last level table should have
> had all entries cleared before the pointer to it has been cleared.

The devil is in the detail here, and I think specifically it depends
what you mean by "before". Does that mean memory barrier, or special
page-table walker barrier, or TLB invalidation or ...?

> So the page table walker could begin from the now-freed page table,
> but it would never instantiate a valid TLB entry from there. So a
> TLB invalidation would behave properly despite not flushing page
> tables.
> 
> Powerpc at least would want to avoid over flushing here, AFAIKS.

For arm64 it really depends how often this hits. Simply not setting
tlb->freed_tables would also break things for us, because we have an
optimisation where we elide invalidation in the fullmm && !freed_tables
case, since this is indicative of the mm going away and so we simply
avoid reallocating its ASID.

> >> > But it looks like benchmarks (for the one test-case we have) seem to
> >> > favour flushing the world over flushing a smaller range.
> >> 
> >> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
> >> it exactly but it looks likely to go beyond the range flush threshold
> >> and flush the entire PID anyway.
> > 
> > If we can get a better idea of what a "good benchmark" might look like (i.e.
> > something that is representative of the cases in which real workloads are
> > likely to trigger this path) then we can definitely try to optimise around
> > that.
> 
> Hard to say unfortunately. A smaller unmap range to start with, but
> even then when you have a TLB over-flushing case, then an unmap micro
> benchmark is not a great test because you'd like to see more impact of
> other useful entries being flushed (e.g., you need an actual working
> set).

Right, sounds like somebody needs to do some better analysis than what's
been done so far.

> > In the meantime, I would really like to see this patch land in mainline
> > since it fixes a regression.
> 
> Sorry I didn't provide input earlier. I would like to improve the fix or 
> at least make an option for archs to provide an optimised way to flush 
> this case, so it would be nice not to fix archs this way and then have 
> to change the fix significantly right away.

Please send patches ;)

> But the bug does need to be fixed of course, if there needs to be more
> thought about it maybe it's best to take this fix for next release.

Agreed.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
  2019-06-03 17:57                 ` Will Deacon
@ 2019-06-04  8:18                   ` Nicholas Piggin
  0 siblings, 0 replies; 12+ messages in thread
From: Nicholas Piggin @ 2019-06-04  8:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: akpm, Aneesh Kumar K.V, jstancek, mgorman, minchan, mm-commits,
	namit, Peter Zijlstra, stable, yang.shi

Will Deacon's on June 4, 2019 3:57 am:
> Hi Nick,
> 
> On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
>> Will Deacon's on June 3, 2019 8:30 pm:
>> > On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
>> >> Peter Zijlstra's on May 31, 2019 7:49 pm:
>> >> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
>> >> >> I don't think it's very nice to set fullmm and freed_tables for this 
>> >> >> case though. Is this concurrent zapping an important fast path? It
>> >> >> must have been, in order to justify all this complexity to the mm, so
>> >> >> we don't want to tie this boat anchor to it AFAIKS?
>> >> > 
>> >> > I'm not convinced its an important fast path, afaict it is an
>> >> > unfortunate correctness issue caused by allowing concurrenct frees.
>> >> 
>> >> I mean -- concurrent freeing was an important fastpath, right?
>> >> And concurrent freeing means that you hit this case. So this
>> >> case itself should be important too.
>> > 
>> > I honestly don't think we (Peter and I) know. Our first involvement with
>> > this was because TLBs were found to contain stale user entries:
>> > 
>> > https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/
>> > 
>> > so the initial work to support the concurrent freeing was done separately
>> > and, I assume, motivated by some real workloads. I would also very much
>> > like to know more about that, since nothing remotely realistic has surfaced
>> > in this discussion, other than some historical glibc thing which has long
>> > since been fixed.
>> 
>> Well, it seems like it is important. While the complexity is carried
>> in the mm, we should not skimp on this last small piece.
> 
> As I say, I really don't know. But yes, if we can do something better we
> should.
> 
>> >> >> Is the problem just that the freed page tables flags get cleared by
>> >> >> __tlb_reset_range()? Why not just remove that then, so the bits are
>> >> >> set properly for the munmap?
>> >> > 
>> >> > That's insufficient; as argued in my initial suggestion:
>> >> > 
>> >> >   https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
>> >> > 
>> >> > Since we don't know what was flushed by the concorrent flushes, we must
>> >> > flush all state (page sizes, tables etc..).
>> >> 
>> >> Page tables should not be concurrently freed I think. Just don't clear
>> >> those page table free flags and it should be okay. Page sizes yes,
>> >> but we accommodated for that in the arch code. I could see reason to
>> >> add a flag to the gather struct like "concurrent_free" and set that
>> >> from the generic code, which the arch has to take care of.
>> > 
>> > I think you're correct that two CPUs cannot free the page tables
>> > concurrently (I misunderstood this initially), although I also think
>> > there may be some subtle issues if tlb->freed_tables is not set,
>> > depending on the architecture. Roughly speaking, if one CPU is clearing
>> > a PMD as part of munmap() and another CPU in madvise() does only last-level
>> > TLB invalidation, then I think there's the potential for the invalidation
>> > to be ineffective if observing a cleared PMD doesn't imply that the last
>> > level has been unmapped from the perspective of the page-table walker.
>> 
>> That should not be the case because the last level table should have
>> had all entries cleared before the pointer to it has been cleared.
> 
> The devil is in the detail here, and I think specifically it depends
> what you mean by "before". Does that mean memory barrier, or special
> page-table walker barrier, or TLB invalidation or ...?

I don't know that matters. It is a complicating factor, but would
not be a new one to page table freeing. The issue really is the
TLB entries (not page walk entry) which need to be flushed by the
concurrent unmaps. Even without any page table freeing activity (so you 
would still have the page table connected), the ordering and flushing 
needs to be right such that it can not re-instantiate a new TLB from the
page table with the old PTEs.

If that is solved, then the subsequent step of freeing the page table
page doesn't introduce a new window where old PTEs could be read
from the same table via page walk cache after it is disconnected.

> 
>> So the page table walker could begin from the now-freed page table,
>> but it would never instantiate a valid TLB entry from there. So a
>> TLB invalidation would behave properly despite not flushing page
>> tables.
>> 
>> Powerpc at least would want to avoid over flushing here, AFAIKS.
> 
> For arm64 it really depends how often this hits. Simply not setting
> tlb->freed_tables would also break things for us, because we have an
> optimisation where we elide invalidation in the fullmm && !freed_tables
> case, since this is indicative of the mm going away and so we simply
> avoid reallocating its ASID.

It wouldn't be not setting it, but rather not clearing it.

Not sure you have to worry about concurrent unmaps in the fullmm case
either.

> 
>> >> > But it looks like benchmarks (for the one test-case we have) seem to
>> >> > favour flushing the world over flushing a smaller range.
>> >> 
>> >> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
>> >> it exactly but it looks likely to go beyond the range flush threshold
>> >> and flush the entire PID anyway.
>> > 
>> > If we can get a better idea of what a "good benchmark" might look like (i.e.
>> > something that is representative of the cases in which real workloads are
>> > likely to trigger this path) then we can definitely try to optimise around
>> > that.
>> 
>> Hard to say unfortunately. A smaller unmap range to start with, but
>> even then when you have a TLB over-flushing case, then an unmap micro
>> benchmark is not a great test because you'd like to see more impact of
>> other useful entries being flushed (e.g., you need an actual working
>> set).
> 
> Right, sounds like somebody needs to do some better analysis than what's
> been done so far.
> 
>> > In the meantime, I would really like to see this patch land in mainline
>> > since it fixes a regression.
>> 
>> Sorry I didn't provide input earlier. I would like to improve the fix or 
>> at least make an option for archs to provide an optimised way to flush 
>> this case, so it would be nice not to fix archs this way and then have 
>> to change the fix significantly right away.
> 
> Please send patches ;)

I have a few things lying around I put on hold until after all the
recent nice refactoring and improvements. I will rebase and try to
look at this issue as well and see if there is any ways to improve.
Likely to need help with arch code and analysis of races.

>> But the bug does need to be fixed of course, if there needs to be more
>> thought about it maybe it's best to take this fix for next release.
> 
> Agreed.

Well I can't argue against in anymore then.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-06-04  8:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21 23:18 + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree akpm
2019-05-27 11:01 ` Peter Zijlstra
2019-05-27 13:29   ` Aneesh Kumar K.V
2019-05-27 14:25     ` Peter Zijlstra
2019-05-30 21:55       ` Jan Stancek
2019-05-31  2:46       ` Nicholas Piggin
2019-05-31  9:49         ` Peter Zijlstra
2019-06-03  2:11           ` Nicholas Piggin
2019-06-03 10:30             ` Will Deacon
2019-06-03 14:10               ` Nicholas Piggin
2019-06-03 17:57                 ` Will Deacon
2019-06-04  8:18                   ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).