Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user

From: Will Deacon <will@kernel.org>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	catalin.marinas@arm.com, alexander.duyck@gmail.com,
	chintan.pandya@oneplus.com, mhocko@suse.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	gregkh@linuxfoundation.com, gthelen@google.com, jack@suse.cz,
	ken.lin@oneplus.com, gasine.xu@oneplus.com, ying.huang@intel.com,
	mark.rutland@arm.com
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
Date: Tue, 5 May 2020 09:59:21 +0100	[thread overview]
Message-ID: <20200505085919.GB16980@willie-the-truck> (raw)
In-Reply-To: <20200501085855.c5dzk5hfrdzunqdl@oneplus.com>

On Fri, May 01, 2020 at 02:28:55PM +0530, Prathu Baronia wrote:
> Platform and setup conditions:
> Qualcomm's SM8150 platform under controlled conditions(i.e. only CPU0 and 6
> turned on and set to max frequency, and DDR set to performance governor).
> ---------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------------
> Summary:
> 	We observed a ~61% improvement in executon time of clearing a hugepage
> 	in the case of arm64 if we increase the granularity i.e. the chunk size
> 	to 64KB from 4KB for each chunk clearing subroutine call.
> ---------------------------------------------------------------------------
> 
> For the base build:
> 
> clear_huge_page() ftrace profile
> --------------------------------
> - CPU0:
> 	- Samples: 95
> 	- Mean: 242.099 us
> 	- Std dev: 45.0096 us

That's one hell of a deviation. Any idea what's going on there?

> - CPU6:
> 	- Samples: 61
> 	- Mean: 258.372 us
> 	- Std dev: 22.0754 us
> 
> With patches [PATCH {1,2,3}/4] provided at the end where we just revert the
> forward-reverse traversal code we observed:
> 
> clear_huge_page() ftrace profile
> --------------------------------
> - CPU0:
> 	- Samples: 77
> 	- Mean: 234.568
> 	- Std dev: 6.52
> - CPU6:
> 	- Samples: 81
> 	- Mean: 259.437
> 	- Std dev: 19.25
> 
> We were expecting a bit of an improvement for arm64's case because of our
> hypothesis that reverse traversal is considerably slower in arm64 but after Will
> Deacon's test code which showed similar timings for forward and reverse
> traversals we digged a bit deeper into this.
> 
> I found that In the case of arm64 a page is cleared using a special clear_page.S
> assembly routine instead of an explicit call to memset. With the below patch we
> bypassed the assembly routine and oberserved improvement in execution time of
> clear_huge_page on CPU0.
> 
>  diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>  index ea5cdbd8c2c3..a0a97a95aee8 100644
>  --- a/include/linux/highmem.h
>  +++ b/include/linux/highmem.h
>  @@ -158,7 +158,7 @@ do {
>  \
>  static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>  {
>         void *addr = kmap_atomic(page);
>  -      clear_user_page(addr, vaddr, page);
>  +      memset(addr, 0x0, PAGE_SIZE);
>         kunmap_atomic(addr);
>  }
>  #endif
> 
> For reference I will call the above patch v-exp.
> 
> When v-exp is applied on base we observed:
> 
> clear_huge_page() ftrace profile
> --------------------------------
> - CPU0:
> 	- Samples: 71
> 	- Mean: 124.657 us
> 	- Std dev: 0.494165 us

This doesn't make any sense to me. memset() of zero is special-cased to
use the DC ZVA instruction in a loop:

3:
	dc	zva, dst
	add	dst, dst, zva_len_x
	subs	count, count, zva_len_x
	b.ge	3b

which is basically the same as clear_page():

1:	dc	zva, x0
	add	x0, x0, x1
	tst	x0, #(PAGE_SIZE - 1)
	b.ne	1b

Are you able to reproduce this in userspace?

Will