linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64
@ 2021-01-21 16:51 Prathu Baronia
  2021-01-21 16:51 ` [PATCH 1/1] " Prathu Baronia
  2021-01-21 17:46 ` [PATCH 0/1] " Will Deacon
  0 siblings, 2 replies; 6+ messages in thread
From: Prathu Baronia @ 2021-01-21 16:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: chintan.pandya, Prathu Baronia, Catalin Marinas, Will Deacon,
	Vincenzo Frascino, glider, Geert Uytterhoeven, Andrew Morton,
	Anshuman Khandual, Andrey Konovalov, linux-arm-kernel

Hello!

This patch removes the unnecessary kmap calls in the hugepage zeroing path and
improves the timing by 62%.

I had proposed a similar change in Apr-May'20 timeframe in memory.c where I
proposed to clear out a hugepage by directly calling a memset over the whole
hugepage but got the opposition that the change was not architecturally neutral.

Upon revisiting this now I see significant improvement by removing around 2k
barrier calls from the zeroing path. So hereby I propose an arm64 specific
definition of clear_user_highpage().

Prathu Baronia (1):
  mm: Optimizing hugepage zeroing in arm64

 arch/arm64/include/asm/page.h | 3 +++
 arch/arm64/mm/copypage.c      | 8 ++++++++
 2 files changed, 11 insertions(+)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] mm: Optimizing hugepage zeroing in arm64
  2021-01-21 16:51 [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64 Prathu Baronia
@ 2021-01-21 16:51 ` Prathu Baronia
  2021-01-21 17:46 ` [PATCH 0/1] " Will Deacon
  1 sibling, 0 replies; 6+ messages in thread
From: Prathu Baronia @ 2021-01-21 16:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: chintan.pandya, Prathu Baronia, Catalin Marinas, Will Deacon,
	Vincenzo Frascino, glider, Anshuman Khandual, Andrew Morton,
	Andrey Konovalov, linux-arm-kernel

In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp
mapping of pages. Hence, k(map|unmap)_atomic() acts as nothing more than
multiple barrier() calls, for example for a 2MB hugepage in
clear_huge_page() these are called 512 times i.e. to map and unmap each
subpage that means in total 2048 barrier calls. This called for
optimization. Simply getting VADDR from page does the job for us.
We profiled clear_huge_page() using ftrace and observed an improvement
of 62%.

Setup:-
Below data has been collected on Qualcomm's SM7250 SoC THP enabled (kernel
v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76) switched on
and set to max frequency, also DDR set to perf governor.

FTRACE Data:-

Base data:-
Number of iterations: 48
Mean of allocation time: 349.5 us
std deviation: 74.5 us

v1 data:-
Number of iterations: 48
Mean of allocation time: 131 us
std deviation: 32.7 us

The following simple userspace experiment to allocate
100MB(BUF_SZ) of pages and writing to it gave us a good insight,
we observed an improvement of 42% in allocation and writing timings.
-------------------------------------------------------------
Test code snippet
-------------------------------------------------------------
      clock_start();
      buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */

        for(i=0; i < BUF_SZ_PAGES; i++)
        {
                *((int *)(buf + (i*PAGE_SIZE))) = 1;
        }
      clock_end();
-------------------------------------------------------------

Malloc test timings for 100MB anon allocation:-

Base data:-
Number of iterations: 100
Mean of allocation time: 31831 us
std deviation: 4286 us

v1 data:-
Number of iterations: 100
Mean of allocation time: 18193 us
std deviation: 4915 us

Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
---
 arch/arm64/include/asm/page.h | 3 +++
 arch/arm64/mm/copypage.c      | 8 ++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..8f9d005a11bb 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -35,6 +35,9 @@ void copy_highpage(struct page *to, struct page *from);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
+#define clear_user_highpage clear_user_highpage
+void clear_user_highpage(struct page *page, unsigned long vaddr);
+
 typedef struct page *pgtable_t;
 
 extern int pfn_valid(unsigned long);
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index b5447e53cd73..7f5943c6fc12 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -44,3 +44,11 @@ void copy_user_highpage(struct page *to, struct page *from,
 	flush_dcache_page(to);
 }
 EXPORT_SYMBOL_GPL(copy_user_highpage);
+
+inline void clear_user_highpage(struct page *page, unsigned long vaddr)
+{
+	void *addr = page_address(page);
+
+	clear_user_page(addr, vaddr, page);
+}
+EXPORT_SYMBOL_GPL(clear_user_highpage);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64
  2021-01-21 16:51 [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64 Prathu Baronia
  2021-01-21 16:51 ` [PATCH 1/1] " Prathu Baronia
@ 2021-01-21 17:46 ` Will Deacon
  2021-01-21 18:59   ` Robin Murphy
  1 sibling, 1 reply; 6+ messages in thread
From: Will Deacon @ 2021-01-21 17:46 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: linux-kernel, chintan.pandya, Prathu Baronia, Catalin Marinas,
	Vincenzo Frascino, glider, Geert Uytterhoeven, Andrew Morton,
	Anshuman Khandual, Andrey Konovalov, linux-arm-kernel

On Thu, Jan 21, 2021 at 10:21:50PM +0530, Prathu Baronia wrote:
> This patch removes the unnecessary kmap calls in the hugepage zeroing path and
> improves the timing by 62%.
> 
> I had proposed a similar change in Apr-May'20 timeframe in memory.c where I
> proposed to clear out a hugepage by directly calling a memset over the whole
> hugepage but got the opposition that the change was not architecturally neutral.
> 
> Upon revisiting this now I see significant improvement by removing around 2k
> barrier calls from the zeroing path. So hereby I propose an arm64 specific
> definition of clear_user_highpage().

Given that barrier() is purely a thing for the compiler, wouldn't the same
change yield a benefit on any other architecture without HIGHMEM? In which
case, I think this sort of change belongs in the core code if it's actually
worthwhile.

Will

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64
  2021-01-21 17:46 ` [PATCH 0/1] " Will Deacon
@ 2021-01-21 18:59   ` Robin Murphy
  2021-01-22 12:13     ` Catalin Marinas
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2021-01-21 18:59 UTC (permalink / raw)
  To: Will Deacon, Prathu Baronia
  Cc: Prathu Baronia, Catalin Marinas, Anshuman Khandual, linux-kernel,
	chintan.pandya, glider, Andrey Konovalov, Geert Uytterhoeven,
	Andrew Morton, Vincenzo Frascino, linux-arm-kernel

On 2021-01-21 17:46, Will Deacon wrote:
> On Thu, Jan 21, 2021 at 10:21:50PM +0530, Prathu Baronia wrote:
>> This patch removes the unnecessary kmap calls in the hugepage zeroing path and
>> improves the timing by 62%.
>>
>> I had proposed a similar change in Apr-May'20 timeframe in memory.c where I
>> proposed to clear out a hugepage by directly calling a memset over the whole
>> hugepage but got the opposition that the change was not architecturally neutral.
>>
>> Upon revisiting this now I see significant improvement by removing around 2k
>> barrier calls from the zeroing path. So hereby I propose an arm64 specific
>> definition of clear_user_highpage().
> 
> Given that barrier() is purely a thing for the compiler, wouldn't the same
> change yield a benefit on any other architecture without HIGHMEM? In which
> case, I think this sort of change belongs in the core code if it's actually
> worthwhile.

I would have thought it's more the constant manipulation of the preempt 
and pagefault counts, rather than the compiler barriers between them, 
that has the impact. Either way, if arm64 doesn't need to be atomic WRT 
preemption when clearing parts of hugepages then I also can't imagine 
that anyone else (at least for !HIGHMEM) would either.

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64
  2021-01-21 18:59   ` Robin Murphy
@ 2021-01-22 12:13     ` Catalin Marinas
  2021-01-22 12:45       ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Catalin Marinas @ 2021-01-22 12:13 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Will Deacon, Prathu Baronia, Prathu Baronia, Anshuman Khandual,
	linux-kernel, chintan.pandya, glider, Andrey Konovalov,
	Geert Uytterhoeven, Andrew Morton, Vincenzo Frascino,
	linux-arm-kernel

On Thu, Jan 21, 2021 at 06:59:37PM +0000, Robin Murphy wrote:
> On 2021-01-21 17:46, Will Deacon wrote:
> > On Thu, Jan 21, 2021 at 10:21:50PM +0530, Prathu Baronia wrote:
> > > This patch removes the unnecessary kmap calls in the hugepage zeroing path and
> > > improves the timing by 62%.
> > > 
> > > I had proposed a similar change in Apr-May'20 timeframe in memory.c where I
> > > proposed to clear out a hugepage by directly calling a memset over the whole
> > > hugepage but got the opposition that the change was not architecturally neutral.
> > > 
> > > Upon revisiting this now I see significant improvement by removing around 2k
> > > barrier calls from the zeroing path. So hereby I propose an arm64 specific
> > > definition of clear_user_highpage().
> > 
> > Given that barrier() is purely a thing for the compiler, wouldn't the same
> > change yield a benefit on any other architecture without HIGHMEM? In which
> > case, I think this sort of change belongs in the core code if it's actually
> > worthwhile.
> 
> I would have thought it's more the constant manipulation of the preempt and
> pagefault counts, rather than the compiler barriers between them, that has
> the impact. Either way, if arm64 doesn't need to be atomic WRT preemption
> when clearing parts of hugepages then I also can't imagine that anyone else
> (at least for !HIGHMEM) would either.

I thought the kmap_local stuff was supposed to fix this unnecessary
preemption disabling on 64-bit architectures:

https://lwn.net/Articles/836144/

I guess it's not there yet.

-- 
Catalin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64
  2021-01-22 12:13     ` Catalin Marinas
@ 2021-01-22 12:45       ` Robin Murphy
  0 siblings, 0 replies; 6+ messages in thread
From: Robin Murphy @ 2021-01-22 12:45 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Prathu Baronia, Prathu Baronia, Anshuman Khandual,
	linux-kernel, chintan.pandya, glider, Andrey Konovalov,
	Geert Uytterhoeven, Andrew Morton, Vincenzo Frascino,
	linux-arm-kernel

On 2021-01-22 12:13, Catalin Marinas wrote:
> On Thu, Jan 21, 2021 at 06:59:37PM +0000, Robin Murphy wrote:
>> On 2021-01-21 17:46, Will Deacon wrote:
>>> On Thu, Jan 21, 2021 at 10:21:50PM +0530, Prathu Baronia wrote:
>>>> This patch removes the unnecessary kmap calls in the hugepage zeroing path and
>>>> improves the timing by 62%.
>>>>
>>>> I had proposed a similar change in Apr-May'20 timeframe in memory.c where I
>>>> proposed to clear out a hugepage by directly calling a memset over the whole
>>>> hugepage but got the opposition that the change was not architecturally neutral.
>>>>
>>>> Upon revisiting this now I see significant improvement by removing around 2k
>>>> barrier calls from the zeroing path. So hereby I propose an arm64 specific
>>>> definition of clear_user_highpage().
>>>
>>> Given that barrier() is purely a thing for the compiler, wouldn't the same
>>> change yield a benefit on any other architecture without HIGHMEM? In which
>>> case, I think this sort of change belongs in the core code if it's actually
>>> worthwhile.
>>
>> I would have thought it's more the constant manipulation of the preempt and
>> pagefault counts, rather than the compiler barriers between them, that has
>> the impact. Either way, if arm64 doesn't need to be atomic WRT preemption
>> when clearing parts of hugepages then I also can't imagine that anyone else
>> (at least for !HIGHMEM) would either.
> 
> I thought the kmap_local stuff was supposed to fix this unnecessary
> preemption disabling on 64-bit architectures:
> 
> https://lwn.net/Articles/836144/
> 
> I guess it's not there yet.

No, it's there alright - when I pulled up the code to double-check my 
memory of this area, I did notice the kerneldoc and start wondering if 
this should simply be using kmap_local_page() for everyone.

Robin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-01-22 12:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 16:51 [PATCH 0/1] mm: Optimizing hugepage zeroing in arm64 Prathu Baronia
2021-01-21 16:51 ` [PATCH 1/1] " Prathu Baronia
2021-01-21 17:46 ` [PATCH 0/1] " Will Deacon
2021-01-21 18:59   ` Robin Murphy
2021-01-22 12:13     ` Catalin Marinas
2021-01-22 12:45       ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).