[RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
@ 2020-04-03  8:18 Prathu Baronia
  2020-04-03  8:52 ` Michal Hocko
  2020-04-10 18:54 ` Alexander Duyck
  0 siblings, 2 replies; 14+ messages in thread
From: Prathu Baronia @ 2020-04-03  8:18 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: gregkh, gthelen, jack, mhocko, ken.lin, gasine.xu, chintan.pandya

THP allocation for anon memory requires zeroing of the huge page. To do so,
we iterate over 2MB memory in 4KB chunks. Each iteration calls for kmap_atomic()
and kunmap_atomic(). This routine makes sense where we need temporary mapping of
the user page. In !HIGHMEM cases, specially in 64-bit architectures, we don't
need temp mapping. Hence, kmap_atomic() acts as nothing more than multiple
barrier() calls.

This called for optimization. Simply getting VADDR from page does the job for
us. So, implement another (optimized) routine for clear_huge_page() which
doesn't need temporary mapping of user space page.

While testing this patch on Qualcomm SM8150 SoC (kernel v4.14.117), we see 64%
Improvement in clear_huge_page().

Ftrace results:

Default profile:
 ------------------------------------------
 6) ! 473.802 us  |  clear_huge_page();
 ------------------------------------------

With this patch applied:
 ------------------------------------------
 5) ! 170.156 us  |  clear_huge_page();
 ------------------------------------------

Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
---
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 3ee073d..3e120e8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5119,6 +5119,7 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+#ifdef CONFIG_HIGHMEM
 static void clear_gigantic_page(struct page *page,
 				unsigned long addr,
 				unsigned int pages_per_huge_page)
@@ -5183,6 +5184,16 @@ void clear_huge_page(struct page *page,
 				    addr + right_idx * PAGE_SIZE);
 	}
 }
+#else
+void clear_huge_page(struct page *page,
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+	void *addr;
+
+	addr = page_address(page);
+	memset(addr, 0, pages_per_huge_page*PAGE_SIZE);
+}
+#endif
 
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
-- 
2.7.4


-- 
Prathu Baronia


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-03  8:18 [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases Prathu Baronia
@ 2020-04-03  8:52 ` Michal Hocko
  2020-04-09 15:29   ` Prathu Baronia
  2020-04-10 18:54 ` Alexander Duyck
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-04-03  8:52 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: akpm, linux-mm, gregkh, gthelen, jack, ken.lin, gasine.xu,
	chintan.pandya, Huang Ying

[Cc Ying Huang]

On Fri 03-04-20 13:48:14, Prathu Baronia wrote:
> THP allocation for anon memory requires zeroing of the huge page. To do so,
> we iterate over 2MB memory in 4KB chunks. Each iteration calls for kmap_atomic()
> and kunmap_atomic(). This routine makes sense where we need temporary mapping of
> the user page. In !HIGHMEM cases, specially in 64-bit architectures, we don't
> need temp mapping. Hence, kmap_atomic() acts as nothing more than multiple
> barrier() calls.
> 
> This called for optimization. Simply getting VADDR from page does the job for
> us. So, implement another (optimized) routine for clear_huge_page() which
> doesn't need temporary mapping of user space page.
> 
> While testing this patch on Qualcomm SM8150 SoC (kernel v4.14.117), we see 64%
> Improvement in clear_huge_page().

This is an old kernel. Do you see the same with the current upstream
kernel? Btw. 60% improvement only from dropping barrier sounds
unexpected to me. Are you sure this is the only reason? c79b57e462b5
("mm: hugetlb: clear target sub-page last when clearing huge page")
is already 4.14 AFAICS, is it possible that this is the effect of this
patch? Your patch is effectively disabling this optimization for most
workloads that really care about it. I strongly doubt that hugetlb is a
thing on 32b kernels these days. So this really begs for more data about
the real underlying problem IMHO.

> Ftrace results:
> 
> Default profile:
>  ------------------------------------------
>  6) ! 473.802 us  |  clear_huge_page();
>  ------------------------------------------
> 
> With this patch applied:
>  ------------------------------------------
>  5) ! 170.156 us  |  clear_huge_page();
>  ------------------------------------------
> 
> Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
> Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
> ---
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ee073d..3e120e8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5119,6 +5119,7 @@ EXPORT_SYMBOL(__might_fault);
>  #endif
>  
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +#ifdef CONFIG_HIGHMEM
>  static void clear_gigantic_page(struct page *page,
>  				unsigned long addr,
>  				unsigned int pages_per_huge_page)
> @@ -5183,6 +5184,16 @@ void clear_huge_page(struct page *page,
>  				    addr + right_idx * PAGE_SIZE);
>  	}
>  }
> +#else
> +void clear_huge_page(struct page *page,
> +		     unsigned long addr_hint, unsigned int pages_per_huge_page)
> +{
> +	void *addr;
> +
> +	addr = page_address(page);
> +	memset(addr, 0, pages_per_huge_page*PAGE_SIZE);
> +}
> +#endif
>  
>  static void copy_user_gigantic_page(struct page *dst, struct page *src,
>  				    unsigned long addr,
> -- 
> 2.7.4
> 
> 
> -- 
> Prathu Baronia

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-03  8:52 ` Michal Hocko
@ 2020-04-09 15:29   ` Prathu Baronia
  2020-04-09 15:45     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Prathu Baronia @ 2020-04-09 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, gregkh, gthelen, jack, ken.lin, gasine.xu,
	chintan.pandya, Huang Ying

Following your response, I tried to find out real benefits of removing    
effective barrier() calls. To find that out, I wrote simple diff (exp-v2) as below    
on top of the base code:    
    
-------------------------------------------------------
include/linux/highmem.h | 3 +--    
1 file changed, 1 insertion(+), 2 deletions(-)    

diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b471a88.. df908b4 100644    
--- a/include/linux/highmem.h    
+++ b/include/linux/highmem.h    
@@ -145,9 +145,8 @@ do {                                                            \    
#ifndef clear_user_highpage    
static inline void clear_user_highpage(struct page *page, unsigned long vaddr)  {    
-       void *addr = kmap_atomic(page);    
+       void *addr = page_address(page);    
        clear_user_page(addr, vaddr, page);    
-       kunmap_atomic(addr);    
}    
#endif    
-------------------------------------------------------

For consistency, I kept CPU, DDR and cache on the performance governor. Target
used is Qualcomm's SM8150 with kernel 4.14.117. In this platform, CPU0 is
Cortex-A55 and CPU6 is Cortex-A76.

And the result of profiling of clear_huge_page() is as follows:
-------------------------------------------------------
Ftrace results: Time mentioned is in micro-seconds.
-------------------------------------------------------
- Base:
	- CPU0:
		- Sample size : 95
		- Mean : 237.383
		- Std dev : 31.288
	- CPU6:
		- Sample size : 61
		- Mean : 258.065
		- Std dev : 19.97

-------------------------------------------------------
- v1 (original submission):
	- CPU0:
		- Sample size : 80
		- Mean : 112.298
		- Std dev : 0.36
	- CPU6:
		- Sample size : 83
		- Mean : 71.238
		- Std dev : 13.7819
-------------------------------------------------------
- exp-v2 (experimental diff mentioned above):
	- CPU0:
		- Sample size : 69
		- Mean : 218.911
		- Std dev : 54.306
	- CPU6:
		- Sample size : 101
		- Mean : 241.522
		- Std dev : 19.3068
-------------------------------------------------------

- Comparing base vs exp-v2: Simply removing barriers from kmap_atomic() code doesn't
  Improve results significantly.

- Comparing v1 vs exp-v2: memset(0) of 2MB page straight is significantly faster than
  Zeroing individual pages.

- Analysing base and exp-v2: It was expected that CPU6 should have outperformed CPU0.
  But the zeroing pattern is adversarial for CPU6 and end up performing poor. Whereas,
  CPU6 truly outperforms CPU0 in serialized load.

Based on above 3 points, it looks like calling straight memset(0) indeed
improves Execution time, primarily due to predictable pattern of execution for
most CPU Architectures out there.

Having said that, I also understand that, v1 will loose out on optimization made
by c79b57e462b5 which keeps caches hot around faulting address. If having caches
hot around faulting address is so important (which numbers can prove, and I
don't have insights to get those numbers), it might be better to develop on top
of v1 than not using v1 at all. 

The 04/03/2020 10:52, Michal Hocko wrote:
> 
> This is an old kernel. Do you see the same with the current upstream
> kernel? Btw. 60% improvement only from dropping barrier sounds
> unexpected to me. Are you sure this is the only reason? c79b57e462b5
> ("mm: hugetlb: clear target sub-page last when clearing huge page")
> is already 4.14 AFAICS, is it possible that this is the effect of this
> patch? Your patch is effectively disabling this optimization for most
> workloads that really care about it. I strongly doubt that hugetlb is a
> thing on 32b kernels these days. So this really begs for more data about
> the real underlying problem IMHO.
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-09 15:29   ` Prathu Baronia
@ 2020-04-09 15:45     ` Michal Hocko
       [not found]       ` <SG2PR04MB2921D2AAA8726318EF53D83691DE0@SG2PR04MB2921.apcprd04.prod.outlook.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-04-09 15:45 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: akpm, linux-mm, gregkh, gthelen, jack, ken.lin, gasine.xu,
	chintan.pandya, Huang Ying

On Thu 09-04-20 20:59:14, Prathu Baronia wrote:
> Following your response, I tried to find out real benefits of removing    
> effective barrier() calls. To find that out, I wrote simple diff (exp-v2) as below    
> on top of the base code:    
>     
> -------------------------------------------------------
> include/linux/highmem.h | 3 +--    
> 1 file changed, 1 insertion(+), 2 deletions(-)    
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b471a88.. df908b4 100644    
> --- a/include/linux/highmem.h    
> +++ b/include/linux/highmem.h    
> @@ -145,9 +145,8 @@ do {                                                            \    
> #ifndef clear_user_highpage    
> static inline void clear_user_highpage(struct page *page, unsigned long vaddr)  {    
> -       void *addr = kmap_atomic(page);    
> +       void *addr = page_address(page);    
>         clear_user_page(addr, vaddr, page);    
> -       kunmap_atomic(addr);    
> }    
> #endif    
> -------------------------------------------------------
> 
> For consistency, I kept CPU, DDR and cache on the performance governor. Target
> used is Qualcomm's SM8150 with kernel 4.14.117. In this platform, CPU0 is
> Cortex-A55 and CPU6 is Cortex-A76.
> 
> And the result of profiling of clear_huge_page() is as follows:
> -------------------------------------------------------
> Ftrace results: Time mentioned is in micro-seconds.
> -------------------------------------------------------
> - Base:
> 	- CPU0:
> 		- Sample size : 95
> 		- Mean : 237.383
> 		- Std dev : 31.288
> 	- CPU6:
> 		- Sample size : 61
> 		- Mean : 258.065
> 		- Std dev : 19.97
> 
> -------------------------------------------------------
> - v1 (original submission):
> 	- CPU0:
> 		- Sample size : 80
> 		- Mean : 112.298
> 		- Std dev : 0.36
> 	- CPU6:
> 		- Sample size : 83
> 		- Mean : 71.238
> 		- Std dev : 13.7819
> -------------------------------------------------------
> - exp-v2 (experimental diff mentioned above):
> 	- CPU0:
> 		- Sample size : 69
> 		- Mean : 218.911
> 		- Std dev : 54.306
> 	- CPU6:
> 		- Sample size : 101
> 		- Mean : 241.522
> 		- Std dev : 19.3068
> -------------------------------------------------------
> 
> - Comparing base vs exp-v2: Simply removing barriers from kmap_atomic() code doesn't
>   Improve results significantly.

Unless I am misreading those numbers, barrier() doesn't change anything
because differences are withing a noise. So the difference is indeed
caused by the more clever initialization to keep the faulted address
cache hot.

Could you be more specific how have you noticed the slow down? I mean,
is there any real world workload that you have observed a regression for
and narrowed it down to zeroing?

I do realize that the initialization improvement patch doesn't really
mention any real life usecase either. It is based on a microbenchmark
but the objective sounds reasonable. If it regresses some other
workloads then we either have to make it conditional or find out what is
causing the regression and how much that regression actually matters.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
       [not found]       ` <SG2PR04MB2921D2AAA8726318EF53D83691DE0@SG2PR04MB2921.apcprd04.prod.outlook.com>
@ 2020-04-10  9:05         ` Huang, Ying
  2020-04-11 15:40           ` Chintan Pandya
  0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2020-04-10  9:05 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Michal Hocko, Prathu Baronia, akpm, linux-mm, gregkh, gthelen,
	jack, Ken Lin, Gasine Xu

Chintan Pandya <chintan.pandya@oneplus.com> writes:

>> I do realize that the initialization improvement patch doesn't really mention
>> any real life usecase either. It is based on a microbenchmark but the objective
>> sounds reasonable. If it regresses some other workloads then we either have to
>> make it conditional or find out what is causing the regression and how much
>
> Generally, many architectures are optimized for serial loads, be it initialization or
> access, as it is simplest form of prediction. Any random access pattern would kill
> that pre-fetching. And for now, I suspect that to be the case here. Probably, we
> can run more tests to confirm this part.

Please prove your theory with test.  Better to test x86 too.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-03  8:18 [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases Prathu Baronia
  2020-04-03  8:52 ` Michal Hocko
@ 2020-04-10 18:54 ` Alexander Duyck
  2020-04-11  8:45   ` Chintan Pandya
  1 sibling, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2020-04-10 18:54 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: Andrew Morton, linux-mm, Greg KH, gthelen, jack, Michal Hocko,
	ken.lin, gasine.xu, chintan.pandya

On Fri, Apr 3, 2020 at 1:18 AM Prathu Baronia
<prathu.baronia@oneplus.com> wrote:
>
> THP allocation for anon memory requires zeroing of the huge page. To do so,
> we iterate over 2MB memory in 4KB chunks. Each iteration calls for kmap_atomic()
> and kunmap_atomic(). This routine makes sense where we need temporary mapping of
> the user page. In !HIGHMEM cases, specially in 64-bit architectures, we don't
> need temp mapping. Hence, kmap_atomic() acts as nothing more than multiple
> barrier() calls.
>
> This called for optimization. Simply getting VADDR from page does the job for
> us. So, implement another (optimized) routine for clear_huge_page() which
> doesn't need temporary mapping of user space page.
>
> While testing this patch on Qualcomm SM8150 SoC (kernel v4.14.117), we see 64%
> Improvement in clear_huge_page().
>
> Ftrace results:
>
> Default profile:
>  ------------------------------------------
>  6) ! 473.802 us  |  clear_huge_page();
>  ------------------------------------------
>
> With this patch applied:
>  ------------------------------------------
>  5) ! 170.156 us  |  clear_huge_page();
>  ------------------------------------------

I suspect that if anything this is really pointing out how much
overhead is being added through process_huge_page. I know for x86 most
of the modern processors are somewhere between 16B/cycle or 32B/cycle
to initialize memory with some fixed amount of overhead for making the
rep movsb/stosb call. One thing that might make sense to look at would
be to see if we could possibly reduce the number of calls we have to
make with process_huge_page by taking the caches into account. For
example I know on x86 the L1 cache is 32K for most processors, so we
could look at possibly bumping things up so that we are processing 8
pages at a time and then making a call to cond_resched() instead of
doing it per 4K page.

> Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
> Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
> ---
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ee073d..3e120e8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5119,6 +5119,7 @@ EXPORT_SYMBOL(__might_fault);
>  #endif
>
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +#ifdef CONFIG_HIGHMEM
>  static void clear_gigantic_page(struct page *page,
>                                 unsigned long addr,
>                                 unsigned int pages_per_huge_page)
> @@ -5183,6 +5184,16 @@ void clear_huge_page(struct page *page,
>                                     addr + right_idx * PAGE_SIZE);
>         }
>  }
> +#else
> +void clear_huge_page(struct page *page,
> +                    unsigned long addr_hint, unsigned int pages_per_huge_page)
> +{
> +       void *addr;
> +
> +       addr = page_address(page);
> +       memset(addr, 0, pages_per_huge_page*PAGE_SIZE);
> +}
> +#endif

This seems like a very simplistic solution to the problem, and I am
worried something like this would introduce latency issues when
pages_per_huge_page gets to be large. It might make more sense to just
wrap the process_huge_page call in the original clear_huge_page and
then add this code block as an #else case. That way you avoid
potentially stalling a system for extended periods of time if you
start trying to clear 1G pages with the function.

One interesting data point would be to see what the cost is for
breaking this up into a loop where you only process some fixed number
of pages and running it with cond_resched() so you can avoid
introducing latency spikes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-10 18:54 ` Alexander Duyck
@ 2020-04-11  8:45   ` Chintan Pandya
  2020-04-14 15:55     ` Daniel Jordan
  0 siblings, 1 reply; 14+ messages in thread
From: Chintan Pandya @ 2020-04-11  8:45 UTC (permalink / raw)
  To: Alexander Duyck, Prathu Baronia
  Cc: Andrew Morton, linux-mm, Greg KH, gthelen, jack, Michal Hocko,
	Ken Lin, Gasine Xu

> > +#else
> > +void clear_huge_page(struct page *page,
> > +                    unsigned long addr_hint, unsigned int
> > +pages_per_huge_page) {
> > +       void *addr;
> > +
> > +       addr = page_address(page);
> > +       memset(addr, 0, pages_per_huge_page*PAGE_SIZE); } #endif
> 
> This seems like a very simplistic solution to the problem, and I am worried
> something like this would introduce latency issues when pages_per_huge_page
> gets to be large. It might make more sense to just wrap the process_huge_page
> call in the original clear_huge_page and then add this code block as an #else
> case. That way you avoid potentially stalling a system for extended periods of
> time if you start trying to clear 1G pages with the function.
> 
> One interesting data point would be to see what the cost is for breaking this up
> into a loop where you only process some fixed number of pages and running it
> with cond_resched() so you can avoid introducing latency spikes.

As per the patch above, it's not using kmap_atomic() and hence preemption & page_fault
are not disabled. Do we still need to explicitly call cond_resched() in this case?
#justAsking

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-10  9:05         ` Huang, Ying
@ 2020-04-11 15:40           ` Chintan Pandya
  2020-04-11 20:47             ` Alexander Duyck
  0 siblings, 1 reply; 14+ messages in thread
From: Chintan Pandya @ 2020-04-11 15:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Michal Hocko, Prathu Baronia, akpm, linux-mm, gregkh, gthelen,
	jack, Ken Lin, Gasine Xu

> > Generally, many architectures are optimized for serial loads, be it
> > initialization or access, as it is simplest form of prediction. Any
> > random access pattern would kill that pre-fetching. And for now, I
> > suspect that to be the case here. Probably, we can run more tests to confirm
> this part.
> 
> Please prove your theory with test.  Better to test x86 too.

Wrote down below userspace test code.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>


#define SZ_1M 0x100000
#define SZ_4K 0x1000
#define NUM 100

Int main ()
{
  void *p;
  void *q;
  void *r;

  unsigned long total_pages, total_size;
  int i, j;
  struct timeval t0, t1, t2, t3;
  int elapsed;

  printf ("Hello World\n");

  total_size = NUM * SZ_1M;
  total_pages = NUM * (SZ_1M / SZ_4K);

  p = malloc (total_size);
  q = malloc (total_size);
  r = malloc (total_size);

  /* So that all pages gets allocated */
  memset (r, 0xa, total_size);
  memset (q, 0xa, total_size);
  memset (p, 0xa, total_size);

  gettimeofday (&t0, NULL);

  /* One shot memset */
  memset (r, 0xd, total_size);

  gettimeofday (&t1, NULL);

  /* traverse in forward order */
  for (j = 0; j < total_pages; j++)
    {
      memset (q + (j * SZ_4K), 0xc, SZ_4K);
    }

  gettimeofday (&t2, NULL);

  /* traverse in reverse order */
  for (i = 0; i < total_pages; i++)
    {
      memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
    }

  gettimeofday (&t3, NULL);

  free (p);
  free (q);
  free (r);

  /* Results time */
  elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec);
  printf ("One shot: %d micro seconds\n", elapsed);


  elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec);
  printf ("Forward order: %d micro seconds\n", elapsed);


  elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec);
  printf ("Reverse order: %d micro seconds\n", elapsed);
  return 0;
}
 
------------------------------------------------------------------------------------------------

Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency)
All numbers are mean of 100 iterations. Variation is ignorable.
- Oneshot : 3389.26 us
- Forward : 8876.16 us
- Reverse : 18157.6 us

Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency)
All numbers are mean of 100 iterations. Variation is ignorable.
- Oneshot : 3203.49 us
- Forward : 5766.46 us
- Reverse : 5187.86 us

To conclude, I observed optimized serial writes in case of ARM processor. But strangely,
memset in reverse order performs better than forward order quite consistently across
multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to
restrict my previous suspicion to ARM only.

> 
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-11 15:40           ` Chintan Pandya
@ 2020-04-11 20:47             ` Alexander Duyck
  2020-04-13 15:33               ` Prathu Baronia
  0 siblings, 1 reply; 14+ messages in thread
From: Alexander Duyck @ 2020-04-11 20:47 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Huang, Ying, Michal Hocko, Prathu Baronia, akpm, linux-mm,
	gregkh, gthelen, jack, Ken Lin, Gasine Xu

On Sat, Apr 11, 2020 at 8:40 AM Chintan Pandya
<chintan.pandya@oneplus.com> wrote:
>
> > > Generally, many architectures are optimized for serial loads, be it
> > > initialization or access, as it is simplest form of prediction. Any
> > > random access pattern would kill that pre-fetching. And for now, I
> > > suspect that to be the case here. Probably, we can run more tests to confirm
> > this part.
> >
> > Please prove your theory with test.  Better to test x86 too.
>
> Wrote down below userspace test code.
>
> Code:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
>
>
> #define SZ_1M 0x100000
> #define SZ_4K 0x1000
> #define NUM 100
>
> Int main ()
> {
>   void *p;
>   void *q;
>   void *r;
>
>   unsigned long total_pages, total_size;
>   int i, j;
>   struct timeval t0, t1, t2, t3;
>   int elapsed;
>
>   printf ("Hello World\n");
>
>   total_size = NUM * SZ_1M;
>   total_pages = NUM * (SZ_1M / SZ_4K);
>
>   p = malloc (total_size);
>   q = malloc (total_size);
>   r = malloc (total_size);
>
>   /* So that all pages gets allocated */
>   memset (r, 0xa, total_size);
>   memset (q, 0xa, total_size);
>   memset (p, 0xa, total_size);
>
>   gettimeofday (&t0, NULL);
>
>   /* One shot memset */
>   memset (r, 0xd, total_size);
>
>   gettimeofday (&t1, NULL);
>
>   /* traverse in forward order */
>   for (j = 0; j < total_pages; j++)
>     {
>       memset (q + (j * SZ_4K), 0xc, SZ_4K);
>     }
>
>   gettimeofday (&t2, NULL);
>
>   /* traverse in reverse order */
>   for (i = 0; i < total_pages; i++)
>     {
>       memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
>     }
>
>   gettimeofday (&t3, NULL);
>
>   free (p);
>   free (q);
>   free (r);
>
>   /* Results time */
>   elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec);
>   printf ("One shot: %d micro seconds\n", elapsed);
>
>
>   elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec);
>   printf ("Forward order: %d micro seconds\n", elapsed);
>
>
>   elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec);
>   printf ("Reverse order: %d micro seconds\n", elapsed);
>   return 0;
> }
>
> ------------------------------------------------------------------------------------------------
>
> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> - Oneshot : 3389.26 us
> - Forward : 8876.16 us
> - Reverse : 18157.6 us

This is an interesting data point. So running things in reverse seems
much more expensive than running them forward. As such I would imagine
process_huge_page is going to be significantly more expensive then on
ARM64 since it will wind through the pages in reverse order from the
end of the page all the way down to wherever the page was accessed.

I wonder if we couldn't simply process_huge_page to process pages in
two passes? The first being from the addr_hint + some offset to the
end, and then loop back around to the start of the page for the second
pass and just process up to where we started the first pass. The idea
would be that the offset would be enough so that we have the 4K that
was accessed plus some range before and after the address hopefully
still in the L1 cache after we are done.

> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> - Oneshot : 3203.49 us
> - Forward : 5766.46 us
> - Reverse : 5187.86 us
>
> To conclude, I observed optimized serial writes in case of ARM processor. But strangely,
> memset in reverse order performs better than forward order quite consistently across
> multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to
> restrict my previous suspicion to ARM only.

What compiler options did you build the test code with? One
possibility is that the compiler may have optimized away
total_pages/total_size/i all into on variable and simply tracked it
until i is less than 0. I know I regularly will write loops to run in
reverse order for that reason as it tends to perform pretty well on
x86 as all you have to do is a sub or dec and then test the signed
flag to determine if you exit the loop.

An additional thing I was just wondering is if this also impacts the
copy operations as well? Looking through the code the two big users
for process_huge_page are clear_huge_page and copy_user_huge_page. One
thing that might make more sense than just splitting the code at a
high level would be to look at possibly refactoring process_huge_page
and the users for it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-11 20:47             ` Alexander Duyck
@ 2020-04-13 15:33               ` Prathu Baronia
  2020-04-13 16:24                 ` Alexander Duyck
  2020-04-14  1:10                 ` Huang, Ying
  0 siblings, 2 replies; 14+ messages in thread
From: Prathu Baronia @ 2020-04-13 15:33 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Chintan Pandya, Huang, Ying, Michal Hocko, akpm, linux-mm,
	gregkh, gthelen, jack, Ken Lin, Gasine Xu

The 04/11/2020 13:47, Alexander Duyck wrote:
> 
> This is an interesting data point. So running things in reverse seems
> much more expensive than running them forward. As such I would imagine
> process_huge_page is going to be significantly more expensive then on
> ARM64 since it will wind through the pages in reverse order from the
> end of the page all the way down to wherever the page was accessed.
> 
> I wonder if we couldn't simply process_huge_page to process pages in
> two passes? The first being from the addr_hint + some offset to the
> end, and then loop back around to the start of the page for the second
> pass and just process up to where we started the first pass. The idea
> would be that the offset would be enough so that we have the 4K that
> was accessed plus some range before and after the address hopefully
> still in the L1 cache after we are done.
That's a great idea, we were working on a similar idea for the v2 patch and you
suggesting this idea has reassured our approach. This will incorporate the
benefits of optimized memset and will keep the cache hot around the
faulting address.

Earlier we had taken this offset as 0.5MB and after your response we have kept it
as 32KB. As we understand there is a trade-off associated with keeping this value
too high, we would really appreciate if you can suggest a method to derive an
appropriate value for this offset from the L1 cache size.
> 
> 
> An additional thing I was just wondering is if this also impacts the
> copy operations as well? Looking through the code the two big users
> for process_huge_page are clear_huge_page and copy_user_huge_page. One
> thing that might make more sense than just splitting the code at a
> high level would be to look at possibly refactoring process_huge_page
> and the users for it.
You are right, we didn't consider refactoring process_huge_page earlier. We
will incorporate this in the soon to be sent v2 patch.

Thanks a lot for the interesting insights!

-- 
Prathu Baronia
OnePlus RnD


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-13 15:33               ` Prathu Baronia
@ 2020-04-13 16:24                 ` Alexander Duyck
  2020-04-14  1:10                 ` Huang, Ying
  1 sibling, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2020-04-13 16:24 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: Chintan Pandya, Huang, Ying, Michal Hocko, akpm, linux-mm,
	gregkh, gthelen, jack, Ken Lin, Gasine Xu

On Mon, Apr 13, 2020 at 8:34 AM Prathu Baronia
<prathu.baronia@oneplus.com> wrote:
>
> The 04/11/2020 13:47, Alexander Duyck wrote:
> >
> > This is an interesting data point. So running things in reverse seems
> > much more expensive than running them forward. As such I would imagine
> > process_huge_page is going to be significantly more expensive then on
> > ARM64 since it will wind through the pages in reverse order from the
> > end of the page all the way down to wherever the page was accessed.
> >
> > I wonder if we couldn't simply process_huge_page to process pages in
> > two passes? The first being from the addr_hint + some offset to the
> > end, and then loop back around to the start of the page for the second
> > pass and just process up to where we started the first pass. The idea
> > would be that the offset would be enough so that we have the 4K that
> > was accessed plus some range before and after the address hopefully
> > still in the L1 cache after we are done.
> That's a great idea, we were working on a similar idea for the v2 patch and you
> suggesting this idea has reassured our approach. This will incorporate the
> benefits of optimized memset and will keep the cache hot around the
> faulting address.
>
> Earlier we had taken this offset as 0.5MB and after your response we have kept it
> as 32KB. As we understand there is a trade-off associated with keeping this value
> too high, we would really appreciate if you can suggest a method to derive an
> appropriate value for this offset from the L1 cache size.

I mentioned 32KB as a value since that happens to be a common value
for L1 cache size on both the ARM64 processor mentioned, and most
modern x86 CPUs. As far as deriving it I don't know if there is a good
way to go about doing that. I suspect it is something that would need
to be architecture specific. If nothing else you might be able to do
something like define it similar to how L1_CACHE_SHIFT/BYTES is
defined in cache.h for most architectures. Also we probably want to
play around with that value a bit as well as I suspect there may be
some room to either increase or decrease the value depending on the
cost for cold accesses versus being able to process memory
initialization in larger batches.

> > An additional thing I was just wondering is if this also impacts the
> > copy operations as well? Looking through the code the two big users
> > for process_huge_page are clear_huge_page and copy_user_huge_page. One
> > thing that might make more sense than just splitting the code at a
> > high level would be to look at possibly refactoring process_huge_page
> > and the users for it.
> You are right, we didn't consider refactoring process_huge_page earlier. We
> will incorporate this in the soon to be sent v2 patch.
>
> Thanks a lot for the interesting insights!

Sounds good. I'll look forward to v2.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-13 15:33               ` Prathu Baronia
  2020-04-13 16:24                 ` Alexander Duyck
@ 2020-04-14  1:10                 ` Huang, Ying
  1 sibling, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2020-04-14  1:10 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: Alexander Duyck, Chintan Pandya, Michal Hocko, akpm, linux-mm,
	gregkh, gthelen, jack, Ken Lin, Gasine Xu

Prathu Baronia <prathu.baronia@oneplus.com> writes:

> The 04/11/2020 13:47, Alexander Duyck wrote:
>> 
>> This is an interesting data point. So running things in reverse seems
>> much more expensive than running them forward. As such I would imagine
>> process_huge_page is going to be significantly more expensive then on
>> ARM64 since it will wind through the pages in reverse order from the
>> end of the page all the way down to wherever the page was accessed.
>> 
>> I wonder if we couldn't simply process_huge_page to process pages in
>> two passes? The first being from the addr_hint + some offset to the
>> end, and then loop back around to the start of the page for the second
>> pass and just process up to where we started the first pass. The idea
>> would be that the offset would be enough so that we have the 4K that
>> was accessed plus some range before and after the address hopefully
>> still in the L1 cache after we are done.
> That's a great idea, we were working on a similar idea for the v2 patch and you
> suggesting this idea has reassured our approach. This will incorporate the
> benefits of optimized memset and will keep the cache hot around the
> faulting address.
>
> Earlier we had taken this offset as 0.5MB and after your response we have kept it
> as 32KB. As we understand there is a trade-off associated with keeping this value
> too high, we would really appreciate if you can suggest a method to derive an
> appropriate value for this offset from the L1 cache size.

I don't think we should only keep L1 cache hot.  I think it is good to
keep L2 cache hot too.  That could be 1 MB on x86 machine.  In theory,
it's better to keep as much cache hot as possible.

I understand that the benefit of cache-hot is offset by slower backward
zeroing in your system.  So you need to balance between them.  But
because backward zeroing is as fast as forward zeroing on x86, we should
consider that too.  Maybe we need to use two different implementations
on x86 and ARM, or use some parameter to tune it for different
architectures.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-11  8:45   ` Chintan Pandya
@ 2020-04-14 15:55     ` Daniel Jordan
  2020-04-14 17:33       ` Chintan Pandya
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Jordan @ 2020-04-14 15:55 UTC (permalink / raw)
  To: Chintan Pandya
  Cc: Alexander Duyck, Prathu Baronia, Andrew Morton, linux-mm,
	Greg KH, gthelen, jack, Michal Hocko, Ken Lin, Gasine Xu

On Sat, Apr 11, 2020 at 08:45:35AM +0000, Chintan Pandya wrote:
> > > +#else
> > > +void clear_huge_page(struct page *page,
> > > +                    unsigned long addr_hint, unsigned int
> > > +pages_per_huge_page) {
> > > +       void *addr;
> > > +
> > > +       addr = page_address(page);
> > > +       memset(addr, 0, pages_per_huge_page*PAGE_SIZE); } #endif
> > 
> > This seems like a very simplistic solution to the problem, and I am worried
> > something like this would introduce latency issues when pages_per_huge_page
> > gets to be large. It might make more sense to just wrap the process_huge_page
> > call in the original clear_huge_page and then add this code block as an #else
> > case. That way you avoid potentially stalling a system for extended periods of
> > time if you start trying to clear 1G pages with the function.
> > 
> > One interesting data point would be to see what the cost is for breaking this up
> > into a loop where you only process some fixed number of pages and running it
> > with cond_resched() so you can avoid introducing latency spikes.
> 
> As per the patch above, it's not using kmap_atomic() and hence preemption & page_fault
> are not disabled. Do we still need to explicitly call cond_resched() in this case?
> #justAsking

Didn't see this answered on the list, but the answer is yes because the kernel
may not have CONFIG_PREEMPTION enabled.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
  2020-04-14 15:55     ` Daniel Jordan
@ 2020-04-14 17:33       ` Chintan Pandya
  0 siblings, 0 replies; 14+ messages in thread
From: Chintan Pandya @ 2020-04-14 17:33 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Alexander Duyck, Prathu Baronia, Andrew Morton, linux-mm,
	Greg KH, gthelen, jack, Michal Hocko, Ken Lin, Gasine Xu

> > > One interesting data point would be to see what the cost is for
> > > breaking this up into a loop where you only process some fixed
> > > number of pages and running it with cond_resched() so you can avoid
> introducing latency spikes.
> >
> > As per the patch above, it's not using kmap_atomic() and hence
> > preemption & page_fault are not disabled. Do we still need to explicitly call
> cond_resched() in this case?
> > #justAsking
> 
> Didn't see this answered on the list, but the answer is yes because the kernel
> may not have CONFIG_PREEMPTION enabled.

Ok. This is addressable. Prathu has already spin off v2. If at all there is any scope
for spinning v3, this would be taken care. Thanks for the clarification.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-04-14 17:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03  8:18 [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases Prathu Baronia
2020-04-03  8:52 ` Michal Hocko
2020-04-09 15:29   ` Prathu Baronia
2020-04-09 15:45     ` Michal Hocko
     [not found]       ` <SG2PR04MB2921D2AAA8726318EF53D83691DE0@SG2PR04MB2921.apcprd04.prod.outlook.com>
2020-04-10  9:05         ` Huang, Ying
2020-04-11 15:40           ` Chintan Pandya
2020-04-11 20:47             ` Alexander Duyck
2020-04-13 15:33               ` Prathu Baronia
2020-04-13 16:24                 ` Alexander Duyck
2020-04-14  1:10                 ` Huang, Ying
2020-04-10 18:54 ` Alexander Duyck
2020-04-11  8:45   ` Chintan Pandya
2020-04-14 15:55     ` Daniel Jordan
2020-04-14 17:33       ` Chintan Pandya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).