linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/1] mm: Optimizing hugepage zeroing in arm64
@ 2021-02-02  7:42 Prathu Baronia
  2021-02-02  7:42 ` [PATCH v2 1/1] " Prathu Baronia
  0 siblings, 1 reply; 4+ messages in thread
From: Prathu Baronia @ 2021-02-02  7:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: chintan.pandya, Prathu Baronia, Andrew Morton, Thomas Gleixner,
	Ira Weiny, Randy Dunlap, Matthew Wilcox (Oracle)

As discussed on the v1 thread I have used the recently introduced kmap_local_*
APIs to avoid unnecessary preemption and pagefault disabling.
I did not get further response on the previous thread so sending this
again.

Prathu Baronia (1):
  mm: Optimizing hugepage zeroing in arm64

 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/1] mm: Optimizing hugepage zeroing in arm64
  2021-02-02  7:42 [PATCH v2 0/1] mm: Optimizing hugepage zeroing in arm64 Prathu Baronia
@ 2021-02-02  7:42 ` Prathu Baronia
  2021-02-02 20:03   ` Ira Weiny
  0 siblings, 1 reply; 4+ messages in thread
From: Prathu Baronia @ 2021-02-02  7:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: chintan.pandya, Prathu Baronia, Andrew Morton, Ira Weiny,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Randy Dunlap, Matthew Wilcox (Oracle)

In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp
mapping of pages. Hence, k(map|unmap)_atomic() acts as nothing more than
multiple barrier() calls, for example for a 2MB hugepage in
clear_huge_page() these are called 512 times i.e. to map and unmap each
subpage that means in total 2048 barrier calls. This called for
optimization. Simply getting VADDR from page in the form of kmap_local_*
APIs does the job for us.  We profiled clear_huge_page() using ftrace
and observed an improvement of 62%.

Setup:-
Below data has been collected on Qualcomm's SM7250 SoC THP enabled
(kernel
v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76) switched on
and set to max frequency, also DDR set to perf governor.

FTRACE Data:-

Base data:-
Number of iterations: 48
Mean of allocation time: 349.5 us
std deviation: 74.5 us

v1 data:-
Number of iterations: 48
Mean of allocation time: 131 us
std deviation: 32.7 us

The following simple userspace experiment to allocate
100MB(BUF_SZ) of pages and writing to it gave us a good insight,
we observed an improvement of 42% in allocation and writing timings.
-------------------------------------------------------------
Test code snippet
-------------------------------------------------------------
      clock_start();
      buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */

        for(i=0; i < BUF_SZ_PAGES; i++)
        {
                *((int *)(buf + (i*PAGE_SIZE))) = 1;
        }
      clock_end();
-------------------------------------------------------------

Malloc test timings for 100MB anon allocation:-

Base data:-
Number of iterations: 100
Mean of allocation time: 31831 us
std deviation: 4286 us

v1 data:-
Number of iterations: 100
Mean of allocation time: 18193 us
std deviation: 4915 us

Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
---
 include/linux/highmem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index d2c70d3772a3..444df139b489 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -146,9 +146,9 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
 #ifndef clear_user_highpage
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
-	void *addr = kmap_atomic(page);
+	void *addr = kmap_local_page(page);
 	clear_user_page(addr, vaddr, page);
-	kunmap_atomic(addr);
+	kunmap_local(addr);
 }
 #endif
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] mm: Optimizing hugepage zeroing in arm64
  2021-02-02  7:42 ` [PATCH v2 1/1] " Prathu Baronia
@ 2021-02-02 20:03   ` Ira Weiny
       [not found]     ` <CAJp9fscSi1yqcZagc7HzKV1h99X0wP6FWuQx8OpnqwgSp8yA5A@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Ira Weiny @ 2021-02-02 20:03 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: linux-kernel, chintan.pandya, Prathu Baronia, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Randy Dunlap, Matthew Wilcox (Oracle)

On Tue, Feb 02, 2021 at 01:12:24PM +0530, Prathu Baronia wrote:
> In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp
> mapping of pages. Hence, k(map|unmap)_atomic() acts as nothing more than
> multiple barrier() calls, for example for a 2MB hugepage in
> clear_huge_page() these are called 512 times i.e. to map and unmap each
> subpage that means in total 2048 barrier calls. This called for
> optimization. Simply getting VADDR from page in the form of kmap_local_*
> APIs does the job for us.  We profiled clear_huge_page() using ftrace
> and observed an improvement of 62%.

Nice!

> 
> Setup:-
> Below data has been collected on Qualcomm's SM7250 SoC THP enabled
> (kernel
> v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76) switched on
> and set to max frequency, also DDR set to perf governor.
> 
> FTRACE Data:-
> 
> Base data:-
> Number of iterations: 48
> Mean of allocation time: 349.5 us
> std deviation: 74.5 us
> 
> v1 data:-
> Number of iterations: 48
> Mean of allocation time: 131 us
> std deviation: 32.7 us
> 
> The following simple userspace experiment to allocate
> 100MB(BUF_SZ) of pages and writing to it gave us a good insight,
> we observed an improvement of 42% in allocation and writing timings.
> -------------------------------------------------------------
> Test code snippet
> -------------------------------------------------------------
>       clock_start();
>       buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
> 
>         for(i=0; i < BUF_SZ_PAGES; i++)
>         {
>                 *((int *)(buf + (i*PAGE_SIZE))) = 1;
>         }
>       clock_end();
> -------------------------------------------------------------
> 
> Malloc test timings for 100MB anon allocation:-
> 
> Base data:-
> Number of iterations: 100
> Mean of allocation time: 31831 us
> std deviation: 4286 us
> 
> v1 data:-
> Number of iterations: 100
> Mean of allocation time: 18193 us
> std deviation: 4915 us
> 
> Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
> Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

FWIW, I have the same change in a patch in my kmap() changes branch.  However,
my patch also changes clear_highpage(), zero_user_segments(),
copy_user_highpage(), and copy_highpage().

Would changing those help you as well?

Ira

> ---
>  include/linux/highmem.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index d2c70d3772a3..444df139b489 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -146,9 +146,9 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>  #ifndef clear_user_highpage
>  static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>  {
> -	void *addr = kmap_atomic(page);
> +	void *addr = kmap_local_page(page);
>  	clear_user_page(addr, vaddr, page);
> -	kunmap_atomic(addr);
> +	kunmap_local(addr);
>  }
>  #endif
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] mm: Optimizing hugepage zeroing in arm64
       [not found]     ` <CAJp9fscSi1yqcZagc7HzKV1h99X0wP6FWuQx8OpnqwgSp8yA5A@mail.gmail.com>
@ 2021-02-03 19:42       ` Ira Weiny
  0 siblings, 0 replies; 4+ messages in thread
From: Ira Weiny @ 2021-02-03 19:42 UTC (permalink / raw)
  To: Prathu Baronia
  Cc: linux-kernel, chintan.pandya, Prathu Baronia, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra (Intel),
	Randy Dunlap, Matthew Wilcox (Oracle)

On Wed, Feb 03, 2021 at 04:08:08PM +0530, Prathu Baronia wrote:
>    Hey Ira,
>    I looked at your below-mentioned patch and I agree that the
>    above-mentioned functions also need modification similar to
>    clear_user_highpage().
>    Would it be okay with you if I send your patch again with a modified
>    commit message by adding my data and maintaining your authorship?
>    [1]https://lore.kernel.org/lkml/20201210171834.2472353-2-ira.weiny@intel.com/

Sure.  I have not changed the patch at all from that version.

Andrew, will this be going through your tree?  If not who?

If you take the above patch I can drop it from the series I'm about to submit
to convert btrfs kmaps.

Ira

>    Regards,
>    Prathu Baronia
> 
>    On Wed, Feb 3, 2021 at 1:33 AM Ira Weiny <[2]ira.weiny@intel.com> wrote:
> 
>      On Tue, Feb 02, 2021 at 01:12:24PM +0530, Prathu Baronia wrote:
>      > In !HIGHMEM cases, specially in 64-bit architectures, we don't need
>      temp
>      > mapping of pages. Hence, k(map|unmap)_atomic() acts as nothing more
>      than
>      > multiple barrier() calls, for example for a 2MB hugepage in
>      > clear_huge_page() these are called 512 times i.e. to map and unmap
>      each
>      > subpage that means in total 2048 barrier calls. This called for
>      > optimization. Simply getting VADDR from page in the form of
>      kmap_local_*
>      > APIs does the job for us.  We profiled clear_huge_page() using ftrace
>      > and observed an improvement of 62%.
> 
>      Nice!
> 
>      >
>      > Setup:-
>      > Below data has been collected on Qualcomm's SM7250 SoC THP enabled
>      > (kernel
>      > v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76) switched
>      on
>      > and set to max frequency, also DDR set to perf governor.
>      >
>      > FTRACE Data:-
>      >
>      > Base data:-
>      > Number of iterations: 48
>      > Mean of allocation time: 349.5 us
>      > std deviation: 74.5 us
>      >
>      > v1 data:-
>      > Number of iterations: 48
>      > Mean of allocation time: 131 us
>      > std deviation: 32.7 us
>      >
>      > The following simple userspace experiment to allocate
>      > 100MB(BUF_SZ) of pages and writing to it gave us a good insight,
>      > we observed an improvement of 42% in allocation and writing timings.
>      > -------------------------------------------------------------
>      > Test code snippet
>      > -------------------------------------------------------------
>      >       clock_start();
>      >       buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
>      >
>      >         for(i=0; i < BUF_SZ_PAGES; i++)
>      >         {
>      >                 *((int *)(buf + (i*PAGE_SIZE))) = 1;
>      >         }
>      >       clock_end();
>      > -------------------------------------------------------------
>      >
>      > Malloc test timings for 100MB anon allocation:-
>      >
>      > Base data:-
>      > Number of iterations: 100
>      > Mean of allocation time: 31831 us
>      > std deviation: 4286 us
>      >
>      > v1 data:-
>      > Number of iterations: 100
>      > Mean of allocation time: 18193 us
>      > std deviation: 4915 us
>      >
>      > Reported-by: Chintan Pandya <[3]chintan.pandya@oneplus.com>
>      > Signed-off-by: Prathu Baronia <[4]prathu.baronia@oneplus.com>
> 
>      Reviewed-by: Ira Weiny <[5]ira.weiny@intel.com>
> 
>      FWIW, I have the same change in a patch in my kmap() changes branch. 
>      However,
>      my patch also changes clear_highpage(), zero_user_segments(),
>      copy_user_highpage(), and copy_highpage().
> 
>      Would changing those help you as well?
> 
>      Ira
> 
>      > ---
>      >  include/linux/highmem.h | 4 ++--
>      >  1 file changed, 2 insertions(+), 2 deletions(-)
>      >
>      > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>      > index d2c70d3772a3..444df139b489 100644
>      > --- a/include/linux/highmem.h
>      > +++ b/include/linux/highmem.h
>      > @@ -146,9 +146,9 @@ static inline void
>      invalidate_kernel_vmap_range(void *vaddr, int size)
>      >  #ifndef clear_user_highpage
>      >  static inline void clear_user_highpage(struct page *page, unsigned
>      long vaddr)
>      >  {
>      > -     void *addr = kmap_atomic(page);
>      > +     void *addr = kmap_local_page(page);
>      >       clear_user_page(addr, vaddr, page);
>      > -     kunmap_atomic(addr);
>      > +     kunmap_local(addr);
>      >  }
>      >  #endif
>      > 
>      > --
>      > 2.17.1
>      >
> 
> References
> 
>    Visible links
>    1. https://lore.kernel.org/lkml/20201210171834.2472353-2-ira.weiny@intel.com/
>    2. mailto:ira.weiny@intel.com
>    3. mailto:chintan.pandya@oneplus.com
>    4. mailto:prathu.baronia@oneplus.com
>    5. mailto:ira.weiny@intel.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-02-03 19:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02  7:42 [PATCH v2 0/1] mm: Optimizing hugepage zeroing in arm64 Prathu Baronia
2021-02-02  7:42 ` [PATCH v2 1/1] " Prathu Baronia
2021-02-02 20:03   ` Ira Weiny
     [not found]     ` <CAJp9fscSi1yqcZagc7HzKV1h99X0wP6FWuQx8OpnqwgSp8yA5A@mail.gmail.com>
2021-02-03 19:42       ` Ira Weiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).