[PATCH v2] mm: Optimized hugepage zeroing & copying from user

* [PATCH v2] mm: Optimized hugepage zeroing & copying from user
@ 2020-04-14 15:38 Prathu Baronia
  2020-04-14 17:03 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Prathu Baronia @ 2020-04-14 15:38 UTC (permalink / raw)
  To: alexander.duyck, chintan.pandya, ying.huang, mhocko, akpm,
	linux-mm, gregkh, gthelen, jack, ken.lin, gasine.xu

In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp mapping
of pages. Hence, k(map|unmap)_atomic() acts as nothing more than multiple
barrier() calls, for example for a 2MB hugepage in clear_huge_page() these are
called 512 times i.e. to map and unmap each subpage that means in total 2048
barrier calls. This called for optimization. Simply getting VADDR from page does
the job for us. This also applies to the copy_user_huge_page() function.

With kmap_atomic() out of the picture we can use memset and memcpy for sizes
larger than 4K. Instead of a left-right approach to access the target subpage,
getting the VADDR from the page and using memset directly in a simple experiment
we observed a 64% improvement in time over the current approach.

With this(v2) patch we observe 65.85%(under controlled conditions) improvement
over the current approach. 

Currently process_huge_page iterates over subpages in a left-right manner
targeting the subpage that was accessed to be processed at last to keep the
cache hot around the faulting address. This caused a latency issue because as we
observed in the case of ARM64 the reverse access is much slower than forward
access and much much slower than oneshot access because of the pre-fetcher
behaviour. The following simple userspace experiment to allocate
100MB(total_size) of pages and writing to it using memset(oneshot), forward
order loop and a reverse order loop gave us a good insight:-

--------------------------------------------------------------------------------
Test code snippet:
--------------------------------------------------------------------------------
  /* One shot memset */
  memset (r, 0xd, total_size);

  /* traverse in forward order */
  for (j = 0; j < total_pages; j++)
    {
      memset (q + (j * SZ_4K), 0xc, SZ_4K);
    }

  /* traverse in reverse order */
  for (i = 0; i < total_pages; i++)
    {
      memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
    }
----------------------------------------------------------------------
Results:
----------------------------------------------------------------------
Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max
frequency)
All numbers are mean of 100 iterations. Variation is ignorable.
----------------------------------------------------------------------
- Oneshot : 3389.26 us
- Forward : 8876.16 us
- Reverse : 18157.6 us
----------------------------------------------------------------------

----------------------------------------------------------------------
Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max
frequency, DDR also running at max frequency.)
All numbers are mean of 100 iterations. Variation is ignorable.
----------------------------------------------------------------------
- Oneshot : 3203.49 us
- Forward : 5766.46 us
- Reverse : 5187.86 us
----------------------------------------------------------------------

Hence refactor the function process_huge_page() to process the hugepage
in oneshot manner using oneshot version of routines clear_huge_page() and
copy_user_huge_page() for !HIGHMEM cases.

These oneshot routines do zeroing using memset and copying using memcpy since we
observed after extensive testing on ARM64 and some local testing on x86 memset
and memcpy routines are highly optimized and with the above data points in hand
it made sense to utilize them directly instead of looping over all subpages.
These oneshot routines do zero and copy with a small offset(default kept as 32KB for
now) to keep the cache hot around the faulting address. This offset is dependent
on the cache size and hence can be kept as a tunable configuration option.

The below profiles are for ARM64(SM8150, CPU0 & 6 are online, running at max
frequency, DDR also running at max frequency.)

----------------------------------------------------------------------
Ftrace Results(clear_huge_page_profile()):
----------------------------------------------------------------------
All timing values are in microseconds(us)
----------------------------------------------------------------------
Base:
        - CPU0:
                - Samples: 95
                - Mean: 242.099 us
                - Std dev: 45.0096 us
        - CPU6:
                - Samples: 61
                - Mean: 258.372 us
                - Std dev: 22.0754 us
----------------------------------------------------------------------
v2:
        - CPU0:
                - Samples: 63
                - Mean: 112.297 us
                - Std dev: 0.310989 us
        - CPU6:
                - Samples: 99
                - Mean: 67.359 us
                - Std dev: 1.15997 us
----------------------------------------------------------------------

Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
---
 mm/Kconfig  |  13 +++++
 mm/memory.c | 159 +++++++++++++++++++++++-----------------------------
 2 files changed, 83 insertions(+), 89 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..31c169432276 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -739,4 +739,17 @@ config ARCH_HAS_HUGEPD
 config MAPPING_DIRTY_HELPERS
         bool
 
+config HOT_CACHE_RANGE
+	int
+	default 8
+	range 1 512
+	help
+	  This value can be tweaked to make the cache hot around the
+	  faulting address of the hugepage, primarily to make the
+	  clearing and copying of hugepage more cache friendly. It is
+	  proportionate to the cache size and should be kept bigger
+	  for bigger caches for better performance.
+	  This value is in terms of number of 4KB pages.
+	  Don't change it if you are not sure.
+
 endmenu
diff --git a/mm/memory.c b/mm/memory.c
index e8bfdf0d9d1d..e7b16e76e794 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4655,6 +4655,11 @@ EXPORT_SYMBOL(__might_fault);
 #endif
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+struct copy_subpage_arg {
+	struct page *dst;
+	struct page *src;
+	struct vm_area_struct *vma;
+};
 /*
  * Process all subpages of the specified huge page with the specified
  * operation.  The target subpage will be processed last to keep its
@@ -4662,137 +4667,113 @@ EXPORT_SYMBOL(__might_fault);
  */
 static inline void process_huge_page(
 	unsigned long addr_hint, unsigned int pages_per_huge_page,
-	void (*process_subpage)(unsigned long addr, int idx, void *arg),
+	void (*process_subpage)(unsigned long offset, int size, void *arg),
 	void *arg)
 {
-	int i, n, base, l;
-	unsigned long addr = addr_hint &
-		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	unsigned long clear_start_addr;
+	unsigned long addr_base;
+	unsigned long offset;
+	unsigned long remaining;
+	unsigned long huge_page_size = (unsigned long) pages_per_huge_page << PAGE_SHIFT;
+	struct copy_subpage_arg *general_arg = arg;
+
+	addr_base = addr_hint & ~(huge_page_size - 1);
 
-	/* Process target subpage last to keep its cache lines hot */
-	might_sleep();
-	n = (addr_hint - addr) / PAGE_SIZE;
-	if (2 * n <= pages_per_huge_page) {
-		/* If target subpage in first half of huge page */
-		base = 0;
-		l = n;
-		/* Process subpages at the end of huge page */
-		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
-			cond_resched();
-			process_subpage(addr + i * PAGE_SIZE, i, arg);
-		}
-	} else {
-		/* If target subpage in second half of huge page */
-		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
-		l = pages_per_huge_page - n;
-		/* Process subpages at the begin of huge page */
-		for (i = 0; i < base; i++) {
-			cond_resched();
-			process_subpage(addr + i * PAGE_SIZE, i, arg);
-		}
-	}
 	/*
-	 * Process remaining subpages in left-right-left-right pattern
-	 * towards the target subpage
+	 * Converting addr_hint into relative & 4KB aligned address
 	 */
-	for (i = 0; i < l; i++) {
-		int left_idx = base + i;
-		int right_idx = base + 2 * l - 1 - i;
+	offset = (addr_hint - addr_base) & ~(PAGE_SIZE - 1);
 
-		cond_resched();
-		process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg);
-		cond_resched();
-		process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg);
+	/*
+	 * First, we will attempt to process higher range of addresses
+	 * and lastly we attempt lower range of addresses within a huge
+	 * page. This will make [addr_hint - range, addr_hint + range]
+	 * to be processed in the last. This will keep cache hot around
+	 * addr_hint, which will be helpful in operations further.
+	 */
+	clear_start_addr = offset + CONFIG_HOT_CACHE_RANGE*PAGE_SIZE;
+	if (clear_start_addr < huge_page_size) {
+		process_subpage(clear_start_addr, huge_page_size - clear_start_addr, general_arg);
+		remaining = clear_start_addr;
+	} else {
+		remaining = huge_page_size;
 	}
+
+	process_subpage(0, remaining, general_arg);
 }
 
-static void clear_gigantic_page(struct page *page,
-				unsigned long addr,
-				unsigned int pages_per_huge_page)
+#ifdef CONFIG_HIGHMEM
+static void clear_subpage(unsigned long offset, int size, void *arg)
 {
 	int i;
-	struct page *p = page;
+	struct copy_subpage_arg *args = arg;
+	struct page *p = args->dst + (offset/PAGE_SIZE);
 
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page;
-	     i++, p = mem_map_next(p, page, i)) {
+	for (i = 0; i < size/PAGE_SIZE;
+	     i++, p = mem_map_next(p, args->dst, i)) {
 		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
+		clear_user_highpage(p, 0);
 	}
 }
 
-static void clear_subpage(unsigned long addr, int idx, void *arg)
+static void copy_subpage(unsigned long offset, int size, void *arg)
 {
-	struct page *page = arg;
-
-	clear_user_highpage(page + idx, addr);
-}
+	int i;
+	struct copy_subpage_arg *copy_args = arg;
+	struct page *dst = copy_args->dst + (offset/PAGE_SIZE);
+	struct page *src = copy_args->src + (offset/PAGE_SIZE);
 
-void clear_huge_page(struct page *page,
-		     unsigned long addr_hint, unsigned int pages_per_huge_page)
-{
-	unsigned long addr = addr_hint &
-		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	for (i = 0; i < size/PAGE_SIZE; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, 0, copy_args->vma);
 
-	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, pages_per_huge_page);
-		return;
+		i++;
+		dst = mem_map_next(dst, copy_args->dst, i);
+		src = mem_map_next(src, copy_args->src, i);
 	}
+}
+#else
+static void clear_subpage(unsigned long offset, int size, void *arg)
+{
+	struct copy_subpage_arg *args = arg;
+	unsigned long addr = (unsigned long) page_address(args->dst);
 
-	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page);
+	memset((void *) addr + offset, 0x0, size);
 }
 
-static void copy_user_gigantic_page(struct page *dst, struct page *src,
-				    unsigned long addr,
-				    struct vm_area_struct *vma,
-				    unsigned int pages_per_huge_page)
+static void copy_subpage(unsigned long offset, int size, void *arg)
 {
-	int i;
-	struct page *dst_base = dst;
-	struct page *src_base = src;
+	struct copy_subpage_arg *copy_args = arg;
+	unsigned long d_addr = (unsigned long) page_address(copy_args->dst);
+	unsigned long s_addr = (unsigned long) page_address(copy_args->src);
 
-	for (i = 0; i < pages_per_huge_page; ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
+	memcpy((void *) d_addr + offset, (void *) s_addr + offset, size);
 }
+#endif
 
-struct copy_subpage_arg {
-	struct page *dst;
-	struct page *src;
-	struct vm_area_struct *vma;
-};
-
-static void copy_subpage(unsigned long addr, int idx, void *arg)
+void clear_huge_page(struct page *page,
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
-	struct copy_subpage_arg *copy_arg = arg;
+	struct copy_subpage_arg arg = {
+		.dst = page,
+		.src = NULL,
+		.vma = NULL,
+	};
 
-	copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx,
-			   addr, copy_arg->vma);
+	process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, &arg);
 }
 
 void copy_user_huge_page(struct page *dst, struct page *src,
 			 unsigned long addr_hint, struct vm_area_struct *vma,
 			 unsigned int pages_per_huge_page)
 {
-	unsigned long addr = addr_hint &
-		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 	struct copy_subpage_arg arg = {
 		.dst = dst,
 		.src = src,
 		.vma = vma,
 	};
 
-	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
-		copy_user_gigantic_page(dst, src, addr, vma,
-					pages_per_huge_page);
-		return;
-	}
-
 	process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg);
 }
 
-- 
2.17.1


-- 
Prathu Baronia
OnePlus RnD


^ permalink raw reply related	[flat|nested] 27+ messages in thread