All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page
@ 2017-08-15  1:46 ` Huang, Ying
  0 siblings, 0 replies; 6+ messages in thread
From: Huang, Ying @ 2017-08-15  1:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Kirill A. Shutemov, Nadia Yvette Chambers, Michal Hocko,
	Matthew Wilcox, Hugh Dickins, Minchan Kim, Shaohua Li,
	Christopher Lameter, Mike Kravetz

From: Huang Ying <ying.huang@intel.com>

Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M.  But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache).  That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread.  If the cache pressure is
heavy when clearing the huge page, and we clear the huge page from the
begin to the end, it is possible that the begin of huge page is
evicted from the cache after we finishing clearing the end of the huge
page.  And it is possible for the application to access the begin of
the huge page after clearing the huge page.

To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed.  In quite some situation, we
can get the address that the application will access after we clear
the huge page, for example, in a page fault handler.  Instead of
clearing the huge page from begin to end, we will clear the sub-pages
farthest from the the sub-page to access firstly, and clear the
sub-page to access last.  This will make the sub-page to access most
cache-hot and sub-pages around it more cache-hot too.  If we cannot
know the address the application will access, the begin of the huge
page is assumed to be the the address the application will access.

With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case creates 72 processes,
each process mmap a big anonymous memory area and writes to it from
the begin to the end.  For each process, other processes could be seen
as other workload which generates heavy cache pressure.  At the same
time, the cache miss rate reduced from ~33.4% to ~31.7%, the
IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
spent in user space is reduced ~7.9%

Christopher Lameter suggests to clear bytes inside a sub-page from end
to begin too.  But tests show no visible performance difference in the
tests.  May because the size of page is small compared with the cache
size.

Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.

The hugetlbfs access address could be improved, will do that in
another patch.

[Use address to access information]
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h |  2 +-
 mm/huge_memory.c   |  4 ++--
 mm/memory.c        | 39 +++++++++++++++++++++++++++++++++++----
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fee3213a75e..b77bcbddde20 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2508,7 +2508,7 @@ enum mf_action_page_type {
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
-			    unsigned long addr,
+			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd3ad6c88c8a..4b19a233392e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -566,7 +566,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 		return VM_FAULT_OOM;
 	}
 
-	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * clear_huge_page writes become visible before the set_pmd_at()
@@ -1310,7 +1310,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	count_vm_event(THP_FAULT_ALLOC);
 
 	if (!page)
-		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+		clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR);
 	else
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
diff --git a/mm/memory.c b/mm/memory.c
index edabf6f03447..c939cfc38bcf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4364,19 +4364,50 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 void clear_huge_page(struct page *page,
-		     unsigned long addr, unsigned int pages_per_huge_page)
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
-	int i;
+	int i, n, base, l;
+	unsigned long addr = addr_hint &
+		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, pages_per_huge_page);
 		return;
 	}
 
+	/* Clear sub-page to access last to keep its cache lines hot */
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page; i++) {
+	n = (addr_hint - addr) / PAGE_SIZE;
+	if (2 * n <= pages_per_huge_page) {
+		/* If sub-page to access in first half of huge page */
+		base = 0;
+		l = n;
+		/* Clear sub-pages at the end of huge page */
+		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+			cond_resched();
+			clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		}
+	} else {
+		/* If sub-page to access in second half of huge page */
+		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
+		l = pages_per_huge_page - n;
+		/* Clear sub-pages at the begin of huge page */
+		for (i = 0; i < base; i++) {
+			cond_resched();
+			clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		}
+	}
+	/*
+	 * Clear remaining sub-pages in left-right-left-right pattern
+	 * towards the sub-page to access
+	 */
+	for (i = 0; i < l; i++) {
+		cond_resched();
+		clear_user_highpage(page + base + i,
+				    addr + (base + i) * PAGE_SIZE);
 		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		clear_user_highpage(page + base + 2 * l - 1 - i,
+				    addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
 	}
 }
 
-- 
2.13.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page
@ 2017-08-15  1:46 ` Huang, Ying
  0 siblings, 0 replies; 6+ messages in thread
From: Huang, Ying @ 2017-08-15  1:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Kirill A. Shutemov, Nadia Yvette Chambers, Michal Hocko,
	Matthew Wilcox, Hugh Dickins, Minchan Kim, Shaohua Li,
	Christopher Lameter, Mike Kravetz

From: Huang Ying <ying.huang@intel.com>

Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M.  But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache).  That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread.  If the cache pressure is
heavy when clearing the huge page, and we clear the huge page from the
begin to the end, it is possible that the begin of huge page is
evicted from the cache after we finishing clearing the end of the huge
page.  And it is possible for the application to access the begin of
the huge page after clearing the huge page.

To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed.  In quite some situation, we
can get the address that the application will access after we clear
the huge page, for example, in a page fault handler.  Instead of
clearing the huge page from begin to end, we will clear the sub-pages
farthest from the the sub-page to access firstly, and clear the
sub-page to access last.  This will make the sub-page to access most
cache-hot and sub-pages around it more cache-hot too.  If we cannot
know the address the application will access, the begin of the huge
page is assumed to be the the address the application will access.

With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case creates 72 processes,
each process mmap a big anonymous memory area and writes to it from
the begin to the end.  For each process, other processes could be seen
as other workload which generates heavy cache pressure.  At the same
time, the cache miss rate reduced from ~33.4% to ~31.7%, the
IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
spent in user space is reduced ~7.9%

Christopher Lameter suggests to clear bytes inside a sub-page from end
to begin too.  But tests show no visible performance difference in the
tests.  May because the size of page is small compared with the cache
size.

Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.

The hugetlbfs access address could be improved, will do that in
another patch.

[Use address to access information]
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h |  2 +-
 mm/huge_memory.c   |  4 ++--
 mm/memory.c        | 39 +++++++++++++++++++++++++++++++++++----
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fee3213a75e..b77bcbddde20 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2508,7 +2508,7 @@ enum mf_action_page_type {
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
-			    unsigned long addr,
+			    unsigned long addr_hint,
 			    unsigned int pages_per_huge_page);
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd3ad6c88c8a..4b19a233392e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -566,7 +566,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 		return VM_FAULT_OOM;
 	}
 
-	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * clear_huge_page writes become visible before the set_pmd_at()
@@ -1310,7 +1310,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	count_vm_event(THP_FAULT_ALLOC);
 
 	if (!page)
-		clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+		clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR);
 	else
 		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
diff --git a/mm/memory.c b/mm/memory.c
index edabf6f03447..c939cfc38bcf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4364,19 +4364,50 @@ static void clear_gigantic_page(struct page *page,
 	}
 }
 void clear_huge_page(struct page *page,
-		     unsigned long addr, unsigned int pages_per_huge_page)
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
 {
-	int i;
+	int i, n, base, l;
+	unsigned long addr = addr_hint &
+		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
 
 	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
 		clear_gigantic_page(page, addr, pages_per_huge_page);
 		return;
 	}
 
+	/* Clear sub-page to access last to keep its cache lines hot */
 	might_sleep();
-	for (i = 0; i < pages_per_huge_page; i++) {
+	n = (addr_hint - addr) / PAGE_SIZE;
+	if (2 * n <= pages_per_huge_page) {
+		/* If sub-page to access in first half of huge page */
+		base = 0;
+		l = n;
+		/* Clear sub-pages at the end of huge page */
+		for (i = pages_per_huge_page - 1; i >= 2 * n; i--) {
+			cond_resched();
+			clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		}
+	} else {
+		/* If sub-page to access in second half of huge page */
+		base = pages_per_huge_page - 2 * (pages_per_huge_page - n);
+		l = pages_per_huge_page - n;
+		/* Clear sub-pages at the begin of huge page */
+		for (i = 0; i < base; i++) {
+			cond_resched();
+			clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		}
+	}
+	/*
+	 * Clear remaining sub-pages in left-right-left-right pattern
+	 * towards the sub-page to access
+	 */
+	for (i = 0; i < l; i++) {
+		cond_resched();
+		clear_user_highpage(page + base + i,
+				    addr + (base + i) * PAGE_SIZE);
 		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		clear_user_highpage(page + base + 2 * l - 1 - i,
+				    addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
 	}
 }
 
-- 
2.13.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page
  2017-08-15  1:46 ` Huang, Ying
@ 2017-08-21 11:52   ` Michal Hocko
  -1 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2017-08-21 11:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Nadia Yvette Chambers, Matthew Wilcox,
	Hugh Dickins, Minchan Kim, Shaohua Li, Christopher Lameter,
	Mike Kravetz

On Tue 15-08-17 09:46:18, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue.  For example, when
> clearing huge page on x86_64 platform, the cache footprint is 2M.  But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache).  That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread.  If the cache pressure is
> heavy when clearing the huge page, and we clear the huge page from the
> begin to the end, it is possible that the begin of huge page is
> evicted from the cache after we finishing clearing the end of the huge
> page.  And it is possible for the application to access the begin of
> the huge page after clearing the huge page.
> 
> To help the above situation, in this patch, when we clear a huge page,
> the order to clear sub-pages is changed.  In quite some situation, we
> can get the address that the application will access after we clear
> the huge page, for example, in a page fault handler.  Instead of
> clearing the huge page from begin to end, we will clear the sub-pages
> farthest from the the sub-page to access firstly, and clear the
> sub-page to access last.  This will make the sub-page to access most
> cache-hot and sub-pages around it more cache-hot too.  If we cannot
> know the address the application will access, the begin of the huge
> page is assumed to be the the address the application will access.
> 
> With this patch, the throughput increases ~28.3% in vm-scalability
> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads).  The test case creates 72 processes,
> each process mmap a big anonymous memory area and writes to it from
> the begin to the end.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
> spent in user space is reduced ~7.9%

The patch looks good to me alebit little bit tricky to read.

But I am still wondering. Have you considered non-temporal stores for
clearing?

> Christopher Lameter suggests to clear bytes inside a sub-page from end
> to begin too.  But tests show no visible performance difference in the
> tests.  May because the size of page is small compared with the cache
> size.
> 
> Thanks Andi Kleen to propose to use address to access to determine the
> order of sub-pages to clear.
> 
> The hugetlbfs access address could be improved, will do that in
> another patch.
> 
> [Use address to access information]
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Jan Kara <jack@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Michal Hocko <mhocko@suse.com>

> +	for (i = 0; i < l; i++) {

I would find it a bit easier to read if this was
		int left_idx = base + i;
		int right_idx = base + 2*l - 1 - i

> +		cond_resched();
> +		clear_user_highpage(page + base + i,
> +				    addr + (base + i) * PAGE_SIZE);
		clear_user_highpage(page + left_idx, addr + left_idx * PAGE_SIZE);

>  		cond_resched();
> -		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> +		clear_user_highpage(page + base + 2 * l - 1 - i,
> +				    addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
		clear_user_highpage(page + right_idx, addr + right_idx * PAGE_SIZE);
>  	}
>  }
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page
@ 2017-08-21 11:52   ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2017-08-21 11:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Nadia Yvette Chambers, Matthew Wilcox,
	Hugh Dickins, Minchan Kim, Shaohua Li, Christopher Lameter,
	Mike Kravetz

On Tue 15-08-17 09:46:18, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> Huge page helps to reduce TLB miss rate, but it has higher cache
> footprint, sometimes this may cause some issue.  For example, when
> clearing huge page on x86_64 platform, the cache footprint is 2M.  But
> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
> LLC (last level cache).  That is, in average, there are 2.5M LLC for
> each core and 1.25M LLC for each thread.  If the cache pressure is
> heavy when clearing the huge page, and we clear the huge page from the
> begin to the end, it is possible that the begin of huge page is
> evicted from the cache after we finishing clearing the end of the huge
> page.  And it is possible for the application to access the begin of
> the huge page after clearing the huge page.
> 
> To help the above situation, in this patch, when we clear a huge page,
> the order to clear sub-pages is changed.  In quite some situation, we
> can get the address that the application will access after we clear
> the huge page, for example, in a page fault handler.  Instead of
> clearing the huge page from begin to end, we will clear the sub-pages
> farthest from the the sub-page to access firstly, and clear the
> sub-page to access last.  This will make the sub-page to access most
> cache-hot and sub-pages around it more cache-hot too.  If we cannot
> know the address the application will access, the begin of the huge
> page is assumed to be the the address the application will access.
> 
> With this patch, the throughput increases ~28.3% in vm-scalability
> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
> system (36 cores, 72 threads).  The test case creates 72 processes,
> each process mmap a big anonymous memory area and writes to it from
> the begin to the end.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
> spent in user space is reduced ~7.9%

The patch looks good to me alebit little bit tricky to read.

But I am still wondering. Have you considered non-temporal stores for
clearing?

> Christopher Lameter suggests to clear bytes inside a sub-page from end
> to begin too.  But tests show no visible performance difference in the
> tests.  May because the size of page is small compared with the cache
> size.
> 
> Thanks Andi Kleen to propose to use address to access to determine the
> order of sub-pages to clear.
> 
> The hugetlbfs access address could be improved, will do that in
> another patch.
> 
> [Use address to access information]
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Jan Kara <jack@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Michal Hocko <mhocko@suse.com>

> +	for (i = 0; i < l; i++) {

I would find it a bit easier to read if this was
		int left_idx = base + i;
		int right_idx = base + 2*l - 1 - i

> +		cond_resched();
> +		clear_user_highpage(page + base + i,
> +				    addr + (base + i) * PAGE_SIZE);
		clear_user_highpage(page + left_idx, addr + left_idx * PAGE_SIZE);

>  		cond_resched();
> -		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> +		clear_user_highpage(page + base + 2 * l - 1 - i,
> +				    addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
		clear_user_highpage(page + right_idx, addr + right_idx * PAGE_SIZE);
>  	}
>  }
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page
  2017-08-21 11:52   ` Michal Hocko
@ 2017-08-22  0:54     ` Huang, Ying
  -1 siblings, 0 replies; 6+ messages in thread
From: Huang, Ying @ 2017-08-22  0:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A. Shutemov, Nadia Yvette Chambers,
	Matthew Wilcox, Hugh Dickins, Minchan Kim, Shaohua Li,
	Christopher Lameter, Mike Kravetz

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 15-08-17 09:46:18, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> Huge page helps to reduce TLB miss rate, but it has higher cache
>> footprint, sometimes this may cause some issue.  For example, when
>> clearing huge page on x86_64 platform, the cache footprint is 2M.  But
>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>> LLC (last level cache).  That is, in average, there are 2.5M LLC for
>> each core and 1.25M LLC for each thread.  If the cache pressure is
>> heavy when clearing the huge page, and we clear the huge page from the
>> begin to the end, it is possible that the begin of huge page is
>> evicted from the cache after we finishing clearing the end of the huge
>> page.  And it is possible for the application to access the begin of
>> the huge page after clearing the huge page.
>> 
>> To help the above situation, in this patch, when we clear a huge page,
>> the order to clear sub-pages is changed.  In quite some situation, we
>> can get the address that the application will access after we clear
>> the huge page, for example, in a page fault handler.  Instead of
>> clearing the huge page from begin to end, we will clear the sub-pages
>> farthest from the the sub-page to access firstly, and clear the
>> sub-page to access last.  This will make the sub-page to access most
>> cache-hot and sub-pages around it more cache-hot too.  If we cannot
>> know the address the application will access, the begin of the huge
>> page is assumed to be the the address the application will access.
>> 
>> With this patch, the throughput increases ~28.3% in vm-scalability
>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>> system (36 cores, 72 threads).  The test case creates 72 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the begin to the end.  For each process, other processes could be seen
>> as other workload which generates heavy cache pressure.  At the same
>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>> spent in user space is reduced ~7.9%
>
> The patch looks good to me alebit little bit tricky to read.
>
> But I am still wondering. Have you considered non-temporal stores for
> clearing?

Yes, the non-temporal stores will have no cache pressure to other
processes.  But the cache will be cold for current process too.  That
is, accessing memory after non-temporal stores need synchronous RAM
loading.  And if cache overhead on other cores isn't heavy, we can take
better advantage of the shared last level cache if we use normal memory
clearing.

>> Christopher Lameter suggests to clear bytes inside a sub-page from end
>> to begin too.  But tests show no visible performance difference in the
>> tests.  May because the size of page is small compared with the cache
>> size.
>> 
>> Thanks Andi Kleen to propose to use address to access to determine the
>> order of sub-pages to clear.
>> 
>> The hugetlbfs access address could be improved, will do that in
>> another patch.
>> 
>> [Use address to access information]
>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Acked-by: Jan Kara <jack@suse.cz>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Shaohua Li <shli@fb.com>
>> Cc: Christopher Lameter <cl@linux.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>
> Reviewed-by: Michal Hocko <mhocko@suse.com>

Thanks!

>> +	for (i = 0; i < l; i++) {
>
> I would find it a bit easier to read if this was
> 		int left_idx = base + i;
> 		int right_idx = base + 2*l - 1 - i
>
>> +		cond_resched();
>> +		clear_user_highpage(page + base + i,
>> +				    addr + (base + i) * PAGE_SIZE);
> 		clear_user_highpage(page + left_idx, addr + left_idx * PAGE_SIZE);
>
>>  		cond_resched();
>> -		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> +		clear_user_highpage(page + base + 2 * l - 1 - i,
>> +				    addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
> 		clear_user_highpage(page + right_idx, addr + right_idx * PAGE_SIZE);
>>  	}
>>  }

Yes.  This looks better.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page
@ 2017-08-22  0:54     ` Huang, Ying
  0 siblings, 0 replies; 6+ messages in thread
From: Huang, Ying @ 2017-08-22  0:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A. Shutemov, Nadia Yvette Chambers,
	Matthew Wilcox, Hugh Dickins, Minchan Kim, Shaohua Li,
	Christopher Lameter, Mike Kravetz

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 15-08-17 09:46:18, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> Huge page helps to reduce TLB miss rate, but it has higher cache
>> footprint, sometimes this may cause some issue.  For example, when
>> clearing huge page on x86_64 platform, the cache footprint is 2M.  But
>> on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
>> LLC (last level cache).  That is, in average, there are 2.5M LLC for
>> each core and 1.25M LLC for each thread.  If the cache pressure is
>> heavy when clearing the huge page, and we clear the huge page from the
>> begin to the end, it is possible that the begin of huge page is
>> evicted from the cache after we finishing clearing the end of the huge
>> page.  And it is possible for the application to access the begin of
>> the huge page after clearing the huge page.
>> 
>> To help the above situation, in this patch, when we clear a huge page,
>> the order to clear sub-pages is changed.  In quite some situation, we
>> can get the address that the application will access after we clear
>> the huge page, for example, in a page fault handler.  Instead of
>> clearing the huge page from begin to end, we will clear the sub-pages
>> farthest from the the sub-page to access firstly, and clear the
>> sub-page to access last.  This will make the sub-page to access most
>> cache-hot and sub-pages around it more cache-hot too.  If we cannot
>> know the address the application will access, the begin of the huge
>> page is assumed to be the the address the application will access.
>> 
>> With this patch, the throughput increases ~28.3% in vm-scalability
>> anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
>> system (36 cores, 72 threads).  The test case creates 72 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the begin to the end.  For each process, other processes could be seen
>> as other workload which generates heavy cache pressure.  At the same
>> time, the cache miss rate reduced from ~33.4% to ~31.7%, the
>> IPC (instruction per cycle) increased from 0.56 to 0.74, and the time
>> spent in user space is reduced ~7.9%
>
> The patch looks good to me alebit little bit tricky to read.
>
> But I am still wondering. Have you considered non-temporal stores for
> clearing?

Yes, the non-temporal stores will have no cache pressure to other
processes.  But the cache will be cold for current process too.  That
is, accessing memory after non-temporal stores need synchronous RAM
loading.  And if cache overhead on other cores isn't heavy, we can take
better advantage of the shared last level cache if we use normal memory
clearing.

>> Christopher Lameter suggests to clear bytes inside a sub-page from end
>> to begin too.  But tests show no visible performance difference in the
>> tests.  May because the size of page is small compared with the cache
>> size.
>> 
>> Thanks Andi Kleen to propose to use address to access to determine the
>> order of sub-pages to clear.
>> 
>> The hugetlbfs access address could be improved, will do that in
>> another patch.
>> 
>> [Use address to access information]
>> Suggested-by: Andi Kleen <andi.kleen@intel.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Acked-by: Jan Kara <jack@suse.cz>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Shaohua Li <shli@fb.com>
>> Cc: Christopher Lameter <cl@linux.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>
> Reviewed-by: Michal Hocko <mhocko@suse.com>

Thanks!

>> +	for (i = 0; i < l; i++) {
>
> I would find it a bit easier to read if this was
> 		int left_idx = base + i;
> 		int right_idx = base + 2*l - 1 - i
>
>> +		cond_resched();
>> +		clear_user_highpage(page + base + i,
>> +				    addr + (base + i) * PAGE_SIZE);
> 		clear_user_highpage(page + left_idx, addr + left_idx * PAGE_SIZE);
>
>>  		cond_resched();
>> -		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> +		clear_user_highpage(page + base + 2 * l - 1 - i,
>> +				    addr + (base + 2 * l - 1 - i) * PAGE_SIZE);
> 		clear_user_highpage(page + right_idx, addr + right_idx * PAGE_SIZE);
>>  	}
>>  }

Yes.  This looks better.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-08-22  0:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-15  1:46 [PATCH -mm -v2] mm: Clear to access sub-page last when clearing huge page Huang, Ying
2017-08-15  1:46 ` Huang, Ying
2017-08-21 11:52 ` Michal Hocko
2017-08-21 11:52   ` Michal Hocko
2017-08-22  0:54   ` Huang, Ying
2017-08-22  0:54     ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.