[PATCH] mm/khugepaged: sched to numa node when collapse huge page

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/khugepaged: sched to numa node when collapse huge page
@ 2022-03-11  9:01 Bibo Mao
  2022-03-11  9:20 ` David Hildenbrand
  2022-03-11 19:08 ` Yang Shi
  0 siblings, 2 replies; 7+ messages in thread
From: Bibo Mao @ 2022-03-11  9:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

collapse huge page is slow, specially when khugepaged daemon runs
on different numa node with that of huge page. It suffers from
huge page copying across nodes, also cache is not used for target
node. With this patch, khugepaged daemon switches to the same numa
node with huge page. It saves copying time and makes use of local
cache better.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
---
 mm/khugepaged.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 131492fd1148..460c285dc974 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -116,6 +116,7 @@ struct khugepaged_scan {
 	struct list_head mm_head;
 	struct mm_slot *mm_slot;
 	unsigned long address;
+	int node;
 };
 
 static struct khugepaged_scan khugepaged_scan = {
@@ -1066,6 +1067,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
 	gfp_t gfp;
+	const struct cpumask *cpumask;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1079,6 +1081,13 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
 	mmap_read_unlock(mm);
+
+	/* sched to specified node before huage page memory copy */
+	cpumask = cpumask_of_node(node);
+	if ((khugepaged_scan.node != node) && !cpumask_empty(cpumask)) {
+		set_cpus_allowed_ptr(current, cpumask);
+		khugepaged_scan.node = node;
+	}
 	new_page = khugepaged_alloc_page(hpage, gfp, node);
 	if (!new_page) {
 		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
@@ -2380,6 +2389,7 @@ int start_stop_khugepaged(void)
 		kthread_stop(khugepaged_thread);
 		khugepaged_thread = NULL;
 	}
+	khugepaged_scan.node = NUMA_NO_NODE;
 	set_recommended_min_free_kbytes();
 fail:
 	mutex_unlock(&khugepaged_mutex);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/khugepaged: sched to numa node when collapse huge page
  2022-03-11  9:01 [PATCH] mm/khugepaged: sched to numa node when collapse huge page Bibo Mao
@ 2022-03-11  9:20 ` David Hildenbrand
  2022-03-11  9:51   ` maobibo
  2022-03-11 19:08 ` Yang Shi
  1 sibling, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2022-03-11  9:20 UTC (permalink / raw)
  To: Bibo Mao, Andrew Morton; +Cc: linux-mm, linux-kernel

On 11.03.22 10:01, Bibo Mao wrote:
> collapse huge page is slow, specially when khugepaged daemon runs
> on different numa node with that of huge page. It suffers from
> huge page copying across nodes, also cache is not used for target
> node. With this patch, khugepaged daemon switches to the same numa
> node with huge page. It saves copying time and makes use of local
> cache better.

Hi,

just the usual question, do you have any performance numbers to back
your claims (e.g., "is slow, specially when") and proof that this patch
does the trick?


> 
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
>  mm/khugepaged.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..460c285dc974 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -116,6 +116,7 @@ struct khugepaged_scan {
>  	struct list_head mm_head;
>  	struct mm_slot *mm_slot;
>  	unsigned long address;
> +	int node;
>  };
>  
>  static struct khugepaged_scan khugepaged_scan = {
> @@ -1066,6 +1067,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
>  	gfp_t gfp;
> +	const struct cpumask *cpumask;

We tend to stick to reverse Christmas tree format as good as possible.

>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1079,6 +1081,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 * that. We will recheck the vma after taking it again in write mode.
>  	 */
>  	mmap_read_unlock(mm);
> +
> +	/* sched to specified node before huage page memory copy */

s/huage/huge/

> +	cpumask = cpumask_of_node(node);
> +	if ((khugepaged_scan.node != node) && !cpumask_empty(cpumask)) {
> +		set_cpus_allowed_ptr(current, cpumask);
> +		khugepaged_scan.node = node;
> +	}
>  	new_page = khugepaged_alloc_page(hpage, gfp, node);
>  	if (!new_page) {
>  		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> @@ -2380,6 +2389,7 @@ int start_stop_khugepaged(void)
>  		kthread_stop(khugepaged_thread);
>  		khugepaged_thread = NULL;
>  	}
> +	khugepaged_scan.node = NUMA_NO_NODE;
>  	set_recommended_min_free_kbytes();
>  fail:
>  	mutex_unlock(&khugepaged_mutex);


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/khugepaged: sched to numa node when collapse huge page
  2022-03-11  9:20 ` David Hildenbrand
@ 2022-03-11  9:51   ` maobibo
  2022-03-11  9:55     ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: maobibo @ 2022-03-11  9:51 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton; +Cc: linux-mm, linux-kernel



On 03/11/2022 05:20 PM, David Hildenbrand wrote:
> On 11.03.22 10:01, Bibo Mao wrote:
>> collapse huge page is slow, specially when khugepaged daemon runs
>> on different numa node with that of huge page. It suffers from
>> huge page copying across nodes, also cache is not used for target
>> node. With this patch, khugepaged daemon switches to the same numa
>> node with huge page. It saves copying time and makes use of local
>> cache better.
> 
> Hi,
> 
> just the usual question, do you have any performance numbers to back
> your claims (e.g., "is slow, specially when") and proof that this patch
> does the trick?
With specint 2006 on loongarch 3C5000L 32core numa system, it improves
about 6%. The page size is 16K and pmd page size is 32M, memory performance
across numa node is obvious different. However I do not test it on x86 box.


> 
> 
>>
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>>  mm/khugepaged.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 131492fd1148..460c285dc974 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -116,6 +116,7 @@ struct khugepaged_scan {
>>  	struct list_head mm_head;
>>  	struct mm_slot *mm_slot;
>>  	unsigned long address;
>> +	int node;
>>  };
>>  
>>  static struct khugepaged_scan khugepaged_scan = {
>> @@ -1066,6 +1067,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>>  	struct vm_area_struct *vma;
>>  	struct mmu_notifier_range range;
>>  	gfp_t gfp;
>> +	const struct cpumask *cpumask;
> 
> We tend to stick to reverse Christmas tree format as good as possible.
> 
>>  
>>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>  
>> @@ -1079,6 +1081,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>>  	 * that. We will recheck the vma after taking it again in write mode.
>>  	 */
>>  	mmap_read_unlock(mm);
>> +
>> +	/* sched to specified node before huage page memory copy */
> 
> s/huage/huge/
> 
>> +	cpumask = cpumask_of_node(node);
>> +	if ((khugepaged_scan.node != node) && !cpumask_empty(cpumask)) {
>> +		set_cpus_allowed_ptr(current, cpumask);
>> +		khugepaged_scan.node = node;
>> +	}
>>  	new_page = khugepaged_alloc_page(hpage, gfp, node);
>>  	if (!new_page) {
>>  		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>> @@ -2380,6 +2389,7 @@ int start_stop_khugepaged(void)
>>  		kthread_stop(khugepaged_thread);
>>  		khugepaged_thread = NULL;
>>  	}
>> +	khugepaged_scan.node = NUMA_NO_NODE;
>>  	set_recommended_min_free_kbytes();
>>  fail:
>>  	mutex_unlock(&khugepaged_mutex);
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/khugepaged: sched to numa node when collapse huge page
  2022-03-11  9:51   ` maobibo
@ 2022-03-11  9:55     ` David Hildenbrand
  2022-03-11 10:12       ` maobibo
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2022-03-11  9:55 UTC (permalink / raw)
  To: maobibo, Andrew Morton; +Cc: linux-mm, linux-kernel

On 11.03.22 10:51, maobibo wrote:
> 
> 
> On 03/11/2022 05:20 PM, David Hildenbrand wrote:
>> On 11.03.22 10:01, Bibo Mao wrote:
>>> collapse huge page is slow, specially when khugepaged daemon runs
>>> on different numa node with that of huge page. It suffers from
>>> huge page copying across nodes, also cache is not used for target
>>> node. With this patch, khugepaged daemon switches to the same numa
>>> node with huge page. It saves copying time and makes use of local
>>> cache better.
>>
>> Hi,
>>
>> just the usual question, do you have any performance numbers to back
>> your claims (e.g., "is slow, specially when") and proof that this patch
>> does the trick?
> With specint 2006 on loongarch 3C5000L 32core numa system, it improves
> about 6%. The page size is 16K and pmd page size is 32M, memory performance
> across numa node is obvious different. However I do not test it on x86 box.
> 

Thanks, can you add these details to the patch description?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/khugepaged: sched to numa node when collapse huge page
  2022-03-11  9:55     ` David Hildenbrand
@ 2022-03-11 10:12       ` maobibo
  0 siblings, 0 replies; 7+ messages in thread
From: maobibo @ 2022-03-11 10:12 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton; +Cc: linux-mm, linux-kernel



On 03/11/2022 05:55 PM, David Hildenbrand wrote:
> On 11.03.22 10:51, maobibo wrote:
>>
>>
>> On 03/11/2022 05:20 PM, David Hildenbrand wrote:
>>> On 11.03.22 10:01, Bibo Mao wrote:
>>>> collapse huge page is slow, specially when khugepaged daemon runs
>>>> on different numa node with that of huge page. It suffers from
>>>> huge page copying across nodes, also cache is not used for target
>>>> node. With this patch, khugepaged daemon switches to the same numa
>>>> node with huge page. It saves copying time and makes use of local
>>>> cache better.
>>>
>>> Hi,
>>>
>>> just the usual question, do you have any performance numbers to back
>>> your claims (e.g., "is slow, specially when") and proof that this patch
>>> does the trick?
>> With specint 2006 on loongarch 3C5000L 32core numa system, it improves
>> about 6%. The page size is 16K and pmd page size is 32M, memory performance
>> across numa node is obvious different. However I do not test it on x86 box.
>>
> 
> Thanks, can you add these details to the patch description?
Surely, will do in next patch.

regards
bibo, mao
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/khugepaged: sched to numa node when collapse huge page
  2022-03-11  9:01 [PATCH] mm/khugepaged: sched to numa node when collapse huge page Bibo Mao
  2022-03-11  9:20 ` David Hildenbrand
@ 2022-03-11 19:08 ` Yang Shi
  2022-03-14  0:46   ` maobibo
  1 sibling, 1 reply; 7+ messages in thread
From: Yang Shi @ 2022-03-11 19:08 UTC (permalink / raw)
  To: Bibo Mao; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, Mar 11, 2022 at 1:01 AM Bibo Mao <maobibo@loongson.cn> wrote:
>
> collapse huge page is slow, specially when khugepaged daemon runs
> on different numa node with that of huge page. It suffers from
> huge page copying across nodes, also cache is not used for target
> node. With this patch, khugepaged daemon switches to the same numa
> node with huge page. It saves copying time and makes use of local
> cache better.
>
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
>  mm/khugepaged.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..460c285dc974 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -116,6 +116,7 @@ struct khugepaged_scan {
>         struct list_head mm_head;
>         struct mm_slot *mm_slot;
>         unsigned long address;
> +       int node;
>  };
>
>  static struct khugepaged_scan khugepaged_scan = {
> @@ -1066,6 +1067,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>         struct vm_area_struct *vma;
>         struct mmu_notifier_range range;
>         gfp_t gfp;
> +       const struct cpumask *cpumask;
>
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1079,6 +1081,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>          * that. We will recheck the vma after taking it again in write mode.
>          */
>         mmap_read_unlock(mm);
> +
> +       /* sched to specified node before huage page memory copy */
> +       cpumask = cpumask_of_node(node);
> +       if ((khugepaged_scan.node != node) && !cpumask_empty(cpumask)) {
> +               set_cpus_allowed_ptr(current, cpumask);
> +               khugepaged_scan.node = node;

What if khugepaged was scheduled to the other nodes after this, but
khugepaged_scan.node still equals to node? It seems possible to me
IIUC.

TBH I'm not quite sure if migrating khugepaged is really worth it for
everyone or not. The worst case is the locality of base pages are not
obvious, for example, the base pages may be across all nodes, so you
always get cross nodes memory copy. And khugepaged may get slower if
cpu is contentious.

In addition, I saw MIPS has its own copy_user_highpage(), is it a
contributing factor too?

> +       }
>         new_page = khugepaged_alloc_page(hpage, gfp, node);
>         if (!new_page) {
>                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> @@ -2380,6 +2389,7 @@ int start_stop_khugepaged(void)
>                 kthread_stop(khugepaged_thread);
>                 khugepaged_thread = NULL;
>         }
> +       khugepaged_scan.node = NUMA_NO_NODE;
>         set_recommended_min_free_kbytes();
>  fail:
>         mutex_unlock(&khugepaged_mutex);
> --
> 2.31.1
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm/khugepaged: sched to numa node when collapse huge page
  2022-03-11 19:08 ` Yang Shi
@ 2022-03-14  0:46   ` maobibo
  0 siblings, 0 replies; 7+ messages in thread
From: maobibo @ 2022-03-14  0:46 UTC (permalink / raw)
  To: Yang Shi; +Cc: Andrew Morton, linux-mm, linux-kernel

Hi Yang,

thanks for reviewing my patch. I reply inline.

On 03/12/2022 03:08 AM, Yang Shi wrote:
> On Fri, Mar 11, 2022 at 1:01 AM Bibo Mao <maobibo@loongson.cn> wrote:
>>
>> collapse huge page is slow, specially when khugepaged daemon runs
>> on different numa node with that of huge page. It suffers from
>> huge page copying across nodes, also cache is not used for target
>> node. With this patch, khugepaged daemon switches to the same numa
>> node with huge page. It saves copying time and makes use of local
>> cache better.
>>
>> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
>> ---
>>  mm/khugepaged.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 131492fd1148..460c285dc974 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -116,6 +116,7 @@ struct khugepaged_scan {
>>         struct list_head mm_head;
>>         struct mm_slot *mm_slot;
>>         unsigned long address;
>> +       int node;
>>  };
>>
>>  static struct khugepaged_scan khugepaged_scan = {
>> @@ -1066,6 +1067,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>>         struct vm_area_struct *vma;
>>         struct mmu_notifier_range range;
>>         gfp_t gfp;
>> +       const struct cpumask *cpumask;
>>
>>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>
>> @@ -1079,6 +1081,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>>          * that. We will recheck the vma after taking it again in write mode.
>>          */
>>         mmap_read_unlock(mm);
>> +
>> +       /* sched to specified node before huage page memory copy */
>> +       cpumask = cpumask_of_node(node);
>> +       if ((khugepaged_scan.node != node) && !cpumask_empty(cpumask)) {
>> +               set_cpus_allowed_ptr(current, cpumask);
>> +               khugepaged_scan.node = node;
> 
> What if khugepaged was scheduled to the other nodes after this, but
> khugepaged_scan.node still equals to node? It seems possible to me
> IIUC.
khugepaged will not schedule to the other nodes after function 
set_cpus_allowed_ptr with my understanding. Or node is not necessary
to record and we can simply use task_node(current) like this:
 +       /* sched to specified node before huage page memory copy */
 +       cpumask = cpumask_of_node(node);
 +       if ((task_node(current) != node) && !cpumask_empty(cpumask))
 +               set_cpus_allowed_ptr(current, cpumask);
 
> 
> TBH I'm not quite sure if migrating khugepaged is really worth it for
> everyone or not. The worst case is the locality of base pages are not
> obvious, for example, the base pages may be across all nodes, so you
> always get cross nodes memory copy. And khugepaged may get slower if
> cpu is contentious.
target node is calculated from src node of base pages. If base pages
are across all nodes, target node is the most one from that of base pages.
Most time if THP is used, memory footprint for workload is large, it
is deserved for other architectures, however I have no such binary workload
on other architectures.

> 
> In addition, I saw MIPS has its own copy_user_highpage(), is it a
> contributing factor too?
copy_user_highpage is similar with function copy on mips64, it is only
different on mips32. It has nothing to do with this, else it will be big
issue.

regards
bibo,mao

> 
>> +       }
>>         new_page = khugepaged_alloc_page(hpage, gfp, node);
>>         if (!new_page) {
>>                 result = SCAN_ALLOC_HUGE_PAGE_FAIL;
>> @@ -2380,6 +2389,7 @@ int start_stop_khugepaged(void)
>>                 kthread_stop(khugepaged_thread);
>>                 khugepaged_thread = NULL;
>>         }
>> +       khugepaged_scan.node = NUMA_NO_NODE;
>>         set_recommended_min_free_kbytes();
>>  fail:
>>         mutex_unlock(&khugepaged_mutex);
>> --
>> 2.31.1
>>
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-14  0:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11  9:01 [PATCH] mm/khugepaged: sched to numa node when collapse huge page Bibo Mao
2022-03-11  9:20 ` David Hildenbrand
2022-03-11  9:51   ` maobibo
2022-03-11  9:55     ` David Hildenbrand
2022-03-11 10:12       ` maobibo
2022-03-11 19:08 ` Yang Shi
2022-03-14  0:46   ` maobibo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.