linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] smp: do not send IPI if call_single_queue not empty
@ 2017-04-28 13:19 Aaron Lu
  2017-05-19  7:53 ` [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many() Aaron Lu
  0 siblings, 1 reply; 6+ messages in thread
From: Aaron Lu @ 2017-04-28 13:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Tim Chen,
	Dave Hansen, Huang Ying, Aaron Lu

For smp_call_function_many(), whenever a CPU queues a CSD to a target
CPU, it will send an IPI to let the target CPU to handle the work.
This isn't necessary - we need only send IPI when queueing a CSD
to an empty call_single_queue.

The reason:
flush_smp_call_function_queue() that is called upon a CPU receiving an
IPI will empty the queue and then handle all of the CSDs there. So if
the target CPU's call_single_queue is not empty, we know that:
i.  An IPI for the target CPU has already been sent by previous queuers*;
ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
Thus, it's safe for us to just queue our CSD there without sending an
addtional IPI.

*For previous queuers, we can limit it to the first queuer.

The workload used to see the effectiveness of this change is
vm-scalability's case-swap-w-seq-mt:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-anon-w-seq-mt

What it does is to spawn 88 threads to do continous anonymous memory
consumption. The test machine is a 2-node Broadwell EP with 88 threads
and 32G memory. The test size is 100G so a lot of swap out happened
after available memory is used up. Due to:
i.  shrink_page_list() will need to
    try_to_unmap_flush() -> flush_tlb_others();
ii. this process' mm_cpumask() is almost full.
the number of IPI sent during the workload is huge.

Base:
"interrupts.CAL:Function_call_interrupts": 819388170.0,
"vm-scalability.throughput": 5051472.0,
This patch:
"interrupts.CAL:Function_call_interrupts": 92214434.66666667, ↓88.7%
"vm-scalability.throughput": 5991764.333333333 ↑18.6%

Interrupts dropped a lot and performance increased 18.6%.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/smp.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index a817769b53c0..76d16fe3c427 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -30,6 +30,7 @@ enum {
 struct call_function_data {
 	struct call_single_data	__percpu *csd;
 	cpumask_var_t		cpumask;
+	cpumask_var_t		cpumask_ipi;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
@@ -45,9 +46,15 @@ int smpcfd_prepare_cpu(unsigned int cpu)
 	if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
 				     cpu_to_node(cpu)))
 		return -ENOMEM;
+	if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL,
+				     cpu_to_node(cpu))) {
+		free_cpumask_var(cfd->cpumask);
+		return -ENOMEM;
+	}
 	cfd->csd = alloc_percpu(struct call_single_data);
 	if (!cfd->csd) {
 		free_cpumask_var(cfd->cpumask);
+		free_cpumask_var(cfd->cpumask_ipi);
 		return -ENOMEM;
 	}
 
@@ -59,6 +66,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
 	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
 
 	free_cpumask_var(cfd->cpumask);
+	free_cpumask_var(cfd->cpumask_ipi);
 	free_percpu(cfd->csd);
 	return 0;
 }
@@ -434,6 +442,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	if (unlikely(!cpumask_weight(cfd->cpumask)))
 		return;
 
+	cpumask_clear(cfd->cpumask_ipi);
 	for_each_cpu(cpu, cfd->cpumask) {
 		struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
 
@@ -442,11 +451,12 @@ void smp_call_function_many(const struct cpumask *mask,
 			csd->flags |= CSD_FLAG_SYNCHRONOUS;
 		csd->func = func;
 		csd->info = info;
-		llist_add(&csd->llist, &per_cpu(call_single_queue, cpu));
+		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 	}
 
 	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(cfd->cpumask);
+	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
 		for_each_cpu(cpu, cfd->cpumask) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many()
  2017-04-28 13:19 [PATCH] smp: do not send IPI if call_single_queue not empty Aaron Lu
@ 2017-05-19  7:53 ` Aaron Lu
  2017-05-19 10:59   ` Peter Zijlstra
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Aaron Lu @ 2017-05-19  7:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Tim Chen,
	Dave Hansen, Huang Ying, Aaron Lu

Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the
process' mm_cpumask() shows the process has ever run on other CPUs. page
migration, page reclaim all need IPIs. The number of IPI needed to send
to different CPUs is especially large for multi-threaded workload since
mm_cpumask() is per process.

For smp_call_function_many(), whenever a CPU queues a CSD to a target
CPU, it will send an IPI to let the target CPU to handle the work.
This isn't necessary - we need only send IPI when queueing a CSD
to an empty call_single_queue.

The reason:
flush_smp_call_function_queue() that is called upon a CPU receiving an
IPI will empty the queue and then handle all of the CSDs there. So if
the target CPU's call_single_queue is not empty, we know that:
i.  An IPI for the target CPU has already been sent by 'previous queuers';
ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
Thus, it's safe for us to just queue our CSD there without sending an
addtional IPI. And for the 'previous queuers', we can limit it to the
first queuer.

To demonstrate the effect of this patch, a multi-thread workload that
spawns 80 threads to equally consume 100G memory is used. This is tested
on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So
after 32G memory is used up, page reclaiming starts to happen a lot.

With this patch, IPI number dropped 88% and throughput increased about
15% for the above workload.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
resend: code has no change, only patch subject and description changes
to better describe the patch.

 kernel/smp.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index a817769b53c0..76d16fe3c427 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -30,6 +30,7 @@ enum {
 struct call_function_data {
 	struct call_single_data	__percpu *csd;
 	cpumask_var_t		cpumask;
+	cpumask_var_t		cpumask_ipi;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
@@ -45,9 +46,15 @@ int smpcfd_prepare_cpu(unsigned int cpu)
 	if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
 				     cpu_to_node(cpu)))
 		return -ENOMEM;
+	if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL,
+				     cpu_to_node(cpu))) {
+		free_cpumask_var(cfd->cpumask);
+		return -ENOMEM;
+	}
 	cfd->csd = alloc_percpu(struct call_single_data);
 	if (!cfd->csd) {
 		free_cpumask_var(cfd->cpumask);
+		free_cpumask_var(cfd->cpumask_ipi);
 		return -ENOMEM;
 	}
 
@@ -59,6 +66,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
 	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
 
 	free_cpumask_var(cfd->cpumask);
+	free_cpumask_var(cfd->cpumask_ipi);
 	free_percpu(cfd->csd);
 	return 0;
 }
@@ -434,6 +442,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	if (unlikely(!cpumask_weight(cfd->cpumask)))
 		return;
 
+	cpumask_clear(cfd->cpumask_ipi);
 	for_each_cpu(cpu, cfd->cpumask) {
 		struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
 
@@ -442,11 +451,12 @@ void smp_call_function_many(const struct cpumask *mask,
 			csd->flags |= CSD_FLAG_SYNCHRONOUS;
 		csd->func = func;
 		csd->info = info;
-		llist_add(&csd->llist, &per_cpu(call_single_queue, cpu));
+		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 	}
 
 	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(cfd->cpumask);
+	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
 		for_each_cpu(cpu, cfd->cpumask) {
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many()
  2017-05-19  7:53 ` [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many() Aaron Lu
@ 2017-05-19 10:59   ` Peter Zijlstra
  2017-05-22  8:04     ` Aaron Lu
  2017-05-19 11:09   ` Peter Zijlstra
  2017-05-23  8:42   ` [tip:sched/core] smp: Avoid " tip-bot for Aaron Lu
  2 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2017-05-19 10:59 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Tim Chen,
	Dave Hansen, Huang Ying

On Fri, May 19, 2017 at 03:53:31PM +0800, Aaron Lu wrote:
> Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the
> process' mm_cpumask() shows the process has ever run on other CPUs. page
> migration, page reclaim all need IPIs. The number of IPI needed to send
> to different CPUs is especially large for multi-threaded workload since
> mm_cpumask() is per process.
> 
> For smp_call_function_many(), whenever a CPU queues a CSD to a target
> CPU, it will send an IPI to let the target CPU to handle the work.
> This isn't necessary - we need only send IPI when queueing a CSD
> to an empty call_single_queue.
> 
> The reason:
> flush_smp_call_function_queue() that is called upon a CPU receiving an
> IPI will empty the queue and then handle all of the CSDs there. So if
> the target CPU's call_single_queue is not empty, we know that:
> i.  An IPI for the target CPU has already been sent by 'previous queuers';
> ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
> Thus, it's safe for us to just queue our CSD there without sending an
> addtional IPI. And for the 'previous queuers', we can limit it to the
> first queuer.
> 
> To demonstrate the effect of this patch, a multi-thread workload that
> spawns 80 threads to equally consume 100G memory is used. This is tested
> on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So
> after 32G memory is used up, page reclaiming starts to happen a lot.
> 
> With this patch, IPI number dropped 88% and throughput increased about
> 15% for the above workload.
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Seems fine to me, I'll queue it.


> @@ -434,6 +442,7 @@ void smp_call_function_many(const struct cpumask *mask,
>  	if (unlikely(!cpumask_weight(cfd->cpumask)))
>  		return;
>  
> +	cpumask_clear(cfd->cpumask_ipi);
>  	for_each_cpu(cpu, cfd->cpumask) {
>  		struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
>  
> @@ -442,11 +451,12 @@ void smp_call_function_many(const struct cpumask *mask,
>  			csd->flags |= CSD_FLAG_SYNCHRONOUS;
>  		csd->func = func;
>  		csd->info = info;
> -		llist_add(&csd->llist, &per_cpu(call_single_queue, cpu));
> +		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
> +			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
>  	}

But looking at this I wonder why cpumask_{set,clear}_cpu() are atomic
ops while most other cpumask ops are not.

This seems to suggest we want __cpumask_{set,clear}_cpu() and use those
here. Because those LOCK prefixes sure are pointless.

>  
>  	/* Send a message to all CPUs in the map */
> -	arch_send_call_function_ipi_mask(cfd->cpumask);
> +	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
>  
>  	if (wait) {
>  		for_each_cpu(cpu, cfd->cpumask) {

Something like so on top I suppose.

Anybody?

---
 include/linux/cpumask.h |   11 +++++++++++
 kernel/smp.c            |    4 ++--
 2 files changed, 13 insertions(+), 2 deletions(-)

--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -293,6 +293,12 @@ static inline void cpumask_set_cpu(unsig
 	set_bit(cpumask_check(cpu), cpumask_bits(dstp));
 }
 
+static inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
+{
+	__set_bit(cpumask_check(cpu), cpumask_bits(dstp));
+}
+
+
 /**
  * cpumask_clear_cpu - clear a cpu in a cpumask
  * @cpu: cpu number (< nr_cpu_ids)
@@ -303,6 +309,11 @@ static inline void cpumask_clear_cpu(int
 	clear_bit(cpumask_check(cpu), cpumask_bits(dstp));
 }
 
+static inline void __cpumask_clear_cpu(int cpu, struct cpumask *dstp)
+{
+	__clear_bit(cpumask_check(cpu), cpumask_bits(dstp));
+}
+
 /**
  * cpumask_test_cpu - test for a cpu in a cpumask
  * @cpu: cpu number (< nr_cpu_ids)
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -436,7 +436,7 @@ void smp_call_function_many(const struct
 	cfd = this_cpu_ptr(&cfd_data);
 
 	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
-	cpumask_clear_cpu(this_cpu, cfd->cpumask);
+	__cpumask_clear_cpu(this_cpu, cfd->cpumask);
 
 	/* Some callers race with other cpus changing the passed mask */
 	if (unlikely(!cpumask_weight(cfd->cpumask)))
@@ -452,7 +452,7 @@ void smp_call_function_many(const struct
 		csd->func = func;
 		csd->info = info;
 		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+			__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 	}
 
 	/* Send a message to all CPUs in the map */

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many()
  2017-05-19  7:53 ` [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many() Aaron Lu
  2017-05-19 10:59   ` Peter Zijlstra
@ 2017-05-19 11:09   ` Peter Zijlstra
  2017-05-23  8:42   ` [tip:sched/core] smp: Avoid " tip-bot for Aaron Lu
  2 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2017-05-19 11:09 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Tim Chen,
	Dave Hansen, Huang Ying

On Fri, May 19, 2017 at 03:53:31PM +0800, Aaron Lu wrote:
> @@ -434,6 +442,7 @@ void smp_call_function_many(const struct cpumask *mask,
>  	if (unlikely(!cpumask_weight(cfd->cpumask)))
>  		return;

Another thing that occurred to me while staring at that function, is
that we could move:

	cpumask_and(cfd->cpumask, mask, cpu_online_mask);

to the very begin of that function, that would then obviate the need for
that cpumask_weight() test afaict.


I'm not sure its worth the trouble, but it would make the function
slightly easier to look at.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many()
  2017-05-19 10:59   ` Peter Zijlstra
@ 2017-05-22  8:04     ` Aaron Lu
  0 siblings, 0 replies; 6+ messages in thread
From: Aaron Lu @ 2017-05-22  8:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Tim Chen,
	Dave Hansen, Huang Ying

On Fri, May 19, 2017 at 12:59:54PM +0200, Peter Zijlstra wrote:
> On Fri, May 19, 2017 at 03:53:31PM +0800, Aaron Lu wrote:
> > Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the
> > process' mm_cpumask() shows the process has ever run on other CPUs. page
> > migration, page reclaim all need IPIs. The number of IPI needed to send
> > to different CPUs is especially large for multi-threaded workload since
> > mm_cpumask() is per process.
> > 
> > For smp_call_function_many(), whenever a CPU queues a CSD to a target
> > CPU, it will send an IPI to let the target CPU to handle the work.
> > This isn't necessary - we need only send IPI when queueing a CSD
> > to an empty call_single_queue.
> > 
> > The reason:
> > flush_smp_call_function_queue() that is called upon a CPU receiving an
> > IPI will empty the queue and then handle all of the CSDs there. So if
> > the target CPU's call_single_queue is not empty, we know that:
> > i.  An IPI for the target CPU has already been sent by 'previous queuers';
> > ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
> > Thus, it's safe for us to just queue our CSD there without sending an
> > addtional IPI. And for the 'previous queuers', we can limit it to the
> > first queuer.
> > 
> > To demonstrate the effect of this patch, a multi-thread workload that
> > spawns 80 threads to equally consume 100G memory is used. This is tested
> > on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So
> > after 32G memory is used up, page reclaiming starts to happen a lot.
> > 
> > With this patch, IPI number dropped 88% and throughput increased about
> > 15% for the above workload.
> > 
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> 
> Seems fine to me, I'll queue it.
> 
> 
> > @@ -434,6 +442,7 @@ void smp_call_function_many(const struct cpumask *mask,
> >  	if (unlikely(!cpumask_weight(cfd->cpumask)))
> >  		return;
> >  
> > +	cpumask_clear(cfd->cpumask_ipi);
> >  	for_each_cpu(cpu, cfd->cpumask) {
> >  		struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
> >  
> > @@ -442,11 +451,12 @@ void smp_call_function_many(const struct cpumask *mask,
> >  			csd->flags |= CSD_FLAG_SYNCHRONOUS;
> >  		csd->func = func;
> >  		csd->info = info;
> > -		llist_add(&csd->llist, &per_cpu(call_single_queue, cpu));
> > +		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
> > +			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
> >  	}
> 
> But looking at this I wonder why cpumask_{set,clear}_cpu() are atomic
> ops while most other cpumask ops are not.
> 
> This seems to suggest we want __cpumask_{set,clear}_cpu() and use those
> here. Because those LOCK prefixes sure are pointless.

Sounds reasonable to me.

> 
> >  
> >  	/* Send a message to all CPUs in the map */
> > -	arch_send_call_function_ipi_mask(cfd->cpumask);
> > +	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
> >  
> >  	if (wait) {
> >  		for_each_cpu(cpu, cfd->cpumask) {
> 
> Something like so on top I suppose.
> 
> Anybody?

With the below patch applied on top, the workload still works fine.
Performance is about the same.

Thanks,
Aaron
 
> ---
>  include/linux/cpumask.h |   11 +++++++++++
>  kernel/smp.c            |    4 ++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -293,6 +293,12 @@ static inline void cpumask_set_cpu(unsig
>  	set_bit(cpumask_check(cpu), cpumask_bits(dstp));
>  }
>  
> +static inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
> +{
> +	__set_bit(cpumask_check(cpu), cpumask_bits(dstp));
> +}
> +
> +
>  /**
>   * cpumask_clear_cpu - clear a cpu in a cpumask
>   * @cpu: cpu number (< nr_cpu_ids)
> @@ -303,6 +309,11 @@ static inline void cpumask_clear_cpu(int
>  	clear_bit(cpumask_check(cpu), cpumask_bits(dstp));
>  }
>  
> +static inline void __cpumask_clear_cpu(int cpu, struct cpumask *dstp)
> +{
> +	__clear_bit(cpumask_check(cpu), cpumask_bits(dstp));
> +}
> +
>  /**
>   * cpumask_test_cpu - test for a cpu in a cpumask
>   * @cpu: cpu number (< nr_cpu_ids)
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -436,7 +436,7 @@ void smp_call_function_many(const struct
>  	cfd = this_cpu_ptr(&cfd_data);
>  
>  	cpumask_and(cfd->cpumask, mask, cpu_online_mask);
> -	cpumask_clear_cpu(this_cpu, cfd->cpumask);
> +	__cpumask_clear_cpu(this_cpu, cfd->cpumask);
>  
>  	/* Some callers race with other cpus changing the passed mask */
>  	if (unlikely(!cpumask_weight(cfd->cpumask)))
> @@ -452,7 +452,7 @@ void smp_call_function_many(const struct
>  		csd->func = func;
>  		csd->info = info;
>  		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
> -			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
> +			__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
>  	}
>  
>  	/* Send a message to all CPUs in the map */

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:sched/core] smp: Avoid sending needless IPI in smp_call_function_many()
  2017-05-19  7:53 ` [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many() Aaron Lu
  2017-05-19 10:59   ` Peter Zijlstra
  2017-05-19 11:09   ` Peter Zijlstra
@ 2017-05-23  8:42   ` tip-bot for Aaron Lu
  2 siblings, 0 replies; 6+ messages in thread
From: tip-bot for Aaron Lu @ 2017-05-23  8:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tim.c.chen, tglx, linux-kernel, hpa, dave.hansen, ying.huang,
	peterz, aaron.lu, torvalds, mingo

Commit-ID:  3fc5b3b6a80b2e08a0fec0056208c5dff757e547
Gitweb:     http://git.kernel.org/tip/3fc5b3b6a80b2e08a0fec0056208c5dff757e547
Author:     Aaron Lu <aaron.lu@intel.com>
AuthorDate: Fri, 19 May 2017 15:53:31 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 23 May 2017 10:01:32 +0200

smp: Avoid sending needless IPI in smp_call_function_many()

Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the
process' mm_cpumask() shows the process has ever run on other CPUs. page
migration, page reclaim all need IPIs. The number of IPI needed to send
to different CPUs is especially large for multi-threaded workload since
mm_cpumask() is per process.

For smp_call_function_many(), whenever a CPU queues a CSD to a target
CPU, it will send an IPI to let the target CPU to handle the work.
This isn't necessary - we need only send IPI when queueing a CSD
to an empty call_single_queue.

The reason:

flush_smp_call_function_queue() that is called upon a CPU receiving an
IPI will empty the queue and then handle all of the CSDs there. So if
the target CPU's call_single_queue is not empty, we know that:
i.  An IPI for the target CPU has already been sent by 'previous queuers';
ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet.
Thus, it's safe for us to just queue our CSD there without sending an
addtional IPI. And for the 'previous queuers', we can limit it to the
first queuer.

To demonstrate the effect of this patch, a multi-thread workload that
spawns 80 threads to equally consume 100G memory is used. This is tested
on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So
after 32G memory is used up, page reclaiming starts to happen a lot.

With this patch, IPI number dropped 88% and throughput increased about
15% for the above workload.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20170519075331.GE2084@aaronlu.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/smp.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index a817769..76d16fe 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -30,6 +30,7 @@ enum {
 struct call_function_data {
 	struct call_single_data	__percpu *csd;
 	cpumask_var_t		cpumask;
+	cpumask_var_t		cpumask_ipi;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data);
@@ -45,9 +46,15 @@ int smpcfd_prepare_cpu(unsigned int cpu)
 	if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
 				     cpu_to_node(cpu)))
 		return -ENOMEM;
+	if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL,
+				     cpu_to_node(cpu))) {
+		free_cpumask_var(cfd->cpumask);
+		return -ENOMEM;
+	}
 	cfd->csd = alloc_percpu(struct call_single_data);
 	if (!cfd->csd) {
 		free_cpumask_var(cfd->cpumask);
+		free_cpumask_var(cfd->cpumask_ipi);
 		return -ENOMEM;
 	}
 
@@ -59,6 +66,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
 	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
 
 	free_cpumask_var(cfd->cpumask);
+	free_cpumask_var(cfd->cpumask_ipi);
 	free_percpu(cfd->csd);
 	return 0;
 }
@@ -434,6 +442,7 @@ void smp_call_function_many(const struct cpumask *mask,
 	if (unlikely(!cpumask_weight(cfd->cpumask)))
 		return;
 
+	cpumask_clear(cfd->cpumask_ipi);
 	for_each_cpu(cpu, cfd->cpumask) {
 		struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
 
@@ -442,11 +451,12 @@ void smp_call_function_many(const struct cpumask *mask,
 			csd->flags |= CSD_FLAG_SYNCHRONOUS;
 		csd->func = func;
 		csd->info = info;
-		llist_add(&csd->llist, &per_cpu(call_single_queue, cpu));
+		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+			cpumask_set_cpu(cpu, cfd->cpumask_ipi);
 	}
 
 	/* Send a message to all CPUs in the map */
-	arch_send_call_function_ipi_mask(cfd->cpumask);
+	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
 
 	if (wait) {
 		for_each_cpu(cpu, cfd->cpumask) {

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-23  8:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-28 13:19 [PATCH] smp: do not send IPI if call_single_queue not empty Aaron Lu
2017-05-19  7:53 ` [PATCH resend] smp: avoid sending needless IPI in smp_call_function_many() Aaron Lu
2017-05-19 10:59   ` Peter Zijlstra
2017-05-22  8:04     ` Aaron Lu
2017-05-19 11:09   ` Peter Zijlstra
2017-05-23  8:42   ` [tip:sched/core] smp: Avoid " tip-bot for Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).