All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen/smp: Speed up on_selected_cpus()
@ 2022-02-04 20:31 Andrew Cooper
  2022-02-06 19:40 ` Julien Grall
  2022-02-07  8:11 ` Jan Beulich
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Cooper @ 2022-02-04 20:31 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Juergen Gross, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis

cpumask_weight() is a horribly expensive way to find if no bits are set, made
worse by the fact that the calculation is performed with the global call_lock
held.

Switch to using cpumask_empty() instead, which will short circuit as soon as
it find any set bit in the cpumask.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Juergen Gross <jgross@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>

I have not done performance testing, but I would be surprised if this cannot
be measured on a busy or large box.
---
 xen/common/smp.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/xen/common/smp.c b/xen/common/smp.c
index 781bcf2c246c..a011f541f1ea 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -50,8 +50,6 @@ void on_selected_cpus(
     void *info,
     int wait)
 {
-    unsigned int nr_cpus;
-
     ASSERT(local_irq_is_enabled());
     ASSERT(cpumask_subset(selected, &cpu_online_map));
 
@@ -59,8 +57,7 @@ void on_selected_cpus(
 
     cpumask_copy(&call_data.selected, selected);
 
-    nr_cpus = cpumask_weight(&call_data.selected);
-    if ( nr_cpus == 0 )
+    if ( cpumask_empty(&call_data.selected) )
         goto out;
 
     call_data.func = func;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/smp: Speed up on_selected_cpus()
  2022-02-04 20:31 [PATCH] xen/smp: Speed up on_selected_cpus() Andrew Cooper
@ 2022-02-06 19:40 ` Julien Grall
  2022-02-07 17:30   ` Andrew Cooper
  2022-02-07  8:11 ` Jan Beulich
  1 sibling, 1 reply; 6+ messages in thread
From: Julien Grall @ 2022-02-06 19:40 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Juergen Gross, Stefano Stabellini, Volodymyr Babchuk,
	Bertrand Marquis

Hi,

On 04/02/2022 20:31, Andrew Cooper wrote:
> cpumask_weight() is a horribly expensive way to find if no bits are set, made
> worse by the fact that the calculation is performed with the global call_lock
> held.

I looked at the archive because I was wondering why we were using 
cpumask_weight here. It looks like this was a left-over of the rework in 
ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to not 
race updates".

> 
> Switch to using cpumask_empty() instead, which will short circuit as soon as
> it find any set bit in the cpumask.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Juergen Gross <jgross@suse.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Julien Grall <julien@xen.org>
> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> CC: Bertrand Marquis <bertrand.marquis@arm.com>
> 
> I have not done performance testing, but I would be surprised if this cannot
> be measured on a busy or large box.
> ---
>   xen/common/smp.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/xen/common/smp.c b/xen/common/smp.c
> index 781bcf2c246c..a011f541f1ea 100644
> --- a/xen/common/smp.c
> +++ b/xen/common/smp.c
> @@ -50,8 +50,6 @@ void on_selected_cpus(
>       void *info,
>       int wait)
>   {
> -    unsigned int nr_cpus;
> -
>       ASSERT(local_irq_is_enabled());
>       ASSERT(cpumask_subset(selected, &cpu_online_map));
>   
> @@ -59,8 +57,7 @@ void on_selected_cpus(
>   
>       cpumask_copy(&call_data.selected, selected);
>   
> -    nr_cpus = cpumask_weight(&call_data.selected);
> -    if ( nr_cpus == 0 )
> +    if ( cpumask_empty(&call_data.selected) )
>           goto out;
>   
>       call_data.func = func;

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/smp: Speed up on_selected_cpus()
  2022-02-04 20:31 [PATCH] xen/smp: Speed up on_selected_cpus() Andrew Cooper
  2022-02-06 19:40 ` Julien Grall
@ 2022-02-07  8:11 ` Jan Beulich
  2022-02-07 17:06   ` Andrew Cooper
  1 sibling, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2022-02-07  8:11 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monné,
	Wei Liu, Juergen Gross, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Bertrand Marquis, Xen-devel

On 04.02.2022 21:31, Andrew Cooper wrote:
> cpumask_weight() is a horribly expensive way to find if no bits are set, made
> worse by the fact that the calculation is performed with the global call_lock
> held.
> 
> Switch to using cpumask_empty() instead, which will short circuit as soon as
> it find any set bit in the cpumask.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

May I suggest to drop "horribly"? How expensive one is compared to the other
depends on the number of CPUs actually enumerated in the system. (And of
course I still have that conversion to POPCNT alternatives patching pending,
where Roger did ask for some re-work in reply to v2, but where it has
remained unclear whether investing time into that wouldn't be in vein,
considering some of your replies on v1. Thus would have further shrunk the
difference, without me meaning to say the change here isn't a good one.)

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/smp: Speed up on_selected_cpus()
  2022-02-07  8:11 ` Jan Beulich
@ 2022-02-07 17:06   ` Andrew Cooper
  2022-02-08 10:39     ` Jan Beulich
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Cooper @ 2022-02-07 17:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monne, Wei Liu, Juergen Gross, Stefano Stabellini,
	Julien Grall, Volodymyr Babchuk, Bertrand Marquis, Xen-devel

On 07/02/2022 08:11, Jan Beulich wrote:
> On 04.02.2022 21:31, Andrew Cooper wrote:
>> cpumask_weight() is a horribly expensive way to find if no bits are set, made
>> worse by the fact that the calculation is performed with the global call_lock
>> held.
>>
>> Switch to using cpumask_empty() instead, which will short circuit as soon as
>> it find any set bit in the cpumask.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> May I suggest to drop "horribly"? How expensive one is compared to the other
> depends on the number of CPUs actually enumerated in the system.

In absolute terms perhaps, but they both scale as O(nr_cpus).  Hamming
weight has a far larger constant.

>  (And of
> course I still have that conversion to POPCNT alternatives patching pending,
> where Roger did ask for some re-work in reply to v2, but where it has
> remained unclear whether investing time into that wouldn't be in vein,
> considering some of your replies on v1. Thus would have further shrunk the
> difference, without me meaning to say the change here isn't a good one.)

There is a perfectly clear and simple way forward.  It's the one which
doesn't fight the optimiser and actively regress the code generation in
the calling functions, and add an unreasonable quantity technical debt
into the marginal paths.

I will ack a version where you're not adding complexity for negative gains.

~Andrew

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/smp: Speed up on_selected_cpus()
  2022-02-06 19:40 ` Julien Grall
@ 2022-02-07 17:30   ` Andrew Cooper
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Cooper @ 2022-02-07 17:30 UTC (permalink / raw)
  To: Julien Grall, Xen-devel
  Cc: Jan Beulich, Roger Pau Monne, Wei Liu, Juergen Gross,
	Stefano Stabellini, Volodymyr Babchuk, Bertrand Marquis

On 06/02/2022 19:40, Julien Grall wrote:
> Hi,
>
> On 04/02/2022 20:31, Andrew Cooper wrote:
>> cpumask_weight() is a horribly expensive way to find if no bits are
>> set, made
>> worse by the fact that the calculation is performed with the global
>> call_lock
>> held.
>
> I looked at the archive because I was wondering why we were using
> cpumask_weight here. It looks like this was a left-over of the rework
> in ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to
> not race updates".

That change shuffled the code, but didn't introduce the problem.

I'm pretty sure it was 433f14699d48 which dropped the !=0 user of nr_cpus.


Talking of, there is more efficiency to be gained by reworking the
second cpumask_empty() call to not restart from 0 on failure, because
that removes useless reads.


>
>>
>> Switch to using cpumask_empty() instead, which will short circuit as
>> soon as
>> it find any set bit in the cpumask.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Reviewed-by: Julien Grall <jgrall@amazon.com>

Thanks.

~Andrew

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xen/smp: Speed up on_selected_cpus()
  2022-02-07 17:06   ` Andrew Cooper
@ 2022-02-08 10:39     ` Jan Beulich
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Beulich @ 2022-02-08 10:39 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Roger Pau Monne, Wei Liu, Juergen Gross, Stefano Stabellini,
	Julien Grall, Volodymyr Babchuk, Bertrand Marquis, Xen-devel

On 07.02.2022 18:06, Andrew Cooper wrote:
> On 07/02/2022 08:11, Jan Beulich wrote:
>>  (And of
>> course I still have that conversion to POPCNT alternatives patching pending,
>> where Roger did ask for some re-work in reply to v2, but where it has
>> remained unclear whether investing time into that wouldn't be in vein,
>> considering some of your replies on v1. Thus would have further shrunk the
>> difference, without me meaning to say the change here isn't a good one.)
> 
> There is a perfectly clear and simple way forward.  It's the one which
> doesn't fight the optimiser and actively regress the code generation in
> the calling functions, and add an unreasonable quantity technical debt
> into the marginal paths.
> 
> I will ack a version where you're not adding complexity for negative gains.

Thanks, at least some form of a reply. I'm afraid I can't really translate
this to which parts of the change you'd be okay with and which parts need
changing. I didn't think I would "fight the optimiser and actively regress
the code generation in the calling functions" in v2 (this may have been
different in v1, but I haven't gone back to check; I wonder though whether
you're mixing this with e.g. the BMI patching series I've long given up on).

Jan



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-02-08 10:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04 20:31 [PATCH] xen/smp: Speed up on_selected_cpus() Andrew Cooper
2022-02-06 19:40 ` Julien Grall
2022-02-07 17:30   ` Andrew Cooper
2022-02-07  8:11 ` Jan Beulich
2022-02-07 17:06   ` Andrew Cooper
2022-02-08 10:39     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.