All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-18 21:55 ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-18 21:55 UTC (permalink / raw)
  To: maz, tglx, will, linux-kernel; +Cc: linux-arm-kernel, linuxarm, Barry Song

dsb(ishst) should be enough here as we only need to guarantee the
visibility of data to other CPUs in smp inner domain before we
send the ipi.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 drivers/irqchip/irq-gic-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 5e935d97207d..0efe1a9a9f3b 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
 	 * Ensure that stores to Normal memory are visible to the
 	 * other CPUs before issuing the IPI.
 	 */
-	wmb();
+	dsb(ishst);
 
 	for_each_cpu(cpu, mask) {
 		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-18 21:55 ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-18 21:55 UTC (permalink / raw)
  To: maz, tglx, will, linux-kernel; +Cc: linux-arm-kernel, linuxarm, Barry Song

dsb(ishst) should be enough here as we only need to guarantee the
visibility of data to other CPUs in smp inner domain before we
send the ipi.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 drivers/irqchip/irq-gic-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 5e935d97207d..0efe1a9a9f3b 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
 	 * Ensure that stores to Normal memory are visible to the
 	 * other CPUs before issuing the IPI.
 	 */
-	wmb();
+	dsb(ishst);
 
 	for_each_cpu(cpu, mask) {
 		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-18 21:55 ` Barry Song
@ 2022-02-19  9:56   ` Marc Zyngier
  -1 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-02-19  9:56 UTC (permalink / raw)
  To: Barry Song
  Cc: tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song

On 2022-02-18 21:55, Barry Song wrote:
> dsb(ishst) should be enough here as we only need to guarantee the
> visibility of data to other CPUs in smp inner domain before we
> send the ipi.
> 
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3.c 
> b/drivers/irqchip/irq-gic-v3.c
> index 5e935d97207d..0efe1a9a9f3b 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> *d, const struct cpumask *mask)
>  	 * Ensure that stores to Normal memory are visible to the
>  	 * other CPUs before issuing the IPI.
>  	 */
> -	wmb();
> +	dsb(ishst);
> 
>  	for_each_cpu(cpu, mask) {
>  		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));

I'm not opposed to that change, but I'm pretty curious whether this 
makes
any visible difference in practice. Could you measure the effect of this 
change
for any sort of IPI heavy workload?

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-19  9:56   ` Marc Zyngier
  0 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-02-19  9:56 UTC (permalink / raw)
  To: Barry Song
  Cc: tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song

On 2022-02-18 21:55, Barry Song wrote:
> dsb(ishst) should be enough here as we only need to guarantee the
> visibility of data to other CPUs in smp inner domain before we
> send the ipi.
> 
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3.c 
> b/drivers/irqchip/irq-gic-v3.c
> index 5e935d97207d..0efe1a9a9f3b 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> *d, const struct cpumask *mask)
>  	 * Ensure that stores to Normal memory are visible to the
>  	 * other CPUs before issuing the IPI.
>  	 */
> -	wmb();
> +	dsb(ishst);
> 
>  	for_each_cpu(cpu, mask) {
>  		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));

I'm not opposed to that change, but I'm pretty curious whether this 
makes
any visible difference in practice. Could you measure the effect of this 
change
for any sort of IPI heavy workload?

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-19  9:56   ` Marc Zyngier
@ 2022-02-19 23:46     ` Barry Song
  -1 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-19 23:46 UTC (permalink / raw)
  To: maz
  Cc: 21cnbao, linux-arm-kernel, linux-kernel, linuxarm, song.bao.hua,
	tglx, will

>> +	dsb(ishst);
>> 
>>  	for_each_cpu(cpu, mask) {
>>  		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>
> I'm not opposed to that change, but I'm pretty curious whether this 
> makes
> any visible difference in practice. Could you measure the effect of this 
> change
> for any sort of IPI heavy workload?
> 
> Thanks,
> 
>          M.

In practice, at least I don't see much difference on the hardware I am
using. So the result probably depends on the implementaion of the real
hardwares.

I wrote a micro benchmark to measure the latency w/ and w/o the patch on
kunpeng920 with 96 cores(2 socket, each socket has 2dies, each die has
24 cores, cpu0-cpu47 belong to socket0, cpu48-95 belong to socket1) by
sending IPI to cpu0-cpu95 1000 times from an specified cpu:

#include <linux/module.h>
#include <linux/timekeeping.h>

static void ipi_latency_func(void *val)
{
}

static int __init ipi_latency_init(void)
{

        ktime_t stime, etime, delta;
        int cpu, i;
        int start = smp_processor_id();

        stime = ktime_get();
        for ( i = 0; i < 1000; i++)
                for (cpu = 0; cpu < 96; cpu++)
                        smp_call_function_single(cpu, ipi_latency_func, NULL, 1); 
        etime = ktime_get();

        delta = ktime_sub(etime, stime);

        printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
                        __func__, start, delta);

        return 0;
}
module_init(ipi_latency_init);

static void ipi_latency_exit(void)
{
}
module_exit(ipi_latency_exit);

MODULE_DESCRIPTION("IPI benchmark");
MODULE_LICENSE("GPL");

do the below 10times:
# taskset -c 0 insmod test.ko
# rmmod test

and the below 10times:
# taskset -c 48 insmod test.ko
# rmmod test

by taskset -c, I can change the source cpu sending IPI.

The result is as below:

vanilla kernel:
[  103.391684] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122237009
[  103.537256] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121364329
[  103.681276] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121420160
[  103.826254] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122392403
[  103.970209] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122371262
[  104.113879] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122041254
[  104.257444] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121594453
[  104.402432] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122592556
[  104.561434] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121601214
[  104.705561] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121732767

[  124.592944] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147048939
[  124.779280] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147467842
[  124.958162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146448676
[  125.129253] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141537482
[  125.298848] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147161504
[  125.471531] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147833787
[  125.643133] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147438445
[  125.814530] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146806172
[  125.989677] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145971002
[  126.159497] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147780655

patched kernel:
[  428.828167] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122195849
[  428.970822] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122361042
[  429.111058] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122528494
[  429.257704] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121155045
[  429.410186] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121608565
[  429.570171] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121613673
[  429.718181] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121593737
[  429.862615] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121953875
[  430.002796] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122102741
[  430.142741] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122005473

[  516.642812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145610926
[  516.817002] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145878266
[  517.004665] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145602966
[  517.188758] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145658672
[  517.372409] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141329497
[  517.557313] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146323829
[  517.733107] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146015196
[  517.921491] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146439231
[  518.093129] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146106916
[  518.264162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145097868

So there is no much difference between vanilla and patched kernel.

What really makes me worried about my hardware is that IPI sent from the second socket
always shows worse performance than the first socket. This seems to be a problem
worth investigation.

Thanks
Barry

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-19 23:46     ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-19 23:46 UTC (permalink / raw)
  To: maz
  Cc: 21cnbao, linux-arm-kernel, linux-kernel, linuxarm, song.bao.hua,
	tglx, will

>> +	dsb(ishst);
>> 
>>  	for_each_cpu(cpu, mask) {
>>  		u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>
> I'm not opposed to that change, but I'm pretty curious whether this 
> makes
> any visible difference in practice. Could you measure the effect of this 
> change
> for any sort of IPI heavy workload?
> 
> Thanks,
> 
>          M.

In practice, at least I don't see much difference on the hardware I am
using. So the result probably depends on the implementaion of the real
hardwares.

I wrote a micro benchmark to measure the latency w/ and w/o the patch on
kunpeng920 with 96 cores(2 socket, each socket has 2dies, each die has
24 cores, cpu0-cpu47 belong to socket0, cpu48-95 belong to socket1) by
sending IPI to cpu0-cpu95 1000 times from an specified cpu:

#include <linux/module.h>
#include <linux/timekeeping.h>

static void ipi_latency_func(void *val)
{
}

static int __init ipi_latency_init(void)
{

        ktime_t stime, etime, delta;
        int cpu, i;
        int start = smp_processor_id();

        stime = ktime_get();
        for ( i = 0; i < 1000; i++)
                for (cpu = 0; cpu < 96; cpu++)
                        smp_call_function_single(cpu, ipi_latency_func, NULL, 1); 
        etime = ktime_get();

        delta = ktime_sub(etime, stime);

        printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
                        __func__, start, delta);

        return 0;
}
module_init(ipi_latency_init);

static void ipi_latency_exit(void)
{
}
module_exit(ipi_latency_exit);

MODULE_DESCRIPTION("IPI benchmark");
MODULE_LICENSE("GPL");

do the below 10times:
# taskset -c 0 insmod test.ko
# rmmod test

and the below 10times:
# taskset -c 48 insmod test.ko
# rmmod test

by taskset -c, I can change the source cpu sending IPI.

The result is as below:

vanilla kernel:
[  103.391684] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122237009
[  103.537256] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121364329
[  103.681276] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121420160
[  103.826254] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122392403
[  103.970209] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122371262
[  104.113879] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122041254
[  104.257444] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121594453
[  104.402432] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122592556
[  104.561434] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121601214
[  104.705561] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121732767

[  124.592944] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147048939
[  124.779280] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147467842
[  124.958162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146448676
[  125.129253] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141537482
[  125.298848] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147161504
[  125.471531] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147833787
[  125.643133] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147438445
[  125.814530] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146806172
[  125.989677] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145971002
[  126.159497] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147780655

patched kernel:
[  428.828167] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122195849
[  428.970822] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122361042
[  429.111058] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122528494
[  429.257704] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121155045
[  429.410186] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121608565
[  429.570171] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121613673
[  429.718181] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121593737
[  429.862615] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121953875
[  430.002796] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122102741
[  430.142741] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122005473

[  516.642812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145610926
[  516.817002] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145878266
[  517.004665] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145602966
[  517.188758] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145658672
[  517.372409] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141329497
[  517.557313] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146323829
[  517.733107] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146015196
[  517.921491] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146439231
[  518.093129] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146106916
[  518.264162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145097868

So there is no much difference between vanilla and patched kernel.

What really makes me worried about my hardware is that IPI sent from the second socket
always shows worse performance than the first socket. This seems to be a problem
worth investigation.

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-19 23:46     ` Barry Song
@ 2022-02-20  1:33       ` Barry Song
  -1 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-20  1:33 UTC (permalink / raw)
  To: 21cnbao
  Cc: linux-arm-kernel, linux-kernel, linuxarm, maz, song.bao.hua, tglx, will

> So there is no much difference between vanilla and patched kernel.

Sorry, let me correct it.

I realize I should write some data before sending IPI. So I have changed the module
to be as below:

#include <linux/module.h>
#include <linux/timekeeping.h>

volatile int data0 ____cacheline_aligned;
volatile int data1 ____cacheline_aligned;
volatile int data2 ____cacheline_aligned;
volatile int data3 ____cacheline_aligned;
volatile int data4 ____cacheline_aligned;
volatile int data5 ____cacheline_aligned;
volatile int data6 ____cacheline_aligned;

static void ipi_latency_func(void *val)
{
}

static int __init ipi_latency_init(void)
{

        ktime_t stime, etime, delta;
        int cpu, i;
        int start = smp_processor_id();

        stime = ktime_get();
        for ( i = 0; i < 1000; i++)
                for (cpu = 0; cpu < 96; cpu++) {
                        data0 = data1 = data2 = data3 = data4 = data5 = data6 = cpu;
                        smp_call_function_single(cpu, ipi_latency_func, NULL, 1); 
                }   
        etime = ktime_get();

        delta = ktime_sub(etime, stime);

        printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
                        __func__, start, delta);

        return 0;
}
module_init(ipi_latency_init);

static void ipi_latency_exit(void)
{
}
module_exit(ipi_latency_exit);

MODULE_DESCRIPTION("IPI benchmark");
MODULE_LICENSE("GPL");

after that, I can see ~1% difference between patched kernel and vanilla:

vanilla:
[  375.220131] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126757449
[  375.382596] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126784249
[  375.537975] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126177703
[  375.686823] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127022281
[  375.849967] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126184883
[  375.999173] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127374585
[  376.149565] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:125778089
[  376.298743] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126974441
[  376.451125] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127357625
[  376.606006] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126228184

[  371.405378] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151851181
[  371.591642] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151568608
[  371.767906] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151853441
[  371.944031] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:152065453
[  372.114085] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146122093
[  372.291345] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151379636
[  372.459812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151854411
[  372.629708] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145750720
[  372.807574] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151629448
[  372.994979] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151050253

patched kernel:
[  105.598815] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124467401
[  105.748368] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123474209
[  105.900400] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123558497
[  106.043890] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122993951
[  106.191845] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122984223
[  106.348215] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123323609
[  106.501448] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124507583
[  106.656358] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123386963
[  106.804367] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123340664
[  106.956331] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123285324

[  108.930802] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143616067
[  109.094750] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148969821
[  109.267428] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149648418
[  109.443274] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149448903
[  109.621760] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147882917
[  109.794611] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148700282
[  109.975197] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149050595
[  110.141543] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143566604
[  110.315213] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149202898
[  110.491008] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148958261

as you can see, while cpu0 is the source, vanilla takes 125xxxxxx-127xxxxxx ns, patched
kernel takes 122xxxxxx-124xxxxxx ns.

Thanks
Barry

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20  1:33       ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-20  1:33 UTC (permalink / raw)
  To: 21cnbao
  Cc: linux-arm-kernel, linux-kernel, linuxarm, maz, song.bao.hua, tglx, will

> So there is no much difference between vanilla and patched kernel.

Sorry, let me correct it.

I realize I should write some data before sending IPI. So I have changed the module
to be as below:

#include <linux/module.h>
#include <linux/timekeeping.h>

volatile int data0 ____cacheline_aligned;
volatile int data1 ____cacheline_aligned;
volatile int data2 ____cacheline_aligned;
volatile int data3 ____cacheline_aligned;
volatile int data4 ____cacheline_aligned;
volatile int data5 ____cacheline_aligned;
volatile int data6 ____cacheline_aligned;

static void ipi_latency_func(void *val)
{
}

static int __init ipi_latency_init(void)
{

        ktime_t stime, etime, delta;
        int cpu, i;
        int start = smp_processor_id();

        stime = ktime_get();
        for ( i = 0; i < 1000; i++)
                for (cpu = 0; cpu < 96; cpu++) {
                        data0 = data1 = data2 = data3 = data4 = data5 = data6 = cpu;
                        smp_call_function_single(cpu, ipi_latency_func, NULL, 1); 
                }   
        etime = ktime_get();

        delta = ktime_sub(etime, stime);

        printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
                        __func__, start, delta);

        return 0;
}
module_init(ipi_latency_init);

static void ipi_latency_exit(void)
{
}
module_exit(ipi_latency_exit);

MODULE_DESCRIPTION("IPI benchmark");
MODULE_LICENSE("GPL");

after that, I can see ~1% difference between patched kernel and vanilla:

vanilla:
[  375.220131] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126757449
[  375.382596] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126784249
[  375.537975] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126177703
[  375.686823] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127022281
[  375.849967] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126184883
[  375.999173] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127374585
[  376.149565] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:125778089
[  376.298743] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126974441
[  376.451125] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127357625
[  376.606006] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126228184

[  371.405378] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151851181
[  371.591642] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151568608
[  371.767906] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151853441
[  371.944031] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:152065453
[  372.114085] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146122093
[  372.291345] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151379636
[  372.459812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151854411
[  372.629708] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145750720
[  372.807574] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151629448
[  372.994979] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151050253

patched kernel:
[  105.598815] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124467401
[  105.748368] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123474209
[  105.900400] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123558497
[  106.043890] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122993951
[  106.191845] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122984223
[  106.348215] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123323609
[  106.501448] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124507583
[  106.656358] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123386963
[  106.804367] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123340664
[  106.956331] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123285324

[  108.930802] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143616067
[  109.094750] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148969821
[  109.267428] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149648418
[  109.443274] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149448903
[  109.621760] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147882917
[  109.794611] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148700282
[  109.975197] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149050595
[  110.141543] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143566604
[  110.315213] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149202898
[  110.491008] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148958261

as you can see, while cpu0 is the source, vanilla takes 125xxxxxx-127xxxxxx ns, patched
kernel takes 122xxxxxx-124xxxxxx ns.

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-19  9:56   ` Marc Zyngier
@ 2022-02-20 13:30     ` Ard Biesheuvel
  -1 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2022-02-20 13:30 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Barry Song, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
>
> On 2022-02-18 21:55, Barry Song wrote:
> > dsb(ishst) should be enough here as we only need to guarantee the
> > visibility of data to other CPUs in smp inner domain before we
> > send the ipi.
> >
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >  drivers/irqchip/irq-gic-v3.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3.c
> > b/drivers/irqchip/irq-gic-v3.c
> > index 5e935d97207d..0efe1a9a9f3b 100644
> > --- a/drivers/irqchip/irq-gic-v3.c
> > +++ b/drivers/irqchip/irq-gic-v3.c
> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > *d, const struct cpumask *mask)
> >        * Ensure that stores to Normal memory are visible to the
> >        * other CPUs before issuing the IPI.
> >        */
> > -     wmb();
> > +     dsb(ishst);
> >
> >       for_each_cpu(cpu, mask) {
> >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>
> I'm not opposed to that change, but I'm pretty curious whether this
> makes
> any visible difference in practice. Could you measure the effect of this
> change
> for any sort of IPI heavy workload?
>

Does this have to be a DSB ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 13:30     ` Ard Biesheuvel
  0 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2022-02-20 13:30 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Barry Song, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
>
> On 2022-02-18 21:55, Barry Song wrote:
> > dsb(ishst) should be enough here as we only need to guarantee the
> > visibility of data to other CPUs in smp inner domain before we
> > send the ipi.
> >
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >  drivers/irqchip/irq-gic-v3.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3.c
> > b/drivers/irqchip/irq-gic-v3.c
> > index 5e935d97207d..0efe1a9a9f3b 100644
> > --- a/drivers/irqchip/irq-gic-v3.c
> > +++ b/drivers/irqchip/irq-gic-v3.c
> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > *d, const struct cpumask *mask)
> >        * Ensure that stores to Normal memory are visible to the
> >        * other CPUs before issuing the IPI.
> >        */
> > -     wmb();
> > +     dsb(ishst);
> >
> >       for_each_cpu(cpu, mask) {
> >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>
> I'm not opposed to that change, but I'm pretty curious whether this
> makes
> any visible difference in practice. Could you measure the effect of this
> change
> for any sort of IPI heavy workload?
>

Does this have to be a DSB ?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-18 21:55 ` Barry Song
@ 2022-02-20 15:04   ` Russell King (Oracle)
  -1 siblings, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2022-02-20 15:04 UTC (permalink / raw)
  To: Barry Song
  Cc: maz, tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song

On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
> dsb(ishst) should be enough here as we only need to guarantee the
> visibility of data to other CPUs in smp inner domain before we
> send the ipi.
> 
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 5e935d97207d..0efe1a9a9f3b 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
>  	 * Ensure that stores to Normal memory are visible to the
>  	 * other CPUs before issuing the IPI.
>  	 */
> -	wmb();
> +	dsb(ishst);

On ARM, wmb() is a dsb(st) followed by other operations which may
include a sync operation at the L2 cache, and SoC specific barriers
for the bus. Hopefully, nothing will break if the sledge hammer is
replaced by the tack hammer.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 15:04   ` Russell King (Oracle)
  0 siblings, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2022-02-20 15:04 UTC (permalink / raw)
  To: Barry Song
  Cc: maz, tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song

On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
> dsb(ishst) should be enough here as we only need to guarantee the
> visibility of data to other CPUs in smp inner domain before we
> send the ipi.
> 
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>  drivers/irqchip/irq-gic-v3.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 5e935d97207d..0efe1a9a9f3b 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
>  	 * Ensure that stores to Normal memory are visible to the
>  	 * other CPUs before issuing the IPI.
>  	 */
> -	wmb();
> +	dsb(ishst);

On ARM, wmb() is a dsb(st) followed by other operations which may
include a sync operation at the L2 cache, and SoC specific barriers
for the bus. Hopefully, nothing will break if the sledge hammer is
replaced by the tack hammer.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-20 13:30     ` Ard Biesheuvel
@ 2022-02-20 15:04       ` Marc Zyngier
  -1 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-02-20 15:04 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Barry Song, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On 2022-02-20 13:30, Ard Biesheuvel wrote:
> On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
>> 
>> On 2022-02-18 21:55, Barry Song wrote:
>> > dsb(ishst) should be enough here as we only need to guarantee the
>> > visibility of data to other CPUs in smp inner domain before we
>> > send the ipi.
>> >
>> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> > ---
>> >  drivers/irqchip/irq-gic-v3.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/irqchip/irq-gic-v3.c
>> > b/drivers/irqchip/irq-gic-v3.c
>> > index 5e935d97207d..0efe1a9a9f3b 100644
>> > --- a/drivers/irqchip/irq-gic-v3.c
>> > +++ b/drivers/irqchip/irq-gic-v3.c
>> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
>> > *d, const struct cpumask *mask)
>> >        * Ensure that stores to Normal memory are visible to the
>> >        * other CPUs before issuing the IPI.
>> >        */
>> > -     wmb();
>> > +     dsb(ishst);
>> >
>> >       for_each_cpu(cpu, mask) {
>> >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>> 
>> I'm not opposed to that change, but I'm pretty curious whether this
>> makes
>> any visible difference in practice. Could you measure the effect of 
>> this
>> change
>> for any sort of IPI heavy workload?
>> 
> 
> Does this have to be a DSB ?

We can use a DMB ISHST for the other interrupt controllers that use a
MMIO access to signal the IPI, as we need to order the previous writes 
with
the MMIO access (and DMB fits that bill).

For GICv3 when ICC_SRE_EL1.SRE==1, we need to order a set of writes with
a system register access with side effects, and the only barrier that
allows us to do that is DSB.

It is a bit unfortunate that the relative lightweight nature of the 
sysreg
CPU interface is encumbered by fairly heavy barriers (the most horrible 
one
being the DSB SY on each read of IAR to be able to synchronise the CPU 
interface
and the redistributor).

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 15:04       ` Marc Zyngier
  0 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-02-20 15:04 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Barry Song, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On 2022-02-20 13:30, Ard Biesheuvel wrote:
> On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
>> 
>> On 2022-02-18 21:55, Barry Song wrote:
>> > dsb(ishst) should be enough here as we only need to guarantee the
>> > visibility of data to other CPUs in smp inner domain before we
>> > send the ipi.
>> >
>> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> > ---
>> >  drivers/irqchip/irq-gic-v3.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/irqchip/irq-gic-v3.c
>> > b/drivers/irqchip/irq-gic-v3.c
>> > index 5e935d97207d..0efe1a9a9f3b 100644
>> > --- a/drivers/irqchip/irq-gic-v3.c
>> > +++ b/drivers/irqchip/irq-gic-v3.c
>> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
>> > *d, const struct cpumask *mask)
>> >        * Ensure that stores to Normal memory are visible to the
>> >        * other CPUs before issuing the IPI.
>> >        */
>> > -     wmb();
>> > +     dsb(ishst);
>> >
>> >       for_each_cpu(cpu, mask) {
>> >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>> 
>> I'm not opposed to that change, but I'm pretty curious whether this
>> makes
>> any visible difference in practice. Could you measure the effect of 
>> this
>> change
>> for any sort of IPI heavy workload?
>> 
> 
> Does this have to be a DSB ?

We can use a DMB ISHST for the other interrupt controllers that use a
MMIO access to signal the IPI, as we need to order the previous writes 
with
the MMIO access (and DMB fits that bill).

For GICv3 when ICC_SRE_EL1.SRE==1, we need to order a set of writes with
a system register access with side effects, and the only barrier that
allows us to do that is DSB.

It is a bit unfortunate that the relative lightweight nature of the 
sysreg
CPU interface is encumbered by fairly heavy barriers (the most horrible 
one
being the DSB SY on each read of IAR to be able to synchronise the CPU 
interface
and the redistributor).

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-20 13:30     ` Ard Biesheuvel
@ 2022-02-20 15:05       ` Russell King (Oracle)
  -1 siblings, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2022-02-20 15:05 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Marc Zyngier, Barry Song, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote:
> On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2022-02-18 21:55, Barry Song wrote:
> > > dsb(ishst) should be enough here as we only need to guarantee the
> > > visibility of data to other CPUs in smp inner domain before we
> > > send the ipi.
> > >
> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > ---
> > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/irqchip/irq-gic-v3.c
> > > b/drivers/irqchip/irq-gic-v3.c
> > > index 5e935d97207d..0efe1a9a9f3b 100644
> > > --- a/drivers/irqchip/irq-gic-v3.c
> > > +++ b/drivers/irqchip/irq-gic-v3.c
> > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > > *d, const struct cpumask *mask)
> > >        * Ensure that stores to Normal memory are visible to the
> > >        * other CPUs before issuing the IPI.
> > >        */
> > > -     wmb();
> > > +     dsb(ishst);
> > >
> > >       for_each_cpu(cpu, mask) {
> > >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
> >
> > I'm not opposed to that change, but I'm pretty curious whether this
> > makes
> > any visible difference in practice. Could you measure the effect of this
> > change
> > for any sort of IPI heavy workload?
> >
> 
> Does this have to be a DSB ?

Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 15:05       ` Russell King (Oracle)
  0 siblings, 0 replies; 22+ messages in thread
From: Russell King (Oracle) @ 2022-02-20 15:05 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Marc Zyngier, Barry Song, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote:
> On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2022-02-18 21:55, Barry Song wrote:
> > > dsb(ishst) should be enough here as we only need to guarantee the
> > > visibility of data to other CPUs in smp inner domain before we
> > > send the ipi.
> > >
> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > ---
> > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/irqchip/irq-gic-v3.c
> > > b/drivers/irqchip/irq-gic-v3.c
> > > index 5e935d97207d..0efe1a9a9f3b 100644
> > > --- a/drivers/irqchip/irq-gic-v3.c
> > > +++ b/drivers/irqchip/irq-gic-v3.c
> > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > > *d, const struct cpumask *mask)
> > >        * Ensure that stores to Normal memory are visible to the
> > >        * other CPUs before issuing the IPI.
> > >        */
> > > -     wmb();
> > > +     dsb(ishst);
> > >
> > >       for_each_cpu(cpu, mask) {
> > >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
> >
> > I'm not opposed to that change, but I'm pretty curious whether this
> > makes
> > any visible difference in practice. Could you measure the effect of this
> > change
> > for any sort of IPI heavy workload?
> >
> 
> Does this have to be a DSB ?

Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-20 15:04   ` Russell King (Oracle)
@ 2022-02-20 15:21     ` Marc Zyngier
  -1 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-02-20 15:21 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Barry Song, tglx, will, linux-kernel, linux-arm-kernel, linuxarm,
	Barry Song

On 2022-02-20 15:04, Russell King (Oracle) wrote:
> On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
>> dsb(ishst) should be enough here as we only need to guarantee the
>> visibility of data to other CPUs in smp inner domain before we
>> send the ipi.
>> 
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> ---
>>  drivers/irqchip/irq-gic-v3.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/irqchip/irq-gic-v3.c 
>> b/drivers/irqchip/irq-gic-v3.c
>> index 5e935d97207d..0efe1a9a9f3b 100644
>> --- a/drivers/irqchip/irq-gic-v3.c
>> +++ b/drivers/irqchip/irq-gic-v3.c
>> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data 
>> *d, const struct cpumask *mask)
>>  	 * Ensure that stores to Normal memory are visible to the
>>  	 * other CPUs before issuing the IPI.
>>  	 */
>> -	wmb();
>> +	dsb(ishst);
> 
> On ARM, wmb() is a dsb(st) followed by other operations which may
> include a sync operation at the L2 cache, and SoC specific barriers
> for the bus. Hopefully, nothing will break if the sledge hammer is
> replaced by the tack hammer.

The saving grace is that ARMv8 forbids (as per D4.4.11) these 
SW-visible,
non architected caches (something like PL310 simply isn't allowed). 
Given
that GICv3 requires ARMv8 the first place, we should be OK.

As for SoC-specific bus barriers, I don't know of any that would affect
an ARMv8 based SoC. But I'm always prepared to be badly surprised...

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 15:21     ` Marc Zyngier
  0 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2022-02-20 15:21 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Barry Song, tglx, will, linux-kernel, linux-arm-kernel, linuxarm,
	Barry Song

On 2022-02-20 15:04, Russell King (Oracle) wrote:
> On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
>> dsb(ishst) should be enough here as we only need to guarantee the
>> visibility of data to other CPUs in smp inner domain before we
>> send the ipi.
>> 
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> ---
>>  drivers/irqchip/irq-gic-v3.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/irqchip/irq-gic-v3.c 
>> b/drivers/irqchip/irq-gic-v3.c
>> index 5e935d97207d..0efe1a9a9f3b 100644
>> --- a/drivers/irqchip/irq-gic-v3.c
>> +++ b/drivers/irqchip/irq-gic-v3.c
>> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data 
>> *d, const struct cpumask *mask)
>>  	 * Ensure that stores to Normal memory are visible to the
>>  	 * other CPUs before issuing the IPI.
>>  	 */
>> -	wmb();
>> +	dsb(ishst);
> 
> On ARM, wmb() is a dsb(st) followed by other operations which may
> include a sync operation at the L2 cache, and SoC specific barriers
> for the bus. Hopefully, nothing will break if the sledge hammer is
> replaced by the tack hammer.

The saving grace is that ARMv8 forbids (as per D4.4.11) these 
SW-visible,
non architected caches (something like PL310 simply isn't allowed). 
Given
that GICv3 requires ARMv8 the first place, we should be OK.

As for SoC-specific bus barriers, I don't know of any that would affect
an ARMv8 based SoC. But I'm always prepared to be badly surprised...

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-20 15:05       ` Russell King (Oracle)
@ 2022-02-20 20:09         ` Barry Song
  -1 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-20 20:09 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Ard Biesheuvel, Marc Zyngier, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On Mon, Feb 21, 2022 at 4:05 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote:
> > On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On 2022-02-18 21:55, Barry Song wrote:
> > > > dsb(ishst) should be enough here as we only need to guarantee the
> > > > visibility of data to other CPUs in smp inner domain before we
> > > > send the ipi.
> > > >
> > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > > ---
> > > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/irqchip/irq-gic-v3.c
> > > > b/drivers/irqchip/irq-gic-v3.c
> > > > index 5e935d97207d..0efe1a9a9f3b 100644
> > > > --- a/drivers/irqchip/irq-gic-v3.c
> > > > +++ b/drivers/irqchip/irq-gic-v3.c
> > > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > > > *d, const struct cpumask *mask)
> > > >        * Ensure that stores to Normal memory are visible to the
> > > >        * other CPUs before issuing the IPI.
> > > >        */
> > > > -     wmb();
> > > > +     dsb(ishst);
> > > >
> > > >       for_each_cpu(cpu, mask) {
> > > >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
> > >
> > > I'm not opposed to that change, but I'm pretty curious whether this
> > > makes
> > > any visible difference in practice. Could you measure the effect of this
> > > change
> > > for any sort of IPI heavy workload?
> > >
> >
> > Does this have to be a DSB ?
>
> Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ?

usually smp_wmb() is ok, for example drivers/irqchip/irq-bcm2836.c:

static void bcm2836_arm_irqchip_ipi_send_mask(struct irq_data *d,
                                              const struct cpumask *mask)
{
        int cpu;
        void __iomem *mailbox0_base = intc.base + LOCAL_MAILBOX0_SET0;

        /*
         * Ensure that stores to normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
        smp_wmb();

        for_each_cpu(cpu, mask)
                writel_relaxed(BIT(d->hwirq), mailbox0_base + 16 * cpu);
}

but the instruction to send ipi for irq-gic-v3.c isn't a store, so
this driver has been
changed by this commit from dmb(ish) to dsb(sy):

commit 21ec30c0ef5234fb1039cc7c7737d885bf875a9e
Author: Shanker Donthineni <shankerd@codeaurora.org>

    irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq()

    A DMB instruction can be used to ensure the relative order of only
    memory accesses before and after the barrier. Since writes to system
    registers are not memory operations, barrier DMB is not sufficient
    for observability of memory accesses that occur before ICC_SGI1R_EL1
    writes.

    A DSB instruction ensures that no instructions that appear in program
    order after the DSB instruction, can execute until the DSB instruction
    has completed.

    ...

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index d71be9a1f9d2..d99cc07903ec 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -688,7 +688,7 @@ static void gic_raise_softirq(const struct cpumask
*mask, unsigned int irq)
         * Ensure that stores to Normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
-       smp_wmb();
+       wmb();

        for_each_cpu(cpu, mask) {
                u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));

my understanding is that dtb(sy) is too strong as this case we are
asking data to
be observed by other CPUs in smp just as smp_wmb is doing that in other drivers
by dmb(isb). we are not requiring data is observed by everyone in the system.

>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

Thanks
Barry

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 20:09         ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-20 20:09 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Ard Biesheuvel, Marc Zyngier, Thomas Gleixner, Will Deacon,
	Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song

On Mon, Feb 21, 2022 at 4:05 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote:
> > On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On 2022-02-18 21:55, Barry Song wrote:
> > > > dsb(ishst) should be enough here as we only need to guarantee the
> > > > visibility of data to other CPUs in smp inner domain before we
> > > > send the ipi.
> > > >
> > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > > > ---
> > > >  drivers/irqchip/irq-gic-v3.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/irqchip/irq-gic-v3.c
> > > > b/drivers/irqchip/irq-gic-v3.c
> > > > index 5e935d97207d..0efe1a9a9f3b 100644
> > > > --- a/drivers/irqchip/irq-gic-v3.c
> > > > +++ b/drivers/irqchip/irq-gic-v3.c
> > > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > > > *d, const struct cpumask *mask)
> > > >        * Ensure that stores to Normal memory are visible to the
> > > >        * other CPUs before issuing the IPI.
> > > >        */
> > > > -     wmb();
> > > > +     dsb(ishst);
> > > >
> > > >       for_each_cpu(cpu, mask) {
> > > >               u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
> > >
> > > I'm not opposed to that change, but I'm pretty curious whether this
> > > makes
> > > any visible difference in practice. Could you measure the effect of this
> > > change
> > > for any sort of IPI heavy workload?
> > >
> >
> > Does this have to be a DSB ?
>
> Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ?

usually smp_wmb() is ok, for example drivers/irqchip/irq-bcm2836.c:

static void bcm2836_arm_irqchip_ipi_send_mask(struct irq_data *d,
                                              const struct cpumask *mask)
{
        int cpu;
        void __iomem *mailbox0_base = intc.base + LOCAL_MAILBOX0_SET0;

        /*
         * Ensure that stores to normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
        smp_wmb();

        for_each_cpu(cpu, mask)
                writel_relaxed(BIT(d->hwirq), mailbox0_base + 16 * cpu);
}

but the instruction to send ipi for irq-gic-v3.c isn't a store, so
this driver has been
changed by this commit from dmb(ish) to dsb(sy):

commit 21ec30c0ef5234fb1039cc7c7737d885bf875a9e
Author: Shanker Donthineni <shankerd@codeaurora.org>

    irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq()

    A DMB instruction can be used to ensure the relative order of only
    memory accesses before and after the barrier. Since writes to system
    registers are not memory operations, barrier DMB is not sufficient
    for observability of memory accesses that occur before ICC_SGI1R_EL1
    writes.

    A DSB instruction ensures that no instructions that appear in program
    order after the DSB instruction, can execute until the DSB instruction
    has completed.

    ...

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index d71be9a1f9d2..d99cc07903ec 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -688,7 +688,7 @@ static void gic_raise_softirq(const struct cpumask
*mask, unsigned int irq)
         * Ensure that stores to Normal memory are visible to the
         * other CPUs before issuing the IPI.
         */
-       smp_wmb();
+       wmb();

        for_each_cpu(cpu, mask) {
                u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));

my understanding is that dtb(sy) is too strong as this case we are
asking data to
be observed by other CPUs in smp just as smp_wmb is doing that in other drivers
by dmb(isb). we are not requiring data is observed by everyone in the system.

>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
  2022-02-20 15:21     ` Marc Zyngier
@ 2022-02-20 20:20       ` Barry Song
  -1 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-20 20:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Russell King (Oracle),
	Thomas Gleixner, Will Deacon, LKML, LAK, Linuxarm, Barry Song

On Mon, Feb 21, 2022 at 4:21 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On 2022-02-20 15:04, Russell King (Oracle) wrote:
> > On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
> >> dsb(ishst) should be enough here as we only need to guarantee the
> >> visibility of data to other CPUs in smp inner domain before we
> >> send the ipi.
> >>
> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >> ---
> >>  drivers/irqchip/irq-gic-v3.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/irqchip/irq-gic-v3.c
> >> b/drivers/irqchip/irq-gic-v3.c
> >> index 5e935d97207d..0efe1a9a9f3b 100644
> >> --- a/drivers/irqchip/irq-gic-v3.c
> >> +++ b/drivers/irqchip/irq-gic-v3.c
> >> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> >> *d, const struct cpumask *mask)
> >>       * Ensure that stores to Normal memory are visible to the
> >>       * other CPUs before issuing the IPI.
> >>       */
> >> -    wmb();
> >> +    dsb(ishst);
> >
> > On ARM, wmb() is a dsb(st) followed by other operations which may
> > include a sync operation at the L2 cache, and SoC specific barriers
> > for the bus. Hopefully, nothing will break if the sledge hammer is
> > replaced by the tack hammer.
>
> The saving grace is that ARMv8 forbids (as per D4.4.11) these
> SW-visible,
> non architected caches (something like PL310 simply isn't allowed).
> Given
> that GICv3 requires ARMv8 the first place, we should be OK.
>
> As for SoC-specific bus barriers, I don't know of any that would affect
> an ARMv8 based SoC. But I'm always prepared to be badly surprised...2

i assume we have been safe since dsb(ish) has been widely used for
smp data visibility in arm64:
arch/arm64/include/asm/assembler.h: dsb ish
arch/arm64/include/asm/cacheflush.h: dsb(ish);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: * is doing a dsb(ishst), but that
penalises the fastpath.
arch/arm64/include/asm/smp.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h:        "dsb ish\n tlbi " #op,        \
arch/arm64/include/asm/tlbflush.h:        "dsb ish\n tlbi " #op ", %0",     \
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/kernel/alternative.c: dsb(ish);
arch/arm64/kernel/entry.S: dsb ish
arch/arm64/kernel/head.S: dsb ishst // Make zero page visible to PTW
arch/arm64/kernel/hibernate-asm.S: dsb ish /* wait for PoU cleaning to finish */
arch/arm64/kernel/hibernate-asm.S: dsb ish
arch/arm64/kernel/mte.c: dsb(ish);
arch/arm64/kernel/process.c: dsb(ish);
arch/arm64/kernel/sys_compat.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ish);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/mm/cache.S: dsb     ishst
arch/arm64/mm/cache.S: dsb ishst
arch/arm64/mm/kasan_init.c: dsb(ishst);
arch/arm64/mm/mmu.c: * We need dsb(ishst) here to ensure the
page-table-walker sees
arch/arm64/mm/mmu.c: dsb(ishst);
arch/arm64/mm/proc.S: dsb ish
drivers/irqchip/irq-gic-v3-its.c: dsb(ishst);
drivers/irqchip/irq-gic-v3-its.c: dsb(ishst);


Plus, is it even a problem to arm since arm only requires soc-level barrier
for system-level memory barrier rather than smp-level barrier?

#if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
#define mb()            __arm_heavy_mb()
#define rmb()           dsb()
#define wmb()           __arm_heavy_mb(st)
#define dma_rmb()       dmb(osh)
#define dma_wmb()       dmb(oshst)
#else
#define mb()            barrier()
#define rmb()           barrier()
#define wmb()           barrier()
#define dma_rmb()       barrier()
#define dma_wmb()       barrier()
#endif

#define __smp_mb()      dmb(ish)
#define __smp_rmb()     __smp_mb()
#define __smp_wmb()     dmb(ishst)

I am not seeing arm requires soc-level mb for smp-level barrier though
the mandatory
barriers are using heavy_mb which has soc-level mb.

In this particular case, we are asking the data visibility for
smp-level not system-level. I am
not quite sure Russell's concern is relevant.

>
>          M.
> --
> Jazz is not dead. It just smells funny...

Thanks
Barry

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi
@ 2022-02-20 20:20       ` Barry Song
  0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2022-02-20 20:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Russell King (Oracle),
	Thomas Gleixner, Will Deacon, LKML, LAK, Linuxarm, Barry Song

On Mon, Feb 21, 2022 at 4:21 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On 2022-02-20 15:04, Russell King (Oracle) wrote:
> > On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
> >> dsb(ishst) should be enough here as we only need to guarantee the
> >> visibility of data to other CPUs in smp inner domain before we
> >> send the ipi.
> >>
> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >> ---
> >>  drivers/irqchip/irq-gic-v3.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/irqchip/irq-gic-v3.c
> >> b/drivers/irqchip/irq-gic-v3.c
> >> index 5e935d97207d..0efe1a9a9f3b 100644
> >> --- a/drivers/irqchip/irq-gic-v3.c
> >> +++ b/drivers/irqchip/irq-gic-v3.c
> >> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> >> *d, const struct cpumask *mask)
> >>       * Ensure that stores to Normal memory are visible to the
> >>       * other CPUs before issuing the IPI.
> >>       */
> >> -    wmb();
> >> +    dsb(ishst);
> >
> > On ARM, wmb() is a dsb(st) followed by other operations which may
> > include a sync operation at the L2 cache, and SoC specific barriers
> > for the bus. Hopefully, nothing will break if the sledge hammer is
> > replaced by the tack hammer.
>
> The saving grace is that ARMv8 forbids (as per D4.4.11) these
> SW-visible,
> non architected caches (something like PL310 simply isn't allowed).
> Given
> that GICv3 requires ARMv8 the first place, we should be OK.
>
> As for SoC-specific bus barriers, I don't know of any that would affect
> an ARMv8 based SoC. But I'm always prepared to be badly surprised...2

i assume we have been safe since dsb(ish) has been widely used for
smp data visibility in arm64:
arch/arm64/include/asm/assembler.h: dsb ish
arch/arm64/include/asm/cacheflush.h: dsb(ish);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: * is doing a dsb(ishst), but that
penalises the fastpath.
arch/arm64/include/asm/smp.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h:        "dsb ish\n tlbi " #op,        \
arch/arm64/include/asm/tlbflush.h:        "dsb ish\n tlbi " #op ", %0",     \
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/kernel/alternative.c: dsb(ish);
arch/arm64/kernel/entry.S: dsb ish
arch/arm64/kernel/head.S: dsb ishst // Make zero page visible to PTW
arch/arm64/kernel/hibernate-asm.S: dsb ish /* wait for PoU cleaning to finish */
arch/arm64/kernel/hibernate-asm.S: dsb ish
arch/arm64/kernel/mte.c: dsb(ish);
arch/arm64/kernel/process.c: dsb(ish);
arch/arm64/kernel/sys_compat.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ish);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/mm/cache.S: dsb     ishst
arch/arm64/mm/cache.S: dsb ishst
arch/arm64/mm/kasan_init.c: dsb(ishst);
arch/arm64/mm/mmu.c: * We need dsb(ishst) here to ensure the
page-table-walker sees
arch/arm64/mm/mmu.c: dsb(ishst);
arch/arm64/mm/proc.S: dsb ish
drivers/irqchip/irq-gic-v3-its.c: dsb(ishst);
drivers/irqchip/irq-gic-v3-its.c: dsb(ishst);


Plus, is it even a problem to arm since arm only requires soc-level barrier
for system-level memory barrier rather than smp-level barrier?

#if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
#define mb()            __arm_heavy_mb()
#define rmb()           dsb()
#define wmb()           __arm_heavy_mb(st)
#define dma_rmb()       dmb(osh)
#define dma_wmb()       dmb(oshst)
#else
#define mb()            barrier()
#define rmb()           barrier()
#define wmb()           barrier()
#define dma_rmb()       barrier()
#define dma_wmb()       barrier()
#endif

#define __smp_mb()      dmb(ish)
#define __smp_rmb()     __smp_mb()
#define __smp_wmb()     dmb(ishst)

I am not seeing arm requires soc-level mb for smp-level barrier though
the mandatory
barriers are using heavy_mb which has soc-level mb.

In this particular case, we are asking the data visibility for
smp-level not system-level. I am
not quite sure Russell's concern is relevant.

>
>          M.
> --
> Jazz is not dead. It just smells funny...

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-02-20 20:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-18 21:55 [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi Barry Song
2022-02-18 21:55 ` Barry Song
2022-02-19  9:56 ` Marc Zyngier
2022-02-19  9:56   ` Marc Zyngier
2022-02-19 23:46   ` Barry Song
2022-02-19 23:46     ` Barry Song
2022-02-20  1:33     ` Barry Song
2022-02-20  1:33       ` Barry Song
2022-02-20 13:30   ` Ard Biesheuvel
2022-02-20 13:30     ` Ard Biesheuvel
2022-02-20 15:04     ` Marc Zyngier
2022-02-20 15:04       ` Marc Zyngier
2022-02-20 15:05     ` Russell King (Oracle)
2022-02-20 15:05       ` Russell King (Oracle)
2022-02-20 20:09       ` Barry Song
2022-02-20 20:09         ` Barry Song
2022-02-20 15:04 ` Russell King (Oracle)
2022-02-20 15:04   ` Russell King (Oracle)
2022-02-20 15:21   ` Marc Zyngier
2022-02-20 15:21     ` Marc Zyngier
2022-02-20 20:20     ` Barry Song
2022-02-20 20:20       ` Barry Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.