All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit
@ 2020-09-15 14:04 lushenming
  2020-09-15 14:47 ` Marc Zyngier
  0 siblings, 1 reply; 5+ messages in thread
From: lushenming @ 2020-09-15 14:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Thomas Gleixner, Jason Cooper, linux-kernel, Wanghaibin (D), yuzenghui

Thanks for your quick response.

Okay, I agree that busy-waiting may add more overhead at the RD level. But I think that the delay time can be adjusted. In our latest hardware implementation, we optimize the search of the VPT, now even the VPT full of interrupts (56k) can be parsed within 2 microseconds. It is true that the parse speeds of various hardware are different, but does directly waiting for 10 microseconds make the optimization of those fast hardware be completely masked? Maybe we can set the delay time smaller, like 1 microseconds?

In addition, 10 microseconds seems to be the data that our team reported earlier, which is the parse time in worst case.

-----Original Message-----
From: Marc Zyngier [mailto:maz@kernel.org] 
Sent: 2020-09-15 15:41
To: lushenming <lushenming@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>; Jason Cooper <jason@lakedaemon.net>; linux-kernel@vger.kernel.org; Wanghaibin (D) <wanghaibin.wang@huawei.com>
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VENPENDBASER.Dirty bit

On 2020-09-15 08:22, Shenming Lu wrote:
> Every time the vPE is scheduled, the code starts polling the 
> GICR_VPENDBASER.Dirty bit until it becomes 0. It's OK. But the 
> delay_us of the poll function is directly set to 10, which is a long 
> time for most interrupts. In our measurement, it takes only 1~2 
> microseconds, or even hundreds of nanoseconds, to finish parsing the 
> VPT in most cases. However, in the current implementation, if the 
> GICR_VPENDBASER.Dirty bit is not 0 immediately after the vPE is 
> scheduled, it will directly wait for 10 microseconds, resulting in 
> meaningless waiting.
> 
> In order to avoid meaningless waiting, we can set the delay_us to 0, 
> which can exit the poll function immediately when the Dirty bit 
> becomes 0.

We clearly have a difference in interpretation of the word "meaningless".

With this, you are busy-waiting on the register, adding even more overhead at the RD level. How is that better? The whole point is to back off and let the RD do its stuff in the background. This is also based on a massive sample of *one* implementation. How is that representative?

It would be a lot more convincing if you measured the difference it makes on the total scheduling latency of a vcpu. Assuming it makes
*any* observable difference.

Thanks,

         M.

> 
> Signed-off-by: Shenming Lu <lushenming@huawei.com>
> ---
>  drivers/irqchip/irq-gic-v3-its.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c
> b/drivers/irqchip/irq-gic-v3-its.c
> index 548de7538632..5cfcf0c2ce1a 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -3803,7 +3803,7 @@ static void its_wait_vpt_parse_complete(void)
>  	WARN_ON_ONCE(readq_relaxed_poll_timeout_atomic(vlpi_base + 
> GICR_VPENDBASER,
>  						       val,
>  						       !(val & GICR_VPENDBASER_Dirty),
> -						       10, 500));
> +						       0, 500));
>  }
> 
>  static void its_vpe_schedule(struct its_vpe *vpe)
--
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit
  2020-09-15 14:04 [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit lushenming
@ 2020-09-15 14:47 ` Marc Zyngier
  2020-09-16  7:04   ` lushenming
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Zyngier @ 2020-09-15 14:47 UTC (permalink / raw)
  To: lushenming
  Cc: Thomas Gleixner, Jason Cooper, linux-kernel, Wanghaibin (D), yuzenghui

On 2020-09-15 15:04, lushenming wrote:
> Thanks for your quick response.
> 
> Okay, I agree that busy-waiting may add more overhead at the RD level.
> But I think that the delay time can be adjusted. In our latest
> hardware implementation, we optimize the search of the VPT, now even
> the VPT full of interrupts (56k) can be parsed within 2 microseconds.

It's not so much when the VPT is full that it is bad. It is when
the pending interrupts are not cached, and that you don't know *where*
to look for them in the VPT.

> It is true that the parse speeds of various hardware are different,
> but does directly waiting for 10 microseconds make the optimization of
> those fast hardware be completely masked? Maybe we can set the delay
> time smaller, like 1 microseconds?

That certainly would be more acceptable. But I still question the impact
of such a change compared to the cost of a vcpu entry. I suggest you
come up with measurements that actually show that polling this register
more often significantly reduces the entry latency. Only then can we
make an educated decision.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit
  2020-09-15 14:47 ` Marc Zyngier
@ 2020-09-16  7:04   ` lushenming
  2020-09-16  8:39     ` Marc Zyngier
  0 siblings, 1 reply; 5+ messages in thread
From: lushenming @ 2020-09-16  7:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Thomas Gleixner, Jason Cooper, linux-kernel, Wanghaibin (D), yuzenghui

Hi,

Our team just discussed this issue again and consulted our GIC hardware 
design team. They think the RD can afford busy waiting. So we still think 
maybe 0 is better, at least for our hardware.

In addition, if not 0, as I said before, in our measurement, it takes only 
hundreds of nanoseconds, or 1~2 microseconds, to finish parsing the VPT 
in most cases. So maybe 1 microseconds, or smaller, is more appropriate. 
Anyway, 10 microseconds is too much.

But it has to be said that it does depend on the hardware implementation.

Besides, I'm not sure where are the start and end point of the total scheduling 
latency of a vcpu you said, which includes many events. Is the parse time of 
the VPT not clear enough?

-----Original Message-----
From: Marc Zyngier [mailto:maz@kernel.org] 
Sent: 2020-09-15 22:48
To: lushenming <lushenming@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>; Jason Cooper <jason@lakedaemon.net>; linux-kernel@vger.kernel.org; Wanghaibin (D) <wanghaibin.wang@huawei.com>; yuzenghui <yuzenghui@huawei.com>
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

On 2020-09-15 15:04, lushenming wrote:
> Thanks for your quick response.
> 
> Okay, I agree that busy-waiting may add more overhead at the RD level.
> But I think that the delay time can be adjusted. In our latest 
> hardware implementation, we optimize the search of the VPT, now even 
> the VPT full of interrupts (56k) can be parsed within 2 microseconds.

It's not so much when the VPT is full that it is bad. It is when the pending interrupts are not cached, and that you don't know *where* to look for them in the VPT.

> It is true that the parse speeds of various hardware are different, 
> but does directly waiting for 10 microseconds make the optimization of 
> those fast hardware be completely masked? Maybe we can set the delay 
> time smaller, like 1 microseconds?

That certainly would be more acceptable. But I still question the impact of such a change compared to the cost of a vcpu entry. I suggest you come up with measurements that actually show that polling this register more often significantly reduces the entry latency. Only then can we make an educated decision.

Thanks,

         M.
--
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit
  2020-09-16  7:04   ` lushenming
@ 2020-09-16  8:39     ` Marc Zyngier
  2020-09-18 12:18       ` lushenming
  0 siblings, 1 reply; 5+ messages in thread
From: Marc Zyngier @ 2020-09-16  8:39 UTC (permalink / raw)
  To: lushenming
  Cc: Thomas Gleixner, Jason Cooper, linux-kernel, Wanghaibin (D), yuzenghui

On 2020-09-16 08:04, lushenming wrote:
> Hi,
> 
> Our team just discussed this issue again and consulted our GIC hardware
> design team. They think the RD can afford busy waiting. So we still 
> think
> maybe 0 is better, at least for our hardware.
> 
> In addition, if not 0, as I said before, in our measurement, it takes 
> only
> hundreds of nanoseconds, or 1~2 microseconds, to finish parsing the VPT
> in most cases. So maybe 1 microseconds, or smaller, is more 
> appropriate.
> Anyway, 10 microseconds is too much.
> 
> But it has to be said that it does depend on the hardware 
> implementation.

Exactly. And given that the only publicly available implementation is
a software model, I am reluctant to change "performance" related things
based on benchmarks that can't be verified and appears to me as a micro
optimization.

> Besides, I'm not sure where are the start and end point of the total 
> scheduling
> latency of a vcpu you said, which includes many events. Is the parse 
> time of
> the VPT not clear enough?

Measure the time it takes from kvm_vcpu_load() to the point where the 
vcpu
enters the guest. How much, in proportion, do these 1/2/10ms represent?

Also, a better(?) course of action would maybe to consider whether we 
should
split the its_vpe_schedule() call into two distinct operations: one that
programs the VPE to be resident, and another that poll the Dirty bit 
*much
later* on the entry path, giving the GIC a chance to work in parallel 
with
the CPU on the entry path.

If your HW is a quick as you say it is, it would pretty much guarantee
a clear read of GICR_VPENDBASER without waiting.

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit
  2020-09-16  8:39     ` Marc Zyngier
@ 2020-09-18 12:18       ` lushenming
  0 siblings, 0 replies; 5+ messages in thread
From: lushenming @ 2020-09-18 12:18 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Thomas Gleixner, Jason Cooper, linux-kernel, Wanghaibin (D), yuzenghui

Hi, Marc,

I measured the time from vcpu_load() (include it) to __guest_enter() on Kunpeng 920. On average, It takes 2.55 microseconds (not first run && the VPT is empty). So waiting for 10 microseconds in 
vcpu scheduling really hurts performance.

And I agree that delaying the execution of its_wait_vpt_parse_complete() might be a  viable solution.

-----Original Message-----
From: Marc Zyngier [mailto:maz@kernel.org] 
Sent: 2020-09-16 16:40
To: lushenming <lushenming@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>; Jason Cooper <jason@lakedaemon.net>; linux-kernel@vger.kernel.org; Wanghaibin (D) <wanghaibin.wang@huawei.com>; yuzenghui <yuzenghui@huawei.com>
Subject: Re: [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit

On 2020-09-16 08:04, lushenming wrote:
> Hi,
> 
> Our team just discussed this issue again and consulted our GIC 
> hardware design team. They think the RD can afford busy waiting. So we 
> still think maybe 0 is better, at least for our hardware.
> 
> In addition, if not 0, as I said before, in our measurement, it takes 
> only hundreds of nanoseconds, or 1~2 microseconds, to finish parsing 
> the VPT in most cases. So maybe 1 microseconds, or smaller, is more 
> appropriate.
> Anyway, 10 microseconds is too much.
> 
> But it has to be said that it does depend on the hardware 
> implementation.

Exactly. And given that the only publicly available implementation is a software model, I am reluctant to change "performance" related things based on benchmarks that can't be verified and appears to me as a micro optimization.

> Besides, I'm not sure where are the start and end point of the total 
> scheduling latency of a vcpu you said, which includes many events. Is 
> the parse time of the VPT not clear enough?

Measure the time it takes from kvm_vcpu_load() to the point where the vcpu enters the guest. How much, in proportion, do these 1/2/10ms represent?

Also, a better(?) course of action would maybe to consider whether we should split the its_vpe_schedule() call into two distinct operations: one that programs the VPE to be resident, and another that poll the Dirty bit *much
later* on the entry path, giving the GIC a chance to work in parallel with the CPU on the entry path.

If your HW is a quick as you say it is, it would pretty much guarantee a clear read of GICR_VPENDBASER without waiting.

         M.
--
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-18 12:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-15 14:04 [PATCH] irqchip/gic-v4.1: Optimize the delay time of the poll on the GICR_VPENDBASER.Dirty bit lushenming
2020-09-15 14:47 ` Marc Zyngier
2020-09-16  7:04   ` lushenming
2020-09-16  8:39     ` Marc Zyngier
2020-09-18 12:18       ` lushenming

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.