All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc Zyngier <marc.zyngier@arm.com>
To: Andre Przywara <andre.przywara@arm.com>
Cc: linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, "Raslan,
	KarimAllah" <karahmed@amazon.de>,
	"Saidi, Ali" <alisaidi@amazon.com>
Subject: Re: [PATCH v2 0/9] KVM: arm/arm64: vgic: ITS translation cache
Date: Thu, 25 Jul 2019 09:50:18 +0100	[thread overview]
Message-ID: <a757bac1-41d1-8ce5-9393-ac2e8a5e1114@arm.com> (raw)
In-Reply-To: <20190723121424.0b632efa@donnerap.cambridge.arm.com>

Hi Andre,

On 23/07/2019 12:14, Andre Przywara wrote:
> On Tue, 11 Jun 2019 18:03:27 +0100
> Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> Hi,
> 
>> It recently became apparent[1] that our LPI injection path is not as
>> efficient as it could be when injecting interrupts coming from a VFIO
>> assigned device.
>>
>> Although the proposed patch wasn't 100% correct, it outlined at least
>> two issues:
>>
>> (1) Injecting an LPI from VFIO always results in a context switch to a
>>     worker thread: no good
>>
>> (2) We have no way of amortising the cost of translating a DID+EID pair
>>     to an LPI number
>>
>> The reason for (1) is that we may sleep when translating an LPI, so we
>> do need a context process. A way to fix that is to implement a small
>> LPI translation cache that could be looked up from an atomic
>> context. It would also solve (2).
>>
>> This is what this small series proposes. It implements a very basic
>> LRU cache of pre-translated LPIs, which gets used to implement
>> kvm_arch_set_irq_inatomic. The size of the cache is currently
>> hard-coded at 16 times the number of vcpus, a number I have picked
>> under the influence of Ali Saidi. If that's not enough for you, blame
>> me, though.
>>
>> Does it work? well, it doesn't crash, and is thus perfect. More
>> seriously, I don't really have a way to benchmark it directly, so my
>> observations are only indirect:
>>
>> On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed
>> to it directly. From the host, I inject interrupts using debugfs. In
>> parallel, I look at the number of context switch, and the number of
>> interrupts on the host. Without this series, I get the same number for
>> both IRQ and CS (about half a million of each per second is pretty
>> easy to reach). With this series, the number of context switches drops
>> to something pretty small (in the low 2k), while the number of
>> interrupts stays the same.
>>
>> Yes, this is a pretty rubbish benchmark, what did you expect? ;-)
>>
>> So I'm putting this out for people with real workloads to try out and
>> report what they see.
> 
> So I gave that a shot with some benchmarks. As expected, it is quite hard
> to show an improvement with just one guest running, although we could show
> a 103%(!) improvement of the memcached QPS score in one experiment when
> running it in a guest with an external load generator.

Is that a fluke or something that you have been able to reproduce
consistently? Because doubling the performance of anything is something
I have a hard time believing in... ;-)

> Throwing more users into the game showed a significant improvement:
> 
> Benchmark 1: kernel compile/FIO: Compiling a kernel on the host, while
> letting a guest run FIO with 4K randreads from a passed-through NVMe SSD:
> The IOPS with this series improved by 27% compared to pure mainline,
> reaching 80% of the host value. Kernel compilation time improved by 8.5%
> compared to mainline.

OK, that's interesting. I guess that's the effect of not unnecessarily
disrupting the scheduling with one extra context-switch per interrupt.

> 
> Benchmark 2: FIO/FIO: Running FIO on a passed through SATA SSD in one
> guest, and FIO on a passed through NVMe SSD in another guest, at the same
> time:
> The IOPS with this series improved by 23% for the NVMe and 34% for the
> SATA disk, compared to pure mainline.

I guess that's the same thing. Not context-switching means more
available resource to other processes in the system.

> So judging from these results, I think this series is a significant
> improvement, which justifies it to be merged, to receive wider testing.
> 
> It would be good if others could also do performance experiments and post
> their results.

Wishful thinking...

Anyway, I'll repost the series shortly now that Eric has gone through it.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

WARNING: multiple messages have this Message-ID (diff)
From: Marc Zyngier <marc.zyngier@arm.com>
To: Andre Przywara <andre.przywara@arm.com>
Cc: "Raslan, KarimAllah" <karahmed@amazon.de>,
	"Saidi, Ali" <alisaidi@amazon.com>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org
Subject: Re: [PATCH v2 0/9] KVM: arm/arm64: vgic: ITS translation cache
Date: Thu, 25 Jul 2019 09:50:18 +0100	[thread overview]
Message-ID: <a757bac1-41d1-8ce5-9393-ac2e8a5e1114@arm.com> (raw)
In-Reply-To: <20190723121424.0b632efa@donnerap.cambridge.arm.com>

Hi Andre,

On 23/07/2019 12:14, Andre Przywara wrote:
> On Tue, 11 Jun 2019 18:03:27 +0100
> Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> Hi,
> 
>> It recently became apparent[1] that our LPI injection path is not as
>> efficient as it could be when injecting interrupts coming from a VFIO
>> assigned device.
>>
>> Although the proposed patch wasn't 100% correct, it outlined at least
>> two issues:
>>
>> (1) Injecting an LPI from VFIO always results in a context switch to a
>>     worker thread: no good
>>
>> (2) We have no way of amortising the cost of translating a DID+EID pair
>>     to an LPI number
>>
>> The reason for (1) is that we may sleep when translating an LPI, so we
>> do need a context process. A way to fix that is to implement a small
>> LPI translation cache that could be looked up from an atomic
>> context. It would also solve (2).
>>
>> This is what this small series proposes. It implements a very basic
>> LRU cache of pre-translated LPIs, which gets used to implement
>> kvm_arch_set_irq_inatomic. The size of the cache is currently
>> hard-coded at 16 times the number of vcpus, a number I have picked
>> under the influence of Ali Saidi. If that's not enough for you, blame
>> me, though.
>>
>> Does it work? well, it doesn't crash, and is thus perfect. More
>> seriously, I don't really have a way to benchmark it directly, so my
>> observations are only indirect:
>>
>> On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed
>> to it directly. From the host, I inject interrupts using debugfs. In
>> parallel, I look at the number of context switch, and the number of
>> interrupts on the host. Without this series, I get the same number for
>> both IRQ and CS (about half a million of each per second is pretty
>> easy to reach). With this series, the number of context switches drops
>> to something pretty small (in the low 2k), while the number of
>> interrupts stays the same.
>>
>> Yes, this is a pretty rubbish benchmark, what did you expect? ;-)
>>
>> So I'm putting this out for people with real workloads to try out and
>> report what they see.
> 
> So I gave that a shot with some benchmarks. As expected, it is quite hard
> to show an improvement with just one guest running, although we could show
> a 103%(!) improvement of the memcached QPS score in one experiment when
> running it in a guest with an external load generator.

Is that a fluke or something that you have been able to reproduce
consistently? Because doubling the performance of anything is something
I have a hard time believing in... ;-)

> Throwing more users into the game showed a significant improvement:
> 
> Benchmark 1: kernel compile/FIO: Compiling a kernel on the host, while
> letting a guest run FIO with 4K randreads from a passed-through NVMe SSD:
> The IOPS with this series improved by 27% compared to pure mainline,
> reaching 80% of the host value. Kernel compilation time improved by 8.5%
> compared to mainline.

OK, that's interesting. I guess that's the effect of not unnecessarily
disrupting the scheduling with one extra context-switch per interrupt.

> 
> Benchmark 2: FIO/FIO: Running FIO on a passed through SATA SSD in one
> guest, and FIO on a passed through NVMe SSD in another guest, at the same
> time:
> The IOPS with this series improved by 23% for the NVMe and 34% for the
> SATA disk, compared to pure mainline.

I guess that's the same thing. Not context-switching means more
available resource to other processes in the system.

> So judging from these results, I think this series is a significant
> improvement, which justifies it to be merged, to receive wider testing.
> 
> It would be good if others could also do performance experiments and post
> their results.

Wishful thinking...

Anyway, I'll repost the series shortly now that Eric has gone through it.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

WARNING: multiple messages have this Message-ID (diff)
From: Marc Zyngier <marc.zyngier@arm.com>
To: Andre Przywara <andre.przywara@arm.com>
Cc: "Raslan, KarimAllah" <karahmed@amazon.de>,
	"Saidi, Ali" <alisaidi@amazon.com>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org
Subject: Re: [PATCH v2 0/9] KVM: arm/arm64: vgic: ITS translation cache
Date: Thu, 25 Jul 2019 09:50:18 +0100	[thread overview]
Message-ID: <a757bac1-41d1-8ce5-9393-ac2e8a5e1114@arm.com> (raw)
In-Reply-To: <20190723121424.0b632efa@donnerap.cambridge.arm.com>

Hi Andre,

On 23/07/2019 12:14, Andre Przywara wrote:
> On Tue, 11 Jun 2019 18:03:27 +0100
> Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> Hi,
> 
>> It recently became apparent[1] that our LPI injection path is not as
>> efficient as it could be when injecting interrupts coming from a VFIO
>> assigned device.
>>
>> Although the proposed patch wasn't 100% correct, it outlined at least
>> two issues:
>>
>> (1) Injecting an LPI from VFIO always results in a context switch to a
>>     worker thread: no good
>>
>> (2) We have no way of amortising the cost of translating a DID+EID pair
>>     to an LPI number
>>
>> The reason for (1) is that we may sleep when translating an LPI, so we
>> do need a context process. A way to fix that is to implement a small
>> LPI translation cache that could be looked up from an atomic
>> context. It would also solve (2).
>>
>> This is what this small series proposes. It implements a very basic
>> LRU cache of pre-translated LPIs, which gets used to implement
>> kvm_arch_set_irq_inatomic. The size of the cache is currently
>> hard-coded at 16 times the number of vcpus, a number I have picked
>> under the influence of Ali Saidi. If that's not enough for you, blame
>> me, though.
>>
>> Does it work? well, it doesn't crash, and is thus perfect. More
>> seriously, I don't really have a way to benchmark it directly, so my
>> observations are only indirect:
>>
>> On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed
>> to it directly. From the host, I inject interrupts using debugfs. In
>> parallel, I look at the number of context switch, and the number of
>> interrupts on the host. Without this series, I get the same number for
>> both IRQ and CS (about half a million of each per second is pretty
>> easy to reach). With this series, the number of context switches drops
>> to something pretty small (in the low 2k), while the number of
>> interrupts stays the same.
>>
>> Yes, this is a pretty rubbish benchmark, what did you expect? ;-)
>>
>> So I'm putting this out for people with real workloads to try out and
>> report what they see.
> 
> So I gave that a shot with some benchmarks. As expected, it is quite hard
> to show an improvement with just one guest running, although we could show
> a 103%(!) improvement of the memcached QPS score in one experiment when
> running it in a guest with an external load generator.

Is that a fluke or something that you have been able to reproduce
consistently? Because doubling the performance of anything is something
I have a hard time believing in... ;-)

> Throwing more users into the game showed a significant improvement:
> 
> Benchmark 1: kernel compile/FIO: Compiling a kernel on the host, while
> letting a guest run FIO with 4K randreads from a passed-through NVMe SSD:
> The IOPS with this series improved by 27% compared to pure mainline,
> reaching 80% of the host value. Kernel compilation time improved by 8.5%
> compared to mainline.

OK, that's interesting. I guess that's the effect of not unnecessarily
disrupting the scheduling with one extra context-switch per interrupt.

> 
> Benchmark 2: FIO/FIO: Running FIO on a passed through SATA SSD in one
> guest, and FIO on a passed through NVMe SSD in another guest, at the same
> time:
> The IOPS with this series improved by 23% for the NVMe and 34% for the
> SATA disk, compared to pure mainline.

I guess that's the same thing. Not context-switching means more
available resource to other processes in the system.

> So judging from these results, I think this series is a significant
> improvement, which justifies it to be merged, to receive wider testing.
> 
> It would be good if others could also do performance experiments and post
> their results.

Wishful thinking...

Anyway, I'll repost the series shortly now that Eric has gone through it.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2019-07-25  8:50 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-11 17:03 [PATCH v2 0/9] KVM: arm/arm64: vgic: ITS translation cache Marc Zyngier
2019-06-11 17:03 ` Marc Zyngier
2019-06-11 17:03 ` Marc Zyngier
2019-06-11 17:03 ` [PATCH v2 1/9] KVM: arm/arm64: vgic: Add LPI translation cache definition Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-12  8:16   ` Julien Thierry
2019-06-12  8:16     ` Julien Thierry
2019-06-12  8:16     ` Julien Thierry
2019-06-12  8:49     ` Julien Thierry
2019-06-12  8:49       ` Julien Thierry
2019-06-12  8:49       ` Julien Thierry
2019-06-12  9:52     ` Marc Zyngier
2019-06-12  9:52       ` Marc Zyngier
2019-06-12 10:58       ` Julien Thierry
2019-06-12 10:58         ` Julien Thierry
2019-06-12 10:58         ` Julien Thierry
2019-06-12 12:28         ` Julien Thierry
2019-06-12 12:28           ` Julien Thierry
2019-06-12 12:28           ` Julien Thierry
2019-07-23 12:43   ` Auger Eric
2019-07-23 12:43     ` Auger Eric
2019-07-23 12:43     ` Auger Eric
2019-06-11 17:03 ` [PATCH v2 2/9] KVM: arm/arm64: vgic: Add __vgic_put_lpi_locked primitive Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03 ` [PATCH v2 3/9] KVM: arm/arm64: vgic-its: Add MSI-LPI translation cache invalidation Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-07-23 12:39   ` Auger Eric
2019-07-23 12:39     ` Auger Eric
2019-07-23 12:39     ` Auger Eric
2019-06-11 17:03 ` [PATCH v2 4/9] KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on specific commands Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-07-01 12:38   ` Auger Eric
2019-07-01 12:38     ` Auger Eric
2019-07-01 12:38     ` Auger Eric
2019-07-22 10:54     ` Marc Zyngier
2019-07-22 10:54       ` Marc Zyngier
2019-07-22 10:54       ` Marc Zyngier
2019-07-23 12:25       ` Auger Eric
2019-07-23 12:25         ` Auger Eric
2019-07-23 12:25         ` Auger Eric
2019-07-23 12:43         ` Marc Zyngier
2019-07-23 12:43           ` Marc Zyngier
2019-07-23 12:43           ` Marc Zyngier
2019-07-23 12:47           ` Auger Eric
2019-07-23 12:47             ` Auger Eric
2019-07-23 12:47             ` Auger Eric
2019-07-23 12:50             ` Marc Zyngier
2019-07-23 12:50               ` Marc Zyngier
2019-07-23 12:50               ` Marc Zyngier
2019-06-11 17:03 ` [PATCH v2 5/9] KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on disabling LPIs Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-07-23 15:09   ` Auger Eric
2019-07-23 15:09     ` Auger Eric
2019-07-23 15:09     ` Auger Eric
2019-06-11 17:03 ` [PATCH v2 6/9] KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on vgic teardown Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-07-23 15:10   ` Auger Eric
2019-07-23 15:10     ` Auger Eric
2019-07-23 15:10     ` Auger Eric
2019-06-11 17:03 ` [PATCH v2 7/9] KVM: arm/arm64: vgic-its: Cache successful MSI->LPI translation Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-25 11:50   ` Zenghui Yu
2019-06-25 11:50     ` Zenghui Yu
2019-06-25 11:50     ` Zenghui Yu
2019-06-25 12:31     ` Marc Zyngier
2019-06-25 12:31       ` Marc Zyngier
2019-06-25 12:31       ` Marc Zyngier
2019-06-25 16:00       ` Zenghui Yu
2019-06-25 16:00         ` Zenghui Yu
2019-06-25 16:00         ` Zenghui Yu
2019-06-26  3:54         ` Zenghui Yu
2019-06-26  3:54           ` Zenghui Yu
2019-06-26  3:54           ` Zenghui Yu
2019-06-26  7:55         ` Marc Zyngier
2019-06-26  7:55           ` Marc Zyngier
2019-07-23 15:21   ` Auger Eric
2019-07-23 15:21     ` Auger Eric
2019-07-23 15:21     ` Auger Eric
2019-06-11 17:03 ` [PATCH v2 8/9] KVM: arm/arm64: vgic-its: Check the LPI translation cache on MSI injection Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-07-23 15:10   ` Auger Eric
2019-07-23 15:10     ` Auger Eric
2019-07-23 15:10     ` Auger Eric
2019-07-23 15:45     ` Marc Zyngier
2019-07-23 15:45       ` Marc Zyngier
2019-07-23 15:45       ` Marc Zyngier
2019-07-24  7:41       ` Auger Eric
2019-07-24  7:41         ` Auger Eric
2019-07-24  7:41         ` Auger Eric
2019-06-11 17:03 ` [PATCH v2 9/9] KVM: arm/arm64: vgic-irqfd: Implement kvm_arch_set_irq_inatomic Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-06-11 17:03   ` Marc Zyngier
2019-07-23 15:14   ` Auger Eric
2019-07-23 15:14     ` Auger Eric
2019-07-23 15:14     ` Auger Eric
2019-07-25  8:24     ` Marc Zyngier
2019-07-25  8:24       ` Marc Zyngier
2019-07-25  8:24       ` Marc Zyngier
2019-07-23 11:14 ` [PATCH v2 0/9] KVM: arm/arm64: vgic: ITS translation cache Andre Przywara
2019-07-23 11:14   ` Andre Przywara
2019-07-23 11:14   ` Andre Przywara
2019-07-25  8:50   ` Marc Zyngier [this message]
2019-07-25  8:50     ` Marc Zyngier
2019-07-25  8:50     ` Marc Zyngier
2019-07-25 10:01     ` Andre Przywara
2019-07-25 10:01       ` Andre Przywara
2019-07-25 10:01       ` Andre Przywara
2019-07-25 15:37       ` Marc Zyngier
2019-07-25 15:37         ` Marc Zyngier
2019-07-25 15:37         ` Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a757bac1-41d1-8ce5-9393-ac2e8a5e1114@arm.com \
    --to=marc.zyngier@arm.com \
    --cc=alisaidi@amazon.com \
    --cc=andre.przywara@arm.com \
    --cc=karahmed@amazon.de \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.