It recently became apparent[1] that our LPI injection path is not as efficient as it could be when injecting interrupts coming from a VFIO assigned device. Although the proposed patch wasn't 100% correct, it outlined at least two issues: (1) Injecting an LPI from VFIO always results in a context switch to a worker thread: no good (2) We have no way of amortising the cost of translating a DID+EID pair to an LPI number The reason for (1) is that we may sleep when translating an LPI, so we do need a context process. A way to fix that is to implement a small LPI translation cache that could be looked up from an atomic context. It would also solve (2). This is what this small series proposes. It implements a very basic LRU cache of pre-translated LPIs, which gets used to implement kvm_arch_set_irq_inatomic. The size of the cache is currently hard-coded at 16 times the number of vcpus, a number I have picked under the influence of Ali Saidi. If that's not enough for you, blame me, though. Does it work? well, it doesn't crash, and is thus perfect. More seriously, I don't really have a way to benchmark it directly, so my observations are only indirect: On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed to it directly. From the host, I inject interrupts using debugfs. In parallel, I look at the number of context switch, and the number of interrupts on the host. Without this series, I get the same number for both IRQ and CS (about half a million of each per second is pretty easy to reach). With this series, the number of context switches drops to something pretty small (in the low 2k), while the number of interrupts stays the same. Yes, this is a pretty rubbish benchmark, what did you expect? ;-) So I'm putting this out for people with real workloads to try out and report what they see. [1] https://lore.kernel.org/lkml/1552833373-19828-1-git-send-email-yuzenghui@huawei.com/ * From v1: - Fixed race on allocation, where the same LPI could be cached multiple times - Now invalidate the cache on vgic teardown, avoiding memory leaks - Change patch split slightly, general reshuffling - Small cleanups here and there - Rebased on 5.2-rc4 Marc Zyngier (9): KVM: arm/arm64: vgic: Add LPI translation cache definition KVM: arm/arm64: vgic: Add __vgic_put_lpi_locked primitive KVM: arm/arm64: vgic-its: Add MSI-LPI translation cache invalidation KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on specific commands KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on disabling LPIs KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on vgic teardown KVM: arm/arm64: vgic-its: Cache successful MSI->LPI translation KVM: arm/arm64: vgic-its: Check the LPI translation cache on MSI injection KVM: arm/arm64: vgic-irqfd: Implement kvm_arch_set_irq_inatomic include/kvm/arm_vgic.h | 3 + virt/kvm/arm/vgic/vgic-init.c | 5 + virt/kvm/arm/vgic/vgic-irqfd.c | 36 +++++- virt/kvm/arm/vgic/vgic-its.c | 204 +++++++++++++++++++++++++++++++ virt/kvm/arm/vgic/vgic-mmio-v3.c | 4 +- virt/kvm/arm/vgic/vgic.c | 26 ++-- virt/kvm/arm/vgic/vgic.h | 5 + 7 files changed, 267 insertions(+), 16 deletions(-) -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Add the basic data structure that expresses an MSI to LPI translation as well as the allocation/release hooks. THe size of the cache is arbitrarily defined as 4*nr_vcpus. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- include/kvm/arm_vgic.h | 3 +++ virt/kvm/arm/vgic/vgic-init.c | 5 ++++ virt/kvm/arm/vgic/vgic-its.c | 49 +++++++++++++++++++++++++++++++++++ virt/kvm/arm/vgic/vgic.h | 2 ++ 4 files changed, 59 insertions(+) diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index c36c86f1ec9a..ca7bcf52dc85 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -260,6 +260,9 @@ struct vgic_dist { struct list_head lpi_list_head; int lpi_list_count; + /* LPI translation cache */ + struct list_head lpi_translation_cache; + /* used by vgic-debug */ struct vgic_state_iter *iter; diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c index 3bdb31eaed64..c7c4c77dd430 100644 --- a/virt/kvm/arm/vgic/vgic-init.c +++ b/virt/kvm/arm/vgic/vgic-init.c @@ -64,6 +64,7 @@ void kvm_vgic_early_init(struct kvm *kvm) struct vgic_dist *dist = &kvm->arch.vgic; INIT_LIST_HEAD(&dist->lpi_list_head); + INIT_LIST_HEAD(&dist->lpi_translation_cache); raw_spin_lock_init(&dist->lpi_list_lock); } @@ -305,6 +306,7 @@ int vgic_init(struct kvm *kvm) } if (vgic_has_its(kvm)) { + vgic_lpi_translation_cache_init(kvm); ret = vgic_v4_init(kvm); if (ret) goto out; @@ -346,6 +348,9 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm) INIT_LIST_HEAD(&dist->rd_regions); } + if (vgic_has_its(kvm)) + vgic_lpi_translation_cache_destroy(kvm); + if (vgic_supports_direct_msis(kvm)) vgic_v4_teardown(kvm); } diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index 44ceaccb18cf..ce9bcddeb7f1 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -149,6 +149,14 @@ struct its_ite { u32 event_id; }; +struct vgic_translation_cache_entry { + struct list_head entry; + phys_addr_t db; + u32 devid; + u32 eventid; + struct vgic_irq *irq; +}; + /** * struct vgic_its_abi - ITS abi ops and settings * @cte_esz: collection table entry size @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its, return ret; } +/* Default is 16 cached LPIs per vcpu */ +#define LPI_DEFAULT_PCPU_CACHE_SIZE 16 + +void vgic_lpi_translation_cache_init(struct kvm *kvm) +{ + struct vgic_dist *dist = &kvm->arch.vgic; + unsigned int sz; + int i; + + if (!list_empty(&dist->lpi_translation_cache)) + return; + + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; + + for (i = 0; i < sz; i++) { + struct vgic_translation_cache_entry *cte; + + /* An allocation failure is not fatal */ + cte = kzalloc(sizeof(*cte), GFP_KERNEL); + if (WARN_ON(!cte)) + break; + + INIT_LIST_HEAD(&cte->entry); + list_add(&cte->entry, &dist->lpi_translation_cache); + } +} + +void vgic_lpi_translation_cache_destroy(struct kvm *kvm) +{ + struct vgic_dist *dist = &kvm->arch.vgic; + struct vgic_translation_cache_entry *cte, *tmp; + + list_for_each_entry_safe(cte, tmp, + &dist->lpi_translation_cache, entry) { + list_del(&cte->entry); + kfree(cte); + } +} + #define INITIAL_BASER_VALUE \ (GIC_BASER_CACHEABILITY(GITS_BASER, INNER, RaWb) | \ GIC_BASER_CACHEABILITY(GITS_BASER, OUTER, SameAsInner) | \ @@ -1696,6 +1743,8 @@ static int vgic_its_create(struct kvm_device *dev, u32 type) kfree(its); return ret; } + + vgic_lpi_translation_cache_init(dev->kvm); } mutex_init(&its->its_lock); diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h index abeeffabc456..50aad705c4a9 100644 --- a/virt/kvm/arm/vgic/vgic.h +++ b/virt/kvm/arm/vgic/vgic.h @@ -316,6 +316,8 @@ int vgic_copy_lpi_list(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 **intid_ptr); int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, u32 devid, u32 eventid, struct vgic_irq **irq); struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi); +void vgic_lpi_translation_cache_init(struct kvm *kvm); +void vgic_lpi_translation_cache_destroy(struct kvm *kvm); bool vgic_supports_direct_msis(struct kvm *kvm); int vgic_v4_init(struct kvm *kvm); -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Our LPI translation cache needs to be able to drop the refcount on an LPI whilst already holding the lpi_list_lock. Let's add a new primitive for this. Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic.c | 26 +++++++++++++++++--------- virt/kvm/arm/vgic/vgic.h | 1 + 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c index 191deccf60bf..376a297e2169 100644 --- a/virt/kvm/arm/vgic/vgic.c +++ b/virt/kvm/arm/vgic/vgic.c @@ -130,6 +130,22 @@ static void vgic_irq_release(struct kref *ref) { } +/* + * Drop the refcount on the LPI. Must be called with lpi_list_lock held. + */ +void __vgic_put_lpi_locked(struct kvm *kvm, struct vgic_irq *irq) +{ + struct vgic_dist *dist = &kvm->arch.vgic; + + if (!kref_put(&irq->refcount, vgic_irq_release)) + return; + + list_del(&irq->lpi_list); + dist->lpi_list_count--; + + kfree(irq); +} + void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) { struct vgic_dist *dist = &kvm->arch.vgic; @@ -139,16 +155,8 @@ void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) return; raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); - if (!kref_put(&irq->refcount, vgic_irq_release)) { - raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); - return; - }; - - list_del(&irq->lpi_list); - dist->lpi_list_count--; + __vgic_put_lpi_locked(kvm, irq); raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); - - kfree(irq); } void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu) diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h index 50aad705c4a9..afac2fed7df4 100644 --- a/virt/kvm/arm/vgic/vgic.h +++ b/virt/kvm/arm/vgic/vgic.h @@ -172,6 +172,7 @@ vgic_get_mmio_region(struct kvm_vcpu *vcpu, struct vgic_io_device *iodev, gpa_t addr, int len); struct vgic_irq *vgic_get_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 intid); +void __vgic_put_lpi_locked(struct kvm *kvm, struct vgic_irq *irq); void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq); bool vgic_get_phys_line_level(struct vgic_irq *irq); void vgic_irq_set_phys_pending(struct vgic_irq *irq, bool pending); -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
There's a number of cases where we need to invalidate the caching of translations, so let's add basic support for that. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-its.c | 23 +++++++++++++++++++++++ virt/kvm/arm/vgic/vgic.h | 1 + 2 files changed, 24 insertions(+) diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index ce9bcddeb7f1..9b6b66204b97 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -546,6 +546,29 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, return 0; } +void vgic_its_invalidate_cache(struct kvm *kvm) +{ + struct vgic_dist *dist = &kvm->arch.vgic; + struct vgic_translation_cache_entry *cte; + unsigned long flags; + + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); + + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { + /* + * If we hit a NULL entry, there is nothing after this + * point. + */ + if (!cte->irq) + break; + + __vgic_put_lpi_locked(kvm, cte->irq); + cte->irq = NULL; + } + + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); +} + int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, u32 devid, u32 eventid, struct vgic_irq **irq) { diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h index afac2fed7df4..072f810dc441 100644 --- a/virt/kvm/arm/vgic/vgic.h +++ b/virt/kvm/arm/vgic/vgic.h @@ -319,6 +319,7 @@ int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi); void vgic_lpi_translation_cache_init(struct kvm *kvm); void vgic_lpi_translation_cache_destroy(struct kvm *kvm); +void vgic_its_invalidate_cache(struct kvm *kvm); bool vgic_supports_direct_msis(struct kvm *kvm); int vgic_v4_init(struct kvm *kvm); -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
The LPI translation cache needs to be discarded when an ITS command may affect the translation of an LPI (DISCARD and MAPD with V=0) or the routing of an LPI to a redistributor with disabled LPIs (MOVI, MOVALL). We decide to perform a full invalidation of the cache, irrespective of the LPI that is affected. Commands are supposed to be rare enough that it doesn't matter. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index 9b6b66204b97..5254bb762e1b 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, * don't bother here since we clear the ITTE anyway and the * pending state is a property of the ITTE struct. */ + vgic_its_invalidate_cache(kvm); + its_free_ite(kvm, ite); return 0; } @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, ite->collection = collection; vcpu = kvm_get_vcpu(kvm, collection->target_addr); + vgic_its_invalidate_cache(kvm); + return update_affinity(ite->irq, vcpu); } @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) its_free_ite(kvm, ite); + vgic_its_invalidate_cache(kvm); + list_del(&device->dev_list); kfree(device); } @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, vgic_put_irq(kvm, irq); } + vgic_its_invalidate_cache(kvm); + kfree(intids); return 0; } -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
If a vcpu disables LPIs at its redistributor level, we need to make sure we won't pend more interrupts. For this, we need to invalidate the LPI translation cache. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-mmio-v3.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c index 936962abc38d..cb60da48810d 100644 --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c @@ -192,8 +192,10 @@ static void vgic_mmio_write_v3r_ctlr(struct kvm_vcpu *vcpu, vgic_cpu->lpis_enabled = val & GICR_CTLR_ENABLE_LPIS; - if (was_enabled && !vgic_cpu->lpis_enabled) + if (was_enabled && !vgic_cpu->lpis_enabled) { vgic_flush_pending_lpis(vcpu); + vgic_its_invalidate_cache(vcpu->kvm); + } if (!was_enabled && vgic_cpu->lpis_enabled) vgic_enable_lpis(vcpu); -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
In order to avoid leaking vgic_irq structures on teardown, we need to drop all references to LPIs before deallocating the cache itself. This is done by invalidating the cache on vgic teardown. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-its.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index 5254bb762e1b..0aa0cbbc3af6 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -1739,6 +1739,8 @@ void vgic_lpi_translation_cache_destroy(struct kvm *kvm) struct vgic_dist *dist = &kvm->arch.vgic; struct vgic_translation_cache_entry *cte, *tmp; + vgic_its_invalidate_cache(kvm); + list_for_each_entry_safe(cte, tmp, &dist->lpi_translation_cache, entry) { list_del(&cte->entry); -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On a successful translation, preserve the parameters in the LPI translation cache. Each translation is reusing the last slot in the list, naturally evincting the least recently used entry. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-its.c | 86 ++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index 0aa0cbbc3af6..62932458476a 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -546,6 +546,90 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, return 0; } +static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, + phys_addr_t db, + u32 devid, u32 eventid) +{ + struct vgic_translation_cache_entry *cte; + struct vgic_irq *irq = NULL; + + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { + /* + * If we hit a NULL entry, there is nothing after this + * point. + */ + if (!cte->irq) + break; + + if (cte->db == db && + cte->devid == devid && + cte->eventid == eventid) { + /* + * Move this entry to the head, as it is the + * most recently used. + */ + list_move(&cte->entry, &dist->lpi_translation_cache); + irq = cte->irq; + break; + } + } + + return irq; +} + +static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, + u32 devid, u32 eventid, + struct vgic_irq *irq) +{ + struct vgic_dist *dist = &kvm->arch.vgic; + struct vgic_translation_cache_entry *cte; + unsigned long flags; + phys_addr_t db; + + /* Do not cache a directly injected interrupt */ + if (irq->hw) + return; + + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); + + if (unlikely(list_empty(&dist->lpi_translation_cache))) + goto out; + + /* + * We could have raced with another CPU caching the same + * translation behind our back, so let's check it is not in + * already + */ + db = its->vgic_its_base + GITS_TRANSLATER; + if (__vgic_its_check_cache(dist, db, devid, eventid)) + goto out; + + /* Always reuse the last entry (LRU policy) */ + cte = list_last_entry(&dist->lpi_translation_cache, + typeof(*cte), entry); + + /* + * Caching the translation implies having an extra reference + * to the interrupt, so drop the potential reference on what + * was in the cache, and increment it on the new interrupt. + */ + if (cte->irq) + __vgic_put_lpi_locked(kvm, cte->irq); + + vgic_get_irq_kref(irq); + + cte->db = db; + cte->devid = devid; + cte->eventid = eventid; + cte->irq = irq; + + /* Move the new translation to the head of the list */ + list_move(&cte->entry, &dist->lpi_translation_cache); + +out: + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); +} + void vgic_its_invalidate_cache(struct kvm *kvm) { struct vgic_dist *dist = &kvm->arch.vgic; @@ -589,6 +673,8 @@ int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, if (!vcpu->arch.vgic_cpu.lpis_enabled) return -EBUSY; + vgic_its_cache_translation(kvm, its, devid, eventid, ite->irq); + *irq = ite->irq; return 0; } -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
When performing an MSI injection, let's first check if the translation is already in the cache. If so, let's inject it quickly without going through the whole translation process. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++++++++++++++++++++++++++++ virt/kvm/arm/vgic/vgic.h | 1 + 2 files changed, 37 insertions(+) diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index 62932458476a..83d80ec33473 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -577,6 +577,20 @@ static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, return irq; } +static struct vgic_irq *vgic_its_check_cache(struct kvm *kvm, phys_addr_t db, + u32 devid, u32 eventid) +{ + struct vgic_dist *dist = &kvm->arch.vgic; + struct vgic_irq *irq; + unsigned long flags; + + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); + irq = __vgic_its_check_cache(dist, db, devid, eventid); + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); + + return irq; +} + static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, u32 devid, u32 eventid, struct vgic_irq *irq) @@ -736,6 +750,25 @@ static int vgic_its_trigger_msi(struct kvm *kvm, struct vgic_its *its, return 0; } +int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) +{ + struct vgic_irq *irq; + unsigned long flags; + phys_addr_t db; + + db = (u64)msi->address_hi << 32 | msi->address_lo; + irq = vgic_its_check_cache(kvm, db, msi->devid, msi->data); + + if (!irq) + return -1; + + raw_spin_lock_irqsave(&irq->irq_lock, flags); + irq->pending_latch = true; + vgic_queue_irq_unlock(kvm, irq, flags); + + return 0; +} + /* * Queries the KVM IO bus framework to get the ITS pointer from the given * doorbell address. @@ -747,6 +780,9 @@ int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi) struct vgic_its *its; int ret; + if (!vgic_its_inject_cached_translation(kvm, msi)) + return 1; + its = vgic_msi_to_its(kvm, msi); if (IS_ERR(its)) return PTR_ERR(its); diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h index 072f810dc441..ad6eba1e2beb 100644 --- a/virt/kvm/arm/vgic/vgic.h +++ b/virt/kvm/arm/vgic/vgic.h @@ -317,6 +317,7 @@ int vgic_copy_lpi_list(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 **intid_ptr); int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, u32 devid, u32 eventid, struct vgic_irq **irq); struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi); +int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi); void vgic_lpi_translation_cache_init(struct kvm *kvm); void vgic_lpi_translation_cache_destroy(struct kvm *kvm); void vgic_its_invalidate_cache(struct kvm *kvm); -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Now that we have a cache of MSI->LPI translations, it is pretty easy to implement kvm_arch_set_irq_inatomic (this cache can be parsed without sleeping). Hopefully, this will improve some LPI-heavy workloads. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- virt/kvm/arm/vgic/vgic-irqfd.c | 36 ++++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-) diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c index 99e026d2dade..9f203ed8c8f3 100644 --- a/virt/kvm/arm/vgic/vgic-irqfd.c +++ b/virt/kvm/arm/vgic/vgic-irqfd.c @@ -77,6 +77,15 @@ int kvm_set_routing_entry(struct kvm *kvm, return r; } +static void kvm_populate_msi(struct kvm_kernel_irq_routing_entry *e, + struct kvm_msi *msi) +{ + msi->address_lo = e->msi.address_lo; + msi->address_hi = e->msi.address_hi; + msi->data = e->msi.data; + msi->flags = e->msi.flags; + msi->devid = e->msi.devid; +} /** * kvm_set_msi: inject the MSI corresponding to the * MSI routing entry @@ -90,21 +99,36 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, { struct kvm_msi msi; - msi.address_lo = e->msi.address_lo; - msi.address_hi = e->msi.address_hi; - msi.data = e->msi.data; - msi.flags = e->msi.flags; - msi.devid = e->msi.devid; - if (!vgic_has_its(kvm)) return -ENODEV; if (!level) return -1; + kvm_populate_msi(e, &msi); return vgic_its_inject_msi(kvm, &msi); } +/** + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection + * + * Currently only direct MSI injecton is supported. + */ +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, + struct kvm *kvm, int irq_source_id, int level, + bool line_status) +{ + if (e->type == KVM_IRQ_ROUTING_MSI && vgic_has_its(kvm) && level) { + struct kvm_msi msi; + + kvm_populate_msi(e, &msi); + if (!vgic_its_inject_cached_translation(kvm, &msi)) + return 0; + } + + return -EWOULDBLOCK; +} + int kvm_vgic_setup_default_irq_routing(struct kvm *kvm) { struct kvm_irq_routing_entry *entries; -- 2.20.1 _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 11/06/2019 18:03, Marc Zyngier wrote: > Add the basic data structure that expresses an MSI to LPI > translation as well as the allocation/release hooks. > > THe size of the cache is arbitrarily defined as 4*nr_vcpus. > The size has been arbitrarily changed to 16*nr_vcpus :) . Nit: The* > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > include/kvm/arm_vgic.h | 3 +++ > virt/kvm/arm/vgic/vgic-init.c | 5 ++++ > virt/kvm/arm/vgic/vgic-its.c | 49 +++++++++++++++++++++++++++++++++++ > virt/kvm/arm/vgic/vgic.h | 2 ++ > 4 files changed, 59 insertions(+) > > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h > index c36c86f1ec9a..ca7bcf52dc85 100644 > --- a/include/kvm/arm_vgic.h > +++ b/include/kvm/arm_vgic.h > @@ -260,6 +260,9 @@ struct vgic_dist { > struct list_head lpi_list_head; > int lpi_list_count; > > + /* LPI translation cache */ > + struct list_head lpi_translation_cache; > + > /* used by vgic-debug */ > struct vgic_state_iter *iter; > > diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c > index 3bdb31eaed64..c7c4c77dd430 100644 > --- a/virt/kvm/arm/vgic/vgic-init.c > +++ b/virt/kvm/arm/vgic/vgic-init.c > @@ -64,6 +64,7 @@ void kvm_vgic_early_init(struct kvm *kvm) > struct vgic_dist *dist = &kvm->arch.vgic; > > INIT_LIST_HEAD(&dist->lpi_list_head); > + INIT_LIST_HEAD(&dist->lpi_translation_cache); > raw_spin_lock_init(&dist->lpi_list_lock); > } > > @@ -305,6 +306,7 @@ int vgic_init(struct kvm *kvm) > } > > if (vgic_has_its(kvm)) { > + vgic_lpi_translation_cache_init(kvm); > ret = vgic_v4_init(kvm); > if (ret) > goto out; > @@ -346,6 +348,9 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm) > INIT_LIST_HEAD(&dist->rd_regions); > } > > + if (vgic_has_its(kvm)) > + vgic_lpi_translation_cache_destroy(kvm); > + > if (vgic_supports_direct_msis(kvm)) > vgic_v4_teardown(kvm); > } > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 44ceaccb18cf..ce9bcddeb7f1 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -149,6 +149,14 @@ struct its_ite { > u32 event_id; > }; > > +struct vgic_translation_cache_entry { > + struct list_head entry; > + phys_addr_t db; > + u32 devid; > + u32 eventid; > + struct vgic_irq *irq; > +}; > + > /** > * struct vgic_its_abi - ITS abi ops and settings > * @cte_esz: collection table entry size > @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its, > return ret; > } > > +/* Default is 16 cached LPIs per vcpu */ > +#define LPI_DEFAULT_PCPU_CACHE_SIZE 16 > + > +void vgic_lpi_translation_cache_init(struct kvm *kvm) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + unsigned int sz; > + int i; > + > + if (!list_empty(&dist->lpi_translation_cache)) > + return; > + > + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; > + > + for (i = 0; i < sz; i++) { > + struct vgic_translation_cache_entry *cte; > + > + /* An allocation failure is not fatal */ > + cte = kzalloc(sizeof(*cte), GFP_KERNEL); > + if (WARN_ON(!cte)) > + break; > + > + INIT_LIST_HEAD(&cte->entry); > + list_add(&cte->entry, &dist->lpi_translation_cache); Going through the series, it looks like this list is either empty (before the cache init) or has a fixed number (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) of entries. And the list never grows nor shrinks throughout the series, so it seems odd to be using a list here. Is there a reason for not using a dynamically allocated array instead of the list? (does list_move() provide a big perf advantage over swapping the data from one array entry to another? Or is there some other facility I am missing? Cheers, -- Julien Thierry _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 12/06/2019 09:16, Julien Thierry wrote: > Hi Marc, > > On 11/06/2019 18:03, Marc Zyngier wrote: [...] >> + >> +void vgic_lpi_translation_cache_init(struct kvm *kvm) >> +{ >> + struct vgic_dist *dist = &kvm->arch.vgic; >> + unsigned int sz; >> + int i; >> + >> + if (!list_empty(&dist->lpi_translation_cache)) >> + return; >> + >> + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; >> + >> + for (i = 0; i < sz; i++) { >> + struct vgic_translation_cache_entry *cte; >> + >> + /* An allocation failure is not fatal */ >> + cte = kzalloc(sizeof(*cte), GFP_KERNEL); >> + if (WARN_ON(!cte)) >> + break; >> + >> + INIT_LIST_HEAD(&cte->entry); >> + list_add(&cte->entry, &dist->lpi_translation_cache); > > Going through the series, it looks like this list is either empty > (before the cache init) or has a fixed number > (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) of entries. And the list never > grows nor shrinks throughout the series, so it seems odd to be using a > list here. > > Is there a reason for not using a dynamically allocated array instead of > the list? (does list_move() provide a big perf advantage over swapping > the data from one array entry to another? Or is there some other > facility I am missing? > Scratch that, I realized having the list makes it easier to implement the LRU policy later in the series. -- Julien Thierry _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Julien, On Wed, 12 Jun 2019 09:16:21 +0100, Julien Thierry <julien.thierry@arm.com> wrote: > > Hi Marc, > > On 11/06/2019 18:03, Marc Zyngier wrote: > > Add the basic data structure that expresses an MSI to LPI > > translation as well as the allocation/release hooks. > > > > THe size of the cache is arbitrarily defined as 4*nr_vcpus. > > > > The size has been arbitrarily changed to 16*nr_vcpus :) . Well spotted! ;-) > > Nit: The* Ah, usual lazy finger on the Shift key... One day I'll learn to type. > > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > > --- > > include/kvm/arm_vgic.h | 3 +++ > > virt/kvm/arm/vgic/vgic-init.c | 5 ++++ > > virt/kvm/arm/vgic/vgic-its.c | 49 +++++++++++++++++++++++++++++++++++ > > virt/kvm/arm/vgic/vgic.h | 2 ++ > > 4 files changed, 59 insertions(+) > > > > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h > > index c36c86f1ec9a..ca7bcf52dc85 100644 > > --- a/include/kvm/arm_vgic.h > > +++ b/include/kvm/arm_vgic.h > > @@ -260,6 +260,9 @@ struct vgic_dist { > > struct list_head lpi_list_head; > > int lpi_list_count; > > > > + /* LPI translation cache */ > > + struct list_head lpi_translation_cache; > > + > > /* used by vgic-debug */ > > struct vgic_state_iter *iter; > > > > diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c > > index 3bdb31eaed64..c7c4c77dd430 100644 > > --- a/virt/kvm/arm/vgic/vgic-init.c > > +++ b/virt/kvm/arm/vgic/vgic-init.c > > @@ -64,6 +64,7 @@ void kvm_vgic_early_init(struct kvm *kvm) > > struct vgic_dist *dist = &kvm->arch.vgic; > > > > INIT_LIST_HEAD(&dist->lpi_list_head); > > + INIT_LIST_HEAD(&dist->lpi_translation_cache); > > raw_spin_lock_init(&dist->lpi_list_lock); > > } > > > > @@ -305,6 +306,7 @@ int vgic_init(struct kvm *kvm) > > } > > > > if (vgic_has_its(kvm)) { > > + vgic_lpi_translation_cache_init(kvm); > > ret = vgic_v4_init(kvm); > > if (ret) > > goto out; > > @@ -346,6 +348,9 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm) > > INIT_LIST_HEAD(&dist->rd_regions); > > } > > > > + if (vgic_has_its(kvm)) > > + vgic_lpi_translation_cache_destroy(kvm); > > + > > if (vgic_supports_direct_msis(kvm)) > > vgic_v4_teardown(kvm); > > } > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > > index 44ceaccb18cf..ce9bcddeb7f1 100644 > > --- a/virt/kvm/arm/vgic/vgic-its.c > > +++ b/virt/kvm/arm/vgic/vgic-its.c > > @@ -149,6 +149,14 @@ struct its_ite { > > u32 event_id; > > }; > > > > +struct vgic_translation_cache_entry { > > + struct list_head entry; > > + phys_addr_t db; > > + u32 devid; > > + u32 eventid; > > + struct vgic_irq *irq; > > +}; > > + > > /** > > * struct vgic_its_abi - ITS abi ops and settings > > * @cte_esz: collection table entry size > > @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its, > > return ret; > > } > > > > +/* Default is 16 cached LPIs per vcpu */ > > +#define LPI_DEFAULT_PCPU_CACHE_SIZE 16 > > + > > +void vgic_lpi_translation_cache_init(struct kvm *kvm) > > +{ > > + struct vgic_dist *dist = &kvm->arch.vgic; > > + unsigned int sz; > > + int i; > > + > > + if (!list_empty(&dist->lpi_translation_cache)) > > + return; > > + > > + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; > > + > > + for (i = 0; i < sz; i++) { > > + struct vgic_translation_cache_entry *cte; > > + > > + /* An allocation failure is not fatal */ > > + cte = kzalloc(sizeof(*cte), GFP_KERNEL); > > + if (WARN_ON(!cte)) > > + break; > > + > > + INIT_LIST_HEAD(&cte->entry); > > + list_add(&cte->entry, &dist->lpi_translation_cache); > > Going through the series, it looks like this list is either empty > (before the cache init) or has a fixed number > (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) of entries. Well, it could also fail when allocating one of the entry, meaning we can have an allocation ranging from 0 to (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) entries. > And the list never grows nor shrinks throughout the series, so it > seems odd to be using a list here. > > Is there a reason for not using a dynamically allocated array instead of > the list? (does list_move() provide a big perf advantage over swapping > the data from one array entry to another? Or is there some other > facility I am missing? The idea was to make the LRU policy cheap, on the assumption that list_move (which is only a couple of pointer updates) is cheaper than a memmove if you want to keep the array ordered. If we exclude the list head, we end-up with 24 bytes per entry to move down to make room for the new entry at the head of the array. For large caches that miss very often, this will hurt badly. But is that really a problem? I don't know. We could allocate an array as you suggest, and use a linked list inside the array. Or something else. I'm definitely open to suggestion! Thanks, M. -- Jazz is not dead, it just smells funny. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 12/06/2019 10:52, Marc Zyngier wrote: > Hi Julien, > > On Wed, 12 Jun 2019 09:16:21 +0100, > Julien Thierry <julien.thierry@arm.com> wrote: >> >> Hi Marc, >> >> On 11/06/2019 18:03, Marc Zyngier wrote: >>> Add the basic data structure that expresses an MSI to LPI >>> translation as well as the allocation/release hooks. >>> >>> THe size of the cache is arbitrarily defined as 4*nr_vcpus. >>> >> >> The size has been arbitrarily changed to 16*nr_vcpus :) . > > Well spotted! ;-) > >> >> Nit: The* > > Ah, usual lazy finger on the Shift key... One day I'll learn to type. > >> >>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>> --- >>> include/kvm/arm_vgic.h | 3 +++ >>> virt/kvm/arm/vgic/vgic-init.c | 5 ++++ >>> virt/kvm/arm/vgic/vgic-its.c | 49 +++++++++++++++++++++++++++++++++++ >>> virt/kvm/arm/vgic/vgic.h | 2 ++ >>> 4 files changed, 59 insertions(+) >>> [...] >>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>> index 44ceaccb18cf..ce9bcddeb7f1 100644 >>> --- a/virt/kvm/arm/vgic/vgic-its.c >>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>> @@ -149,6 +149,14 @@ struct its_ite { >>> u32 event_id; >>> }; >>> >>> +struct vgic_translation_cache_entry { >>> + struct list_head entry; >>> + phys_addr_t db; >>> + u32 devid; >>> + u32 eventid; >>> + struct vgic_irq *irq; >>> +}; >>> + >>> /** >>> * struct vgic_its_abi - ITS abi ops and settings >>> * @cte_esz: collection table entry size >>> @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its, >>> return ret; >>> } >>> >>> +/* Default is 16 cached LPIs per vcpu */ >>> +#define LPI_DEFAULT_PCPU_CACHE_SIZE 16 >>> + >>> +void vgic_lpi_translation_cache_init(struct kvm *kvm) >>> +{ >>> + struct vgic_dist *dist = &kvm->arch.vgic; >>> + unsigned int sz; >>> + int i; >>> + >>> + if (!list_empty(&dist->lpi_translation_cache)) >>> + return; >>> + >>> + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; >>> + >>> + for (i = 0; i < sz; i++) { >>> + struct vgic_translation_cache_entry *cte; >>> + >>> + /* An allocation failure is not fatal */ >>> + cte = kzalloc(sizeof(*cte), GFP_KERNEL); >>> + if (WARN_ON(!cte)) >>> + break; >>> + >>> + INIT_LIST_HEAD(&cte->entry); >>> + list_add(&cte->entry, &dist->lpi_translation_cache); >> >> Going through the series, it looks like this list is either empty >> (before the cache init) or has a fixed number >> (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) of entries. > > Well, it could also fail when allocating one of the entry, meaning we > can have an allocation ranging from 0 to (LPI_DEFAULT_PCPU_CACHE_SIZE > * nr_cpus) entries. > >> And the list never grows nor shrinks throughout the series, so it >> seems odd to be using a list here. >> >> Is there a reason for not using a dynamically allocated array instead of >> the list? (does list_move() provide a big perf advantage over swapping >> the data from one array entry to another? Or is there some other >> facility I am missing? > > The idea was to make the LRU policy cheap, on the assumption that > list_move (which is only a couple of pointer updates) is cheaper than > a memmove if you want to keep the array ordered. If we exclude the > list head, we end-up with 24 bytes per entry to move down to make room > for the new entry at the head of the array. For large caches that miss > very often, this will hurt badly. But is that really a problem? I > don't know. > Yes, I realized afterwards that the LRU uses the fact you can easily move list entries without modifying the rest of the list. > We could allocate an array as you suggest, and use a linked list > inside the array. Or something else. I'm definitely open to > suggestion! If it there turns out to be some benefit to just you a fixed array, we could use a simple ring buffer. Have one pointer on the most recently inserted entry (and we know the next insertion will take place on the entry "just before" it) and one pointer on the least recently used entry (which gets moved when the most recently inserted catches up to it) so we know where to stop when looping. We don't really have to worry about the "ring buffer" full case since that means we just overwrite the LRU and move the pointer. This might prove a bit more efficient when looping over the cache entries compared to the list. However, I have no certainty of actual performance gain from that and the current implementation has the benefit of being simple. Let me know if you decide to give the ring buffer approach a try. Otherwise there's always the option to add even more complex structure with a hashtable + linked list using hashes and tags to lookup the entries. But keeping things simple for now seems reasonable (also, it avoids having to think about what to use as hash and tag :D ). Cheers, -- Julien Thierry _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 12/06/2019 11:58, Julien Thierry wrote: > > > On 12/06/2019 10:52, Marc Zyngier wrote: >> Hi Julien, >> >> On Wed, 12 Jun 2019 09:16:21 +0100, >> Julien Thierry <julien.thierry@arm.com> wrote: >>> >>> Hi Marc, >>> >>> On 11/06/2019 18:03, Marc Zyngier wrote: >>>> Add the basic data structure that expresses an MSI to LPI >>>> translation as well as the allocation/release hooks. >>>> >>>> THe size of the cache is arbitrarily defined as 4*nr_vcpus. >>>> >>> >>> The size has been arbitrarily changed to 16*nr_vcpus :) . >> >> Well spotted! ;-) >> >>> >>> Nit: The* >> >> Ah, usual lazy finger on the Shift key... One day I'll learn to type. >> >>> >>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>>> --- >>>> include/kvm/arm_vgic.h | 3 +++ >>>> virt/kvm/arm/vgic/vgic-init.c | 5 ++++ >>>> virt/kvm/arm/vgic/vgic-its.c | 49 +++++++++++++++++++++++++++++++++++ >>>> virt/kvm/arm/vgic/vgic.h | 2 ++ >>>> 4 files changed, 59 insertions(+) >>>> > > [...] > >>>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>>> index 44ceaccb18cf..ce9bcddeb7f1 100644 >>>> --- a/virt/kvm/arm/vgic/vgic-its.c >>>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>>> @@ -149,6 +149,14 @@ struct its_ite { >>>> u32 event_id; >>>> }; >>>> >>>> +struct vgic_translation_cache_entry { >>>> + struct list_head entry; >>>> + phys_addr_t db; >>>> + u32 devid; >>>> + u32 eventid; >>>> + struct vgic_irq *irq; >>>> +}; >>>> + >>>> /** >>>> * struct vgic_its_abi - ITS abi ops and settings >>>> * @cte_esz: collection table entry size >>>> @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its, >>>> return ret; >>>> } >>>> >>>> +/* Default is 16 cached LPIs per vcpu */ >>>> +#define LPI_DEFAULT_PCPU_CACHE_SIZE 16 >>>> + >>>> +void vgic_lpi_translation_cache_init(struct kvm *kvm) >>>> +{ >>>> + struct vgic_dist *dist = &kvm->arch.vgic; >>>> + unsigned int sz; >>>> + int i; >>>> + >>>> + if (!list_empty(&dist->lpi_translation_cache)) >>>> + return; >>>> + >>>> + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; >>>> + >>>> + for (i = 0; i < sz; i++) { >>>> + struct vgic_translation_cache_entry *cte; >>>> + >>>> + /* An allocation failure is not fatal */ >>>> + cte = kzalloc(sizeof(*cte), GFP_KERNEL); >>>> + if (WARN_ON(!cte)) >>>> + break; >>>> + >>>> + INIT_LIST_HEAD(&cte->entry); >>>> + list_add(&cte->entry, &dist->lpi_translation_cache); >>> >>> Going through the series, it looks like this list is either empty >>> (before the cache init) or has a fixed number >>> (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) of entries. >> >> Well, it could also fail when allocating one of the entry, meaning we >> can have an allocation ranging from 0 to (LPI_DEFAULT_PCPU_CACHE_SIZE >> * nr_cpus) entries. >> >>> And the list never grows nor shrinks throughout the series, so it >>> seems odd to be using a list here. >>> >>> Is there a reason for not using a dynamically allocated array instead of >>> the list? (does list_move() provide a big perf advantage over swapping >>> the data from one array entry to another? Or is there some other >>> facility I am missing? >> >> The idea was to make the LRU policy cheap, on the assumption that >> list_move (which is only a couple of pointer updates) is cheaper than >> a memmove if you want to keep the array ordered. If we exclude the >> list head, we end-up with 24 bytes per entry to move down to make room >> for the new entry at the head of the array. For large caches that miss >> very often, this will hurt badly. But is that really a problem? I >> don't know. >> > > Yes, I realized afterwards that the LRU uses the fact you can easily > move list entries without modifying the rest of the list. > >> We could allocate an array as you suggest, and use a linked list >> inside the array. Or something else. I'm definitely open to >> suggestion! > > If it there turns out to be some benefit to just you a fixed array, we > could use a simple ring buffer. Have one pointer on the most recently > inserted entry (and we know the next insertion will take place on the > entry "just before" it) and one pointer on the least recently used entry > (which gets moved when the most recently inserted catches up to it) so > we know where to stop when looping. We don't really have to worry about > the "ring buffer" full case since that means we just overwrite the LRU > and move the pointer. > > This might prove a bit more efficient when looping over the cache > entries compared to the list. However, I have no certainty of actual > performance gain from that and the current implementation has the > benefit of being simple. > > Let me know if you decide to give the ring buffer approach a try. > > Otherwise there's always the option to add even more complex structure > with a hashtable + linked list using hashes and tags to lookup the > entries. But keeping things simple for now seems reasonable (also, it > avoids having to think about what to use as hash and tag :D ). > Acutally, still not a good approach for when there is a cache hit and we want to move a entry to the most recently used position. List seems like the best approach in terms of keeping it simple. Sorry for the noise. -- Julien Thierry _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 2019/6/12 1:03, Marc Zyngier wrote: > On a successful translation, preserve the parameters in the LPI > translation cache. Each translation is reusing the last slot > in the list, naturally evincting the least recently used entry. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > virt/kvm/arm/vgic/vgic-its.c | 86 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 86 insertions(+) > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 0aa0cbbc3af6..62932458476a 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -546,6 +546,90 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, > return 0; > } > > +static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, > + phys_addr_t db, > + u32 devid, u32 eventid) > +{ > + struct vgic_translation_cache_entry *cte; > + struct vgic_irq *irq = NULL; > + > + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { > + /* > + * If we hit a NULL entry, there is nothing after this > + * point. > + */ > + if (!cte->irq) > + break; > + > + if (cte->db == db && > + cte->devid == devid && > + cte->eventid == eventid) { > + /* > + * Move this entry to the head, as it is the > + * most recently used. > + */ > + list_move(&cte->entry, &dist->lpi_translation_cache); Only for performance reasons: if we hit at the "head" of the list, we don't need to do a list_move(). In our tests, we found that a single list_move() takes nearly (sometimes even more than) one microsecond, for some unknown reason... Thanks, zenghui > + irq = cte->irq; > + break; > + } > + } > + > + return irq; > +} > + > +static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, > + u32 devid, u32 eventid, > + struct vgic_irq *irq) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + struct vgic_translation_cache_entry *cte; > + unsigned long flags; > + phys_addr_t db; > + > + /* Do not cache a directly injected interrupt */ > + if (irq->hw) > + return; > + > + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); > + > + if (unlikely(list_empty(&dist->lpi_translation_cache))) > + goto out; > + > + /* > + * We could have raced with another CPU caching the same > + * translation behind our back, so let's check it is not in > + * already > + */ > + db = its->vgic_its_base + GITS_TRANSLATER; > + if (__vgic_its_check_cache(dist, db, devid, eventid)) > + goto out; > + > + /* Always reuse the last entry (LRU policy) */ > + cte = list_last_entry(&dist->lpi_translation_cache, > + typeof(*cte), entry); > + > + /* > + * Caching the translation implies having an extra reference > + * to the interrupt, so drop the potential reference on what > + * was in the cache, and increment it on the new interrupt. > + */ > + if (cte->irq) > + __vgic_put_lpi_locked(kvm, cte->irq); > + > + vgic_get_irq_kref(irq); > + > + cte->db = db; > + cte->devid = devid; > + cte->eventid = eventid; > + cte->irq = irq; > + > + /* Move the new translation to the head of the list */ > + list_move(&cte->entry, &dist->lpi_translation_cache); > + > +out: > + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); > +} > + > void vgic_its_invalidate_cache(struct kvm *kvm) > { > struct vgic_dist *dist = &kvm->arch.vgic; > @@ -589,6 +673,8 @@ int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, > if (!vcpu->arch.vgic_cpu.lpis_enabled) > return -EBUSY; > > + vgic_its_cache_translation(kvm, its, devid, eventid, ite->irq); > + > *irq = ite->irq; > return 0; > } > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Zenghui, On 25/06/2019 12:50, Zenghui Yu wrote: > Hi Marc, > > On 2019/6/12 1:03, Marc Zyngier wrote: >> On a successful translation, preserve the parameters in the LPI >> translation cache. Each translation is reusing the last slot >> in the list, naturally evincting the least recently used entry. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> virt/kvm/arm/vgic/vgic-its.c | 86 ++++++++++++++++++++++++++++++++++++ >> 1 file changed, 86 insertions(+) >> >> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >> index 0aa0cbbc3af6..62932458476a 100644 >> --- a/virt/kvm/arm/vgic/vgic-its.c >> +++ b/virt/kvm/arm/vgic/vgic-its.c >> @@ -546,6 +546,90 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, >> return 0; >> } >> >> +static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, >> + phys_addr_t db, >> + u32 devid, u32 eventid) >> +{ >> + struct vgic_translation_cache_entry *cte; >> + struct vgic_irq *irq = NULL; >> + >> + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { >> + /* >> + * If we hit a NULL entry, there is nothing after this >> + * point. >> + */ >> + if (!cte->irq) >> + break; >> + >> + if (cte->db == db && >> + cte->devid == devid && >> + cte->eventid == eventid) { >> + /* >> + * Move this entry to the head, as it is the >> + * most recently used. >> + */ >> + list_move(&cte->entry, &dist->lpi_translation_cache); > > Only for performance reasons: if we hit at the "head" of the list, we > don't need to do a list_move(). > In our tests, we found that a single list_move() takes nearly (sometimes > even more than) one microsecond, for some unknown reason... Huh... That's odd. Can you narrow down under which conditions this happens? I'm not sure if checking for the list head would be more efficient, as you end-up fetching the head anyway. Can you try replacing this line with: if (!list_is_first(&cte->entry, &dist->lpi_translation_cache)) list_move(&cte->entry, &dist->lpi_translation_cache); and let me know whether it helps? Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 2019/6/25 20:31, Marc Zyngier wrote: > Hi Zenghui, > > On 25/06/2019 12:50, Zenghui Yu wrote: >> Hi Marc, >> >> On 2019/6/12 1:03, Marc Zyngier wrote: >>> On a successful translation, preserve the parameters in the LPI >>> translation cache. Each translation is reusing the last slot >>> in the list, naturally evincting the least recently used entry. >>> >>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>> --- >>> virt/kvm/arm/vgic/vgic-its.c | 86 ++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 86 insertions(+) >>> >>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>> index 0aa0cbbc3af6..62932458476a 100644 >>> --- a/virt/kvm/arm/vgic/vgic-its.c >>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>> @@ -546,6 +546,90 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, >>> return 0; >>> } >>> >>> +static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, >>> + phys_addr_t db, >>> + u32 devid, u32 eventid) >>> +{ >>> + struct vgic_translation_cache_entry *cte; >>> + struct vgic_irq *irq = NULL; >>> + >>> + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { >>> + /* >>> + * If we hit a NULL entry, there is nothing after this >>> + * point. >>> + */ >>> + if (!cte->irq) >>> + break; >>> + >>> + if (cte->db == db && >>> + cte->devid == devid && >>> + cte->eventid == eventid) { >>> + /* >>> + * Move this entry to the head, as it is the >>> + * most recently used. >>> + */ >>> + list_move(&cte->entry, &dist->lpi_translation_cache); >> >> Only for performance reasons: if we hit at the "head" of the list, we >> don't need to do a list_move(). >> In our tests, we found that a single list_move() takes nearly (sometimes >> even more than) one microsecond, for some unknown reason... > > Huh... That's odd. > > Can you narrow down under which conditions this happens? I'm not sure if > checking for the list head would be more efficient, as you end-up > fetching the head anyway. Can you try replacing this line with: > > if (!list_is_first(&cte->entry, &dist->lpi_translation_cache)) > list_move(&cte->entry, &dist->lpi_translation_cache); > > and let me know whether it helps? It helps. With this change, the overhead of list_move() is gone. We run 16 4-vcpu VMs on the host, each with a vhost-user nic, and run "iperf" in pairs between them. It's likely to hit at the head of the cache list in our tests. With this change, the sys% utilization of vhostdpfwd threads will decrease by about 10%. But I don't know the reason exactly (I haven't found any clues in code yet, in implementation of list_move...). Thanks, zenghui _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 2019/6/26 0:00, Zenghui Yu wrote: > Hi Marc, > > On 2019/6/25 20:31, Marc Zyngier wrote: >> Hi Zenghui, >> >> On 25/06/2019 12:50, Zenghui Yu wrote: >>> Hi Marc, >>> >>> On 2019/6/12 1:03, Marc Zyngier wrote: >>>> On a successful translation, preserve the parameters in the LPI >>>> translation cache. Each translation is reusing the last slot >>>> in the list, naturally evincting the least recently used entry. >>>> >>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>>> --- >>>> virt/kvm/arm/vgic/vgic-its.c | 86 >>>> ++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 86 insertions(+) >>>> >>>> diff --git a/virt/kvm/arm/vgic/vgic-its.c >>>> b/virt/kvm/arm/vgic/vgic-its.c >>>> index 0aa0cbbc3af6..62932458476a 100644 >>>> --- a/virt/kvm/arm/vgic/vgic-its.c >>>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>>> @@ -546,6 +546,90 @@ static unsigned long >>>> vgic_mmio_read_its_idregs(struct kvm *kvm, >>>> return 0; >>>> } >>>> +static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, >>>> + phys_addr_t db, >>>> + u32 devid, u32 eventid) >>>> +{ >>>> + struct vgic_translation_cache_entry *cte; >>>> + struct vgic_irq *irq = NULL; >>>> + >>>> + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { >>>> + /* >>>> + * If we hit a NULL entry, there is nothing after this >>>> + * point. >>>> + */ >>>> + if (!cte->irq) >>>> + break; >>>> + >>>> + if (cte->db == db && >>>> + cte->devid == devid && >>>> + cte->eventid == eventid) { >>>> + /* >>>> + * Move this entry to the head, as it is the >>>> + * most recently used. >>>> + */ >>>> + list_move(&cte->entry, &dist->lpi_translation_cache); >>> >>> Only for performance reasons: if we hit at the "head" of the list, we >>> don't need to do a list_move(). >>> In our tests, we found that a single list_move() takes nearly (sometimes >>> even more than) one microsecond, for some unknown reason... s/one microsecond/500 nanoseconds/ (I got the value of CNTFRQ wrong, sorry.) >> >> Huh... That's odd. >> >> Can you narrow down under which conditions this happens? I'm not sure if >> checking for the list head would be more efficient, as you end-up >> fetching the head anyway. Can you try replacing this line with: >> >> if (!list_is_first(&cte->entry, &dist->lpi_translation_cache)) >> list_move(&cte->entry, &dist->lpi_translation_cache); >> >> and let me know whether it helps? > > It helps. With this change, the overhead of list_move() is gone. > > We run 16 4-vcpu VMs on the host, each with a vhost-user nic, and run > "iperf" in pairs between them. It's likely to hit at the head of the > cache list in our tests. > With this change, the sys% utilization of vhostdpfwd threads will > decrease by about 10%. But I don't know the reason exactly (I haven't > found any clues in code yet, in implementation of list_move...). > > > Thanks, > zenghui > > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On Tue, 25 Jun 2019 17:00:54 +0100, Zenghui Yu <yuzenghui@huawei.com> wrote: > > Hi Marc, > > On 2019/6/25 20:31, Marc Zyngier wrote: > > Hi Zenghui, > > > > On 25/06/2019 12:50, Zenghui Yu wrote: > >> Hi Marc, > >> > >> On 2019/6/12 1:03, Marc Zyngier wrote: > >>> On a successful translation, preserve the parameters in the LPI > >>> translation cache. Each translation is reusing the last slot > >>> in the list, naturally evincting the least recently used entry. > >>> > >>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > >>> --- > >>> virt/kvm/arm/vgic/vgic-its.c | 86 ++++++++++++++++++++++++++++++++++++ > >>> 1 file changed, 86 insertions(+) > >>> > >>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > >>> index 0aa0cbbc3af6..62932458476a 100644 > >>> --- a/virt/kvm/arm/vgic/vgic-its.c > >>> +++ b/virt/kvm/arm/vgic/vgic-its.c > >>> @@ -546,6 +546,90 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, > >>> return 0; > >>> } > >>> +static struct vgic_irq *__vgic_its_check_cache(struct > >>> vgic_dist *dist, > >>> + phys_addr_t db, > >>> + u32 devid, u32 eventid) > >>> +{ > >>> + struct vgic_translation_cache_entry *cte; > >>> + struct vgic_irq *irq = NULL; > >>> + > >>> + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { > >>> + /* > >>> + * If we hit a NULL entry, there is nothing after this > >>> + * point. > >>> + */ > >>> + if (!cte->irq) > >>> + break; > >>> + > >>> + if (cte->db == db && > >>> + cte->devid == devid && > >>> + cte->eventid == eventid) { > >>> + /* > >>> + * Move this entry to the head, as it is the > >>> + * most recently used. > >>> + */ > >>> + list_move(&cte->entry, &dist->lpi_translation_cache); > >> > >> Only for performance reasons: if we hit at the "head" of the list, we > >> don't need to do a list_move(). > >> In our tests, we found that a single list_move() takes nearly (sometimes > >> even more than) one microsecond, for some unknown reason... > > > > Huh... That's odd. > > > > Can you narrow down under which conditions this happens? I'm not sure if > > checking for the list head would be more efficient, as you end-up > > fetching the head anyway. Can you try replacing this line with: > > > > if (!list_is_first(&cte->entry, &dist->lpi_translation_cache)) > > list_move(&cte->entry, &dist->lpi_translation_cache); > > > > and let me know whether it helps? > > It helps. With this change, the overhead of list_move() is gone. > > We run 16 4-vcpu VMs on the host, each with a vhost-user nic, and run > "iperf" in pairs between them. It's likely to hit at the head of the > cache list in our tests. > With this change, the sys% utilization of vhostdpfwd threads will > decrease by about 10%. But I don't know the reason exactly (I haven't > found any clues in code yet, in implementation of list_move...). list_move is rather simple, and shouldn't be too hard to execute quickly. The only contention I can imagine is that as the cache line is held by multiple CPUs, the update to the list pointers causes an invalidation to be sent to other CPUs, leading to a slower update. But it remains that 500ns is a pretty long time (that's 1000 cycles on a 2GHz CPU). It'd be interesting to throw perf at this and see shows up. It would give us a clue about what is going on here. Thanks, M. -- Jazz is not dead, it just smells funny. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > The LPI translation cache needs to be discarded when an ITS command > may affect the translation of an LPI (DISCARD and MAPD with V=0) or > the routing of an LPI to a redistributor with disabled LPIs (MOVI, > MOVALL). > > We decide to perform a full invalidation of the cache, irrespective > of the LPI that is affected. Commands are supposed to be rare enough > that it doesn't matter. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 9b6b66204b97..5254bb762e1b 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, > * don't bother here since we clear the ITTE anyway and the > * pending state is a property of the ITTE struct. > */ > + vgic_its_invalidate_cache(kvm); > + > its_free_ite(kvm, ite); > return 0; > } > @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, > ite->collection = collection; > vcpu = kvm_get_vcpu(kvm, collection->target_addr); > > + vgic_its_invalidate_cache(kvm); > + > return update_affinity(ite->irq, vcpu); > } > > @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) > list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) > its_free_ite(kvm, ite); > > + vgic_its_invalidate_cache(kvm); > + > list_del(&device->dev_list); > kfree(device); > } > @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, > vgic_put_irq(kvm, irq); > } > > + vgic_its_invalidate_cache(kvm); All the commands are executed with the its_lock held. Now we don't take it anymore on the fast cache injection path. Don't we have a window where the move has been applied at table level and the cache is not yet invalidated? Same question for vgic_its_free_device(). Thanks Eric > + > kfree(intids); > return 0; > } > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Eric, On 01/07/2019 13:38, Auger Eric wrote: > Hi Marc, > > On 6/11/19 7:03 PM, Marc Zyngier wrote: >> The LPI translation cache needs to be discarded when an ITS command >> may affect the translation of an LPI (DISCARD and MAPD with V=0) or >> the routing of an LPI to a redistributor with disabled LPIs (MOVI, >> MOVALL). >> >> We decide to perform a full invalidation of the cache, irrespective >> of the LPI that is affected. Commands are supposed to be rare enough >> that it doesn't matter. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >> index 9b6b66204b97..5254bb762e1b 100644 >> --- a/virt/kvm/arm/vgic/vgic-its.c >> +++ b/virt/kvm/arm/vgic/vgic-its.c >> @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, >> * don't bother here since we clear the ITTE anyway and the >> * pending state is a property of the ITTE struct. >> */ >> + vgic_its_invalidate_cache(kvm); >> + >> its_free_ite(kvm, ite); >> return 0; >> } >> @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, >> ite->collection = collection; >> vcpu = kvm_get_vcpu(kvm, collection->target_addr); >> >> + vgic_its_invalidate_cache(kvm); >> + >> return update_affinity(ite->irq, vcpu); >> } >> >> @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) >> list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) >> its_free_ite(kvm, ite); >> >> + vgic_its_invalidate_cache(kvm); >> + >> list_del(&device->dev_list); >> kfree(device); >> } >> @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, >> vgic_put_irq(kvm, irq); >> } >> >> + vgic_its_invalidate_cache(kvm); > All the commands are executed with the its_lock held. Now we don't take > it anymore on the fast cache injection path. Don't we have a window > where the move has been applied at table level and the cache is not yet > invalidated? Same question for vgic_its_free_device(). There is definitely a race, but that race is invisible from the guest's perspective. The guest can only assume that the command has taken effect once a SYNC command has been executed, and it cannot observe that the SYNC command has been executed before we have invalidated the cache. Does this answer your question? Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On Tue, 11 Jun 2019 18:03:27 +0100 Marc Zyngier <marc.zyngier@arm.com> wrote: Hi, > It recently became apparent[1] that our LPI injection path is not as > efficient as it could be when injecting interrupts coming from a VFIO > assigned device. > > Although the proposed patch wasn't 100% correct, it outlined at least > two issues: > > (1) Injecting an LPI from VFIO always results in a context switch to a > worker thread: no good > > (2) We have no way of amortising the cost of translating a DID+EID pair > to an LPI number > > The reason for (1) is that we may sleep when translating an LPI, so we > do need a context process. A way to fix that is to implement a small > LPI translation cache that could be looked up from an atomic > context. It would also solve (2). > > This is what this small series proposes. It implements a very basic > LRU cache of pre-translated LPIs, which gets used to implement > kvm_arch_set_irq_inatomic. The size of the cache is currently > hard-coded at 16 times the number of vcpus, a number I have picked > under the influence of Ali Saidi. If that's not enough for you, blame > me, though. > > Does it work? well, it doesn't crash, and is thus perfect. More > seriously, I don't really have a way to benchmark it directly, so my > observations are only indirect: > > On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed > to it directly. From the host, I inject interrupts using debugfs. In > parallel, I look at the number of context switch, and the number of > interrupts on the host. Without this series, I get the same number for > both IRQ and CS (about half a million of each per second is pretty > easy to reach). With this series, the number of context switches drops > to something pretty small (in the low 2k), while the number of > interrupts stays the same. > > Yes, this is a pretty rubbish benchmark, what did you expect? ;-) > > So I'm putting this out for people with real workloads to try out and > report what they see. So I gave that a shot with some benchmarks. As expected, it is quite hard to show an improvement with just one guest running, although we could show a 103%(!) improvement of the memcached QPS score in one experiment when running it in a guest with an external load generator. Throwing more users into the game showed a significant improvement: Benchmark 1: kernel compile/FIO: Compiling a kernel on the host, while letting a guest run FIO with 4K randreads from a passed-through NVMe SSD: The IOPS with this series improved by 27% compared to pure mainline, reaching 80% of the host value. Kernel compilation time improved by 8.5% compared to mainline. Benchmark 2: FIO/FIO: Running FIO on a passed through SATA SSD in one guest, and FIO on a passed through NVMe SSD in another guest, at the same time: The IOPS with this series improved by 23% for the NVMe and 34% for the SATA disk, compared to pure mainline. So judging from these results, I think this series is a significant improvement, which justifies it to be merged, to receive wider testing. It would be good if others could also do performance experiments and post their results. Cheers, Andre. > [1] https://lore.kernel.org/lkml/1552833373-19828-1-git-send-email-yuzenghui@huawei.com/ > > * From v1: > > - Fixed race on allocation, where the same LPI could be cached multiple times > - Now invalidate the cache on vgic teardown, avoiding memory leaks > - Change patch split slightly, general reshuffling > - Small cleanups here and there > - Rebased on 5.2-rc4 > > Marc Zyngier (9): > KVM: arm/arm64: vgic: Add LPI translation cache definition > KVM: arm/arm64: vgic: Add __vgic_put_lpi_locked primitive > KVM: arm/arm64: vgic-its: Add MSI-LPI translation cache invalidation > KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on > specific commands > KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on > disabling LPIs > KVM: arm/arm64: vgic-its: Invalidate MSI-LPI translation cache on vgic > teardown > KVM: arm/arm64: vgic-its: Cache successful MSI->LPI translation > KVM: arm/arm64: vgic-its: Check the LPI translation cache on MSI > injection > KVM: arm/arm64: vgic-irqfd: Implement kvm_arch_set_irq_inatomic > > include/kvm/arm_vgic.h | 3 + > virt/kvm/arm/vgic/vgic-init.c | 5 + > virt/kvm/arm/vgic/vgic-irqfd.c | 36 +++++- > virt/kvm/arm/vgic/vgic-its.c | 204 +++++++++++++++++++++++++++++++ > virt/kvm/arm/vgic/vgic-mmio-v3.c | 4 +- > virt/kvm/arm/vgic/vgic.c | 26 ++-- > virt/kvm/arm/vgic/vgic.h | 5 + > 7 files changed, 267 insertions(+), 16 deletions(-) > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 7/22/19 12:54 PM, Marc Zyngier wrote: > Hi Eric, > > On 01/07/2019 13:38, Auger Eric wrote: >> Hi Marc, >> >> On 6/11/19 7:03 PM, Marc Zyngier wrote: >>> The LPI translation cache needs to be discarded when an ITS command >>> may affect the translation of an LPI (DISCARD and MAPD with V=0) or >>> the routing of an LPI to a redistributor with disabled LPIs (MOVI, >>> MOVALL). >>> >>> We decide to perform a full invalidation of the cache, irrespective >>> of the LPI that is affected. Commands are supposed to be rare enough >>> that it doesn't matter. >>> >>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>> --- >>> virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>> index 9b6b66204b97..5254bb762e1b 100644 >>> --- a/virt/kvm/arm/vgic/vgic-its.c >>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>> @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, >>> * don't bother here since we clear the ITTE anyway and the >>> * pending state is a property of the ITTE struct. >>> */ >>> + vgic_its_invalidate_cache(kvm); >>> + >>> its_free_ite(kvm, ite); >>> return 0; >>> } >>> @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, >>> ite->collection = collection; >>> vcpu = kvm_get_vcpu(kvm, collection->target_addr); >>> >>> + vgic_its_invalidate_cache(kvm); >>> + >>> return update_affinity(ite->irq, vcpu); >>> } >>> >>> @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) >>> list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) >>> its_free_ite(kvm, ite); >>> >>> + vgic_its_invalidate_cache(kvm); >>> + >>> list_del(&device->dev_list); >>> kfree(device); >>> } >>> @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, >>> vgic_put_irq(kvm, irq); >>> } >>> >>> + vgic_its_invalidate_cache(kvm); >> All the commands are executed with the its_lock held. Now we don't take >> it anymore on the fast cache injection path. Don't we have a window >> where the move has been applied at table level and the cache is not yet >> invalidated? Same question for vgic_its_free_device(). > > There is definitely a race, but that race is invisible from the guest's > perspective. The guest can only assume that the command has taken effect > once a SYNC command has been executed, and it cannot observe that the > SYNC command has been executed before we have invalidated the cache. > > Does this answer your question? OK make sense. Thank you for the clarification Another question, don't we need to invalidate the cache on MAPC V=0 as well? Removing the mapping of the collection results in interrupts belonging to that collection being ignored. If we don't flush the pending bit will be set? Thanks Eric > > Thanks, > > M. > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > There's a number of cases where we need to invalidate the caching > of translations, so let's add basic support for that. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric > --- > virt/kvm/arm/vgic/vgic-its.c | 23 +++++++++++++++++++++++ > virt/kvm/arm/vgic/vgic.h | 1 + > 2 files changed, 24 insertions(+) > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index ce9bcddeb7f1..9b6b66204b97 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -546,6 +546,29 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, > return 0; > } > > +void vgic_its_invalidate_cache(struct kvm *kvm) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + struct vgic_translation_cache_entry *cte; > + unsigned long flags; > + > + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); > + > + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { > + /* > + * If we hit a NULL entry, there is nothing after this > + * point. > + */ > + if (!cte->irq) > + break; > + > + __vgic_put_lpi_locked(kvm, cte->irq); > + cte->irq = NULL; > + } > + > + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); > +} > + > int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, > u32 devid, u32 eventid, struct vgic_irq **irq) > { > diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h > index afac2fed7df4..072f810dc441 100644 > --- a/virt/kvm/arm/vgic/vgic.h > +++ b/virt/kvm/arm/vgic/vgic.h > @@ -319,6 +319,7 @@ int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, > struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi); > void vgic_lpi_translation_cache_init(struct kvm *kvm); > void vgic_lpi_translation_cache_destroy(struct kvm *kvm); > +void vgic_its_invalidate_cache(struct kvm *kvm); > > bool vgic_supports_direct_msis(struct kvm *kvm); > int vgic_v4_init(struct kvm *kvm); > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 23/07/2019 13:25, Auger Eric wrote: > Hi Marc, > > On 7/22/19 12:54 PM, Marc Zyngier wrote: >> Hi Eric, >> >> On 01/07/2019 13:38, Auger Eric wrote: >>> Hi Marc, >>> >>> On 6/11/19 7:03 PM, Marc Zyngier wrote: >>>> The LPI translation cache needs to be discarded when an ITS command >>>> may affect the translation of an LPI (DISCARD and MAPD with V=0) or >>>> the routing of an LPI to a redistributor with disabled LPIs (MOVI, >>>> MOVALL). >>>> >>>> We decide to perform a full invalidation of the cache, irrespective >>>> of the LPI that is affected. Commands are supposed to be rare enough >>>> that it doesn't matter. >>>> >>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>>> --- >>>> virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ >>>> 1 file changed, 8 insertions(+) >>>> >>>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>>> index 9b6b66204b97..5254bb762e1b 100644 >>>> --- a/virt/kvm/arm/vgic/vgic-its.c >>>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>>> @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, >>>> * don't bother here since we clear the ITTE anyway and the >>>> * pending state is a property of the ITTE struct. >>>> */ >>>> + vgic_its_invalidate_cache(kvm); >>>> + >>>> its_free_ite(kvm, ite); >>>> return 0; >>>> } >>>> @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, >>>> ite->collection = collection; >>>> vcpu = kvm_get_vcpu(kvm, collection->target_addr); >>>> >>>> + vgic_its_invalidate_cache(kvm); >>>> + >>>> return update_affinity(ite->irq, vcpu); >>>> } >>>> >>>> @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) >>>> list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) >>>> its_free_ite(kvm, ite); >>>> >>>> + vgic_its_invalidate_cache(kvm); >>>> + >>>> list_del(&device->dev_list); >>>> kfree(device); >>>> } >>>> @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, >>>> vgic_put_irq(kvm, irq); >>>> } >>>> >>>> + vgic_its_invalidate_cache(kvm); >>> All the commands are executed with the its_lock held. Now we don't take >>> it anymore on the fast cache injection path. Don't we have a window >>> where the move has been applied at table level and the cache is not yet >>> invalidated? Same question for vgic_its_free_device(). >> >> There is definitely a race, but that race is invisible from the guest's >> perspective. The guest can only assume that the command has taken effect >> once a SYNC command has been executed, and it cannot observe that the >> SYNC command has been executed before we have invalidated the cache. >> >> Does this answer your question? > > OK make sense. Thank you for the clarification > > Another question, don't we need to invalidate the cache on MAPC V=0 as > well? Removing the mapping of the collection results in interrupts > belonging to that collection being ignored. If we don't flush the > pending bit will be set? Yup, that's a good point. I think i had that at some point, and ended up dropping it, probably missing the point that the interrupt would be made pending. I'll add this: @@ -1218,6 +1218,7 @@ static int vgic_its_cmd_handle_mapc(struct kvm *kvm, struct vgic_its *its, if (!valid) { vgic_its_free_collection(its, coll_id); + vgic_its_invalidate_cache(kvm); } else { collection = find_collection(its, coll_id); Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > Add the basic data structure that expresses an MSI to LPI > translation as well as the allocation/release hooks. > > THe size of the cache is arbitrarily defined as 4*nr_vcpus. The > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > include/kvm/arm_vgic.h | 3 +++ > virt/kvm/arm/vgic/vgic-init.c | 5 ++++ > virt/kvm/arm/vgic/vgic-its.c | 49 +++++++++++++++++++++++++++++++++++ > virt/kvm/arm/vgic/vgic.h | 2 ++ > 4 files changed, 59 insertions(+) > > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h > index c36c86f1ec9a..ca7bcf52dc85 100644 > --- a/include/kvm/arm_vgic.h > +++ b/include/kvm/arm_vgic.h > @@ -260,6 +260,9 @@ struct vgic_dist { > struct list_head lpi_list_head; > int lpi_list_count; > > + /* LPI translation cache */ > + struct list_head lpi_translation_cache; > + > /* used by vgic-debug */ > struct vgic_state_iter *iter; > > diff --git a/virt/kvm/arm/vgic/vgic-init.c b/virt/kvm/arm/vgic/vgic-init.c > index 3bdb31eaed64..c7c4c77dd430 100644 > --- a/virt/kvm/arm/vgic/vgic-init.c > +++ b/virt/kvm/arm/vgic/vgic-init.c > @@ -64,6 +64,7 @@ void kvm_vgic_early_init(struct kvm *kvm) > struct vgic_dist *dist = &kvm->arch.vgic; > > INIT_LIST_HEAD(&dist->lpi_list_head); > + INIT_LIST_HEAD(&dist->lpi_translation_cache); > raw_spin_lock_init(&dist->lpi_list_lock); > } > > @@ -305,6 +306,7 @@ int vgic_init(struct kvm *kvm) > } > > if (vgic_has_its(kvm)) { > + vgic_lpi_translation_cache_init(kvm); > ret = vgic_v4_init(kvm); > if (ret) > goto out; > @@ -346,6 +348,9 @@ static void kvm_vgic_dist_destroy(struct kvm *kvm) > INIT_LIST_HEAD(&dist->rd_regions); > } > > + if (vgic_has_its(kvm)) > + vgic_lpi_translation_cache_destroy(kvm); > + > if (vgic_supports_direct_msis(kvm)) > vgic_v4_teardown(kvm); > } > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 44ceaccb18cf..ce9bcddeb7f1 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -149,6 +149,14 @@ struct its_ite { > u32 event_id; > }; > > +struct vgic_translation_cache_entry { > + struct list_head entry; > + phys_addr_t db; > + u32 devid; > + u32 eventid; > + struct vgic_irq *irq; > +}; > + > /** > * struct vgic_its_abi - ITS abi ops and settings > * @cte_esz: collection table entry size > @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its, > return ret; > } > > +/* Default is 16 cached LPIs per vcpu */ > +#define LPI_DEFAULT_PCPU_CACHE_SIZE 16 > + > +void vgic_lpi_translation_cache_init(struct kvm *kvm) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + unsigned int sz; > + int i; > + > + if (!list_empty(&dist->lpi_translation_cache)) > + return; > + > + sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE; > + > + for (i = 0; i < sz; i++) { > + struct vgic_translation_cache_entry *cte; > + > + /* An allocation failure is not fatal */ > + cte = kzalloc(sizeof(*cte), GFP_KERNEL); > + if (WARN_ON(!cte)) > + break; > + > + INIT_LIST_HEAD(&cte->entry); > + list_add(&cte->entry, &dist->lpi_translation_cache); > + } > +} > + > +void vgic_lpi_translation_cache_destroy(struct kvm *kvm) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + struct vgic_translation_cache_entry *cte, *tmp; > + > + list_for_each_entry_safe(cte, tmp, > + &dist->lpi_translation_cache, entry) { > + list_del(&cte->entry); > + kfree(cte); > + } > +} > + > #define INITIAL_BASER_VALUE \ > (GIC_BASER_CACHEABILITY(GITS_BASER, INNER, RaWb) | \ > GIC_BASER_CACHEABILITY(GITS_BASER, OUTER, SameAsInner) | \ > @@ -1696,6 +1743,8 @@ static int vgic_its_create(struct kvm_device *dev, u32 type) > kfree(its); > return ret; > } > + > + vgic_lpi_translation_cache_init(dev->kvm); > } > > mutex_init(&its->its_lock); > diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h > index abeeffabc456..50aad705c4a9 100644 > --- a/virt/kvm/arm/vgic/vgic.h > +++ b/virt/kvm/arm/vgic/vgic.h > @@ -316,6 +316,8 @@ int vgic_copy_lpi_list(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 **intid_ptr); > int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, > u32 devid, u32 eventid, struct vgic_irq **irq); > struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi); > +void vgic_lpi_translation_cache_init(struct kvm *kvm); > +void vgic_lpi_translation_cache_destroy(struct kvm *kvm); > > bool vgic_supports_direct_msis(struct kvm *kvm); > int vgic_v4_init(struct kvm *kvm); > Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 7/23/19 2:43 PM, Marc Zyngier wrote: > On 23/07/2019 13:25, Auger Eric wrote: >> Hi Marc, >> >> On 7/22/19 12:54 PM, Marc Zyngier wrote: >>> Hi Eric, >>> >>> On 01/07/2019 13:38, Auger Eric wrote: >>>> Hi Marc, >>>> >>>> On 6/11/19 7:03 PM, Marc Zyngier wrote: >>>>> The LPI translation cache needs to be discarded when an ITS command >>>>> may affect the translation of an LPI (DISCARD and MAPD with V=0) or >>>>> the routing of an LPI to a redistributor with disabled LPIs (MOVI, >>>>> MOVALL). >>>>> >>>>> We decide to perform a full invalidation of the cache, irrespective >>>>> of the LPI that is affected. Commands are supposed to be rare enough >>>>> that it doesn't matter. >>>>> >>>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>>>> --- >>>>> virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ >>>>> 1 file changed, 8 insertions(+) >>>>> >>>>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>>>> index 9b6b66204b97..5254bb762e1b 100644 >>>>> --- a/virt/kvm/arm/vgic/vgic-its.c >>>>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>>>> @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, >>>>> * don't bother here since we clear the ITTE anyway and the >>>>> * pending state is a property of the ITTE struct. >>>>> */ >>>>> + vgic_its_invalidate_cache(kvm); >>>>> + >>>>> its_free_ite(kvm, ite); >>>>> return 0; >>>>> } >>>>> @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, >>>>> ite->collection = collection; >>>>> vcpu = kvm_get_vcpu(kvm, collection->target_addr); >>>>> >>>>> + vgic_its_invalidate_cache(kvm); >>>>> + >>>>> return update_affinity(ite->irq, vcpu); >>>>> } >>>>> >>>>> @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) >>>>> list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) >>>>> its_free_ite(kvm, ite); >>>>> >>>>> + vgic_its_invalidate_cache(kvm); >>>>> + >>>>> list_del(&device->dev_list); >>>>> kfree(device); >>>>> } >>>>> @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, >>>>> vgic_put_irq(kvm, irq); >>>>> } >>>>> >>>>> + vgic_its_invalidate_cache(kvm); >>>> All the commands are executed with the its_lock held. Now we don't take >>>> it anymore on the fast cache injection path. Don't we have a window >>>> where the move has been applied at table level and the cache is not yet >>>> invalidated? Same question for vgic_its_free_device(). >>> >>> There is definitely a race, but that race is invisible from the guest's >>> perspective. The guest can only assume that the command has taken effect >>> once a SYNC command has been executed, and it cannot observe that the >>> SYNC command has been executed before we have invalidated the cache. >>> >>> Does this answer your question? >> >> OK make sense. Thank you for the clarification >> >> Another question, don't we need to invalidate the cache on MAPC V=0 as >> well? Removing the mapping of the collection results in interrupts >> belonging to that collection being ignored. If we don't flush the >> pending bit will be set? > > Yup, that's a good point. I think i had that at some point, and ended up > dropping it, probably missing the point that the interrupt would be made > pending. > > I'll add this: > > @@ -1218,6 +1218,7 @@ static int vgic_its_cmd_handle_mapc(struct kvm *kvm, struct vgic_its *its, > > if (!valid) { > vgic_its_free_collection(its, coll_id); > + vgic_its_invalidate_cache(kvm); > } else { > collection = find_collection(its, coll_id); > Yep, with that change feel free to add my R-b Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric > > Thanks, > > M. > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 23/07/2019 13:47, Auger Eric wrote: > Hi Marc, > > On 7/23/19 2:43 PM, Marc Zyngier wrote: >> On 23/07/2019 13:25, Auger Eric wrote: >>> Hi Marc, >>> >>> On 7/22/19 12:54 PM, Marc Zyngier wrote: >>>> Hi Eric, >>>> >>>> On 01/07/2019 13:38, Auger Eric wrote: >>>>> Hi Marc, >>>>> >>>>> On 6/11/19 7:03 PM, Marc Zyngier wrote: >>>>>> The LPI translation cache needs to be discarded when an ITS command >>>>>> may affect the translation of an LPI (DISCARD and MAPD with V=0) or >>>>>> the routing of an LPI to a redistributor with disabled LPIs (MOVI, >>>>>> MOVALL). >>>>>> >>>>>> We decide to perform a full invalidation of the cache, irrespective >>>>>> of the LPI that is affected. Commands are supposed to be rare enough >>>>>> that it doesn't matter. >>>>>> >>>>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>>>>> --- >>>>>> virt/kvm/arm/vgic/vgic-its.c | 8 ++++++++ >>>>>> 1 file changed, 8 insertions(+) >>>>>> >>>>>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>>>>> index 9b6b66204b97..5254bb762e1b 100644 >>>>>> --- a/virt/kvm/arm/vgic/vgic-its.c >>>>>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>>>>> @@ -733,6 +733,8 @@ static int vgic_its_cmd_handle_discard(struct kvm *kvm, struct vgic_its *its, >>>>>> * don't bother here since we clear the ITTE anyway and the >>>>>> * pending state is a property of the ITTE struct. >>>>>> */ >>>>>> + vgic_its_invalidate_cache(kvm); >>>>>> + >>>>>> its_free_ite(kvm, ite); >>>>>> return 0; >>>>>> } >>>>>> @@ -768,6 +770,8 @@ static int vgic_its_cmd_handle_movi(struct kvm *kvm, struct vgic_its *its, >>>>>> ite->collection = collection; >>>>>> vcpu = kvm_get_vcpu(kvm, collection->target_addr); >>>>>> >>>>>> + vgic_its_invalidate_cache(kvm); >>>>>> + >>>>>> return update_affinity(ite->irq, vcpu); >>>>>> } >>>>>> >>>>>> @@ -996,6 +1000,8 @@ static void vgic_its_free_device(struct kvm *kvm, struct its_device *device) >>>>>> list_for_each_entry_safe(ite, temp, &device->itt_head, ite_list) >>>>>> its_free_ite(kvm, ite); >>>>>> >>>>>> + vgic_its_invalidate_cache(kvm); >>>>>> + >>>>>> list_del(&device->dev_list); >>>>>> kfree(device); >>>>>> } >>>>>> @@ -1249,6 +1255,8 @@ static int vgic_its_cmd_handle_movall(struct kvm *kvm, struct vgic_its *its, >>>>>> vgic_put_irq(kvm, irq); >>>>>> } >>>>>> >>>>>> + vgic_its_invalidate_cache(kvm); >>>>> All the commands are executed with the its_lock held. Now we don't take >>>>> it anymore on the fast cache injection path. Don't we have a window >>>>> where the move has been applied at table level and the cache is not yet >>>>> invalidated? Same question for vgic_its_free_device(). >>>> >>>> There is definitely a race, but that race is invisible from the guest's >>>> perspective. The guest can only assume that the command has taken effect >>>> once a SYNC command has been executed, and it cannot observe that the >>>> SYNC command has been executed before we have invalidated the cache. >>>> >>>> Does this answer your question? >>> >>> OK make sense. Thank you for the clarification >>> >>> Another question, don't we need to invalidate the cache on MAPC V=0 as >>> well? Removing the mapping of the collection results in interrupts >>> belonging to that collection being ignored. If we don't flush the >>> pending bit will be set? >> >> Yup, that's a good point. I think i had that at some point, and ended up >> dropping it, probably missing the point that the interrupt would be made >> pending. >> >> I'll add this: >> >> @@ -1218,6 +1218,7 @@ static int vgic_its_cmd_handle_mapc(struct kvm *kvm, struct vgic_its *its, >> >> if (!valid) { >> vgic_its_free_collection(its, coll_id); >> + vgic_its_invalidate_cache(kvm); >> } else { >> collection = find_collection(its, coll_id); >> > Yep, with that change feel free to add my R-b > > Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks! M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > If a vcpu disables LPIs at its redistributor level, we need to make sure > we won't pend more interrupts. For this, we need to invalidate the LPI > translation cache. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric > --- > virt/kvm/arm/vgic/vgic-mmio-v3.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c > index 936962abc38d..cb60da48810d 100644 > --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c > +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c > @@ -192,8 +192,10 @@ static void vgic_mmio_write_v3r_ctlr(struct kvm_vcpu *vcpu, > > vgic_cpu->lpis_enabled = val & GICR_CTLR_ENABLE_LPIS; > > - if (was_enabled && !vgic_cpu->lpis_enabled) > + if (was_enabled && !vgic_cpu->lpis_enabled) { > vgic_flush_pending_lpis(vcpu); > + vgic_its_invalidate_cache(vcpu->kvm); > + } > > if (!was_enabled && vgic_cpu->lpis_enabled) > vgic_enable_lpis(vcpu); > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > In order to avoid leaking vgic_irq structures on teardown, we need to > drop all references to LPIs before deallocating the cache itself. > > This is done by invalidating the cache on vgic teardown. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > virt/kvm/arm/vgic/vgic-its.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 5254bb762e1b..0aa0cbbc3af6 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -1739,6 +1739,8 @@ void vgic_lpi_translation_cache_destroy(struct kvm *kvm) > struct vgic_dist *dist = &kvm->arch.vgic; > struct vgic_translation_cache_entry *cte, *tmp; > > + vgic_its_invalidate_cache(kvm); > + > list_for_each_entry_safe(cte, tmp, > &dist->lpi_translation_cache, entry) { > list_del(&cte->entry); > Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > When performing an MSI injection, let's first check if the translation > is already in the cache. If so, let's inject it quickly without > going through the whole translation process. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++++++++++++++++++++++++++++ > virt/kvm/arm/vgic/vgic.h | 1 + > 2 files changed, 37 insertions(+) > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 62932458476a..83d80ec33473 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -577,6 +577,20 @@ static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, > return irq; > } > > +static struct vgic_irq *vgic_its_check_cache(struct kvm *kvm, phys_addr_t db, > + u32 devid, u32 eventid) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + struct vgic_irq *irq; > + unsigned long flags; > + > + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); > + irq = __vgic_its_check_cache(dist, db, devid, eventid); > + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); > + > + return irq; > +} > + > static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, > u32 devid, u32 eventid, > struct vgic_irq *irq) > @@ -736,6 +750,25 @@ static int vgic_its_trigger_msi(struct kvm *kvm, struct vgic_its *its, > return 0; > } > > +int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) > +{ > + struct vgic_irq *irq; > + unsigned long flags; > + phys_addr_t db; > + > + db = (u64)msi->address_hi << 32 | msi->address_lo; > + irq = vgic_its_check_cache(kvm, db, msi->devid, msi->data); I think we miss a check of its->enabled. This is currently done in vgic_its_resolve_lpi() but now likely to be bypassed. Doing that in this function is needed for next patch I think. Thanks Eric > + > + if (!irq) > + return -1; > + > + raw_spin_lock_irqsave(&irq->irq_lock, flags); > + irq->pending_latch = true; > + vgic_queue_irq_unlock(kvm, irq, flags); > + > + return 0; > +} > + > /* > * Queries the KVM IO bus framework to get the ITS pointer from the given > * doorbell address. > @@ -747,6 +780,9 @@ int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi) > struct vgic_its *its; > int ret; > > + if (!vgic_its_inject_cached_translation(kvm, msi)) > + return 1; > + > its = vgic_msi_to_its(kvm, msi); > if (IS_ERR(its)) > return PTR_ERR(its); > diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h > index 072f810dc441..ad6eba1e2beb 100644 > --- a/virt/kvm/arm/vgic/vgic.h > +++ b/virt/kvm/arm/vgic/vgic.h > @@ -317,6 +317,7 @@ int vgic_copy_lpi_list(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 **intid_ptr); > int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, > u32 devid, u32 eventid, struct vgic_irq **irq); > struct vgic_its *vgic_msi_to_its(struct kvm *kvm, struct kvm_msi *msi); > +int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi); > void vgic_lpi_translation_cache_init(struct kvm *kvm); > void vgic_lpi_translation_cache_destroy(struct kvm *kvm); > void vgic_its_invalidate_cache(struct kvm *kvm); > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > Now that we have a cache of MSI->LPI translations, it is pretty > easy to implement kvm_arch_set_irq_inatomic (this cache can be > parsed without sleeping). > > Hopefully, this will improve some LPI-heavy workloads. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > virt/kvm/arm/vgic/vgic-irqfd.c | 36 ++++++++++++++++++++++++++++------ > 1 file changed, 30 insertions(+), 6 deletions(-) > > diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c > index 99e026d2dade..9f203ed8c8f3 100644 > --- a/virt/kvm/arm/vgic/vgic-irqfd.c > +++ b/virt/kvm/arm/vgic/vgic-irqfd.c > @@ -77,6 +77,15 @@ int kvm_set_routing_entry(struct kvm *kvm, > return r; > } > > +static void kvm_populate_msi(struct kvm_kernel_irq_routing_entry *e, > + struct kvm_msi *msi) > +{ > + msi->address_lo = e->msi.address_lo; > + msi->address_hi = e->msi.address_hi; > + msi->data = e->msi.data; > + msi->flags = e->msi.flags; > + msi->devid = e->msi.devid; > +} > /** > * kvm_set_msi: inject the MSI corresponding to the s/:/ - > * MSI routing entry > @@ -90,21 +99,36 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, > { > struct kvm_msi msi; > > - msi.address_lo = e->msi.address_lo; > - msi.address_hi = e->msi.address_hi; > - msi.data = e->msi.data; > - msi.flags = e->msi.flags; > - msi.devid = e->msi.devid; > - > if (!vgic_has_its(kvm)) > return -ENODEV; > > if (!level) > return -1; > > + kvm_populate_msi(e, &msi); > return vgic_its_inject_msi(kvm, &msi); > } > > +/** > + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection s/:/ - > + * > + * Currently only direct MSI injecton is supported. injection > + */ > +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, > + struct kvm *kvm, int irq_source_id, int level, > + bool line_status) > +{ > + if (e->type == KVM_IRQ_ROUTING_MSI && vgic_has_its(kvm) && level) { > + struct kvm_msi msi; > + > + kvm_populate_msi(e, &msi); > + if (!vgic_its_inject_cached_translation(kvm, &msi)) > + return 0; if this fails since its->enabled is false we will re-attempt the injection though the normal injection path but that's not a big deal. > + } > + > + return -EWOULDBLOCK; > +} > + > int kvm_vgic_setup_default_irq_routing(struct kvm *kvm) > { > struct kvm_irq_routing_entry *entries; > Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 6/11/19 7:03 PM, Marc Zyngier wrote: > On a successful translation, preserve the parameters in the LPI > translation cache. Each translation is reusing the last slot > in the list, naturally evincting the least recently used entry. evicting > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > virt/kvm/arm/vgic/vgic-its.c | 86 ++++++++++++++++++++++++++++++++++++ > 1 file changed, 86 insertions(+) > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 0aa0cbbc3af6..62932458476a 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -546,6 +546,90 @@ static unsigned long vgic_mmio_read_its_idregs(struct kvm *kvm, > return 0; > } > > +static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, > + phys_addr_t db, > + u32 devid, u32 eventid) > +{ > + struct vgic_translation_cache_entry *cte; > + struct vgic_irq *irq = NULL; > + > + list_for_each_entry(cte, &dist->lpi_translation_cache, entry) { > + /* > + * If we hit a NULL entry, there is nothing after this > + * point. > + */ > + if (!cte->irq) > + break; > + > + if (cte->db == db && > + cte->devid == devid && > + cte->eventid == eventid) { > + /* > + * Move this entry to the head, as it is the > + * most recently used. > + */ > + list_move(&cte->entry, &dist->lpi_translation_cache); > + irq = cte->irq; > + break; > + } > + } > + > + return irq; > +} > + > +static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, > + u32 devid, u32 eventid, > + struct vgic_irq *irq) > +{ > + struct vgic_dist *dist = &kvm->arch.vgic; > + struct vgic_translation_cache_entry *cte; > + unsigned long flags; > + phys_addr_t db; > + > + /* Do not cache a directly injected interrupt */ > + if (irq->hw) > + return; > + > + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); > + > + if (unlikely(list_empty(&dist->lpi_translation_cache))) > + goto out; > + > + /* > + * We could have raced with another CPU caching the same > + * translation behind our back, so let's check it is not in > + * already > + */ > + db = its->vgic_its_base + GITS_TRANSLATER; > + if (__vgic_its_check_cache(dist, db, devid, eventid)) > + goto out; > + > + /* Always reuse the last entry (LRU policy) */ > + cte = list_last_entry(&dist->lpi_translation_cache, > + typeof(*cte), entry); > + > + /* > + * Caching the translation implies having an extra reference > + * to the interrupt, so drop the potential reference on what > + * was in the cache, and increment it on the new interrupt. > + */ > + if (cte->irq) > + __vgic_put_lpi_locked(kvm, cte->irq); > + > + vgic_get_irq_kref(irq); > + > + cte->db = db; > + cte->devid = devid; > + cte->eventid = eventid; > + cte->irq = irq; > + > + /* Move the new translation to the head of the list */ > + list_move(&cte->entry, &dist->lpi_translation_cache); > + > +out: > + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); > +} > + > void vgic_its_invalidate_cache(struct kvm *kvm) > { > struct vgic_dist *dist = &kvm->arch.vgic; > @@ -589,6 +673,8 @@ int vgic_its_resolve_lpi(struct kvm *kvm, struct vgic_its *its, > if (!vcpu->arch.vgic_cpu.lpis_enabled) > return -EBUSY; > > + vgic_its_cache_translation(kvm, its, devid, eventid, ite->irq); > + > *irq = ite->irq; > return 0; > } > Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Eric, On 23/07/2019 16:10, Auger Eric wrote: > Hi Marc, > > On 6/11/19 7:03 PM, Marc Zyngier wrote: >> When performing an MSI injection, let's first check if the translation >> is already in the cache. If so, let's inject it quickly without >> going through the whole translation process. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++++++++++++++++++++++++++++ >> virt/kvm/arm/vgic/vgic.h | 1 + >> 2 files changed, 37 insertions(+) >> >> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >> index 62932458476a..83d80ec33473 100644 >> --- a/virt/kvm/arm/vgic/vgic-its.c >> +++ b/virt/kvm/arm/vgic/vgic-its.c >> @@ -577,6 +577,20 @@ static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, >> return irq; >> } >> >> +static struct vgic_irq *vgic_its_check_cache(struct kvm *kvm, phys_addr_t db, >> + u32 devid, u32 eventid) >> +{ >> + struct vgic_dist *dist = &kvm->arch.vgic; >> + struct vgic_irq *irq; >> + unsigned long flags; >> + >> + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); >> + irq = __vgic_its_check_cache(dist, db, devid, eventid); >> + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); >> + >> + return irq; >> +} >> + >> static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, >> u32 devid, u32 eventid, >> struct vgic_irq *irq) >> @@ -736,6 +750,25 @@ static int vgic_its_trigger_msi(struct kvm *kvm, struct vgic_its *its, >> return 0; >> } >> >> +int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) >> +{ >> + struct vgic_irq *irq; >> + unsigned long flags; >> + phys_addr_t db; >> + >> + db = (u64)msi->address_hi << 32 | msi->address_lo; >> + irq = vgic_its_check_cache(kvm, db, msi->devid, msi->data); > > I think we miss a check of its->enabled. This is currently done in > vgic_its_resolve_lpi() but now likely to be bypassed. But why would a translation be cached if the ITS is disabled? It should never haver been there the first place (vgic_its_resolve_lpi does check for the ITS being enabled, as you pointed out). Which makes me think that we miss an invalidate on an ITS being disabled: diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c index 2633b0e88981..5f2ad74ad834 100644 --- a/virt/kvm/arm/vgic/vgic-its.c +++ b/virt/kvm/arm/vgic/vgic-its.c @@ -1719,6 +1719,8 @@ static void vgic_mmio_write_its_ctlr(struct kvm *kvm, struct vgic_its *its, goto out; its->enabled = !!(val & GITS_CTLR_ENABLE); + if (!its->enabled) + vgic_its_invalidate_cache(kvm); /* * Try to process any pending commands. This function bails out early What do you think? M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Marc, On 7/23/19 5:45 PM, Marc Zyngier wrote: > Hi Eric, > > On 23/07/2019 16:10, Auger Eric wrote: >> Hi Marc, >> >> On 6/11/19 7:03 PM, Marc Zyngier wrote: >>> When performing an MSI injection, let's first check if the translation >>> is already in the cache. If so, let's inject it quickly without >>> going through the whole translation process. >>> >>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >>> --- >>> virt/kvm/arm/vgic/vgic-its.c | 36 ++++++++++++++++++++++++++++++++++++ >>> virt/kvm/arm/vgic/vgic.h | 1 + >>> 2 files changed, 37 insertions(+) >>> >>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c >>> index 62932458476a..83d80ec33473 100644 >>> --- a/virt/kvm/arm/vgic/vgic-its.c >>> +++ b/virt/kvm/arm/vgic/vgic-its.c >>> @@ -577,6 +577,20 @@ static struct vgic_irq *__vgic_its_check_cache(struct vgic_dist *dist, >>> return irq; >>> } >>> >>> +static struct vgic_irq *vgic_its_check_cache(struct kvm *kvm, phys_addr_t db, >>> + u32 devid, u32 eventid) >>> +{ >>> + struct vgic_dist *dist = &kvm->arch.vgic; >>> + struct vgic_irq *irq; >>> + unsigned long flags; >>> + >>> + raw_spin_lock_irqsave(&dist->lpi_list_lock, flags); >>> + irq = __vgic_its_check_cache(dist, db, devid, eventid); >>> + raw_spin_unlock_irqrestore(&dist->lpi_list_lock, flags); >>> + >>> + return irq; >>> +} >>> + >>> static void vgic_its_cache_translation(struct kvm *kvm, struct vgic_its *its, >>> u32 devid, u32 eventid, >>> struct vgic_irq *irq) >>> @@ -736,6 +750,25 @@ static int vgic_its_trigger_msi(struct kvm *kvm, struct vgic_its *its, >>> return 0; >>> } >>> >>> +int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) >>> +{ >>> + struct vgic_irq *irq; >>> + unsigned long flags; >>> + phys_addr_t db; >>> + >>> + db = (u64)msi->address_hi << 32 | msi->address_lo; >>> + irq = vgic_its_check_cache(kvm, db, msi->devid, msi->data); >> >> I think we miss a check of its->enabled. This is currently done in >> vgic_its_resolve_lpi() but now likely to be bypassed. > > But why would a translation be cached if the ITS is disabled? It should > never haver been there the first place (vgic_its_resolve_lpi does check > for the ITS being enabled, as you pointed out). > > Which makes me think that we miss an invalidate on an ITS being disabled: > > diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c > index 2633b0e88981..5f2ad74ad834 100644 > --- a/virt/kvm/arm/vgic/vgic-its.c > +++ b/virt/kvm/arm/vgic/vgic-its.c > @@ -1719,6 +1719,8 @@ static void vgic_mmio_write_its_ctlr(struct kvm *kvm, struct vgic_its *its, > goto out; > > its->enabled = !!(val & GITS_CTLR_ENABLE); > + if (!its->enabled) > + vgic_its_invalidate_cache(kvm); > > /* > * Try to process any pending commands. This function bails out early > > > What do you think? Yes I agree this is the right way to fix it. Thanks Eric > > M. > _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 23/07/2019 16:14, Auger Eric wrote: > Hi Marc, > > On 6/11/19 7:03 PM, Marc Zyngier wrote: >> Now that we have a cache of MSI->LPI translations, it is pretty >> easy to implement kvm_arch_set_irq_inatomic (this cache can be >> parsed without sleeping). >> >> Hopefully, this will improve some LPI-heavy workloads. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> virt/kvm/arm/vgic/vgic-irqfd.c | 36 ++++++++++++++++++++++++++++------ >> 1 file changed, 30 insertions(+), 6 deletions(-) >> >> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c >> index 99e026d2dade..9f203ed8c8f3 100644 >> --- a/virt/kvm/arm/vgic/vgic-irqfd.c >> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c >> @@ -77,6 +77,15 @@ int kvm_set_routing_entry(struct kvm *kvm, >> return r; >> } >> >> +static void kvm_populate_msi(struct kvm_kernel_irq_routing_entry *e, >> + struct kvm_msi *msi) >> +{ >> + msi->address_lo = e->msi.address_lo; >> + msi->address_hi = e->msi.address_hi; >> + msi->data = e->msi.data; >> + msi->flags = e->msi.flags; >> + msi->devid = e->msi.devid; >> +} >> /** >> * kvm_set_msi: inject the MSI corresponding to the > s/:/ - >> * MSI routing entry >> @@ -90,21 +99,36 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, >> { >> struct kvm_msi msi; >> >> - msi.address_lo = e->msi.address_lo; >> - msi.address_hi = e->msi.address_hi; >> - msi.data = e->msi.data; >> - msi.flags = e->msi.flags; >> - msi.devid = e->msi.devid; >> - >> if (!vgic_has_its(kvm)) >> return -ENODEV; >> >> if (!level) >> return -1; >> >> + kvm_populate_msi(e, &msi); >> return vgic_its_inject_msi(kvm, &msi); >> } >> >> +/** >> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection > s/:/ - >> + * >> + * Currently only direct MSI injecton is supported. > injection >> + */ >> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, >> + struct kvm *kvm, int irq_source_id, int level, >> + bool line_status) >> +{ >> + if (e->type == KVM_IRQ_ROUTING_MSI && vgic_has_its(kvm) && level) { >> + struct kvm_msi msi; >> + >> + kvm_populate_msi(e, &msi); >> + if (!vgic_its_inject_cached_translation(kvm, &msi)) >> + return 0; > if this fails since its->enabled is false we will re-attempt the > injection though the normal injection path but that's not a big deal. Yeah, I wouldn't worry about that. If there is a screaming device, so be it. The guest should know better... >> + } >> + >> + return -EWOULDBLOCK; >> +} >> + >> int kvm_vgic_setup_default_irq_routing(struct kvm *kvm) >> { >> struct kvm_irq_routing_entry *entries; >> > Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric. M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Andre, On 23/07/2019 12:14, Andre Przywara wrote: > On Tue, 11 Jun 2019 18:03:27 +0100 > Marc Zyngier <marc.zyngier@arm.com> wrote: > > Hi, > >> It recently became apparent[1] that our LPI injection path is not as >> efficient as it could be when injecting interrupts coming from a VFIO >> assigned device. >> >> Although the proposed patch wasn't 100% correct, it outlined at least >> two issues: >> >> (1) Injecting an LPI from VFIO always results in a context switch to a >> worker thread: no good >> >> (2) We have no way of amortising the cost of translating a DID+EID pair >> to an LPI number >> >> The reason for (1) is that we may sleep when translating an LPI, so we >> do need a context process. A way to fix that is to implement a small >> LPI translation cache that could be looked up from an atomic >> context. It would also solve (2). >> >> This is what this small series proposes. It implements a very basic >> LRU cache of pre-translated LPIs, which gets used to implement >> kvm_arch_set_irq_inatomic. The size of the cache is currently >> hard-coded at 16 times the number of vcpus, a number I have picked >> under the influence of Ali Saidi. If that's not enough for you, blame >> me, though. >> >> Does it work? well, it doesn't crash, and is thus perfect. More >> seriously, I don't really have a way to benchmark it directly, so my >> observations are only indirect: >> >> On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed >> to it directly. From the host, I inject interrupts using debugfs. In >> parallel, I look at the number of context switch, and the number of >> interrupts on the host. Without this series, I get the same number for >> both IRQ and CS (about half a million of each per second is pretty >> easy to reach). With this series, the number of context switches drops >> to something pretty small (in the low 2k), while the number of >> interrupts stays the same. >> >> Yes, this is a pretty rubbish benchmark, what did you expect? ;-) >> >> So I'm putting this out for people with real workloads to try out and >> report what they see. > > So I gave that a shot with some benchmarks. As expected, it is quite hard > to show an improvement with just one guest running, although we could show > a 103%(!) improvement of the memcached QPS score in one experiment when > running it in a guest with an external load generator. Is that a fluke or something that you have been able to reproduce consistently? Because doubling the performance of anything is something I have a hard time believing in... ;-) > Throwing more users into the game showed a significant improvement: > > Benchmark 1: kernel compile/FIO: Compiling a kernel on the host, while > letting a guest run FIO with 4K randreads from a passed-through NVMe SSD: > The IOPS with this series improved by 27% compared to pure mainline, > reaching 80% of the host value. Kernel compilation time improved by 8.5% > compared to mainline. OK, that's interesting. I guess that's the effect of not unnecessarily disrupting the scheduling with one extra context-switch per interrupt. > > Benchmark 2: FIO/FIO: Running FIO on a passed through SATA SSD in one > guest, and FIO on a passed through NVMe SSD in another guest, at the same > time: > The IOPS with this series improved by 23% for the NVMe and 34% for the > SATA disk, compared to pure mainline. I guess that's the same thing. Not context-switching means more available resource to other processes in the system. > So judging from these results, I think this series is a significant > improvement, which justifies it to be merged, to receive wider testing. > > It would be good if others could also do performance experiments and post > their results. Wishful thinking... Anyway, I'll repost the series shortly now that Eric has gone through it. Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On Thu, 25 Jul 2019 09:50:18 +0100 Marc Zyngier <marc.zyngier@arm.com> wrote: Hi Marc, > On 23/07/2019 12:14, Andre Przywara wrote: > > On Tue, 11 Jun 2019 18:03:27 +0100 > > Marc Zyngier <marc.zyngier@arm.com> wrote: > > > > Hi, > > > >> It recently became apparent[1] that our LPI injection path is not as > >> efficient as it could be when injecting interrupts coming from a VFIO > >> assigned device. > >> > >> Although the proposed patch wasn't 100% correct, it outlined at least > >> two issues: > >> > >> (1) Injecting an LPI from VFIO always results in a context switch to a > >> worker thread: no good > >> > >> (2) We have no way of amortising the cost of translating a DID+EID pair > >> to an LPI number > >> > >> The reason for (1) is that we may sleep when translating an LPI, so we > >> do need a context process. A way to fix that is to implement a small > >> LPI translation cache that could be looked up from an atomic > >> context. It would also solve (2). > >> > >> This is what this small series proposes. It implements a very basic > >> LRU cache of pre-translated LPIs, which gets used to implement > >> kvm_arch_set_irq_inatomic. The size of the cache is currently > >> hard-coded at 16 times the number of vcpus, a number I have picked > >> under the influence of Ali Saidi. If that's not enough for you, blame > >> me, though. > >> > >> Does it work? well, it doesn't crash, and is thus perfect. More > >> seriously, I don't really have a way to benchmark it directly, so my > >> observations are only indirect: > >> > >> On a TX2 system, I run a 4 vcpu VM with an Ethernet interface passed > >> to it directly. From the host, I inject interrupts using debugfs. In > >> parallel, I look at the number of context switch, and the number of > >> interrupts on the host. Without this series, I get the same number for > >> both IRQ and CS (about half a million of each per second is pretty > >> easy to reach). With this series, the number of context switches drops > >> to something pretty small (in the low 2k), while the number of > >> interrupts stays the same. > >> > >> Yes, this is a pretty rubbish benchmark, what did you expect? ;-) > >> > >> So I'm putting this out for people with real workloads to try out and > >> report what they see. > > > > So I gave that a shot with some benchmarks. As expected, it is quite hard > > to show an improvement with just one guest running, although we could show > > a 103%(!) improvement of the memcached QPS score in one experiment when > > running it in a guest with an external load generator. > > Is that a fluke or something that you have been able to reproduce > consistently? Because doubling the performance of anything is something > I have a hard time believing in... ;-) Me too. I didn't do this particular test, but it seems that at least in this particular setup the results were reproducible. AFAICS the parameters for memcached were just tuned to reduce variation. The test was run three times on a TX2, with a variation of +/- 5%. The average number (Memcached QPS SLA) was 180539 with this series, and 89076 without it. This benchmark setup is reported to be very latency sensitive, with high I/O requirements, so the observed scheduling improvement of this series would quite plausibly show a dramatic effect in a guest. > > Throwing more users into the game showed a significant improvement: > > > > Benchmark 1: kernel compile/FIO: Compiling a kernel on the host, while > > letting a guest run FIO with 4K randreads from a passed-through NVMe SSD: > > The IOPS with this series improved by 27% compared to pure mainline, > > reaching 80% of the host value. Kernel compilation time improved by 8.5% > > compared to mainline. > > OK, that's interesting. I guess that's the effect of not unnecessarily > disrupting the scheduling with one extra context-switch per interrupt. That's my understanding as well. The machine had four cores, the guest four VCPUs, FIO in that guest was told to use four jobs. The kernel was compiling with make -j5. So yes, the scheduler is quite busy here, and I would expect any relief there to benefit performance. > > Benchmark 2: FIO/FIO: Running FIO on a passed through SATA SSD in one > > guest, and FIO on a passed through NVMe SSD in another guest, at the same > > time: > > The IOPS with this series improved by 23% for the NVMe and 34% for the > > SATA disk, compared to pure mainline. > > I guess that's the same thing. Not context-switching means more > available resource to other processes in the system. Yes. These were again four VCPU guests with a 4-job FIO in each. And for the records, using FIO with just "read" and a blocksize of 1MB didn't show any effects: the numbers were basically the same as bare metal, in every case. I would attribute this to the number of interrupts being far too low to show an impact. > > So judging from these results, I think this series is a significant > > improvement, which justifies it to be merged, to receive wider testing. > > > > It would be good if others could also do performance experiments and post > > their results. > > Wishful thinking... > > Anyway, I'll repost the series shortly now that Eric has gone through it. Thanks! Feel free to add my Tested-by: at an appropriate place. Cheers, Andre. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 25/07/2019 11:01, Andre Przywara wrote: > Thanks! Feel free to add my Tested-by: at an appropriate place. Ah, sorry, missed that. If you give the new series a go, I swear I'll add your tag! ;-) Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm