Re: [PATCH v2 1/9] KVM: arm/arm64: vgic: Add LPI translation cache definition

From: Julien Thierry <julien.thierry@arm.com>
To: Marc Zyngier <marc.zyngier@arm.com>
Cc: kvm@vger.kernel.org, "Raslan, KarimAllah" <karahmed@amazon.de>,
	"Saidi, Ali" <alisaidi@amazon.com>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v2 1/9] KVM: arm/arm64: vgic: Add LPI translation cache definition
Date: Wed, 12 Jun 2019 11:58:28 +0100	[thread overview]
Message-ID: <13655730-165b-d67b-a1da-11c8869c7053@arm.com> (raw)
In-Reply-To: <86ef3zgmg6.wl-marc.zyngier@arm.com>

On 12/06/2019 10:52, Marc Zyngier wrote:
> Hi Julien,
> 
> On Wed, 12 Jun 2019 09:16:21 +0100,
> Julien Thierry <julien.thierry@arm.com> wrote:
>>
>> Hi Marc,
>>
>> On 11/06/2019 18:03, Marc Zyngier wrote:
>>> Add the basic data structure that expresses an MSI to LPI
>>> translation as well as the allocation/release hooks.
>>>
>>> THe size of the cache is arbitrarily defined as 4*nr_vcpus.
>>>
>>
>> The size has been arbitrarily changed to 16*nr_vcpus :) .
> 
> Well spotted! ;-)
> 
>>
>> Nit: The*
> 
> Ah, usual lazy finger on the Shift key... One day I'll learn to type.
> 
>>
>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>>> ---
>>>  include/kvm/arm_vgic.h        |  3 +++
>>>  virt/kvm/arm/vgic/vgic-init.c |  5 ++++
>>>  virt/kvm/arm/vgic/vgic-its.c  | 49 +++++++++++++++++++++++++++++++++++
>>>  virt/kvm/arm/vgic/vgic.h      |  2 ++
>>>  4 files changed, 59 insertions(+)
>>>

[...]

>>> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c
>>> index 44ceaccb18cf..ce9bcddeb7f1 100644
>>> --- a/virt/kvm/arm/vgic/vgic-its.c
>>> +++ b/virt/kvm/arm/vgic/vgic-its.c
>>> @@ -149,6 +149,14 @@ struct its_ite {
>>>  	u32 event_id;
>>>  };
>>>  
>>> +struct vgic_translation_cache_entry {
>>> +	struct list_head	entry;
>>> +	phys_addr_t		db;
>>> +	u32			devid;
>>> +	u32			eventid;
>>> +	struct vgic_irq		*irq;
>>> +};
>>> +
>>>  /**
>>>   * struct vgic_its_abi - ITS abi ops and settings
>>>   * @cte_esz: collection table entry size
>>> @@ -1668,6 +1676,45 @@ static int vgic_register_its_iodev(struct kvm *kvm, struct vgic_its *its,
>>>  	return ret;
>>>  }
>>>  
>>> +/* Default is 16 cached LPIs per vcpu */
>>> +#define LPI_DEFAULT_PCPU_CACHE_SIZE	16
>>> +
>>> +void vgic_lpi_translation_cache_init(struct kvm *kvm)
>>> +{
>>> +	struct vgic_dist *dist = &kvm->arch.vgic;
>>> +	unsigned int sz;
>>> +	int i;
>>> +
>>> +	if (!list_empty(&dist->lpi_translation_cache))
>>> +		return;
>>> +
>>> +	sz = atomic_read(&kvm->online_vcpus) * LPI_DEFAULT_PCPU_CACHE_SIZE;
>>> +
>>> +	for (i = 0; i < sz; i++) {
>>> +		struct vgic_translation_cache_entry *cte;
>>> +
>>> +		/* An allocation failure is not fatal */
>>> +		cte = kzalloc(sizeof(*cte), GFP_KERNEL);
>>> +		if (WARN_ON(!cte))
>>> +			break;
>>> +
>>> +		INIT_LIST_HEAD(&cte->entry);
>>> +		list_add(&cte->entry, &dist->lpi_translation_cache);
>>
>> Going through the series, it looks like this list is either empty
>> (before the cache init) or has a fixed number
>> (LPI_DEFAULT_PCPU_CACHE_SIZE * nr_cpus) of entries.
> 
> Well, it could also fail when allocating one of the entry, meaning we
> can have an allocation ranging from 0 to (LPI_DEFAULT_PCPU_CACHE_SIZE
> * nr_cpus) entries.
> 
>> And the list never grows nor shrinks throughout the series, so it
>> seems odd to be using a list here.
>>
>> Is there a reason for not using a dynamically allocated array instead of
>> the list? (does list_move() provide a big perf advantage over swapping
>> the data from one array entry to another? Or is there some other
>> facility I am missing?
> 
> The idea was to make the LRU policy cheap, on the assumption that
> list_move (which is only a couple of pointer updates) is cheaper than
> a memmove if you want to keep the array ordered. If we exclude the
> list head, we end-up with 24 bytes per entry to move down to make room
> for the new entry at the head of the array. For large caches that miss
> very often, this will hurt badly. But is that really a problem? I
> don't know.
> 

Yes, I realized afterwards that the LRU uses the fact you can easily
move list entries without modifying the rest of the list.

> We could allocate an array as you suggest, and use a linked list
> inside the array. Or something else. I'm definitely open to
> suggestion!

If it there turns out to be some benefit to just you a fixed array, we
could use a simple ring buffer. Have one pointer on the most recently
inserted entry (and we know the next insertion will take place on the
entry "just before" it) and one pointer on the least recently used entry
(which gets moved when the most recently inserted catches up to it) so
we know where to stop when looping. We don't really have to worry about
the "ring buffer" full case since that means we just overwrite the LRU
and move the pointer.

This might prove a bit more efficient when looping over the cache
entries compared to the list. However, I have no certainty of actual
performance gain from that and the current implementation has the
benefit of being simple.

Let me know if you decide to give the ring buffer approach a try.

Otherwise there's always the option to add even more complex structure
with a hashtable + linked list using hashes and tags to lookup the
entries. But keeping things simple for now seems reasonable (also, it
avoids having to think about what to use as hash and tag :D ).

Cheers,

-- 
Julien Thierry
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm