On Tue, May 01, 2018 at 02:52:21PM +1000, Sam Bobroff wrote:
> On Tue, Apr 24, 2018 at 01:48:25PM +1000, David Gibson wrote:
> > On Tue, Apr 24, 2018 at 01:19:15PM +1000, Sam Bobroff wrote:
> > > On Mon, Apr 23, 2018 at 11:06:35AM +0200, Cédric Le Goater wrote:
> > > > On 04/16/2018 06:09 AM, David Gibson wrote:
[snip]
> > > At the moment, kvm->vcores[] and xive->vp_base are both sized by NR_CPUS
> > > (via KVM_MAX_VCPUS and KVM_MAX_VCORES which are both NR_CPUS). This is
> > > enough space for the maximum number of VCPUs, and some space is wasted
> > > when the guest uses less than this (but KVM doesn't know how many will
> > > be created, so we can't do better easily). The problem is that the
> > > indicies overflow before all of those VCPUs can be created, not that
> > > more space is needed.
> > > 
> > > We could fix the overflow by expanding these areas to KVM_MAX_VCPU_ID
> > > but that will use 8x the space we use now, and we know that no more than
> > > KVM_MAX_VCPUS will be used so all this new space is basically wasted.
> > > 
> > > So remapping seems better if it will work. (Ben H. was strongly against
> > > wasting more XIVE space if possible.)
> > 
> > Hm, ok.  Are the relevant arrays here per-VM, or global?  Or some of both?
> 
> Per-VM. They are the kvm->vcores[] array and the blocks of memory
> pointed to by xive->vp_base.

Hm.  If it were global (where you can't know the size of a specific
VM) I'd certainly see the concern about not expanding the size of the
array.

As it is, I'm a little perplexed that we care so much about the
difference between KVM_MAX_VCPUS and KVM_MAX_VCPU_ID, a factor of 8,
when we apparently don't care about the difference between the vm's
actual number of cpus and KVM_MAX_VCPUS, a factor of maybe 2048 (for a
1vcpu guest and powernv_defconfig).

> 
> > > In short, remapping provides a way to allow the guest to create it's full set
> > > of VCPUs without wasting any more space than we do currently, without
> > > having to do something more complicated like tracking used IDs or adding
> > > additional KVM CAPs.
> > > 
> > > > >> +
> > > > >>  #endif /* __ASM_KVM_BOOK3S_H__ */
> > > > >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > > >> index 9cb9448163c4..49165cc90051 100644
> > > > >> --- a/arch/powerpc/kvm/book3s_hv.c
> > > > >> +++ b/arch/powerpc/kvm/book3s_hv.c
> > > > >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
> > > > >>  	return threads_per_subcore;
> > > > >>  }
> > > > >>  
> > > > >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > > >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> > > > >>  {
> > > > >>  	struct kvmppc_vcore *vcore;
> > > > >>  
> > > > >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > > > >>  	init_swait_queue_head(&vcore->wq);
> > > > >>  	vcore->preempt_tb = TB_NIL;
> > > > >>  	vcore->lpcr = kvm->arch.lpcr;
> > > > >> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> > > > >> +	vcore->first_vcpuid = id;
> > > > >>  	vcore->kvm = kvm;
> > > > >>  	INIT_LIST_HEAD(&vcore->preempt_list);
> > > > >>  
> > > > >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> > > > >>  	mutex_lock(&kvm->lock);
> > > > >>  	vcore = NULL;
> > > > >>  	err = -EINVAL;
> > > > >> -	core = id / kvm->arch.smt_mode;
> > > > >> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> > > > >> +		BUG_ON(kvm->arch.smt_mode != 1);
> > > > >> +		core = kvmppc_pack_vcpu_id(kvm, id);
> > > > >> +	} else {
> > > > >> +		core = id / kvm->arch.smt_mode;
> > > > >> +	}
> > > > >>  	if (core < KVM_MAX_VCORES) {
> > > > >>  		vcore = kvm->arch.vcores[core];
> > > > >> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> > > > >>  		if (!vcore) {
> > > > >>  			err = -ENOMEM;
> > > > >> -			vcore = kvmppc_vcore_create(kvm, core);
> > > > >> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
> > > > >>  			kvm->arch.vcores[core] = vcore;
> > > > >>  			kvm->arch.online_vcores++;
> > > > >>  		}
> > > > >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > > > >> index f9818d7d3381..681dfe12a5f3 100644
> > > > >> --- a/arch/powerpc/kvm/book3s_xive.c
> > > > >> +++ b/arch/powerpc/kvm/book3s_xive.c
> > > > >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> > > > >>  	return -EBUSY;
> > > > >>  }
> > > > >>  
> > > > >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
> > > > >> +{
> > > > >> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
> > > > >> +}
> > > > >> +
> > > > > 
> > > > > I'm finding the XIVE indexing really baffling.  There are a bunch of
> > > > > other places where the code uses (xive->vp_base + NUMBER) directly.
> > > 
> > > Ugh, yes. It looks like I botched part of my final cleanup and all the
> > > cases you saw in kvm/book3s_xive.c should have been replaced with a call to
> > > xive_vp(). I'll fix it and sorry for the confusion.
> > 
> > Ok.
> > 
> > > > This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
> > > > in OPAL. So we need to check that all used NUMBERs are, first, consistent 
> > > > and then, in the correct range.
> > > 
> > > Right. My approach was to allow XIVE to keep using server numbers that
> > > are equal to VCPU IDs, and just pack down the ID before indexing into
> > > the vp_base area.
> > > 
> > > > > If those are host side references, I guess they don't need updates for
> > > > > this.
> > > 
> > > These are all guest side references.
> > > 
> > > > > But if that's the case, then how does indexing into the same array
> > > > > with both host and guest server numbers make sense?
> > > 
> > > Right, it doesn't make sense to mix host and guest server numbers when
> > > we're remapping only the guest ones, but in this case (without native
> > > guest XIVE support) it's just guest ones.
> > 
> > Right.  Will this remapping be broken by guest-visible XIVE?  That is
> > for the guest visible XIVE are we going to need to expose un-remapped
> > XIVE server IDs to the guest?
> 
> I'm not sure, I'll start looking at that next.
> 
> > > > yes. VPs are allocated with KVM_MAX_VCPUS :
> > > > 
> > > > 	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > > > 
> > > > but
> > > > 
> > > > 	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)
> > > > 
> > > > WE would need to change the allocation of the VPs I guess.
> > > 
> > > Yes, this is one of the structures that overflow if we don't pack the IDs.
> > > 
> > > > >>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
> > > > >>  			     struct kvmppc_xive_src_block *sb,
> > > > >>  			     struct kvmppc_xive_irq_state *state)
> > > > >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > > >>  		pr_devel("Duplicate !\n");
> > > > >>  		return -EEXIST;
> > > > >>  	}
> > > > >> -	if (cpu >= KVM_MAX_VCPUS) {
> > > > >> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
> > > > >>  		pr_devel("Out of bounds !\n");
> > > > >>  		return -EINVAL;
> > > > >>  	}
> > > > >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
> > > > >>  	xc->xive = xive;
> > > > >>  	xc->vcpu = vcpu;
> > > > >>  	xc->server_num = cpu;
> > > > >> -	xc->vp_id = xive->vp_base + cpu;
> > > > >> +	xc->vp_id = xive_vp(xive, cpu);
> > > > >>  	xc->mfrr = 0xff;
> > > > >>  	xc->valid = true;
> > > > >>  
> > > > > 
> > > > 
> > 
> > 
> > 
> 
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson