From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= <clg@kaod.org>
Subject: Re: [PATCH RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access
 full VCPU ID space
Date: Mon, 23 Apr 2018 11:06:35 +0200
Message-ID: <1e01ea66-6103-94c8-ccb1-ed35b3a3104b@kaod.org>
References: <70974cfb62a7f09a53ec914d2909639884228244.1523516498.git.sam.bobroff@au1.ibm.com>
 <20180416040942.GB20551@umbus.fritz.box>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Cc: paulus@samba.org, linuxppc-dev@lists.ozlabs.org, kvm-ppc@vger.kernel.org,
 kvm@vger.kernel.org
To: David Gibson <david@gibson.dropbear.id.au>,
 Sam Bobroff <sam.bobroff@au1.ibm.com>
Return-path: <linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org>
In-Reply-To: <20180416040942.GB20551@umbus.fritz.box>
Content-Language: en-US
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>
Errors-To: linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
 <linuxppc-dev-bounces+glppe-linuxppc-embedded-2=m.gmane.org@lists.ozlabs.org>
List-Id: kvm.vger.kernel.org

On 04/16/2018 06:09 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
>> It is not currently possible to create the full number of possible
>> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
>> threads per core than it's core stride (or "VSMT mode"). This is
>> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
>> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>>
>> To address this, "pack" the VCORE ID and XIVE offsets by using
>> knowledge of the way the VCPU IDs will be used when there are less
>> guest threads per core than the core stride. The primary thread of
>> each core will always be used first. Then, if the guest uses more than
>> one thread per core, these secondary threads will sequentially follow
>> the primary in each core.
>>
>> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
>> VCPUs are being spaced apart, so at least half of each core is empty
>> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
>> into the second half of each core (4..7, in an 8-thread core).
>>
>> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
>> each core is being left empty, and we can map down into the second and
>> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>>
>> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
>> threads are being used and 7/8 of the core is empty, allowing use of
>> the 1, 3, 5 and 7 thread slots.
>>
>> (Strides less than 8 are handled similarly.)
>>
>> This allows the VCORE ID or offset to be calculated quickly from the
>> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>>
>> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
>> ---
>> Hello everyone,
>>
>> I've tested this on P8 and P9, in lots of combinations of host and guest
>> threading modes and it has been fine but it does feel like a "tricky"
>> approach, so I still feel somewhat wary about it.

Have you done any migration ? 

>> I've posted it as an RFC because I have not tested it with guest native-XIVE,
>> and I suspect that it will take some work to support it.

The KVM XIVE device will be different for XIVE exploitation mode, same structures 
though. I will send a patchset shortly. 

>>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
>>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
>>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
>>  3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
>> index 376ae803b69c..1295056d564a 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>>  #define SPLIT_HACK_MASK			0xff000000
>>  #define SPLIT_HACK_OFFS			0xfb000000
>>  
>> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
>> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
>> + * (but not it's actual threading mode, which is not available) to avoid
>> + * collisions.
>> + */
>> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
>> +{
>> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> 
> I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> roughly the same thing, but I think makes the pattern more obvious.
> 
>> +	int stride = kvm->arch.emul_smt_mode > 1 ?
>> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> 
> AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> always be 1 when this is called, so the conditional here doesn't seem
> useful.
> 
>> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
>> +	u32 packed_id;
>> +
>> +	BUG_ON(block >= MAX_SMT_THREADS);
>> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
>> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
>> +	return packed_id;
>> +}
> 
> It took me a while to wrap my head around the packing function, but I
> think I got there in the end.  It's pretty clever.
> 
> One thing bothers me, though.  This certainly packs things under
> KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> etc. unused.
> 
> So again, the question is what exactly are these remapped IDs useful
> for.  If we're indexing into a bare array of structures of size
> KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> more entries than vcpus.  If we're indexing into something sparser,
> then why is the remapping worthwhile?
> 
> 
> 
>> +
>>  #endif /* __ASM_KVM_BOOK3S_H__ */
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 9cb9448163c4..49165cc90051 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
>>  	return threads_per_subcore;
>>  }
>>  
>> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>>  {
>>  	struct kvmppc_vcore *vcore;
>>  
>> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>>  	init_swait_queue_head(&vcore->wq);
>>  	vcore->preempt_tb = TB_NIL;
>>  	vcore->lpcr = kvm->arch.lpcr;
>> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
>> +	vcore->first_vcpuid = id;
>>  	vcore->kvm = kvm;
>>  	INIT_LIST_HEAD(&vcore->preempt_list);
>>  
>> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>>  	mutex_lock(&kvm->lock);
>>  	vcore = NULL;
>>  	err = -EINVAL;
>> -	core = id / kvm->arch.smt_mode;
>> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>> +		BUG_ON(kvm->arch.smt_mode != 1);
>> +		core = kvmppc_pack_vcpu_id(kvm, id);
>> +	} else {
>> +		core = id / kvm->arch.smt_mode;
>> +	}
>>  	if (core < KVM_MAX_VCORES) {
>>  		vcore = kvm->arch.vcores[core];
>> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>>  		if (!vcore) {
>>  			err = -ENOMEM;
>> -			vcore = kvmppc_vcore_create(kvm, core);
>> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>>  			kvm->arch.vcores[core] = vcore;
>>  			kvm->arch.online_vcores++;
>>  		}
>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>> index f9818d7d3381..681dfe12a5f3 100644
>> --- a/arch/powerpc/kvm/book3s_xive.c
>> +++ b/arch/powerpc/kvm/book3s_xive.c
>> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>>  	return -EBUSY;
>>  }
>>  
>> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
>> +{
>> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>> +}
>> +
> 
> I'm finding the XIVE indexing really baffling.  There are a bunch of
> other places where the code uses (xive->vp_base + NUMBER) directly.

This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
in OPAL. So we need to check that all used NUMBERs are, first, consistent 
and then, in the correct range.

> If those are host side references, I guess they don't need updates for
> this.

> But if that's the case, then how does indexing into the same array
> with both host and guest server numbers make sense?

yes. VPs are allocated with KVM_MAX_VCPUS :

	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);

but

	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)

WE would need to change the allocation of the VPs I guess.

>>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>>  			     struct kvmppc_xive_src_block *sb,
>>  			     struct kvmppc_xive_irq_state *state)
>> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  		pr_devel("Duplicate !\n");
>>  		return -EEXIST;
>>  	}
>> -	if (cpu >= KVM_MAX_VCPUS) {
>> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
>>  		pr_devel("Out of bounds !\n");
>>  		return -EINVAL;
>>  	}
>> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  	xc->xive = xive;
>>  	xc->vcpu = vcpu;
>>  	xc->server_num = cpu;
>> -	xc->vp_id = xive->vp_base + cpu;
>> +	xc->vp_id = xive_vp(xive, cpu);
>>  	xc->mfrr = 0xff;
>>  	xc->valid = true;
>>  
> 

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <clg@kaod.org>
Received: from 3.mo3.mail-out.ovh.net (3.mo3.mail-out.ovh.net [46.105.44.175])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
 bits)) (No client certificate requested)
 by lists.ozlabs.org (Postfix) with ESMTPS id 40V2dY2PmXzDrJw
 for <linuxppc-dev@lists.ozlabs.org>; Mon, 23 Apr 2018 20:24:46 +1000 (AEST)
Received: from player737.ha.ovh.net (unknown [10.109.105.35])
 by mo3.mail-out.ovh.net (Postfix) with ESMTP id A84491B406A
 for <linuxppc-dev@lists.ozlabs.org>; Mon, 23 Apr 2018 11:06:47 +0200 (CEST)
Subject: Re: [PATCH RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access
 full VCPU ID space
To: David Gibson <david@gibson.dropbear.id.au>,
 Sam Bobroff <sam.bobroff@au1.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
 kvm-ppc@vger.kernel.org, paulus@samba.org
References: <70974cfb62a7f09a53ec914d2909639884228244.1523516498.git.sam.bobroff@au1.ibm.com>
 <20180416040942.GB20551@umbus.fritz.box>
From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= <clg@kaod.org>
Message-ID: <1e01ea66-6103-94c8-ccb1-ed35b3a3104b@kaod.org>
Date: Mon, 23 Apr 2018 11:06:35 +0200
MIME-Version: 1.0
In-Reply-To: <20180416040942.GB20551@umbus.fritz.box>
Content-Type: text/plain; charset=windows-1252
List-Id: Linux on PowerPC Developers Mail List <linuxppc-dev.lists.ozlabs.org>
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>

On 04/16/2018 06:09 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
>> It is not currently possible to create the full number of possible
>> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
>> threads per core than it's core stride (or "VSMT mode"). This is
>> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
>> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>>
>> To address this, "pack" the VCORE ID and XIVE offsets by using
>> knowledge of the way the VCPU IDs will be used when there are less
>> guest threads per core than the core stride. The primary thread of
>> each core will always be used first. Then, if the guest uses more than
>> one thread per core, these secondary threads will sequentially follow
>> the primary in each core.
>>
>> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
>> VCPUs are being spaced apart, so at least half of each core is empty
>> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
>> into the second half of each core (4..7, in an 8-thread core).
>>
>> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
>> each core is being left empty, and we can map down into the second and
>> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>>
>> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
>> threads are being used and 7/8 of the core is empty, allowing use of
>> the 1, 3, 5 and 7 thread slots.
>>
>> (Strides less than 8 are handled similarly.)
>>
>> This allows the VCORE ID or offset to be calculated quickly from the
>> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>>
>> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
>> ---
>> Hello everyone,
>>
>> I've tested this on P8 and P9, in lots of combinations of host and guest
>> threading modes and it has been fine but it does feel like a "tricky"
>> approach, so I still feel somewhat wary about it.

Have you done any migration ? 

>> I've posted it as an RFC because I have not tested it with guest native-XIVE,
>> and I suspect that it will take some work to support it.

The KVM XIVE device will be different for XIVE exploitation mode, same structures 
though. I will send a patchset shortly. 

>>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
>>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
>>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
>>  3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
>> index 376ae803b69c..1295056d564a 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>>  #define SPLIT_HACK_MASK			0xff000000
>>  #define SPLIT_HACK_OFFS			0xfb000000
>>  
>> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
>> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
>> + * (but not it's actual threading mode, which is not available) to avoid
>> + * collisions.
>> + */
>> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
>> +{
>> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> 
> I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> roughly the same thing, but I think makes the pattern more obvious.
> 
>> +	int stride = kvm->arch.emul_smt_mode > 1 ?
>> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> 
> AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> always be 1 when this is called, so the conditional here doesn't seem
> useful.
> 
>> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
>> +	u32 packed_id;
>> +
>> +	BUG_ON(block >= MAX_SMT_THREADS);
>> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
>> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
>> +	return packed_id;
>> +}
> 
> It took me a while to wrap my head around the packing function, but I
> think I got there in the end.  It's pretty clever.
> 
> One thing bothers me, though.  This certainly packs things under
> KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> etc. unused.
> 
> So again, the question is what exactly are these remapped IDs useful
> for.  If we're indexing into a bare array of structures of size
> KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> more entries than vcpus.  If we're indexing into something sparser,
> then why is the remapping worthwhile?
> 
> 
> 
>> +
>>  #endif /* __ASM_KVM_BOOK3S_H__ */
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 9cb9448163c4..49165cc90051 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
>>  	return threads_per_subcore;
>>  }
>>  
>> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>>  {
>>  	struct kvmppc_vcore *vcore;
>>  
>> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>>  	init_swait_queue_head(&vcore->wq);
>>  	vcore->preempt_tb = TB_NIL;
>>  	vcore->lpcr = kvm->arch.lpcr;
>> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
>> +	vcore->first_vcpuid = id;
>>  	vcore->kvm = kvm;
>>  	INIT_LIST_HEAD(&vcore->preempt_list);
>>  
>> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>>  	mutex_lock(&kvm->lock);
>>  	vcore = NULL;
>>  	err = -EINVAL;
>> -	core = id / kvm->arch.smt_mode;
>> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>> +		BUG_ON(kvm->arch.smt_mode != 1);
>> +		core = kvmppc_pack_vcpu_id(kvm, id);
>> +	} else {
>> +		core = id / kvm->arch.smt_mode;
>> +	}
>>  	if (core < KVM_MAX_VCORES) {
>>  		vcore = kvm->arch.vcores[core];
>> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>>  		if (!vcore) {
>>  			err = -ENOMEM;
>> -			vcore = kvmppc_vcore_create(kvm, core);
>> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>>  			kvm->arch.vcores[core] = vcore;
>>  			kvm->arch.online_vcores++;
>>  		}
>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>> index f9818d7d3381..681dfe12a5f3 100644
>> --- a/arch/powerpc/kvm/book3s_xive.c
>> +++ b/arch/powerpc/kvm/book3s_xive.c
>> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>>  	return -EBUSY;
>>  }
>>  
>> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
>> +{
>> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>> +}
>> +
> 
> I'm finding the XIVE indexing really baffling.  There are a bunch of
> other places where the code uses (xive->vp_base + NUMBER) directly.

This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
in OPAL. So we need to check that all used NUMBERs are, first, consistent 
and then, in the correct range.

> If those are host side references, I guess they don't need updates for
> this.

> But if that's the case, then how does indexing into the same array
> with both host and guest server numbers make sense?

yes. VPs are allocated with KVM_MAX_VCPUS :

	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);

but

	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)

WE would need to change the allocation of the VPs I guess.

>>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>>  			     struct kvmppc_xive_src_block *sb,
>>  			     struct kvmppc_xive_irq_state *state)
>> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  		pr_devel("Duplicate !\n");
>>  		return -EEXIST;
>>  	}
>> -	if (cpu >= KVM_MAX_VCPUS) {
>> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
>>  		pr_devel("Out of bounds !\n");
>>  		return -EINVAL;
>>  	}
>> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  	xc->xive = xive;
>>  	xc->vcpu = vcpu;
>>  	xc->server_num = cpu;
>> -	xc->vp_id = xive->vp_base + cpu;
>> +	xc->vp_id = xive_vp(xive, cpu);
>>  	xc->mfrr = 0xff;
>>  	xc->valid = true;
>>  
> 

From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= <clg@kaod.org>
Date: Mon, 23 Apr 2018 09:06:35 +0000
Subject: Re: [PATCH RFC 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
Message-Id: <1e01ea66-6103-94c8-ccb1-ed35b3a3104b@kaod.org>
List-Id: <kvm-ppc.vger.kernel.org>
References: <70974cfb62a7f09a53ec914d2909639884228244.1523516498.git.sam.bobroff@au1.ibm.com>
 <20180416040942.GB20551@umbus.fritz.box>
In-Reply-To: <20180416040942.GB20551@umbus.fritz.box>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: David Gibson <david@gibson.dropbear.id.au>, Sam Bobroff <sam.bobroff@au1.ibm.com>
Cc: paulus@samba.org, linuxppc-dev@lists.ozlabs.org, kvm-ppc@vger.kernel.org, kvm@vger.kernel.org

On 04/16/2018 06:09 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote:
>> It is not currently possible to create the full number of possible
>> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
>> threads per core than it's core stride (or "VSMT mode"). This is
>> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
>> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
>>
>> To address this, "pack" the VCORE ID and XIVE offsets by using
>> knowledge of the way the VCPU IDs will be used when there are less
>> guest threads per core than the core stride. The primary thread of
>> each core will always be used first. Then, if the guest uses more than
>> one thread per core, these secondary threads will sequentially follow
>> the primary in each core.
>>
>> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
>> VCPUs are being spaced apart, so at least half of each core is empty
>> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
>> into the second half of each core (4..7, in an 8-thread core).
>>
>> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
>> each core is being left empty, and we can map down into the second and
>> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
>>
>> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
>> threads are being used and 7/8 of the core is empty, allowing use of
>> the 1, 3, 5 and 7 thread slots.
>>
>> (Strides less than 8 are handled similarly.)
>>
>> This allows the VCORE ID or offset to be calculated quickly from the
>> VCPU ID or XIVE server numbers, without access to the VCPU structure.
>>
>> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
>> ---
>> Hello everyone,
>>
>> I've tested this on P8 and P9, in lots of combinations of host and guest
>> threading modes and it has been fine but it does feel like a "tricky"
>> approach, so I still feel somewhat wary about it.

Have you done any migration ? 

>> I've posted it as an RFC because I have not tested it with guest native-XIVE,
>> and I suspect that it will take some work to support it.

The KVM XIVE device will be different for XIVE exploitation mode, same structures 
though. I will send a patchset shortly. 

>>  arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++
>>  arch/powerpc/kvm/book3s_hv.c          | 14 ++++++++++----
>>  arch/powerpc/kvm/book3s_xive.c        |  9 +++++++--
>>  3 files changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
>> index 376ae803b69c..1295056d564a 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s.h
>> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
>>  #define SPLIT_HACK_MASK			0xff000000
>>  #define SPLIT_HACK_OFFS			0xfb000000
>>  
>> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
>> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
>> + * (but not it's actual threading mode, which is not available) to avoid
>> + * collisions.
>> + */
>> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
>> +{
>> +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
> 
> I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes
> roughly the same thing, but I think makes the pattern more obvious.
> 
>> +	int stride = kvm->arch.emul_smt_mode > 1 ?
>> +		     kvm->arch.emul_smt_mode : kvm->arch.smt_mode;
> 
> AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must
> always be 1 when this is called, so the conditional here doesn't seem
> useful.
> 
>> +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
>> +	u32 packed_id;
>> +
>> +	BUG_ON(block >= MAX_SMT_THREADS);
>> +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
>> +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
>> +	return packed_id;
>> +}
> 
> It took me a while to wrap my head around the packing function, but I
> think I got there in the end.  It's pretty clever.
> 
> One thing bothers me, though.  This certainly packs things under
> KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus.
> e.g. KVM_MAC_VCPUS=16, 8 vcpus total, stride 8, 2 vthreads/vcore (as
> qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25)
> and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7
> etc. unused.
> 
> So again, the question is what exactly are these remapped IDs useful
> for.  If we're indexing into a bare array of structures of size
> KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having
> more entries than vcpus.  If we're indexing into something sparser,
> then why is the remapping worthwhile?
> 
> 
> 
>> +
>>  #endif /* __ASM_KVM_BOOK3S_H__ */
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 9cb9448163c4..49165cc90051 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm)
>>  	return threads_per_subcore;
>>  }
>>  
>> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
>>  {
>>  	struct kvmppc_vcore *vcore;
>>  
>> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
>>  	init_swait_queue_head(&vcore->wq);
>>  	vcore->preempt_tb = TB_NIL;
>>  	vcore->lpcr = kvm->arch.lpcr;
>> -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
>> +	vcore->first_vcpuid = id;
>>  	vcore->kvm = kvm;
>>  	INIT_LIST_HEAD(&vcore->preempt_list);
>>  
>> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
>>  	mutex_lock(&kvm->lock);
>>  	vcore = NULL;
>>  	err = -EINVAL;
>> -	core = id / kvm->arch.smt_mode;
>> +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>> +		BUG_ON(kvm->arch.smt_mode != 1);
>> +		core = kvmppc_pack_vcpu_id(kvm, id);
>> +	} else {
>> +		core = id / kvm->arch.smt_mode;
>> +	}
>>  	if (core < KVM_MAX_VCORES) {
>>  		vcore = kvm->arch.vcores[core];
>> +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
>>  		if (!vcore) {
>>  			err = -ENOMEM;
>> -			vcore = kvmppc_vcore_create(kvm, core);
>> +			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
>>  			kvm->arch.vcores[core] = vcore;
>>  			kvm->arch.online_vcores++;
>>  		}
>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>> index f9818d7d3381..681dfe12a5f3 100644
>> --- a/arch/powerpc/kvm/book3s_xive.c
>> +++ b/arch/powerpc/kvm/book3s_xive.c
>> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>>  	return -EBUSY;
>>  }
>>  
>> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
>> +{
>> +	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>> +}
>> +
> 
> I'm finding the XIVE indexing really baffling.  There are a bunch of
> other places where the code uses (xive->vp_base + NUMBER) directly.

This links the QEMU vCPU server NUMBER to a XIVE virtual processor number 
in OPAL. So we need to check that all used NUMBERs are, first, consistent 
and then, in the correct range.

> If those are host side references, I guess they don't need updates for
> this.

> But if that's the case, then how does indexing into the same array
> with both host and guest server numbers make sense?

yes. VPs are allocated with KVM_MAX_VCPUS :

	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);

but

	#define KVM_MAX_VCPU_ID  (threads_per_subcore * KVM_MAX_VCORES)

WE would need to change the allocation of the VPs I guess.

>>  static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
>>  			     struct kvmppc_xive_src_block *sb,
>>  			     struct kvmppc_xive_irq_state *state)
>> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  		pr_devel("Duplicate !\n");
>>  		return -EEXIST;
>>  	}
>> -	if (cpu >= KVM_MAX_VCPUS) {
>> +	if (cpu >= KVM_MAX_VCPU_ID) {>>
>>  		pr_devel("Out of bounds !\n");
>>  		return -EINVAL;
>>  	}
>> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>>  	xc->xive = xive;
>>  	xc->vcpu = vcpu;
>>  	xc->server_num = cpu;
>> -	xc->vp_id = xive->vp_base + cpu;
>> +	xc->vp_id = xive_vp(xive, cpu);
>>  	xc->mfrr = 0xff;
>>  	xc->valid = true;
>>  
>