LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM
@ 2019-09-10 16:49 Greg Kurz
  2019-09-11  2:30 ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Greg Kurz @ 2019-09-10 16:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev, David Gibson

Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
can handle a maximum of 1M event queues.

The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means
that on a bi-socket system, we can create at most:

(2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices

ie, start at most 127 VMs benefiting from an in-kernel interrupt controller.
Subsequent VMs need to rely on much slower userspace emulated XIVE device in
QEMU.

This is problematic as one can legitimately expect to start the same
number of mono-CPU VMs as the number of HW threads available on the
system (eg, 144 on Witherspoon).

I'm not aware of any userspace supporting more that 1024 vCPUs. It thus
seem overkill to consume that many VPs per VM. Ideally we would even
want userspace to be able to tell KVM about the maximum number of vCPUs
when creating the VM.

For now, provide a module parameter to configure the maximum number of
vCPUs per VM. While here, reduce the default value to 1024 to match the
current limit in QEMU. This number is only used by the XIVE KVM devices,
but some more users of KVM_MAX_VCPUS could possibly be converted.

With this change, I could successfully run 230 mono-CPU VMs on a
Witherspoon system using the official skiboot-6.3.

I could even run more VMs by using upstream skiboot containing this
fix, that allows to better spread interrupts between sockets:

e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")

MAX VPCUS | MAX VMS
----------+---------
     1024 |     255
      512 |     511
      256 |    1023 (*)

(*) the system was barely usable because of the extreme load and
    memory exhaustion but the VMs did start.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/include/asm/kvm_host.h   |    1 +
 arch/powerpc/kvm/book3s_hv.c          |   32 ++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_xive.c        |    2 +-
 arch/powerpc/kvm/book3s_xive_native.c |    2 +-
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 6fb5fb4779e0..17582ce38788 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -335,6 +335,7 @@ struct kvm_arch {
 	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
 	/* This array can grow quite large, keep it at the end */
 	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
+	unsigned int max_vcpus;
 #endif
 };
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index f8975c620f41..393d8a1ce9d8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -125,6 +125,36 @@ static bool nested = true;
 module_param(nested, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
 
+#define MIN(x, y) (((x) < (y)) ? (x) : (y))
+
+static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024);
+
+static int set_max_vcpus(const char *val, const struct kernel_param *kp)
+{
+	unsigned int new_max_vcpus;
+	int ret;
+
+	ret = kstrtouint(val, 0, &new_max_vcpus);
+	if (ret)
+		return ret;
+
+	if (new_max_vcpus > KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	max_vcpus = new_max_vcpus;
+
+	return 0;
+}
+
+static struct kernel_param_ops max_vcpus_ops = {
+	.set = set_max_vcpus,
+	.get = param_get_uint,
+};
+
+module_param_cb(max_vcpus, &max_vcpus_ops, &max_vcpus, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = "
+		 __stringify(KVM_MAX_VCPUS) ")");
+
 static inline bool nesting_enabled(struct kvm *kvm)
 {
 	return kvm->arch.nested_enable && kvm_is_radix(kvm);
@@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
 	if (radix_enabled())
 		kvmhv_radix_debugfs_init(kvm);
 
+	kvm->arch.max_vcpus = max_vcpus;
+
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 2ef43d037a4f..0fea31b64564 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 		xive->q_page_order = xive->q_order - PAGE_SHIFT;
 
 	/* Allocate a bunch of VPs */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
+	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
 	pr_devel("VP_Base=%x\n", xive->vp_base);
 
 	if (xive->vp_base == XIVE_INVALID_VP)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 84a354b90f60..20314010da56 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1095,7 +1095,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	 * a default. Getting the max number of CPUs the VM was
 	 * configured with would improve our usage of the XIVE VP space.
 	 */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
+	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
 	pr_devel("VP_Base=%x\n", xive->vp_base);
 
 	if (xive->vp_base == XIVE_INVALID_VP)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM
  2019-09-10 16:49 [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM Greg Kurz
@ 2019-09-11  2:30 ` David Gibson
  2019-09-11 10:25   ` Greg Kurz
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2019-09-11  2:30 UTC (permalink / raw)
  To: Greg Kurz; +Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6205 bytes --]

On Tue, Sep 10, 2019 at 06:49:34PM +0200, Greg Kurz wrote:
> Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
> 8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
> can handle a maximum of 1M event queues.
> 
> The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
> and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means
> that on a bi-socket system, we can create at most:
> 
> (2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices
> 
> ie, start at most 127 VMs benefiting from an in-kernel interrupt controller.
> Subsequent VMs need to rely on much slower userspace emulated XIVE device in
> QEMU.
> 
> This is problematic as one can legitimately expect to start the same
> number of mono-CPU VMs as the number of HW threads available on the
> system (eg, 144 on Witherspoon).
> 
> I'm not aware of any userspace supporting more that 1024 vCPUs. It thus
> seem overkill to consume that many VPs per VM. Ideally we would even
> want userspace to be able to tell KVM about the maximum number of vCPUs
> when creating the VM.
> 
> For now, provide a module parameter to configure the maximum number of
> vCPUs per VM. While here, reduce the default value to 1024 to match the
> current limit in QEMU. This number is only used by the XIVE KVM devices,
> but some more users of KVM_MAX_VCPUS could possibly be converted.
> 
> With this change, I could successfully run 230 mono-CPU VMs on a
> Witherspoon system using the official skiboot-6.3.
> 
> I could even run more VMs by using upstream skiboot containing this
> fix, that allows to better spread interrupts between sockets:
> 
> e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")
> 
> MAX VPCUS | MAX VMS
> ----------+---------
>      1024 |     255
>       512 |     511
>       256 |    1023 (*)
> 
> (*) the system was barely usable because of the extreme load and
>     memory exhaustion but the VMs did start.

Hrm.  I don't love the idea of using a global tunable for this,
although I guess it could have some use.  It's another global system
property that admins have to worry about.

A better approach would seem to be a way for userspace to be able to
hint the maximum number of cpus for a specific VM to the kernel.

> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  arch/powerpc/include/asm/kvm_host.h   |    1 +
>  arch/powerpc/kvm/book3s_hv.c          |   32 ++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/book3s_xive.c        |    2 +-
>  arch/powerpc/kvm/book3s_xive_native.c |    2 +-
>  4 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 6fb5fb4779e0..17582ce38788 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -335,6 +335,7 @@ struct kvm_arch {
>  	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
>  	/* This array can grow quite large, keep it at the end */
>  	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> +	unsigned int max_vcpus;
>  #endif
>  };
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index f8975c620f41..393d8a1ce9d8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -125,6 +125,36 @@ static bool nested = true;
>  module_param(nested, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
>  
> +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> +
> +static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024);
> +
> +static int set_max_vcpus(const char *val, const struct kernel_param *kp)
> +{
> +	unsigned int new_max_vcpus;
> +	int ret;
> +
> +	ret = kstrtouint(val, 0, &new_max_vcpus);
> +	if (ret)
> +		return ret;
> +
> +	if (new_max_vcpus > KVM_MAX_VCPUS)
> +		return -EINVAL;
> +
> +	max_vcpus = new_max_vcpus;
> +
> +	return 0;
> +}
> +
> +static struct kernel_param_ops max_vcpus_ops = {
> +	.set = set_max_vcpus,
> +	.get = param_get_uint,
> +};
> +
> +module_param_cb(max_vcpus, &max_vcpus_ops, &max_vcpus, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = "
> +		 __stringify(KVM_MAX_VCPUS) ")");
> +
>  static inline bool nesting_enabled(struct kvm *kvm)
>  {
>  	return kvm->arch.nested_enable && kvm_is_radix(kvm);
> @@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
>  	if (radix_enabled())
>  		kvmhv_radix_debugfs_init(kvm);
>  
> +	kvm->arch.max_vcpus = max_vcpus;
> +
>  	return 0;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 2ef43d037a4f..0fea31b64564 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  		xive->q_page_order = xive->q_order - PAGE_SHIFT;
>  
>  	/* Allocate a bunch of VPs */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> +	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
>  	pr_devel("VP_Base=%x\n", xive->vp_base);
>  
>  	if (xive->vp_base == XIVE_INVALID_VP)
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 84a354b90f60..20314010da56 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1095,7 +1095,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  	 * a default. Getting the max number of CPUs the VM was
>  	 * configured with would improve our usage of the XIVE VP space.
>  	 */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> +	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
>  	pr_devel("VP_Base=%x\n", xive->vp_base);
>  
>  	if (xive->vp_base == XIVE_INVALID_VP)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM
  2019-09-11  2:30 ` David Gibson
@ 2019-09-11 10:25   ` Greg Kurz
  2019-09-12 13:49     ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Greg Kurz @ 2019-09-11 10:25 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6793 bytes --]

On Wed, 11 Sep 2019 12:30:48 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Sep 10, 2019 at 06:49:34PM +0200, Greg Kurz wrote:
> > Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
> > 8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
> > can handle a maximum of 1M event queues.
> > 
> > The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
> > and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means
> > that on a bi-socket system, we can create at most:
> > 
> > (2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices
> > 
> > ie, start at most 127 VMs benefiting from an in-kernel interrupt controller.
> > Subsequent VMs need to rely on much slower userspace emulated XIVE device in
> > QEMU.
> > 
> > This is problematic as one can legitimately expect to start the same
> > number of mono-CPU VMs as the number of HW threads available on the
> > system (eg, 144 on Witherspoon).
> > 
> > I'm not aware of any userspace supporting more that 1024 vCPUs. It thus
> > seem overkill to consume that many VPs per VM. Ideally we would even
> > want userspace to be able to tell KVM about the maximum number of vCPUs
> > when creating the VM.
> > 
> > For now, provide a module parameter to configure the maximum number of
> > vCPUs per VM. While here, reduce the default value to 1024 to match the
> > current limit in QEMU. This number is only used by the XIVE KVM devices,
> > but some more users of KVM_MAX_VCPUS could possibly be converted.
> > 
> > With this change, I could successfully run 230 mono-CPU VMs on a
> > Witherspoon system using the official skiboot-6.3.
> > 
> > I could even run more VMs by using upstream skiboot containing this
> > fix, that allows to better spread interrupts between sockets:
> > 
> > e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")
> > 
> > MAX VPCUS | MAX VMS
> > ----------+---------
> >      1024 |     255
> >       512 |     511
> >       256 |    1023 (*)
> > 
> > (*) the system was barely usable because of the extreme load and
> >     memory exhaustion but the VMs did start.
> 
> Hrm.  I don't love the idea of using a global tunable for this,
> although I guess it could have some use.  It's another global system
> property that admins have to worry about.
> 

Well, they have to worry only if they're unhappy with the new
1024 default FWIW.

> A better approach would seem to be a way for userspace to be able to
> hint the maximum number of cpus for a specific VM to the kernel.
> 

Yes and it's mentioned in the changelog. Since this requires to add
a new API in KVM and the corresponding changes in QEMU, I was thinking
that having a way to change the limit in KVM would be an acceptable
solution for the short term.

Anyway, I'll start looking into the better approach.

> > 
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> > ---
> >  arch/powerpc/include/asm/kvm_host.h   |    1 +
> >  arch/powerpc/kvm/book3s_hv.c          |   32 ++++++++++++++++++++++++++++++++
> >  arch/powerpc/kvm/book3s_xive.c        |    2 +-
> >  arch/powerpc/kvm/book3s_xive_native.c |    2 +-
> >  4 files changed, 35 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > index 6fb5fb4779e0..17582ce38788 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -335,6 +335,7 @@ struct kvm_arch {
> >  	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
> >  	/* This array can grow quite large, keep it at the end */
> >  	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> > +	unsigned int max_vcpus;
> >  #endif
> >  };
> >  
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index f8975c620f41..393d8a1ce9d8 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -125,6 +125,36 @@ static bool nested = true;
> >  module_param(nested, bool, S_IRUGO | S_IWUSR);
> >  MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
> >  
> > +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> > +
> > +static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024);
> > +
> > +static int set_max_vcpus(const char *val, const struct kernel_param *kp)
> > +{
> > +	unsigned int new_max_vcpus;
> > +	int ret;
> > +
> > +	ret = kstrtouint(val, 0, &new_max_vcpus);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (new_max_vcpus > KVM_MAX_VCPUS)
> > +		return -EINVAL;
> > +
> > +	max_vcpus = new_max_vcpus;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct kernel_param_ops max_vcpus_ops = {
> > +	.set = set_max_vcpus,
> > +	.get = param_get_uint,
> > +};
> > +
> > +module_param_cb(max_vcpus, &max_vcpus_ops, &max_vcpus, S_IRUGO | S_IWUSR);
> > +MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = "
> > +		 __stringify(KVM_MAX_VCPUS) ")");
> > +
> >  static inline bool nesting_enabled(struct kvm *kvm)
> >  {
> >  	return kvm->arch.nested_enable && kvm_is_radix(kvm);
> > @@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
> >  	if (radix_enabled())
> >  		kvmhv_radix_debugfs_init(kvm);
> >  
> > +	kvm->arch.max_vcpus = max_vcpus;
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > index 2ef43d037a4f..0fea31b64564 100644
> > --- a/arch/powerpc/kvm/book3s_xive.c
> > +++ b/arch/powerpc/kvm/book3s_xive.c
> > @@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
> >  		xive->q_page_order = xive->q_order - PAGE_SHIFT;
> >  
> >  	/* Allocate a bunch of VPs */
> > -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > +	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
> >  	pr_devel("VP_Base=%x\n", xive->vp_base);
> >  
> >  	if (xive->vp_base == XIVE_INVALID_VP)
> > diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> > index 84a354b90f60..20314010da56 100644
> > --- a/arch/powerpc/kvm/book3s_xive_native.c
> > +++ b/arch/powerpc/kvm/book3s_xive_native.c
> > @@ -1095,7 +1095,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
> >  	 * a default. Getting the max number of CPUs the VM was
> >  	 * configured with would improve our usage of the XIVE VP space.
> >  	 */
> > -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > +	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
> >  	pr_devel("VP_Base=%x\n", xive->vp_base);
> >  
> >  	if (xive->vp_base == XIVE_INVALID_VP)
> > 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM
  2019-09-11 10:25   ` Greg Kurz
@ 2019-09-12 13:49     ` David Gibson
  0 siblings, 0 replies; 4+ messages in thread
From: David Gibson @ 2019-09-12 13:49 UTC (permalink / raw)
  To: Greg Kurz; +Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7429 bytes --]

On Wed, Sep 11, 2019 at 12:25:24PM +0200, Greg Kurz wrote:
> On Wed, 11 Sep 2019 12:30:48 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Sep 10, 2019 at 06:49:34PM +0200, Greg Kurz wrote:
> > > Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
> > > 8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
> > > can handle a maximum of 1M event queues.
> > > 
> > > The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
> > > and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means
> > > that on a bi-socket system, we can create at most:
> > > 
> > > (2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices
> > > 
> > > ie, start at most 127 VMs benefiting from an in-kernel interrupt controller.
> > > Subsequent VMs need to rely on much slower userspace emulated XIVE device in
> > > QEMU.
> > > 
> > > This is problematic as one can legitimately expect to start the same
> > > number of mono-CPU VMs as the number of HW threads available on the
> > > system (eg, 144 on Witherspoon).
> > > 
> > > I'm not aware of any userspace supporting more that 1024 vCPUs. It thus
> > > seem overkill to consume that many VPs per VM. Ideally we would even
> > > want userspace to be able to tell KVM about the maximum number of vCPUs
> > > when creating the VM.
> > > 
> > > For now, provide a module parameter to configure the maximum number of
> > > vCPUs per VM. While here, reduce the default value to 1024 to match the
> > > current limit in QEMU. This number is only used by the XIVE KVM devices,
> > > but some more users of KVM_MAX_VCPUS could possibly be converted.
> > > 
> > > With this change, I could successfully run 230 mono-CPU VMs on a
> > > Witherspoon system using the official skiboot-6.3.
> > > 
> > > I could even run more VMs by using upstream skiboot containing this
> > > fix, that allows to better spread interrupts between sockets:
> > > 
> > > e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()")
> > > 
> > > MAX VPCUS | MAX VMS
> > > ----------+---------
> > >      1024 |     255
> > >       512 |     511
> > >       256 |    1023 (*)
> > > 
> > > (*) the system was barely usable because of the extreme load and
> > >     memory exhaustion but the VMs did start.
> > 
> > Hrm.  I don't love the idea of using a global tunable for this,
> > although I guess it could have some use.  It's another global system
> > property that admins have to worry about.
> > 
> 
> Well, they have to worry only if they're unhappy with the new
> 1024 default FWIW.

True.

> > A better approach would seem to be a way for userspace to be able to
> > hint the maximum number of cpus for a specific VM to the kernel.
> > 
> 
> Yes and it's mentioned in the changelog. Since this requires to add
> a new API in KVM and the corresponding changes in QEMU, I was thinking
> that having a way to change the limit in KVM would be an acceptable
> solution for the short term.

Yeah, I guess that makes sense.

> Anyway, I'll start looking into the better approach.
> 
> > > 
> > > Signed-off-by: Greg Kurz <groug@kaod.org>
> > > ---
> > >  arch/powerpc/include/asm/kvm_host.h   |    1 +
> > >  arch/powerpc/kvm/book3s_hv.c          |   32 ++++++++++++++++++++++++++++++++
> > >  arch/powerpc/kvm/book3s_xive.c        |    2 +-
> > >  arch/powerpc/kvm/book3s_xive_native.c |    2 +-
> > >  4 files changed, 35 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > > index 6fb5fb4779e0..17582ce38788 100644
> > > --- a/arch/powerpc/include/asm/kvm_host.h
> > > +++ b/arch/powerpc/include/asm/kvm_host.h
> > > @@ -335,6 +335,7 @@ struct kvm_arch {
> > >  	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
> > >  	/* This array can grow quite large, keep it at the end */
> > >  	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> > > +	unsigned int max_vcpus;
> > >  #endif
> > >  };
> > >  
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index f8975c620f41..393d8a1ce9d8 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -125,6 +125,36 @@ static bool nested = true;
> > >  module_param(nested, bool, S_IRUGO | S_IWUSR);
> > >  MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
> > >  
> > > +#define MIN(x, y) (((x) < (y)) ? (x) : (y))
> > > +
> > > +static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024);
> > > +
> > > +static int set_max_vcpus(const char *val, const struct kernel_param *kp)
> > > +{
> > > +	unsigned int new_max_vcpus;
> > > +	int ret;
> > > +
> > > +	ret = kstrtouint(val, 0, &new_max_vcpus);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (new_max_vcpus > KVM_MAX_VCPUS)
> > > +		return -EINVAL;
> > > +
> > > +	max_vcpus = new_max_vcpus;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static struct kernel_param_ops max_vcpus_ops = {
> > > +	.set = set_max_vcpus,
> > > +	.get = param_get_uint,
> > > +};
> > > +
> > > +module_param_cb(max_vcpus, &max_vcpus_ops, &max_vcpus, S_IRUGO | S_IWUSR);
> > > +MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = "
> > > +		 __stringify(KVM_MAX_VCPUS) ")");
> > > +
> > >  static inline bool nesting_enabled(struct kvm *kvm)
> > >  {
> > >  	return kvm->arch.nested_enable && kvm_is_radix(kvm);
> > > @@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
> > >  	if (radix_enabled())
> > >  		kvmhv_radix_debugfs_init(kvm);
> > >  
> > > +	kvm->arch.max_vcpus = max_vcpus;
> > > +
> > >  	return 0;
> > >  }
> > >  
> > > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> > > index 2ef43d037a4f..0fea31b64564 100644
> > > --- a/arch/powerpc/kvm/book3s_xive.c
> > > +++ b/arch/powerpc/kvm/book3s_xive.c
> > > @@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
> > >  		xive->q_page_order = xive->q_order - PAGE_SHIFT;
> > >  
> > >  	/* Allocate a bunch of VPs */
> > > -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > > +	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
> > >  	pr_devel("VP_Base=%x\n", xive->vp_base);
> > >  
> > >  	if (xive->vp_base == XIVE_INVALID_VP)
> > > diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> > > index 84a354b90f60..20314010da56 100644
> > > --- a/arch/powerpc/kvm/book3s_xive_native.c
> > > +++ b/arch/powerpc/kvm/book3s_xive_native.c
> > > @@ -1095,7 +1095,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
> > >  	 * a default. Getting the max number of CPUs the VM was
> > >  	 * configured with would improve our usage of the XIVE VP space.
> > >  	 */
> > > -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> > > +	xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus);
> > >  	pr_devel("VP_Base=%x\n", xive->vp_base);
> > >  
> > >  	if (xive->vp_base == XIVE_INVALID_VP)
> > > 
> > 
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 16:49 [PATCH] KVM: PPC: Book3S HV: Tunable to configure maximum # of vCPUs per VM Greg Kurz
2019-09-11  2:30 ` David Gibson
2019-09-11 10:25   ` Greg Kurz
2019-09-12 13:49     ` David Gibson

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org linuxppc-dev@archiver.kernel.org
	public-inbox-index linuxppc-dev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox