On Wed, Sep 11, 2019 at 12:25:24PM +0200, Greg Kurz wrote: > On Wed, 11 Sep 2019 12:30:48 +1000 > David Gibson wrote: > > > On Tue, Sep 10, 2019 at 06:49:34PM +0200, Greg Kurz wrote: > > > Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with > > > 8 event queue (EQ) descriptors, one for each priority. A POWER9 socket > > > can handle a maximum of 1M event queues. > > > > > > The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor, > > > and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This means > > > that on a bi-socket system, we can create at most: > > > > > > (2 * 1M) / (8 * 2048) - 1 == 127 XIVE or XICS-on-XIVE KVM devices > > > > > > ie, start at most 127 VMs benefiting from an in-kernel interrupt controller. > > > Subsequent VMs need to rely on much slower userspace emulated XIVE device in > > > QEMU. > > > > > > This is problematic as one can legitimately expect to start the same > > > number of mono-CPU VMs as the number of HW threads available on the > > > system (eg, 144 on Witherspoon). > > > > > > I'm not aware of any userspace supporting more that 1024 vCPUs. It thus > > > seem overkill to consume that many VPs per VM. Ideally we would even > > > want userspace to be able to tell KVM about the maximum number of vCPUs > > > when creating the VM. > > > > > > For now, provide a module parameter to configure the maximum number of > > > vCPUs per VM. While here, reduce the default value to 1024 to match the > > > current limit in QEMU. This number is only used by the XIVE KVM devices, > > > but some more users of KVM_MAX_VCPUS could possibly be converted. > > > > > > With this change, I could successfully run 230 mono-CPU VMs on a > > > Witherspoon system using the official skiboot-6.3. > > > > > > I could even run more VMs by using upstream skiboot containing this > > > fix, that allows to better spread interrupts between sockets: > > > > > > e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()") > > > > > > MAX VPCUS | MAX VMS > > > ----------+--------- > > > 1024 | 255 > > > 512 | 511 > > > 256 | 1023 (*) > > > > > > (*) the system was barely usable because of the extreme load and > > > memory exhaustion but the VMs did start. > > > > Hrm. I don't love the idea of using a global tunable for this, > > although I guess it could have some use. It's another global system > > property that admins have to worry about. > > > > Well, they have to worry only if they're unhappy with the new > 1024 default FWIW. True. > > A better approach would seem to be a way for userspace to be able to > > hint the maximum number of cpus for a specific VM to the kernel. > > > > Yes and it's mentioned in the changelog. Since this requires to add > a new API in KVM and the corresponding changes in QEMU, I was thinking > that having a way to change the limit in KVM would be an acceptable > solution for the short term. Yeah, I guess that makes sense. > Anyway, I'll start looking into the better approach. > > > > > > > Signed-off-by: Greg Kurz > > > --- > > > arch/powerpc/include/asm/kvm_host.h | 1 + > > > arch/powerpc/kvm/book3s_hv.c | 32 ++++++++++++++++++++++++++++++++ > > > arch/powerpc/kvm/book3s_xive.c | 2 +- > > > arch/powerpc/kvm/book3s_xive_native.c | 2 +- > > > 4 files changed, 35 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h > > > index 6fb5fb4779e0..17582ce38788 100644 > > > --- a/arch/powerpc/include/asm/kvm_host.h > > > +++ b/arch/powerpc/include/asm/kvm_host.h > > > @@ -335,6 +335,7 @@ struct kvm_arch { > > > struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS]; > > > /* This array can grow quite large, keep it at the end */ > > > struct kvmppc_vcore *vcores[KVM_MAX_VCORES]; > > > + unsigned int max_vcpus; > > > #endif > > > }; > > > > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > > > index f8975c620f41..393d8a1ce9d8 100644 > > > --- a/arch/powerpc/kvm/book3s_hv.c > > > +++ b/arch/powerpc/kvm/book3s_hv.c > > > @@ -125,6 +125,36 @@ static bool nested = true; > > > module_param(nested, bool, S_IRUGO | S_IWUSR); > > > MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)"); > > > > > > +#define MIN(x, y) (((x) < (y)) ? (x) : (y)) > > > + > > > +static unsigned int max_vcpus = MIN(KVM_MAX_VCPUS, 1024); > > > + > > > +static int set_max_vcpus(const char *val, const struct kernel_param *kp) > > > +{ > > > + unsigned int new_max_vcpus; > > > + int ret; > > > + > > > + ret = kstrtouint(val, 0, &new_max_vcpus); > > > + if (ret) > > > + return ret; > > > + > > > + if (new_max_vcpus > KVM_MAX_VCPUS) > > > + return -EINVAL; > > > + > > > + max_vcpus = new_max_vcpus; > > > + > > > + return 0; > > > +} > > > + > > > +static struct kernel_param_ops max_vcpus_ops = { > > > + .set = set_max_vcpus, > > > + .get = param_get_uint, > > > +}; > > > + > > > +module_param_cb(max_vcpus, &max_vcpus_ops, &max_vcpus, S_IRUGO | S_IWUSR); > > > +MODULE_PARM_DESC(max_vcpus, "Maximum number of vCPUS per VM (max = " > > > + __stringify(KVM_MAX_VCPUS) ")"); > > > + > > > static inline bool nesting_enabled(struct kvm *kvm) > > > { > > > return kvm->arch.nested_enable && kvm_is_radix(kvm); > > > @@ -4918,6 +4948,8 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm) > > > if (radix_enabled()) > > > kvmhv_radix_debugfs_init(kvm); > > > > > > + kvm->arch.max_vcpus = max_vcpus; > > > + > > > return 0; > > > } > > > > > > diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c > > > index 2ef43d037a4f..0fea31b64564 100644 > > > --- a/arch/powerpc/kvm/book3s_xive.c > > > +++ b/arch/powerpc/kvm/book3s_xive.c > > > @@ -2026,7 +2026,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type) > > > xive->q_page_order = xive->q_order - PAGE_SHIFT; > > > > > > /* Allocate a bunch of VPs */ > > > - xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS); > > > + xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus); > > > pr_devel("VP_Base=%x\n", xive->vp_base); > > > > > > if (xive->vp_base == XIVE_INVALID_VP) > > > diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c > > > index 84a354b90f60..20314010da56 100644 > > > --- a/arch/powerpc/kvm/book3s_xive_native.c > > > +++ b/arch/powerpc/kvm/book3s_xive_native.c > > > @@ -1095,7 +1095,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type) > > > * a default. Getting the max number of CPUs the VM was > > > * configured with would improve our usage of the XIVE VP space. > > > */ > > > - xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS); > > > + xive->vp_base = xive_native_alloc_vp_block(kvm->arch.max_vcpus); > > > pr_devel("VP_Base=%x\n", xive->vp_base); > > > > > > if (xive->vp_base == XIVE_INVALID_VP) > > > > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson