All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] KVM: PPC: Book3S: HV: XIVE: Allocate less VPs in OPAL
@ 2019-09-23 15:43 ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
can handle a maximum of 1M event queues.

The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This
means that on a bi-socket system, we can create at most:

(2 * 1M) / (8 * 2048) - 1 == 127 XIVE KVM devices

ie, start at most 127 VMs benefiting from an in-kernel interrupt
controller. Subsequent VMs need to rely on a much slower userspace
emulated XIVE or XICS device in QEMU.

This is problematic as one can legitimately expect to start the same
number of mono-cpu VMs as the number of HW threads available on the
system, eg, 144 on a bi-socket POWER9 Witherspoon.

This series allows QEMU to tell KVM how many interrupt servers are needed,
which is likely less than 2048 with a typical VM, eg. it is only 256 for
32 vCPUs with a guest's core stride of 8 and 1 thread per core.

With this I could run ~500 SMP1 VMs on a Witherspoon system.

Patches 1 to 3 are preliminary fixes (1 and 2 have already been posted
but are provided for convenience).

--
Greg

---

Cédric Le Goater (1):
      KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated

Greg Kurz (5):
      KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
      KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
      KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
      KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
      KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs


 Documentation/virt/kvm/devices/xics.txt |   14 +++
 Documentation/virt/kvm/devices/xive.txt |    8 ++
 arch/powerpc/include/uapi/asm/kvm.h     |    3 +
 arch/powerpc/kvm/book3s_xive.c          |  145 +++++++++++++++++++++++++------
 arch/powerpc/kvm/book3s_xive.h          |   17 ++++
 arch/powerpc/kvm/book3s_xive_native.c   |   49 +++++-----
 6 files changed, 179 insertions(+), 57 deletions(-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 0/6] KVM: PPC: Book3S: HV: XIVE: Allocate less VPs in OPAL
@ 2019-09-23 15:43 ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
can handle a maximum of 1M event queues.

The powernv platform allocates NR_CPUS (== 2048) VPs for the hypervisor,
and each XIVE KVM device allocates KVM_MAX_VCPUS (== 2048) VPs. This
means that on a bi-socket system, we can create at most:

(2 * 1M) / (8 * 2048) - 1 == 127 XIVE KVM devices

ie, start at most 127 VMs benefiting from an in-kernel interrupt
controller. Subsequent VMs need to rely on a much slower userspace
emulated XIVE or XICS device in QEMU.

This is problematic as one can legitimately expect to start the same
number of mono-cpu VMs as the number of HW threads available on the
system, eg, 144 on a bi-socket POWER9 Witherspoon.

This series allows QEMU to tell KVM how many interrupt servers are needed,
which is likely less than 2048 with a typical VM, eg. it is only 256 for
32 vCPUs with a guest's core stride of 8 and 1 thread per core.

With this I could run ~500 SMP1 VMs on a Witherspoon system.

Patches 1 to 3 are preliminary fixes (1 and 2 have already been posted
but are provided for convenience).

--
Greg

---

Cédric Le Goater (1):
      KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated

Greg Kurz (5):
      KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
      KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
      KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
      KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
      KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs


 Documentation/virt/kvm/devices/xics.txt |   14 +++
 Documentation/virt/kvm/devices/xive.txt |    8 ++
 arch/powerpc/include/uapi/asm/kvm.h     |    3 +
 arch/powerpc/kvm/book3s_xive.c          |  145 +++++++++++++++++++++++++------
 arch/powerpc/kvm/book3s_xive.h          |   17 ++++
 arch/powerpc/kvm/book3s_xive_native.c   |   49 +++++-----
 6 files changed, 179 insertions(+), 57 deletions(-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 0/6] KVM: PPC: Book3S: HV: XIVE: Allocate less VPs in OPAL
@ 2019-09-23 15:43 ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

Each vCPU of a VM allocates a XIVE VP in OPAL which is associated with
8 event queue (EQ) descriptors, one for each priority. A POWER9 socket
can handle a maximum of 1M event queues.

The powernv platform allocates NR_CPUS (= 2048) VPs for the hypervisor,
and each XIVE KVM device allocates KVM_MAX_VCPUS (= 2048) VPs. This
means that on a bi-socket system, we can create at most:

(2 * 1M) / (8 * 2048) - 1 = 127 XIVE KVM devices

ie, start at most 127 VMs benefiting from an in-kernel interrupt
controller. Subsequent VMs need to rely on a much slower userspace
emulated XIVE or XICS device in QEMU.

This is problematic as one can legitimately expect to start the same
number of mono-cpu VMs as the number of HW threads available on the
system, eg, 144 on a bi-socket POWER9 Witherspoon.

This series allows QEMU to tell KVM how many interrupt servers are needed,
which is likely less than 2048 with a typical VM, eg. it is only 256 for
32 vCPUs with a guest's core stride of 8 and 1 thread per core.

With this I could run ~500 SMP1 VMs on a Witherspoon system.

Patches 1 to 3 are preliminary fixes (1 and 2 have already been posted
but are provided for convenience).

--
Greg

---

Cédric Le Goater (1):
      KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated

Greg Kurz (5):
      KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
      KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
      KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
      KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
      KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs


 Documentation/virt/kvm/devices/xics.txt |   14 +++
 Documentation/virt/kvm/devices/xive.txt |    8 ++
 arch/powerpc/include/uapi/asm/kvm.h     |    3 +
 arch/powerpc/kvm/book3s_xive.c          |  145 +++++++++++++++++++++++++------
 arch/powerpc/kvm/book3s_xive.h          |   17 ++++
 arch/powerpc/kvm/book3s_xive_native.c   |   49 +++++-----
 6 files changed, 179 insertions(+), 57 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
  2019-09-23 15:43 ` Greg Kurz
  (?)
@ 2019-09-23 15:43   ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

From: Cédric Le Goater <clg@kaod.org>

Do not assign the device private pointer before making sure the XIVE
VPs are allocated in OPAL and test pointer validity when releasing
the device.

Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   13 +++++++++++--
 arch/powerpc/kvm/book3s_xive_native.c |   13 +++++++++++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 591bfb4bfd0f..cd2006bfcd3e 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1897,12 +1897,21 @@ void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 static void kvmppc_xive_release(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
-	struct kvm *kvm = xive->kvm;
+	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
 
 	pr_devel("Releasing xive device\n");
 
+	/*
+	 * In case of failure (OPAL) at device creation, the device
+	 * can be partially initialized.
+	 */
+	if (!xive)
+		return;
+
+	kvm = xive->kvm;
+
 	/*
 	 * Since this is the device release function, we know that
 	 * userspace does not have any open fd referring to the
@@ -2001,7 +2010,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	if (!xive)
 		return -ENOMEM;
 
-	dev->private = xive;
 	xive->dev = dev;
 	xive->kvm = kvm;
 	mutex_init(&xive->lock);
@@ -2031,6 +2039,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	if (ret)
 		return ret;
 
+	dev->private = xive;
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 248c1ea9e788..e9cbb42de424 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -987,12 +987,21 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 static void kvmppc_xive_native_release(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
-	struct kvm *kvm = xive->kvm;
+	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
 
 	pr_devel("Releasing xive native device\n");
 
+	/*
+	 * In case of failure (OPAL) at device creation, the device
+	 * can be partially initialized.
+	 */
+	if (!xive)
+		return;
+
+	kvm = xive->kvm;
+
 	/*
 	 * Clear the KVM device file address_space which is used to
 	 * unmap the ESB pages when a device is passed-through.
@@ -1076,7 +1085,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (!xive)
 		return -ENOMEM;
 
-	dev->private = xive;
 	xive->dev = dev;
 	xive->kvm = kvm;
 	kvm->arch.xive = xive;
@@ -1100,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (ret)
 		return ret;
 
+	dev->private = xive;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

From: Cédric Le Goater <clg@kaod.org>

Do not assign the device private pointer before making sure the XIVE
VPs are allocated in OPAL and test pointer validity when releasing
the device.

Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   13 +++++++++++--
 arch/powerpc/kvm/book3s_xive_native.c |   13 +++++++++++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 591bfb4bfd0f..cd2006bfcd3e 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1897,12 +1897,21 @@ void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 static void kvmppc_xive_release(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
-	struct kvm *kvm = xive->kvm;
+	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
 
 	pr_devel("Releasing xive device\n");
 
+	/*
+	 * In case of failure (OPAL) at device creation, the device
+	 * can be partially initialized.
+	 */
+	if (!xive)
+		return;
+
+	kvm = xive->kvm;
+
 	/*
 	 * Since this is the device release function, we know that
 	 * userspace does not have any open fd referring to the
@@ -2001,7 +2010,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	if (!xive)
 		return -ENOMEM;
 
-	dev->private = xive;
 	xive->dev = dev;
 	xive->kvm = kvm;
 	mutex_init(&xive->lock);
@@ -2031,6 +2039,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	if (ret)
 		return ret;
 
+	dev->private = xive;
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 248c1ea9e788..e9cbb42de424 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -987,12 +987,21 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 static void kvmppc_xive_native_release(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
-	struct kvm *kvm = xive->kvm;
+	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
 
 	pr_devel("Releasing xive native device\n");
 
+	/*
+	 * In case of failure (OPAL) at device creation, the device
+	 * can be partially initialized.
+	 */
+	if (!xive)
+		return;
+
+	kvm = xive->kvm;
+
 	/*
 	 * Clear the KVM device file address_space which is used to
 	 * unmap the ESB pages when a device is passed-through.
@@ -1076,7 +1085,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (!xive)
 		return -ENOMEM;
 
-	dev->private = xive;
 	xive->dev = dev;
 	xive->kvm = kvm;
 	kvm->arch.xive = xive;
@@ -1100,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (ret)
 		return ret;
 
+	dev->private = xive;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

From: Cédric Le Goater <clg@kaod.org>

Do not assign the device private pointer before making sure the XIVE
VPs are allocated in OPAL and test pointer validity when releasing
the device.

Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   13 +++++++++++--
 arch/powerpc/kvm/book3s_xive_native.c |   13 +++++++++++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 591bfb4bfd0f..cd2006bfcd3e 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1897,12 +1897,21 @@ void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 static void kvmppc_xive_release(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
-	struct kvm *kvm = xive->kvm;
+	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
 
 	pr_devel("Releasing xive device\n");
 
+	/*
+	 * In case of failure (OPAL) at device creation, the device
+	 * can be partially initialized.
+	 */
+	if (!xive)
+		return;
+
+	kvm = xive->kvm;
+
 	/*
 	 * Since this is the device release function, we know that
 	 * userspace does not have any open fd referring to the
@@ -2001,7 +2010,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	if (!xive)
 		return -ENOMEM;
 
-	dev->private = xive;
 	xive->dev = dev;
 	xive->kvm = kvm;
 	mutex_init(&xive->lock);
@@ -2031,6 +2039,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	if (ret)
 		return ret;
 
+	dev->private = xive;
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 248c1ea9e788..e9cbb42de424 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -987,12 +987,21 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 static void kvmppc_xive_native_release(struct kvm_device *dev)
 {
 	struct kvmppc_xive *xive = dev->private;
-	struct kvm *kvm = xive->kvm;
+	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
 
 	pr_devel("Releasing xive native device\n");
 
+	/*
+	 * In case of failure (OPAL) at device creation, the device
+	 * can be partially initialized.
+	 */
+	if (!xive)
+		return;
+
+	kvm = xive->kvm;
+
 	/*
 	 * Clear the KVM device file address_space which is used to
 	 * unmap the ESB pages when a device is passed-through.
@@ -1076,7 +1085,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (!xive)
 		return -ENOMEM;
 
-	dev->private = xive;
 	xive->dev = dev;
 	xive->kvm = kvm;
 	kvm->arch.xive = xive;
@@ -1100,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (ret)
 		return ret;
 
+	dev->private = xive;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
  2019-09-23 15:43 ` Greg Kurz
  (?)
@ 2019-09-23 15:43   ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
set forever since the relase method isn't called in this case. Any
subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
thus always fail. This is a problem for QEMU since it destroys and
re-creates these devices when the VM is reset: the VM would be
restricted to using the emulated XIVE or XICS forever.

As an alternative to adding rollback, do not assign kvm->arch.xive before
making sure the XIVE VPs are allocated in OPAL.

Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   11 +++++------
 arch/powerpc/kvm/book3s_xive_native.c |    2 +-
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index cd2006bfcd3e..2ef43d037a4f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -2006,6 +2006,10 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 
 	pr_devel("Creating xive for partition\n");
 
+	/* Already there ? */
+	if (kvm->arch.xive)
+		return -EEXIST;
+
 	xive = kvmppc_xive_get_device(kvm, type);
 	if (!xive)
 		return -ENOMEM;
@@ -2014,12 +2018,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	xive->kvm = kvm;
 	mutex_init(&xive->lock);
 
-	/* Already there ? */
-	if (kvm->arch.xive)
-		ret = -EEXIST;
-	else
-		kvm->arch.xive = xive;
-
 	/* We use the default queue size set by the host */
 	xive->q_order = xive_native_default_eq_shift();
 	if (xive->q_order < PAGE_SHIFT)
@@ -2040,6 +2038,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 		return ret;
 
 	dev->private = xive;
+	kvm->arch.xive = xive;
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index e9cbb42de424..84a354b90f60 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1087,7 +1087,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 
 	xive->dev = dev;
 	xive->kvm = kvm;
-	kvm->arch.xive = xive;
 	mutex_init(&xive->mapping_lock);
 	mutex_init(&xive->lock);
 
@@ -1109,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 		return ret;
 
 	dev->private = xive;
+	kvm->arch.xive = xive;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
set forever since the relase method isn't called in this case. Any
subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
thus always fail. This is a problem for QEMU since it destroys and
re-creates these devices when the VM is reset: the VM would be
restricted to using the emulated XIVE or XICS forever.

As an alternative to adding rollback, do not assign kvm->arch.xive before
making sure the XIVE VPs are allocated in OPAL.

Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   11 +++++------
 arch/powerpc/kvm/book3s_xive_native.c |    2 +-
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index cd2006bfcd3e..2ef43d037a4f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -2006,6 +2006,10 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 
 	pr_devel("Creating xive for partition\n");
 
+	/* Already there ? */
+	if (kvm->arch.xive)
+		return -EEXIST;
+
 	xive = kvmppc_xive_get_device(kvm, type);
 	if (!xive)
 		return -ENOMEM;
@@ -2014,12 +2018,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	xive->kvm = kvm;
 	mutex_init(&xive->lock);
 
-	/* Already there ? */
-	if (kvm->arch.xive)
-		ret = -EEXIST;
-	else
-		kvm->arch.xive = xive;
-
 	/* We use the default queue size set by the host */
 	xive->q_order = xive_native_default_eq_shift();
 	if (xive->q_order < PAGE_SHIFT)
@@ -2040,6 +2038,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 		return ret;
 
 	dev->private = xive;
+	kvm->arch.xive = xive;
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index e9cbb42de424..84a354b90f60 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1087,7 +1087,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 
 	xive->dev = dev;
 	xive->kvm = kvm;
-	kvm->arch.xive = xive;
 	mutex_init(&xive->mapping_lock);
 	mutex_init(&xive->lock);
 
@@ -1109,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 		return ret;
 
 	dev->private = xive;
+	kvm->arch.xive = xive;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
set forever since the relase method isn't called in this case. Any
subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
thus always fail. This is a problem for QEMU since it destroys and
re-creates these devices when the VM is reset: the VM would be
restricted to using the emulated XIVE or XICS forever.

As an alternative to adding rollback, do not assign kvm->arch.xive before
making sure the XIVE VPs are allocated in OPAL.

Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   11 +++++------
 arch/powerpc/kvm/book3s_xive_native.c |    2 +-
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index cd2006bfcd3e..2ef43d037a4f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -2006,6 +2006,10 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 
 	pr_devel("Creating xive for partition\n");
 
+	/* Already there ? */
+	if (kvm->arch.xive)
+		return -EEXIST;
+
 	xive = kvmppc_xive_get_device(kvm, type);
 	if (!xive)
 		return -ENOMEM;
@@ -2014,12 +2018,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	xive->kvm = kvm;
 	mutex_init(&xive->lock);
 
-	/* Already there ? */
-	if (kvm->arch.xive)
-		ret = -EEXIST;
-	else
-		kvm->arch.xive = xive;
-
 	/* We use the default queue size set by the host */
 	xive->q_order = xive_native_default_eq_shift();
 	if (xive->q_order < PAGE_SHIFT)
@@ -2040,6 +2038,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 		return ret;
 
 	dev->private = xive;
+	kvm->arch.xive = xive;
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index e9cbb42de424..84a354b90f60 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1087,7 +1087,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 
 	xive->dev = dev;
 	xive->kvm = kvm;
-	kvm->arch.xive = xive;
 	mutex_init(&xive->mapping_lock);
 	mutex_init(&xive->lock);
 
@@ -1109,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 		return ret;
 
 	dev->private = xive;
+	kvm->arch.xive = xive;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
  2019-09-23 15:43 ` Greg Kurz
  (?)
@ 2019-09-23 15:43   ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

We currently prevent userspace to connect a new vCPU if we already have
one with the same vCPU id. This is good but unfortunately not enough,
because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
can return colliding values. For examples, 348 stays unchanged since it
is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
guest's core stride is 8. Nothing currently prevents userspace to connect
vCPUs with forged ids, that end up being associated to the same VP. This
confuses the irq layer and likely crashes the kernel:

[96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)

Check the VP id instead of the vCPU id when a new vCPU is connected.
The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
is moved after the check to avoid the need for rollback.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   24 ++++++++++++++++--------
 arch/powerpc/kvm/book3s_xive.h        |   12 ++++++++++++
 arch/powerpc/kvm/book3s_xive_native.c |    6 ++++--
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 2ef43d037a4f..01bff7befc9f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1217,6 +1217,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	struct kvmppc_xive *xive = dev->private;
 	struct kvmppc_xive_vcpu *xc;
 	int i, r = -EBUSY;
+	u32 vp_id;
 
 	pr_devel("connect_vcpu(cpu=%d)\n", cpu);
 
@@ -1228,25 +1229,32 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
-		pr_devel("Duplicate !\n");
-		return -EEXIST;
-	}
 	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
 		pr_devel("Out of bounds !\n");
 		return -EINVAL;
 	}
-	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
-	if (!xc)
-		return -ENOMEM;
 
 	/* We need to synchronize with queue provisioning */
 	mutex_lock(&xive->lock);
+
+	vp_id = kvmppc_xive_vp(xive, cpu);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
+		pr_devel("Duplicate !\n");
+		r = -EEXIST;
+		goto bail;
+	}
+
+	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
+	if (!xc) {
+		r = -ENOMEM;
+		goto bail;
+	}
+
 	vcpu->arch.xive_vcpu = xc;
 	xc->xive = xive;
 	xc->vcpu = vcpu;
 	xc->server_num = cpu;
-	xc->vp_id = kvmppc_xive_vp(xive, cpu);
+	xc->vp_id = vp_id;
 	xc->mfrr = 0xff;
 	xc->valid = true;
 
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 955b820ffd6d..fe3ed50e0818 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -220,6 +220,18 @@ static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
 	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
 }
 
+static inline bool kvmppc_xive_vp_in_use(struct kvm *kvm, u32 vp_id)
+{
+	struct kvm_vcpu *vcpu = NULL;
+	int i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (vcpu->arch.xive_vcpu && vp_id == vcpu->arch.xive_vcpu->vp_id)
+			return true;
+	}
+	return false;
+}
+
 /*
  * Mapping between guest priorities and host priorities
  * is as follow.
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 84a354b90f60..53a22771908c 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -106,6 +106,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	struct kvmppc_xive *xive = dev->private;
 	struct kvmppc_xive_vcpu *xc = NULL;
 	int rc;
+	u32 vp_id;
 
 	pr_devel("native_connect_vcpu(server=%d)\n", server_num);
 
@@ -124,7 +125,8 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 
 	mutex_lock(&xive->lock);
 
-	if (kvmppc_xive_find_server(vcpu->kvm, server_num)) {
+	vp_id = kvmppc_xive_vp(xive, server_num);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
 		pr_devel("Duplicate !\n");
 		rc = -EEXIST;
 		goto bail;
@@ -141,7 +143,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	xc->vcpu = vcpu;
 	xc->server_num = server_num;
 
-	xc->vp_id = kvmppc_xive_vp(xive, server_num);
+	xc->vp_id = vp_id;
 	xc->valid = true;
 	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

We currently prevent userspace to connect a new vCPU if we already have
one with the same vCPU id. This is good but unfortunately not enough,
because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
can return colliding values. For examples, 348 stays unchanged since it
is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
guest's core stride is 8. Nothing currently prevents userspace to connect
vCPUs with forged ids, that end up being associated to the same VP. This
confuses the irq layer and likely crashes the kernel:

[96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)

Check the VP id instead of the vCPU id when a new vCPU is connected.
The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
is moved after the check to avoid the need for rollback.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   24 ++++++++++++++++--------
 arch/powerpc/kvm/book3s_xive.h        |   12 ++++++++++++
 arch/powerpc/kvm/book3s_xive_native.c |    6 ++++--
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 2ef43d037a4f..01bff7befc9f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1217,6 +1217,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	struct kvmppc_xive *xive = dev->private;
 	struct kvmppc_xive_vcpu *xc;
 	int i, r = -EBUSY;
+	u32 vp_id;
 
 	pr_devel("connect_vcpu(cpu=%d)\n", cpu);
 
@@ -1228,25 +1229,32 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
-		pr_devel("Duplicate !\n");
-		return -EEXIST;
-	}
 	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
 		pr_devel("Out of bounds !\n");
 		return -EINVAL;
 	}
-	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
-	if (!xc)
-		return -ENOMEM;
 
 	/* We need to synchronize with queue provisioning */
 	mutex_lock(&xive->lock);
+
+	vp_id = kvmppc_xive_vp(xive, cpu);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
+		pr_devel("Duplicate !\n");
+		r = -EEXIST;
+		goto bail;
+	}
+
+	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
+	if (!xc) {
+		r = -ENOMEM;
+		goto bail;
+	}
+
 	vcpu->arch.xive_vcpu = xc;
 	xc->xive = xive;
 	xc->vcpu = vcpu;
 	xc->server_num = cpu;
-	xc->vp_id = kvmppc_xive_vp(xive, cpu);
+	xc->vp_id = vp_id;
 	xc->mfrr = 0xff;
 	xc->valid = true;
 
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 955b820ffd6d..fe3ed50e0818 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -220,6 +220,18 @@ static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
 	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
 }
 
+static inline bool kvmppc_xive_vp_in_use(struct kvm *kvm, u32 vp_id)
+{
+	struct kvm_vcpu *vcpu = NULL;
+	int i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (vcpu->arch.xive_vcpu && vp_id == vcpu->arch.xive_vcpu->vp_id)
+			return true;
+	}
+	return false;
+}
+
 /*
  * Mapping between guest priorities and host priorities
  * is as follow.
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 84a354b90f60..53a22771908c 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -106,6 +106,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	struct kvmppc_xive *xive = dev->private;
 	struct kvmppc_xive_vcpu *xc = NULL;
 	int rc;
+	u32 vp_id;
 
 	pr_devel("native_connect_vcpu(server=%d)\n", server_num);
 
@@ -124,7 +125,8 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 
 	mutex_lock(&xive->lock);
 
-	if (kvmppc_xive_find_server(vcpu->kvm, server_num)) {
+	vp_id = kvmppc_xive_vp(xive, server_num);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
 		pr_devel("Duplicate !\n");
 		rc = -EEXIST;
 		goto bail;
@@ -141,7 +143,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	xc->vcpu = vcpu;
 	xc->server_num = server_num;
 
-	xc->vp_id = kvmppc_xive_vp(xive, server_num);
+	xc->vp_id = vp_id;
 	xc->valid = true;
 	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

We currently prevent userspace to connect a new vCPU if we already have
one with the same vCPU id. This is good but unfortunately not enough,
because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
can return colliding values. For examples, 348 stays unchanged since it
is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
guest's core stride is 8. Nothing currently prevents userspace to connect
vCPUs with forged ids, that end up being associated to the same VP. This
confuses the irq layer and likely crashes the kernel:

[96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)

Check the VP id instead of the vCPU id when a new vCPU is connected.
The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
is moved after the check to avoid the need for rollback.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   24 ++++++++++++++++--------
 arch/powerpc/kvm/book3s_xive.h        |   12 ++++++++++++
 arch/powerpc/kvm/book3s_xive_native.c |    6 ++++--
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 2ef43d037a4f..01bff7befc9f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1217,6 +1217,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	struct kvmppc_xive *xive = dev->private;
 	struct kvmppc_xive_vcpu *xc;
 	int i, r = -EBUSY;
+	u32 vp_id;
 
 	pr_devel("connect_vcpu(cpu=%d)\n", cpu);
 
@@ -1228,25 +1229,32 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
-		pr_devel("Duplicate !\n");
-		return -EEXIST;
-	}
 	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
 		pr_devel("Out of bounds !\n");
 		return -EINVAL;
 	}
-	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
-	if (!xc)
-		return -ENOMEM;
 
 	/* We need to synchronize with queue provisioning */
 	mutex_lock(&xive->lock);
+
+	vp_id = kvmppc_xive_vp(xive, cpu);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
+		pr_devel("Duplicate !\n");
+		r = -EEXIST;
+		goto bail;
+	}
+
+	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
+	if (!xc) {
+		r = -ENOMEM;
+		goto bail;
+	}
+
 	vcpu->arch.xive_vcpu = xc;
 	xc->xive = xive;
 	xc->vcpu = vcpu;
 	xc->server_num = cpu;
-	xc->vp_id = kvmppc_xive_vp(xive, cpu);
+	xc->vp_id = vp_id;
 	xc->mfrr = 0xff;
 	xc->valid = true;
 
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 955b820ffd6d..fe3ed50e0818 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -220,6 +220,18 @@ static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
 	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
 }
 
+static inline bool kvmppc_xive_vp_in_use(struct kvm *kvm, u32 vp_id)
+{
+	struct kvm_vcpu *vcpu = NULL;
+	int i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (vcpu->arch.xive_vcpu && vp_id = vcpu->arch.xive_vcpu->vp_id)
+			return true;
+	}
+	return false;
+}
+
 /*
  * Mapping between guest priorities and host priorities
  * is as follow.
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 84a354b90f60..53a22771908c 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -106,6 +106,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	struct kvmppc_xive *xive = dev->private;
 	struct kvmppc_xive_vcpu *xc = NULL;
 	int rc;
+	u32 vp_id;
 
 	pr_devel("native_connect_vcpu(server=%d)\n", server_num);
 
@@ -124,7 +125,8 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 
 	mutex_lock(&xive->lock);
 
-	if (kvmppc_xive_find_server(vcpu->kvm, server_num)) {
+	vp_id = kvmppc_xive_vp(xive, server_num);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
 		pr_devel("Duplicate !\n");
 		rc = -EEXIST;
 		goto bail;
@@ -141,7 +143,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	xc->vcpu = vcpu;
 	xc->server_num = server_num;
 
-	xc->vp_id = kvmppc_xive_vp(xive, server_num);
+	xc->vp_id = vp_id;
 	xc->valid = true;
 	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/6] KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
  2019-09-23 15:43 ` Greg Kurz
  (?)
@ 2019-09-23 15:43   ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

Reduce code duplication by consolidating the checking of vCPU ids and VP
ids to a common helper used by both legacy and native XIVE KVM devices.
And explain the magic with a comment.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   42 ++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive.h        |    1 +
 arch/powerpc/kvm/book3s_xive_native.c |   11 ++-------
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 01bff7befc9f..9ac6315fb9ae 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1211,6 +1211,37 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	vcpu->arch.xive_vcpu = NULL;
 }
 
+static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
+{
+	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
+	 * raw vCPU ids are below the expected limit for this guest's
+	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
+	 * index that can be safely used to compute a VP id that belongs
+	 * to the VP block.
+	 */
+	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
+}
+
+int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
+{
+	u32 vp_id;
+
+	if (!kvmppc_xive_vcpu_id_valid(xive, cpu)) {
+		pr_devel("Out of bounds !\n");
+		return -EINVAL;
+	}
+
+	vp_id = kvmppc_xive_vp(xive, cpu);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
+		pr_devel("Duplicate !\n");
+		return -EEXIST;
+	}
+
+	*vp = vp_id;
+
+	return 0;
+}
+
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 			     struct kvm_vcpu *vcpu, u32 cpu)
 {
@@ -1229,20 +1260,13 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
-		pr_devel("Out of bounds !\n");
-		return -EINVAL;
-	}
 
 	/* We need to synchronize with queue provisioning */
 	mutex_lock(&xive->lock);
 
-	vp_id = kvmppc_xive_vp(xive, cpu);
-	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
-		pr_devel("Duplicate !\n");
-		r = -EEXIST;
+	r = kvmppc_xive_compute_vp_id(xive, cpu, &vp_id);
+	if (r)
 		goto bail;
-	}
 
 	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
 	if (!xc) {
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index fe3ed50e0818..90cf6ec35a68 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -296,6 +296,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
 				    struct kvmppc_xive_vcpu *xc, int irq);
+int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 53a22771908c..6902319c5ee9 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -118,19 +118,12 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (server_num >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
-		pr_devel("Out of bounds !\n");
-		return -EINVAL;
-	}
 
 	mutex_lock(&xive->lock);
 
-	vp_id = kvmppc_xive_vp(xive, server_num);
-	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
-		pr_devel("Duplicate !\n");
-		rc = -EEXIST;
+	rc = kvmppc_xive_compute_vp_id(xive, server_num, &vp_id);
+	if (rc)
 		goto bail;
-	}
 
 	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
 	if (!xc) {


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/6] KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

Reduce code duplication by consolidating the checking of vCPU ids and VP
ids to a common helper used by both legacy and native XIVE KVM devices.
And explain the magic with a comment.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   42 ++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive.h        |    1 +
 arch/powerpc/kvm/book3s_xive_native.c |   11 ++-------
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 01bff7befc9f..9ac6315fb9ae 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1211,6 +1211,37 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	vcpu->arch.xive_vcpu = NULL;
 }
 
+static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
+{
+	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
+	 * raw vCPU ids are below the expected limit for this guest's
+	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
+	 * index that can be safely used to compute a VP id that belongs
+	 * to the VP block.
+	 */
+	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
+}
+
+int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
+{
+	u32 vp_id;
+
+	if (!kvmppc_xive_vcpu_id_valid(xive, cpu)) {
+		pr_devel("Out of bounds !\n");
+		return -EINVAL;
+	}
+
+	vp_id = kvmppc_xive_vp(xive, cpu);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
+		pr_devel("Duplicate !\n");
+		return -EEXIST;
+	}
+
+	*vp = vp_id;
+
+	return 0;
+}
+
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 			     struct kvm_vcpu *vcpu, u32 cpu)
 {
@@ -1229,20 +1260,13 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
-		pr_devel("Out of bounds !\n");
-		return -EINVAL;
-	}
 
 	/* We need to synchronize with queue provisioning */
 	mutex_lock(&xive->lock);
 
-	vp_id = kvmppc_xive_vp(xive, cpu);
-	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
-		pr_devel("Duplicate !\n");
-		r = -EEXIST;
+	r = kvmppc_xive_compute_vp_id(xive, cpu, &vp_id);
+	if (r)
 		goto bail;
-	}
 
 	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
 	if (!xc) {
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index fe3ed50e0818..90cf6ec35a68 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -296,6 +296,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
 				    struct kvmppc_xive_vcpu *xc, int irq);
+int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 53a22771908c..6902319c5ee9 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -118,19 +118,12 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (server_num >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
-		pr_devel("Out of bounds !\n");
-		return -EINVAL;
-	}
 
 	mutex_lock(&xive->lock);
 
-	vp_id = kvmppc_xive_vp(xive, server_num);
-	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
-		pr_devel("Duplicate !\n");
-		rc = -EEXIST;
+	rc = kvmppc_xive_compute_vp_id(xive, server_num, &vp_id);
+	if (rc)
 		goto bail;
-	}
 
 	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
 	if (!xc) {


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/6] KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper
@ 2019-09-23 15:43   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

Reduce code duplication by consolidating the checking of vCPU ids and VP
ids to a common helper used by both legacy and native XIVE KVM devices.
And explain the magic with a comment.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   42 ++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive.h        |    1 +
 arch/powerpc/kvm/book3s_xive_native.c |   11 ++-------
 3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 01bff7befc9f..9ac6315fb9ae 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1211,6 +1211,37 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	vcpu->arch.xive_vcpu = NULL;
 }
 
+static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
+{
+	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
+	 * raw vCPU ids are below the expected limit for this guest's
+	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
+	 * index that can be safely used to compute a VP id that belongs
+	 * to the VP block.
+	 */
+	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
+}
+
+int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
+{
+	u32 vp_id;
+
+	if (!kvmppc_xive_vcpu_id_valid(xive, cpu)) {
+		pr_devel("Out of bounds !\n");
+		return -EINVAL;
+	}
+
+	vp_id = kvmppc_xive_vp(xive, cpu);
+	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
+		pr_devel("Duplicate !\n");
+		return -EEXIST;
+	}
+
+	*vp = vp_id;
+
+	return 0;
+}
+
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 			     struct kvm_vcpu *vcpu, u32 cpu)
 {
@@ -1229,20 +1260,13 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
-		pr_devel("Out of bounds !\n");
-		return -EINVAL;
-	}
 
 	/* We need to synchronize with queue provisioning */
 	mutex_lock(&xive->lock);
 
-	vp_id = kvmppc_xive_vp(xive, cpu);
-	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
-		pr_devel("Duplicate !\n");
-		r = -EEXIST;
+	r = kvmppc_xive_compute_vp_id(xive, cpu, &vp_id);
+	if (r)
 		goto bail;
-	}
 
 	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
 	if (!xc) {
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index fe3ed50e0818..90cf6ec35a68 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -296,6 +296,7 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
 struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
 				    struct kvmppc_xive_vcpu *xc, int irq);
+int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 53a22771908c..6902319c5ee9 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -118,19 +118,12 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 		return -EPERM;
 	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
 		return -EBUSY;
-	if (server_num >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
-		pr_devel("Out of bounds !\n");
-		return -EINVAL;
-	}
 
 	mutex_lock(&xive->lock);
 
-	vp_id = kvmppc_xive_vp(xive, server_num);
-	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
-		pr_devel("Duplicate !\n");
-		rc = -EEXIST;
+	rc = kvmppc_xive_compute_vp_id(xive, server_num, &vp_id);
+	if (rc)
 		goto bail;
-	}
 
 	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
 	if (!xc) {

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
  2019-09-23 15:43 ` Greg Kurz
  (?)
@ 2019-09-23 15:44   ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

The XIVE VP is an internal structure which allow the XIVE interrupt
controller to maintain the interrupt context state of vCPUs non
dispatched on HW threads.

When a guest is started, the XIVE KVM device allocates a block of
XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
With a guest's core stride of 8 and a threading mode of 1 (QEMU's
default), a VM must run at least 256 vCPUs to actually need such a
range of VPs.

A POWER9 system has a limited XIVE VP space : 512k and KVM is
currently wasting this HW resource with large VP allocations,
especially since a typical VM likely runs with a lot less vCPUs.

Make the size of the VP block configurable. Add an nr_servers
field to the XIVE structure and a function to set it for this
purpose.

Split VP allocation out of the device create function. Since the
VP block isn't used before the first vCPU connects to the XIVE KVM
device, allocation is now performed by kvmppc_xive_connect_vcpu().
This gives the opportunity to set nr_servers in between:

          kvmppc_xive_create() / kvmppc_xive_native_create()
                               .
                               .
                     kvmppc_xive_set_nr_servers()
                               .
                               .
    kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()

The connect_vcpu() functions check that the vCPU id is below nr_servers
and if it is the first vCPU they allocate the VP block. This is protected
against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
with the xive->lock mutex.

Also, the block is allocated once for the device lifetime: nr_servers
should stay constant otherwise connect_vcpu() could generate a boggus
VP id and likely crash OPAL. It is thus forbidden to update nr_servers
once the block is allocated.

If the VP allocation fail, return ENOSPC which seems more appropriate to
report the depletion of system wide HW resource than ENOMEM or ENXIO.

A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
only need 256 VPs instead of 2048. If the stride is set to match the number
of threads per core, this goes further down to 32.

This will be exposed to userspace by a subsequent patch.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   59 ++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive.h        |    4 ++
 arch/powerpc/kvm/book3s_xive_native.c |   18 +++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9ac6315fb9ae..4a333dcfddd8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1213,13 +1213,13 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 
 static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
 {
-	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
+	/* We have a block of xive->nr_servers VPs. We just need to check
 	 * raw vCPU ids are below the expected limit for this guest's
 	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
 	 * index that can be safely used to compute a VP id that belongs
 	 * to the VP block.
 	 */
-	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
+	return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
 }
 
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
@@ -1231,6 +1231,14 @@ int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
 		return -EINVAL;
 	}
 
+	if (xive->vp_base == XIVE_INVALID_VP) {
+		xive->vp_base = xive_native_alloc_vp_block(xive->nr_servers);
+		pr_devel("VP_Base=%x nr_servers=%d\n", xive->vp_base, xive->nr_servers);
+
+		if (xive->vp_base == XIVE_INVALID_VP)
+			return -ENOSPC;
+	}
+
 	vp_id = kvmppc_xive_vp(xive, cpu);
 	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
 		pr_devel("Duplicate !\n");
@@ -1858,6 +1866,37 @@ int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
 	return 0;
 }
 
+int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr)
+{
+	u32 __user *ubufp = (u32 __user *) addr;
+	u32 nr_servers;
+	int rc = 0;
+
+	if (get_user(nr_servers, ubufp))
+		return -EFAULT;
+
+	pr_devel("%s nr_servers=%u\n", __func__, nr_servers);
+
+	if (nr_servers > KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	mutex_lock(&xive->lock);
+	/* The VP block is allocated once and freed when the device is
+	 * released. Better not allow to change its size since its used
+	 * by connect_vcpu to validate vCPU ids are valid (eg, setting
+	 * it back to a higher value could allow connect_vcpu to come
+	 * up with a VP id that goes beyond the VP block, which is likely
+	 * to cause a crash in OPAL).
+	 */
+	if (xive->vp_base != XIVE_INVALID_VP)
+		rc = -EBUSY;
+	else
+		xive->nr_servers = nr_servers;
+	mutex_unlock(&xive->lock);
+
+	return rc;
+}
+
 static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 {
 	struct kvmppc_xive *xive = dev->private;
@@ -2034,7 +2073,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
 	struct kvm *kvm = dev->kvm;
-	int ret = 0;
 
 	pr_devel("Creating xive for partition\n");
 
@@ -2057,18 +2095,15 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	else
 		xive->q_page_order = xive->q_order - PAGE_SHIFT;
 
-	/* Allocate a bunch of VPs */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
-	pr_devel("VP_Base=%x\n", xive->vp_base);
-
-	if (xive->vp_base == XIVE_INVALID_VP)
-		ret = -ENOMEM;
+	/* VP allocation is delayed to the first call to connect_vcpu */
+	xive->vp_base = XIVE_INVALID_VP;
+	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
+	 * on a POWER9 system.
+	 */
+	xive->nr_servers = KVM_MAX_VCPUS;
 
 	xive->single_escalation = xive_native_has_single_escalation();
 
-	if (ret)
-		return ret;
-
 	dev->private = xive;
 	kvm->arch.xive = xive;
 	return 0;
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 90cf6ec35a68..382e3a56e789 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -135,6 +135,9 @@ struct kvmppc_xive {
 	/* Flags */
 	u8	single_escalation;
 
+	/* Number of entries in the VP block */
+	u32	nr_servers;
+
 	struct kvmppc_xive_ops *ops;
 	struct address_space   *mapping;
 	struct mutex mapping_lock;
@@ -297,6 +300,7 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
 				    struct kvmppc_xive_vcpu *xc, int irq);
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
+int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 6902319c5ee9..5e18364d52a9 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1069,7 +1069,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
 	struct kvm *kvm = dev->kvm;
-	int ret = 0;
 
 	pr_devel("Creating xive native device\n");
 
@@ -1085,23 +1084,16 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	mutex_init(&xive->mapping_lock);
 	mutex_init(&xive->lock);
 
-	/*
-	 * Allocate a bunch of VPs. KVM_MAX_VCPUS is a large value for
-	 * a default. Getting the max number of CPUs the VM was
-	 * configured with would improve our usage of the XIVE VP space.
+	/* VP allocation is delayed to the first call to connect_vcpu */
+	xive->vp_base = XIVE_INVALID_VP;
+	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
+	 * on a POWER9 system.
 	 */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
-	pr_devel("VP_Base=%x\n", xive->vp_base);
-
-	if (xive->vp_base == XIVE_INVALID_VP)
-		ret = -ENXIO;
+	xive->nr_servers = KVM_MAX_VCPUS;
 
 	xive->single_escalation = xive_native_has_single_escalation();
 	xive->ops = &kvmppc_xive_native_ops;
 
-	if (ret)
-		return ret;
-
 	dev->private = xive;
 	kvm->arch.xive = xive;
 	return 0;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
@ 2019-09-23 15:44   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

The XIVE VP is an internal structure which allow the XIVE interrupt
controller to maintain the interrupt context state of vCPUs non
dispatched on HW threads.

When a guest is started, the XIVE KVM device allocates a block of
XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
With a guest's core stride of 8 and a threading mode of 1 (QEMU's
default), a VM must run at least 256 vCPUs to actually need such a
range of VPs.

A POWER9 system has a limited XIVE VP space : 512k and KVM is
currently wasting this HW resource with large VP allocations,
especially since a typical VM likely runs with a lot less vCPUs.

Make the size of the VP block configurable. Add an nr_servers
field to the XIVE structure and a function to set it for this
purpose.

Split VP allocation out of the device create function. Since the
VP block isn't used before the first vCPU connects to the XIVE KVM
device, allocation is now performed by kvmppc_xive_connect_vcpu().
This gives the opportunity to set nr_servers in between:

          kvmppc_xive_create() / kvmppc_xive_native_create()
                               .
                               .
                     kvmppc_xive_set_nr_servers()
                               .
                               .
    kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()

The connect_vcpu() functions check that the vCPU id is below nr_servers
and if it is the first vCPU they allocate the VP block. This is protected
against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
with the xive->lock mutex.

Also, the block is allocated once for the device lifetime: nr_servers
should stay constant otherwise connect_vcpu() could generate a boggus
VP id and likely crash OPAL. It is thus forbidden to update nr_servers
once the block is allocated.

If the VP allocation fail, return ENOSPC which seems more appropriate to
report the depletion of system wide HW resource than ENOMEM or ENXIO.

A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
only need 256 VPs instead of 2048. If the stride is set to match the number
of threads per core, this goes further down to 32.

This will be exposed to userspace by a subsequent patch.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   59 ++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive.h        |    4 ++
 arch/powerpc/kvm/book3s_xive_native.c |   18 +++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9ac6315fb9ae..4a333dcfddd8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1213,13 +1213,13 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 
 static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
 {
-	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
+	/* We have a block of xive->nr_servers VPs. We just need to check
 	 * raw vCPU ids are below the expected limit for this guest's
 	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
 	 * index that can be safely used to compute a VP id that belongs
 	 * to the VP block.
 	 */
-	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
+	return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
 }
 
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
@@ -1231,6 +1231,14 @@ int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
 		return -EINVAL;
 	}
 
+	if (xive->vp_base == XIVE_INVALID_VP) {
+		xive->vp_base = xive_native_alloc_vp_block(xive->nr_servers);
+		pr_devel("VP_Base=%x nr_servers=%d\n", xive->vp_base, xive->nr_servers);
+
+		if (xive->vp_base == XIVE_INVALID_VP)
+			return -ENOSPC;
+	}
+
 	vp_id = kvmppc_xive_vp(xive, cpu);
 	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
 		pr_devel("Duplicate !\n");
@@ -1858,6 +1866,37 @@ int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
 	return 0;
 }
 
+int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr)
+{
+	u32 __user *ubufp = (u32 __user *) addr;
+	u32 nr_servers;
+	int rc = 0;
+
+	if (get_user(nr_servers, ubufp))
+		return -EFAULT;
+
+	pr_devel("%s nr_servers=%u\n", __func__, nr_servers);
+
+	if (nr_servers > KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	mutex_lock(&xive->lock);
+	/* The VP block is allocated once and freed when the device is
+	 * released. Better not allow to change its size since its used
+	 * by connect_vcpu to validate vCPU ids are valid (eg, setting
+	 * it back to a higher value could allow connect_vcpu to come
+	 * up with a VP id that goes beyond the VP block, which is likely
+	 * to cause a crash in OPAL).
+	 */
+	if (xive->vp_base != XIVE_INVALID_VP)
+		rc = -EBUSY;
+	else
+		xive->nr_servers = nr_servers;
+	mutex_unlock(&xive->lock);
+
+	return rc;
+}
+
 static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 {
 	struct kvmppc_xive *xive = dev->private;
@@ -2034,7 +2073,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
 	struct kvm *kvm = dev->kvm;
-	int ret = 0;
 
 	pr_devel("Creating xive for partition\n");
 
@@ -2057,18 +2095,15 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	else
 		xive->q_page_order = xive->q_order - PAGE_SHIFT;
 
-	/* Allocate a bunch of VPs */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
-	pr_devel("VP_Base=%x\n", xive->vp_base);
-
-	if (xive->vp_base == XIVE_INVALID_VP)
-		ret = -ENOMEM;
+	/* VP allocation is delayed to the first call to connect_vcpu */
+	xive->vp_base = XIVE_INVALID_VP;
+	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
+	 * on a POWER9 system.
+	 */
+	xive->nr_servers = KVM_MAX_VCPUS;
 
 	xive->single_escalation = xive_native_has_single_escalation();
 
-	if (ret)
-		return ret;
-
 	dev->private = xive;
 	kvm->arch.xive = xive;
 	return 0;
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 90cf6ec35a68..382e3a56e789 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -135,6 +135,9 @@ struct kvmppc_xive {
 	/* Flags */
 	u8	single_escalation;
 
+	/* Number of entries in the VP block */
+	u32	nr_servers;
+
 	struct kvmppc_xive_ops *ops;
 	struct address_space   *mapping;
 	struct mutex mapping_lock;
@@ -297,6 +300,7 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
 				    struct kvmppc_xive_vcpu *xc, int irq);
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
+int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 6902319c5ee9..5e18364d52a9 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1069,7 +1069,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
 	struct kvm *kvm = dev->kvm;
-	int ret = 0;
 
 	pr_devel("Creating xive native device\n");
 
@@ -1085,23 +1084,16 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	mutex_init(&xive->mapping_lock);
 	mutex_init(&xive->lock);
 
-	/*
-	 * Allocate a bunch of VPs. KVM_MAX_VCPUS is a large value for
-	 * a default. Getting the max number of CPUs the VM was
-	 * configured with would improve our usage of the XIVE VP space.
+	/* VP allocation is delayed to the first call to connect_vcpu */
+	xive->vp_base = XIVE_INVALID_VP;
+	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
+	 * on a POWER9 system.
 	 */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
-	pr_devel("VP_Base=%x\n", xive->vp_base);
-
-	if (xive->vp_base == XIVE_INVALID_VP)
-		ret = -ENXIO;
+	xive->nr_servers = KVM_MAX_VCPUS;
 
 	xive->single_escalation = xive_native_has_single_escalation();
 	xive->ops = &kvmppc_xive_native_ops;
 
-	if (ret)
-		return ret;
-
 	dev->private = xive;
 	kvm->arch.xive = xive;
 	return 0;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
@ 2019-09-23 15:44   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

The XIVE VP is an internal structure which allow the XIVE interrupt
controller to maintain the interrupt context state of vCPUs non
dispatched on HW threads.

When a guest is started, the XIVE KVM device allocates a block of
XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
With a guest's core stride of 8 and a threading mode of 1 (QEMU's
default), a VM must run at least 256 vCPUs to actually need such a
range of VPs.

A POWER9 system has a limited XIVE VP space : 512k and KVM is
currently wasting this HW resource with large VP allocations,
especially since a typical VM likely runs with a lot less vCPUs.

Make the size of the VP block configurable. Add an nr_servers
field to the XIVE structure and a function to set it for this
purpose.

Split VP allocation out of the device create function. Since the
VP block isn't used before the first vCPU connects to the XIVE KVM
device, allocation is now performed by kvmppc_xive_connect_vcpu().
This gives the opportunity to set nr_servers in between:

          kvmppc_xive_create() / kvmppc_xive_native_create()
                               .
                               .
                     kvmppc_xive_set_nr_servers()
                               .
                               .
    kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()

The connect_vcpu() functions check that the vCPU id is below nr_servers
and if it is the first vCPU they allocate the VP block. This is protected
against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
with the xive->lock mutex.

Also, the block is allocated once for the device lifetime: nr_servers
should stay constant otherwise connect_vcpu() could generate a boggus
VP id and likely crash OPAL. It is thus forbidden to update nr_servers
once the block is allocated.

If the VP allocation fail, return ENOSPC which seems more appropriate to
report the depletion of system wide HW resource than ENOMEM or ENXIO.

A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
only need 256 VPs instead of 2048. If the stride is set to match the number
of threads per core, this goes further down to 32.

This will be exposed to userspace by a subsequent patch.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c        |   59 ++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive.h        |    4 ++
 arch/powerpc/kvm/book3s_xive_native.c |   18 +++-------
 3 files changed, 56 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9ac6315fb9ae..4a333dcfddd8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1213,13 +1213,13 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 
 static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
 {
-	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
+	/* We have a block of xive->nr_servers VPs. We just need to check
 	 * raw vCPU ids are below the expected limit for this guest's
 	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
 	 * index that can be safely used to compute a VP id that belongs
 	 * to the VP block.
 	 */
-	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
+	return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
 }
 
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
@@ -1231,6 +1231,14 @@ int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
 		return -EINVAL;
 	}
 
+	if (xive->vp_base = XIVE_INVALID_VP) {
+		xive->vp_base = xive_native_alloc_vp_block(xive->nr_servers);
+		pr_devel("VP_Base=%x nr_servers=%d\n", xive->vp_base, xive->nr_servers);
+
+		if (xive->vp_base = XIVE_INVALID_VP)
+			return -ENOSPC;
+	}
+
 	vp_id = kvmppc_xive_vp(xive, cpu);
 	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
 		pr_devel("Duplicate !\n");
@@ -1858,6 +1866,37 @@ int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
 	return 0;
 }
 
+int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr)
+{
+	u32 __user *ubufp = (u32 __user *) addr;
+	u32 nr_servers;
+	int rc = 0;
+
+	if (get_user(nr_servers, ubufp))
+		return -EFAULT;
+
+	pr_devel("%s nr_servers=%u\n", __func__, nr_servers);
+
+	if (nr_servers > KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	mutex_lock(&xive->lock);
+	/* The VP block is allocated once and freed when the device is
+	 * released. Better not allow to change its size since its used
+	 * by connect_vcpu to validate vCPU ids are valid (eg, setting
+	 * it back to a higher value could allow connect_vcpu to come
+	 * up with a VP id that goes beyond the VP block, which is likely
+	 * to cause a crash in OPAL).
+	 */
+	if (xive->vp_base != XIVE_INVALID_VP)
+		rc = -EBUSY;
+	else
+		xive->nr_servers = nr_servers;
+	mutex_unlock(&xive->lock);
+
+	return rc;
+}
+
 static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 {
 	struct kvmppc_xive *xive = dev->private;
@@ -2034,7 +2073,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
 	struct kvm *kvm = dev->kvm;
-	int ret = 0;
 
 	pr_devel("Creating xive for partition\n");
 
@@ -2057,18 +2095,15 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	else
 		xive->q_page_order = xive->q_order - PAGE_SHIFT;
 
-	/* Allocate a bunch of VPs */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
-	pr_devel("VP_Base=%x\n", xive->vp_base);
-
-	if (xive->vp_base = XIVE_INVALID_VP)
-		ret = -ENOMEM;
+	/* VP allocation is delayed to the first call to connect_vcpu */
+	xive->vp_base = XIVE_INVALID_VP;
+	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
+	 * on a POWER9 system.
+	 */
+	xive->nr_servers = KVM_MAX_VCPUS;
 
 	xive->single_escalation = xive_native_has_single_escalation();
 
-	if (ret)
-		return ret;
-
 	dev->private = xive;
 	kvm->arch.xive = xive;
 	return 0;
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 90cf6ec35a68..382e3a56e789 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -135,6 +135,9 @@ struct kvmppc_xive {
 	/* Flags */
 	u8	single_escalation;
 
+	/* Number of entries in the VP block */
+	u32	nr_servers;
+
 	struct kvmppc_xive_ops *ops;
 	struct address_space   *mapping;
 	struct mutex mapping_lock;
@@ -297,6 +300,7 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
 void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
 				    struct kvmppc_xive_vcpu *xc, int irq);
 int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
+int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 6902319c5ee9..5e18364d52a9 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1069,7 +1069,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
 	struct kvm *kvm = dev->kvm;
-	int ret = 0;
 
 	pr_devel("Creating xive native device\n");
 
@@ -1085,23 +1084,16 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	mutex_init(&xive->mapping_lock);
 	mutex_init(&xive->lock);
 
-	/*
-	 * Allocate a bunch of VPs. KVM_MAX_VCPUS is a large value for
-	 * a default. Getting the max number of CPUs the VM was
-	 * configured with would improve our usage of the XIVE VP space.
+	/* VP allocation is delayed to the first call to connect_vcpu */
+	xive->vp_base = XIVE_INVALID_VP;
+	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
+	 * on a POWER9 system.
 	 */
-	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
-	pr_devel("VP_Base=%x\n", xive->vp_base);
-
-	if (xive->vp_base = XIVE_INVALID_VP)
-		ret = -ENXIO;
+	xive->nr_servers = KVM_MAX_VCPUS;
 
 	xive->single_escalation = xive_native_has_single_escalation();
 	xive->ops = &kvmppc_xive_native_ops;
 
-	if (ret)
-		return ret;
-
 	dev->private = xive;
 	kvm->arch.xive = xive;
 	return 0;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
  2019-09-23 15:43 ` Greg Kurz
  (?)
@ 2019-09-23 15:44   ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

Add a new attribute to both legacy and native XIVE KVM devices so that
userspace can require less interrupt servers than the current default
(KVM_MAX_VCPUS, 2048). This will allow to allocate less VPs in OPAL,
and likely increase the number of VMs that can run with an in-kernel
XIVE implementation.

Since the legacy XIVE KVM device is exposed to userspace through the
XICS KVM API, a new attribute group is added to it for this purpose.
While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
in the XICS documentation.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 Documentation/virt/kvm/devices/xics.txt |   14 ++++++++++++--
 Documentation/virt/kvm/devices/xive.txt |    8 ++++++++
 arch/powerpc/include/uapi/asm/kvm.h     |    3 +++
 arch/powerpc/kvm/book3s_xive.c          |   10 ++++++++++
 arch/powerpc/kvm/book3s_xive_native.c   |    3 +++
 5 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt
index 42864935ac5d..1cf9621f8341 100644
--- a/Documentation/virt/kvm/devices/xics.txt
+++ b/Documentation/virt/kvm/devices/xics.txt
@@ -3,9 +3,19 @@ XICS interrupt controller
 Device type supported: KVM_DEV_TYPE_XICS
 
 Groups:
-  KVM_DEV_XICS_SOURCES
+  1. KVM_DEV_XICS_GRP_SOURCES
   Attributes: One per interrupt source, indexed by the source number.
 
+  2. KVM_DEV_XICS_GRP_CTRL
+  Attributes:
+    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
+  The kvm_device_attr.addr points to a __u32 value which is the number of
+  interrupt server numbers (ie, highest possible vcpu id plus one).
+  Errors:
+    -EINVAL: Value greater than KVM_MAX_VCPUS.
+    -EFAULT: Invalid user pointer for attr->addr.
+    -EBUSY:  A vcpu is already connected to the device.
+
 This device emulates the XICS (eXternal Interrupt Controller
 Specification) defined in PAPR.  The XICS has a set of interrupt
 sources, each identified by a 20-bit source number, and a set of
@@ -38,7 +48,7 @@ least-significant end of the word:
 
 Each source has 64 bits of state that can be read and written using
 the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
-KVM_DEV_XICS_SOURCES attribute group, with the attribute number being
+KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
 the interrupt source number.  The 64 bit state word has the following
 bitfields, starting from the least-significant end of the word:
 
diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt
index 9a24a4525253..fd418b907d0e 100644
--- a/Documentation/virt/kvm/devices/xive.txt
+++ b/Documentation/virt/kvm/devices/xive.txt
@@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
     migrating the VM.
     Errors: none
 
+    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
+    The kvm_device_attr.addr points to a __u32 value which is the number of
+    interrupt server numbers (ie, highest possible vcpu id plus one).
+    Errors:
+      -EINVAL: Value greater than KVM_KVM_VCPUS.
+      -EFAULT: Invalid user pointer for attr->addr.
+      -EBUSY:  A vCPU is already connected to the device.
+
   2. KVM_DEV_XIVE_GRP_SOURCE (write only)
   Initializes a new source in the XIVE device and mask it.
   Attributes:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index b0f72dea8b11..264e266a85bf 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char {
 
 /* PPC64 eXternal Interrupt Controller Specification */
 #define KVM_DEV_XICS_GRP_SOURCES	1	/* 64-bit source attributes */
+#define KVM_DEV_XICS_GRP_CTRL		2
+#define   KVM_DEV_XICS_NR_SERVERS	1
 
 /* Layout of 64-bit source attribute values */
 #define  KVM_XICS_DESTINATION_SHIFT	0
@@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL		1
 #define   KVM_DEV_XIVE_RESET		1
 #define   KVM_DEV_XIVE_EQ_SYNC		2
+#define   KVM_DEV_XIVE_NR_SERVERS	3
 #define KVM_DEV_XIVE_GRP_SOURCE		2	/* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG	3	/* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG	4	/* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 4a333dcfddd8..c1901583e6c0 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1905,6 +1905,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 	switch (attr->group) {
 	case KVM_DEV_XICS_GRP_SOURCES:
 		return xive_set_source(xive, attr->attr, attr->addr);
+	case KVM_DEV_XICS_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XICS_NR_SERVERS:
+			return kvmppc_xive_set_nr_servers(xive, attr->addr);
+		}
 	}
 	return -ENXIO;
 }
@@ -1930,6 +1935,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 		    attr->attr < KVMPPC_XICS_NR_IRQS)
 			return 0;
 		break;
+	case KVM_DEV_XICS_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XICS_NR_SERVERS:
+			return 0;
+		}
 	}
 	return -ENXIO;
 }
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 5e18364d52a9..8e954c5d5efb 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 			return kvmppc_xive_reset(xive);
 		case KVM_DEV_XIVE_EQ_SYNC:
 			return kvmppc_xive_native_eq_sync(xive);
+		case KVM_DEV_XIVE_NR_SERVERS:
+			return kvmppc_xive_set_nr_servers(xive, attr->addr);
 		}
 		break;
 	case KVM_DEV_XIVE_GRP_SOURCE:
@@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_RESET:
 		case KVM_DEV_XIVE_EQ_SYNC:
+		case KVM_DEV_XIVE_NR_SERVERS:
 			return 0;
 		}
 		break;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
@ 2019-09-23 15:44   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

Add a new attribute to both legacy and native XIVE KVM devices so that
userspace can require less interrupt servers than the current default
(KVM_MAX_VCPUS, 2048). This will allow to allocate less VPs in OPAL,
and likely increase the number of VMs that can run with an in-kernel
XIVE implementation.

Since the legacy XIVE KVM device is exposed to userspace through the
XICS KVM API, a new attribute group is added to it for this purpose.
While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
in the XICS documentation.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 Documentation/virt/kvm/devices/xics.txt |   14 ++++++++++++--
 Documentation/virt/kvm/devices/xive.txt |    8 ++++++++
 arch/powerpc/include/uapi/asm/kvm.h     |    3 +++
 arch/powerpc/kvm/book3s_xive.c          |   10 ++++++++++
 arch/powerpc/kvm/book3s_xive_native.c   |    3 +++
 5 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt
index 42864935ac5d..1cf9621f8341 100644
--- a/Documentation/virt/kvm/devices/xics.txt
+++ b/Documentation/virt/kvm/devices/xics.txt
@@ -3,9 +3,19 @@ XICS interrupt controller
 Device type supported: KVM_DEV_TYPE_XICS
 
 Groups:
-  KVM_DEV_XICS_SOURCES
+  1. KVM_DEV_XICS_GRP_SOURCES
   Attributes: One per interrupt source, indexed by the source number.
 
+  2. KVM_DEV_XICS_GRP_CTRL
+  Attributes:
+    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
+  The kvm_device_attr.addr points to a __u32 value which is the number of
+  interrupt server numbers (ie, highest possible vcpu id plus one).
+  Errors:
+    -EINVAL: Value greater than KVM_MAX_VCPUS.
+    -EFAULT: Invalid user pointer for attr->addr.
+    -EBUSY:  A vcpu is already connected to the device.
+
 This device emulates the XICS (eXternal Interrupt Controller
 Specification) defined in PAPR.  The XICS has a set of interrupt
 sources, each identified by a 20-bit source number, and a set of
@@ -38,7 +48,7 @@ least-significant end of the word:
 
 Each source has 64 bits of state that can be read and written using
 the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
-KVM_DEV_XICS_SOURCES attribute group, with the attribute number being
+KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
 the interrupt source number.  The 64 bit state word has the following
 bitfields, starting from the least-significant end of the word:
 
diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt
index 9a24a4525253..fd418b907d0e 100644
--- a/Documentation/virt/kvm/devices/xive.txt
+++ b/Documentation/virt/kvm/devices/xive.txt
@@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
     migrating the VM.
     Errors: none
 
+    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
+    The kvm_device_attr.addr points to a __u32 value which is the number of
+    interrupt server numbers (ie, highest possible vcpu id plus one).
+    Errors:
+      -EINVAL: Value greater than KVM_KVM_VCPUS.
+      -EFAULT: Invalid user pointer for attr->addr.
+      -EBUSY:  A vCPU is already connected to the device.
+
   2. KVM_DEV_XIVE_GRP_SOURCE (write only)
   Initializes a new source in the XIVE device and mask it.
   Attributes:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index b0f72dea8b11..264e266a85bf 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char {
 
 /* PPC64 eXternal Interrupt Controller Specification */
 #define KVM_DEV_XICS_GRP_SOURCES	1	/* 64-bit source attributes */
+#define KVM_DEV_XICS_GRP_CTRL		2
+#define   KVM_DEV_XICS_NR_SERVERS	1
 
 /* Layout of 64-bit source attribute values */
 #define  KVM_XICS_DESTINATION_SHIFT	0
@@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL		1
 #define   KVM_DEV_XIVE_RESET		1
 #define   KVM_DEV_XIVE_EQ_SYNC		2
+#define   KVM_DEV_XIVE_NR_SERVERS	3
 #define KVM_DEV_XIVE_GRP_SOURCE		2	/* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG	3	/* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG	4	/* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 4a333dcfddd8..c1901583e6c0 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1905,6 +1905,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 	switch (attr->group) {
 	case KVM_DEV_XICS_GRP_SOURCES:
 		return xive_set_source(xive, attr->attr, attr->addr);
+	case KVM_DEV_XICS_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XICS_NR_SERVERS:
+			return kvmppc_xive_set_nr_servers(xive, attr->addr);
+		}
 	}
 	return -ENXIO;
 }
@@ -1930,6 +1935,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 		    attr->attr < KVMPPC_XICS_NR_IRQS)
 			return 0;
 		break;
+	case KVM_DEV_XICS_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XICS_NR_SERVERS:
+			return 0;
+		}
 	}
 	return -ENXIO;
 }
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 5e18364d52a9..8e954c5d5efb 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 			return kvmppc_xive_reset(xive);
 		case KVM_DEV_XIVE_EQ_SYNC:
 			return kvmppc_xive_native_eq_sync(xive);
+		case KVM_DEV_XIVE_NR_SERVERS:
+			return kvmppc_xive_set_nr_servers(xive, attr->addr);
 		}
 		break;
 	case KVM_DEV_XIVE_GRP_SOURCE:
@@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_RESET:
 		case KVM_DEV_XIVE_EQ_SYNC:
+		case KVM_DEV_XIVE_NR_SERVERS:
 			return 0;
 		}
 		break;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
@ 2019-09-23 15:44   ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-23 15:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

Add a new attribute to both legacy and native XIVE KVM devices so that
userspace can require less interrupt servers than the current default
(KVM_MAX_VCPUS, 2048). This will allow to allocate less VPs in OPAL,
and likely increase the number of VMs that can run with an in-kernel
XIVE implementation.

Since the legacy XIVE KVM device is exposed to userspace through the
XICS KVM API, a new attribute group is added to it for this purpose.
While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
in the XICS documentation.

Signed-off-by: Greg Kurz <groug@kaod.org>
---
 Documentation/virt/kvm/devices/xics.txt |   14 ++++++++++++--
 Documentation/virt/kvm/devices/xive.txt |    8 ++++++++
 arch/powerpc/include/uapi/asm/kvm.h     |    3 +++
 arch/powerpc/kvm/book3s_xive.c          |   10 ++++++++++
 arch/powerpc/kvm/book3s_xive_native.c   |    3 +++
 5 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt
index 42864935ac5d..1cf9621f8341 100644
--- a/Documentation/virt/kvm/devices/xics.txt
+++ b/Documentation/virt/kvm/devices/xics.txt
@@ -3,9 +3,19 @@ XICS interrupt controller
 Device type supported: KVM_DEV_TYPE_XICS
 
 Groups:
-  KVM_DEV_XICS_SOURCES
+  1. KVM_DEV_XICS_GRP_SOURCES
   Attributes: One per interrupt source, indexed by the source number.
 
+  2. KVM_DEV_XICS_GRP_CTRL
+  Attributes:
+    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
+  The kvm_device_attr.addr points to a __u32 value which is the number of
+  interrupt server numbers (ie, highest possible vcpu id plus one).
+  Errors:
+    -EINVAL: Value greater than KVM_MAX_VCPUS.
+    -EFAULT: Invalid user pointer for attr->addr.
+    -EBUSY:  A vcpu is already connected to the device.
+
 This device emulates the XICS (eXternal Interrupt Controller
 Specification) defined in PAPR.  The XICS has a set of interrupt
 sources, each identified by a 20-bit source number, and a set of
@@ -38,7 +48,7 @@ least-significant end of the word:
 
 Each source has 64 bits of state that can be read and written using
 the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
-KVM_DEV_XICS_SOURCES attribute group, with the attribute number being
+KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
 the interrupt source number.  The 64 bit state word has the following
 bitfields, starting from the least-significant end of the word:
 
diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt
index 9a24a4525253..fd418b907d0e 100644
--- a/Documentation/virt/kvm/devices/xive.txt
+++ b/Documentation/virt/kvm/devices/xive.txt
@@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
     migrating the VM.
     Errors: none
 
+    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
+    The kvm_device_attr.addr points to a __u32 value which is the number of
+    interrupt server numbers (ie, highest possible vcpu id plus one).
+    Errors:
+      -EINVAL: Value greater than KVM_KVM_VCPUS.
+      -EFAULT: Invalid user pointer for attr->addr.
+      -EBUSY:  A vCPU is already connected to the device.
+
   2. KVM_DEV_XIVE_GRP_SOURCE (write only)
   Initializes a new source in the XIVE device and mask it.
   Attributes:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index b0f72dea8b11..264e266a85bf 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char {
 
 /* PPC64 eXternal Interrupt Controller Specification */
 #define KVM_DEV_XICS_GRP_SOURCES	1	/* 64-bit source attributes */
+#define KVM_DEV_XICS_GRP_CTRL		2
+#define   KVM_DEV_XICS_NR_SERVERS	1
 
 /* Layout of 64-bit source attribute values */
 #define  KVM_XICS_DESTINATION_SHIFT	0
@@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL		1
 #define   KVM_DEV_XIVE_RESET		1
 #define   KVM_DEV_XIVE_EQ_SYNC		2
+#define   KVM_DEV_XIVE_NR_SERVERS	3
 #define KVM_DEV_XIVE_GRP_SOURCE		2	/* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG	3	/* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG	4	/* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 4a333dcfddd8..c1901583e6c0 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1905,6 +1905,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 	switch (attr->group) {
 	case KVM_DEV_XICS_GRP_SOURCES:
 		return xive_set_source(xive, attr->attr, attr->addr);
+	case KVM_DEV_XICS_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XICS_NR_SERVERS:
+			return kvmppc_xive_set_nr_servers(xive, attr->addr);
+		}
 	}
 	return -ENXIO;
 }
@@ -1930,6 +1935,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
 		    attr->attr < KVMPPC_XICS_NR_IRQS)
 			return 0;
 		break;
+	case KVM_DEV_XICS_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XICS_NR_SERVERS:
+			return 0;
+		}
 	}
 	return -ENXIO;
 }
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 5e18364d52a9..8e954c5d5efb 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 			return kvmppc_xive_reset(xive);
 		case KVM_DEV_XIVE_EQ_SYNC:
 			return kvmppc_xive_native_eq_sync(xive);
+		case KVM_DEV_XIVE_NR_SERVERS:
+			return kvmppc_xive_set_nr_servers(xive, attr->addr);
 		}
 		break;
 	case KVM_DEV_XIVE_GRP_SOURCE:
@@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_RESET:
 		case KVM_DEV_XIVE_EQ_SYNC:
+		case KVM_DEV_XIVE_NR_SERVERS:
 			return 0;
 		}
 		break;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
  2019-09-23 15:43   ` Greg Kurz
  (?)
@ 2019-09-23 15:49     ` Cédric Le Goater
  -1 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:49 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, David Gibson,
	Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On 23/09/2019 17:43, Greg Kurz wrote:
> If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
> XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
> set forever since the relase method isn't called in this case. Any

release

> subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
> thus always fail. This is a problem for QEMU since it destroys and
> re-creates these devices when the VM is reset: the VM would be
> restricted to using the emulated XIVE or XICS forever.
> 
> As an alternative to adding rollback, do not assign kvm->arch.xive before
> making sure the XIVE VPs are allocated in OPAL.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> Signed-off-by: Greg Kurz <groug@kaod.org>


Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.

> ---
>  arch/powerpc/kvm/book3s_xive.c        |   11 +++++------
>  arch/powerpc/kvm/book3s_xive_native.c |    2 +-
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index cd2006bfcd3e..2ef43d037a4f 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -2006,6 +2006,10 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  
>  	pr_devel("Creating xive for partition\n");
>  
> +	/* Already there ? */
> +	if (kvm->arch.xive)
> +		return -EEXIST;
> +
>  	xive = kvmppc_xive_get_device(kvm, type);
>  	if (!xive)
>  		return -ENOMEM;
> @@ -2014,12 +2018,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	xive->kvm = kvm;
>  	mutex_init(&xive->lock);
>  
> -	/* Already there ? */
> -	if (kvm->arch.xive)
> -		ret = -EEXIST;
> -	else
> -		kvm->arch.xive = xive;
> -
>  	/* We use the default queue size set by the host */
>  	xive->q_order = xive_native_default_eq_shift();
>  	if (xive->q_order < PAGE_SHIFT)
> @@ -2040,6 +2038,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  		return ret;
>  
>  	dev->private = xive;
> +	kvm->arch.xive = xive;
>  	return 0;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index e9cbb42de424..84a354b90f60 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1087,7 +1087,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  
>  	xive->dev = dev;
>  	xive->kvm = kvm;
> -	kvm->arch.xive = xive;
>  	mutex_init(&xive->mapping_lock);
>  	mutex_init(&xive->lock);
>  
> @@ -1109,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  		return ret;
>  
>  	dev->private = xive;
> +	kvm->arch.xive = xive;
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
@ 2019-09-23 15:49     ` Cédric Le Goater
  0 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:49 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Paolo Bonzini, linuxppc-dev, David Gibson

On 23/09/2019 17:43, Greg Kurz wrote:
> If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
> XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
> set forever since the relase method isn't called in this case. Any

release

> subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
> thus always fail. This is a problem for QEMU since it destroys and
> re-creates these devices when the VM is reset: the VM would be
> restricted to using the emulated XIVE or XICS forever.
> 
> As an alternative to adding rollback, do not assign kvm->arch.xive before
> making sure the XIVE VPs are allocated in OPAL.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> Signed-off-by: Greg Kurz <groug@kaod.org>


Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.

> ---
>  arch/powerpc/kvm/book3s_xive.c        |   11 +++++------
>  arch/powerpc/kvm/book3s_xive_native.c |    2 +-
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index cd2006bfcd3e..2ef43d037a4f 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -2006,6 +2006,10 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  
>  	pr_devel("Creating xive for partition\n");
>  
> +	/* Already there ? */
> +	if (kvm->arch.xive)
> +		return -EEXIST;
> +
>  	xive = kvmppc_xive_get_device(kvm, type);
>  	if (!xive)
>  		return -ENOMEM;
> @@ -2014,12 +2018,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	xive->kvm = kvm;
>  	mutex_init(&xive->lock);
>  
> -	/* Already there ? */
> -	if (kvm->arch.xive)
> -		ret = -EEXIST;
> -	else
> -		kvm->arch.xive = xive;
> -
>  	/* We use the default queue size set by the host */
>  	xive->q_order = xive_native_default_eq_shift();
>  	if (xive->q_order < PAGE_SHIFT)
> @@ -2040,6 +2038,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  		return ret;
>  
>  	dev->private = xive;
> +	kvm->arch.xive = xive;
>  	return 0;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index e9cbb42de424..84a354b90f60 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1087,7 +1087,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  
>  	xive->dev = dev;
>  	xive->kvm = kvm;
> -	kvm->arch.xive = xive;
>  	mutex_init(&xive->mapping_lock);
>  	mutex_init(&xive->lock);
>  
> @@ -1109,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  		return ret;
>  
>  	dev->private = xive;
> +	kvm->arch.xive = xive;
>  	return 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive when VPs are allocated
@ 2019-09-23 15:49     ` Cédric Le Goater
  0 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:49 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, David Gibson,
	Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On 23/09/2019 17:43, Greg Kurz wrote:
> If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
> XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
> set forever since the relase method isn't called in this case. Any

release

> subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
> thus always fail. This is a problem for QEMU since it destroys and
> re-creates these devices when the VM is reset: the VM would be
> restricted to using the emulated XIVE or XICS forever.
> 
> As an alternative to adding rollback, do not assign kvm->arch.xive before
> making sure the XIVE VPs are allocated in OPAL.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> Signed-off-by: Greg Kurz <groug@kaod.org>


Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.

> ---
>  arch/powerpc/kvm/book3s_xive.c        |   11 +++++------
>  arch/powerpc/kvm/book3s_xive_native.c |    2 +-
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index cd2006bfcd3e..2ef43d037a4f 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -2006,6 +2006,10 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  
>  	pr_devel("Creating xive for partition\n");
>  
> +	/* Already there ? */
> +	if (kvm->arch.xive)
> +		return -EEXIST;
> +
>  	xive = kvmppc_xive_get_device(kvm, type);
>  	if (!xive)
>  		return -ENOMEM;
> @@ -2014,12 +2018,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	xive->kvm = kvm;
>  	mutex_init(&xive->lock);
>  
> -	/* Already there ? */
> -	if (kvm->arch.xive)
> -		ret = -EEXIST;
> -	else
> -		kvm->arch.xive = xive;
> -
>  	/* We use the default queue size set by the host */
>  	xive->q_order = xive_native_default_eq_shift();
>  	if (xive->q_order < PAGE_SHIFT)
> @@ -2040,6 +2038,7 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  		return ret;
>  
>  	dev->private = xive;
> +	kvm->arch.xive = xive;
>  	return 0;
>  }
>  
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index e9cbb42de424..84a354b90f60 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1087,7 +1087,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  
>  	xive->dev = dev;
>  	xive->kvm = kvm;
> -	kvm->arch.xive = xive;
>  	mutex_init(&xive->mapping_lock);
>  	mutex_init(&xive->lock);
>  
> @@ -1109,6 +1108,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  		return ret;
>  
>  	dev->private = xive;
> +	kvm->arch.xive = xive;
>  	return 0;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
  2019-09-23 15:43   ` Greg Kurz
  (?)
@ 2019-09-23 15:52     ` Cédric Le Goater
  -1 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:52 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, David Gibson,
	Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On 23/09/2019 17:43, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)
> 
> Check the VP id instead of the vCPU id when a new vCPU is connected.
> The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
> is moved after the check to avoid the need for rollback.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>


Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.


> ---
>  arch/powerpc/kvm/book3s_xive.c        |   24 ++++++++++++++++--------
>  arch/powerpc/kvm/book3s_xive.h        |   12 ++++++++++++
>  arch/powerpc/kvm/book3s_xive_native.c |    6 ++++--
>  3 files changed, 32 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 2ef43d037a4f..01bff7befc9f 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1217,6 +1217,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	struct kvmppc_xive *xive = dev->private;
>  	struct kvmppc_xive_vcpu *xc;
>  	int i, r = -EBUSY;
> +	u32 vp_id;
>  
>  	pr_devel("connect_vcpu(cpu=%d)\n", cpu);
>  
> @@ -1228,25 +1229,32 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		return -EPERM;
>  	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
>  		return -EBUSY;
> -	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> -		pr_devel("Duplicate !\n");
> -		return -EEXIST;
> -	}
>  	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
>  		pr_devel("Out of bounds !\n");
>  		return -EINVAL;
>  	}
> -	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> -	if (!xc)
> -		return -ENOMEM;
>  
>  	/* We need to synchronize with queue provisioning */
>  	mutex_lock(&xive->lock);
> +
> +	vp_id = kvmppc_xive_vp(xive, cpu);
> +	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
> +		pr_devel("Duplicate !\n");
> +		r = -EEXIST;
> +		goto bail;
> +	}
> +
> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> +	if (!xc) {
> +		r = -ENOMEM;
> +		goto bail;
> +	}
> +
>  	vcpu->arch.xive_vcpu = xc;
>  	xc->xive = xive;
>  	xc->vcpu = vcpu;
>  	xc->server_num = cpu;
> -	xc->vp_id = kvmppc_xive_vp(xive, cpu);
> +	xc->vp_id = vp_id;
>  	xc->mfrr = 0xff;
>  	xc->valid = true;
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 955b820ffd6d..fe3ed50e0818 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -220,6 +220,18 @@ static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
>  	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>  }
>  
> +static inline bool kvmppc_xive_vp_in_use(struct kvm *kvm, u32 vp_id)
> +{
> +	struct kvm_vcpu *vcpu = NULL;
> +	int i;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.xive_vcpu && vp_id == vcpu->arch.xive_vcpu->vp_id)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  /*
>   * Mapping between guest priorities and host priorities
>   * is as follow.
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 84a354b90f60..53a22771908c 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -106,6 +106,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	struct kvmppc_xive *xive = dev->private;
>  	struct kvmppc_xive_vcpu *xc = NULL;
>  	int rc;
> +	u32 vp_id;
>  
>  	pr_devel("native_connect_vcpu(server=%d)\n", server_num);
>  
> @@ -124,7 +125,8 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  
>  	mutex_lock(&xive->lock);
>  
> -	if (kvmppc_xive_find_server(vcpu->kvm, server_num)) {
> +	vp_id = kvmppc_xive_vp(xive, server_num);
> +	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
>  		pr_devel("Duplicate !\n");
>  		rc = -EEXIST;
>  		goto bail;
> @@ -141,7 +143,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	xc->vcpu = vcpu;
>  	xc->server_num = server_num;
>  
> -	xc->vp_id = kvmppc_xive_vp(xive, server_num);
> +	xc->vp_id = vp_id;
>  	xc->valid = true;
>  	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
>  
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-23 15:52     ` Cédric Le Goater
  0 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:52 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Paolo Bonzini, linuxppc-dev, David Gibson

On 23/09/2019 17:43, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)
> 
> Check the VP id instead of the vCPU id when a new vCPU is connected.
> The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
> is moved after the check to avoid the need for rollback.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>


Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.


> ---
>  arch/powerpc/kvm/book3s_xive.c        |   24 ++++++++++++++++--------
>  arch/powerpc/kvm/book3s_xive.h        |   12 ++++++++++++
>  arch/powerpc/kvm/book3s_xive_native.c |    6 ++++--
>  3 files changed, 32 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 2ef43d037a4f..01bff7befc9f 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1217,6 +1217,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	struct kvmppc_xive *xive = dev->private;
>  	struct kvmppc_xive_vcpu *xc;
>  	int i, r = -EBUSY;
> +	u32 vp_id;
>  
>  	pr_devel("connect_vcpu(cpu=%d)\n", cpu);
>  
> @@ -1228,25 +1229,32 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		return -EPERM;
>  	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
>  		return -EBUSY;
> -	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> -		pr_devel("Duplicate !\n");
> -		return -EEXIST;
> -	}
>  	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
>  		pr_devel("Out of bounds !\n");
>  		return -EINVAL;
>  	}
> -	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> -	if (!xc)
> -		return -ENOMEM;
>  
>  	/* We need to synchronize with queue provisioning */
>  	mutex_lock(&xive->lock);
> +
> +	vp_id = kvmppc_xive_vp(xive, cpu);
> +	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
> +		pr_devel("Duplicate !\n");
> +		r = -EEXIST;
> +		goto bail;
> +	}
> +
> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> +	if (!xc) {
> +		r = -ENOMEM;
> +		goto bail;
> +	}
> +
>  	vcpu->arch.xive_vcpu = xc;
>  	xc->xive = xive;
>  	xc->vcpu = vcpu;
>  	xc->server_num = cpu;
> -	xc->vp_id = kvmppc_xive_vp(xive, cpu);
> +	xc->vp_id = vp_id;
>  	xc->mfrr = 0xff;
>  	xc->valid = true;
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 955b820ffd6d..fe3ed50e0818 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -220,6 +220,18 @@ static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
>  	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>  }
>  
> +static inline bool kvmppc_xive_vp_in_use(struct kvm *kvm, u32 vp_id)
> +{
> +	struct kvm_vcpu *vcpu = NULL;
> +	int i;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.xive_vcpu && vp_id == vcpu->arch.xive_vcpu->vp_id)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  /*
>   * Mapping between guest priorities and host priorities
>   * is as follow.
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 84a354b90f60..53a22771908c 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -106,6 +106,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	struct kvmppc_xive *xive = dev->private;
>  	struct kvmppc_xive_vcpu *xc = NULL;
>  	int rc;
> +	u32 vp_id;
>  
>  	pr_devel("native_connect_vcpu(server=%d)\n", server_num);
>  
> @@ -124,7 +125,8 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  
>  	mutex_lock(&xive->lock);
>  
> -	if (kvmppc_xive_find_server(vcpu->kvm, server_num)) {
> +	vp_id = kvmppc_xive_vp(xive, server_num);
> +	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
>  		pr_devel("Duplicate !\n");
>  		rc = -EEXIST;
>  		goto bail;
> @@ -141,7 +143,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	xc->vcpu = vcpu;
>  	xc->server_num = server_num;
>  
> -	xc->vp_id = kvmppc_xive_vp(xive, server_num);
> +	xc->vp_id = vp_id;
>  	xc->valid = true;
>  	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
>  
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-23 15:52     ` Cédric Le Goater
  0 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:52 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, David Gibson,
	Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On 23/09/2019 17:43, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)
> 
> Check the VP id instead of the vCPU id when a new vCPU is connected.
> The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
> is moved after the check to avoid the need for rollback.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>


Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.


> ---
>  arch/powerpc/kvm/book3s_xive.c        |   24 ++++++++++++++++--------
>  arch/powerpc/kvm/book3s_xive.h        |   12 ++++++++++++
>  arch/powerpc/kvm/book3s_xive_native.c |    6 ++++--
>  3 files changed, 32 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 2ef43d037a4f..01bff7befc9f 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1217,6 +1217,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	struct kvmppc_xive *xive = dev->private;
>  	struct kvmppc_xive_vcpu *xc;
>  	int i, r = -EBUSY;
> +	u32 vp_id;
>  
>  	pr_devel("connect_vcpu(cpu=%d)\n", cpu);
>  
> @@ -1228,25 +1229,32 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		return -EPERM;
>  	if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
>  		return -EBUSY;
> -	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> -		pr_devel("Duplicate !\n");
> -		return -EEXIST;
> -	}
>  	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
>  		pr_devel("Out of bounds !\n");
>  		return -EINVAL;
>  	}
> -	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> -	if (!xc)
> -		return -ENOMEM;
>  
>  	/* We need to synchronize with queue provisioning */
>  	mutex_lock(&xive->lock);
> +
> +	vp_id = kvmppc_xive_vp(xive, cpu);
> +	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
> +		pr_devel("Duplicate !\n");
> +		r = -EEXIST;
> +		goto bail;
> +	}
> +
> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> +	if (!xc) {
> +		r = -ENOMEM;
> +		goto bail;
> +	}
> +
>  	vcpu->arch.xive_vcpu = xc;
>  	xc->xive = xive;
>  	xc->vcpu = vcpu;
>  	xc->server_num = cpu;
> -	xc->vp_id = kvmppc_xive_vp(xive, cpu);
> +	xc->vp_id = vp_id;
>  	xc->mfrr = 0xff;
>  	xc->valid = true;
>  
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 955b820ffd6d..fe3ed50e0818 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -220,6 +220,18 @@ static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
>  	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
>  }
>  
> +static inline bool kvmppc_xive_vp_in_use(struct kvm *kvm, u32 vp_id)
> +{
> +	struct kvm_vcpu *vcpu = NULL;
> +	int i;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.xive_vcpu && vp_id = vcpu->arch.xive_vcpu->vp_id)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  /*
>   * Mapping between guest priorities and host priorities
>   * is as follow.
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 84a354b90f60..53a22771908c 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -106,6 +106,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	struct kvmppc_xive *xive = dev->private;
>  	struct kvmppc_xive_vcpu *xc = NULL;
>  	int rc;
> +	u32 vp_id;
>  
>  	pr_devel("native_connect_vcpu(server=%d)\n", server_num);
>  
> @@ -124,7 +125,8 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  
>  	mutex_lock(&xive->lock);
>  
> -	if (kvmppc_xive_find_server(vcpu->kvm, server_num)) {
> +	vp_id = kvmppc_xive_vp(xive, server_num);
> +	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
>  		pr_devel("Duplicate !\n");
>  		rc = -EEXIST;
>  		goto bail;
> @@ -141,7 +143,7 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	xc->vcpu = vcpu;
>  	xc->server_num = server_num;
>  
> -	xc->vp_id = kvmppc_xive_vp(xive, server_num);
> +	xc->vp_id = vp_id;
>  	xc->valid = true;
>  	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
>  
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
  2019-09-23 15:44   ` Greg Kurz
  (?)
@ 2019-09-23 15:56     ` Cédric Le Goater
  -1 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:56 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, David Gibson,
	Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On 23/09/2019 17:44, Greg Kurz wrote:
> Add a new attribute to both legacy and native XIVE KVM devices so that
> userspace can require less interrupt servers than the current default
> (KVM_MAX_VCPUS, 2048). This will allow to allocate less VPs in OPAL,
> and likely increase the number of VMs that can run with an in-kernel
> XIVE implementation.
> 
> Since the legacy XIVE KVM device is exposed to userspace through the
> XICS KVM API, a new attribute group is added to it for this purpose.
> While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
> in the XICS documentation.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>

Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.

> ---
>  Documentation/virt/kvm/devices/xics.txt |   14 ++++++++++++--
>  Documentation/virt/kvm/devices/xive.txt |    8 ++++++++
>  arch/powerpc/include/uapi/asm/kvm.h     |    3 +++
>  arch/powerpc/kvm/book3s_xive.c          |   10 ++++++++++
>  arch/powerpc/kvm/book3s_xive_native.c   |    3 +++
>  5 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt
> index 42864935ac5d..1cf9621f8341 100644
> --- a/Documentation/virt/kvm/devices/xics.txt
> +++ b/Documentation/virt/kvm/devices/xics.txt
> @@ -3,9 +3,19 @@ XICS interrupt controller
>  Device type supported: KVM_DEV_TYPE_XICS
>  
>  Groups:
> -  KVM_DEV_XICS_SOURCES
> +  1. KVM_DEV_XICS_GRP_SOURCES
>    Attributes: One per interrupt source, indexed by the source number.
>  
> +  2. KVM_DEV_XICS_GRP_CTRL
> +  Attributes:
> +    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
> +  The kvm_device_attr.addr points to a __u32 value which is the number of
> +  interrupt server numbers (ie, highest possible vcpu id plus one).
> +  Errors:
> +    -EINVAL: Value greater than KVM_MAX_VCPUS.
> +    -EFAULT: Invalid user pointer for attr->addr.
> +    -EBUSY:  A vcpu is already connected to the device.
> +
>  This device emulates the XICS (eXternal Interrupt Controller
>  Specification) defined in PAPR.  The XICS has a set of interrupt
>  sources, each identified by a 20-bit source number, and a set of
> @@ -38,7 +48,7 @@ least-significant end of the word:
>  
>  Each source has 64 bits of state that can be read and written using
>  the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
> -KVM_DEV_XICS_SOURCES attribute group, with the attribute number being
> +KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
>  the interrupt source number.  The 64 bit state word has the following
>  bitfields, starting from the least-significant end of the word:
>  
> diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt
> index 9a24a4525253..fd418b907d0e 100644
> --- a/Documentation/virt/kvm/devices/xive.txt
> +++ b/Documentation/virt/kvm/devices/xive.txt
> @@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
>      migrating the VM.
>      Errors: none
>  
> +    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
> +    The kvm_device_attr.addr points to a __u32 value which is the number of
> +    interrupt server numbers (ie, highest possible vcpu id plus one).
> +    Errors:
> +      -EINVAL: Value greater than KVM_KVM_VCPUS.
> +      -EFAULT: Invalid user pointer for attr->addr.
> +      -EBUSY:  A vCPU is already connected to the device.
> +
>    2. KVM_DEV_XIVE_GRP_SOURCE (write only)
>    Initializes a new source in the XIVE device and mask it.
>    Attributes:
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index b0f72dea8b11..264e266a85bf 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char {
>  
>  /* PPC64 eXternal Interrupt Controller Specification */
>  #define KVM_DEV_XICS_GRP_SOURCES	1	/* 64-bit source attributes */
> +#define KVM_DEV_XICS_GRP_CTRL		2
> +#define   KVM_DEV_XICS_NR_SERVERS	1
>  
>  /* Layout of 64-bit source attribute values */
>  #define  KVM_XICS_DESTINATION_SHIFT	0
> @@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char {
>  #define KVM_DEV_XIVE_GRP_CTRL		1
>  #define   KVM_DEV_XIVE_RESET		1
>  #define   KVM_DEV_XIVE_EQ_SYNC		2
> +#define   KVM_DEV_XIVE_NR_SERVERS	3
>  #define KVM_DEV_XIVE_GRP_SOURCE		2	/* 64-bit source identifier */
>  #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG	3	/* 64-bit source identifier */
>  #define KVM_DEV_XIVE_GRP_EQ_CONFIG	4	/* 64-bit EQ identifier */
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 4a333dcfddd8..c1901583e6c0 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1905,6 +1905,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  	switch (attr->group) {
>  	case KVM_DEV_XICS_GRP_SOURCES:
>  		return xive_set_source(xive, attr->attr, attr->addr);
> +	case KVM_DEV_XICS_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XICS_NR_SERVERS:
> +			return kvmppc_xive_set_nr_servers(xive, attr->addr);
> +		}
>  	}
>  	return -ENXIO;
>  }
> @@ -1930,6 +1935,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  		    attr->attr < KVMPPC_XICS_NR_IRQS)
>  			return 0;
>  		break;
> +	case KVM_DEV_XICS_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XICS_NR_SERVERS:
> +			return 0;
> +		}
>  	}
>  	return -ENXIO;
>  }
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 5e18364d52a9..8e954c5d5efb 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  			return kvmppc_xive_reset(xive);
>  		case KVM_DEV_XIVE_EQ_SYNC:
>  			return kvmppc_xive_native_eq_sync(xive);
> +		case KVM_DEV_XIVE_NR_SERVERS:
> +			return kvmppc_xive_set_nr_servers(xive, attr->addr);
>  		}
>  		break;
>  	case KVM_DEV_XIVE_GRP_SOURCE:
> @@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_XIVE_RESET:
>  		case KVM_DEV_XIVE_EQ_SYNC:
> +		case KVM_DEV_XIVE_NR_SERVERS:
>  			return 0;
>  		}
>  		break;
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
@ 2019-09-23 15:56     ` Cédric Le Goater
  0 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:56 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Paolo Bonzini, linuxppc-dev, David Gibson

On 23/09/2019 17:44, Greg Kurz wrote:
> Add a new attribute to both legacy and native XIVE KVM devices so that
> userspace can require less interrupt servers than the current default
> (KVM_MAX_VCPUS, 2048). This will allow to allocate less VPs in OPAL,
> and likely increase the number of VMs that can run with an in-kernel
> XIVE implementation.
> 
> Since the legacy XIVE KVM device is exposed to userspace through the
> XICS KVM API, a new attribute group is added to it for this purpose.
> While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
> in the XICS documentation.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>

Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.

> ---
>  Documentation/virt/kvm/devices/xics.txt |   14 ++++++++++++--
>  Documentation/virt/kvm/devices/xive.txt |    8 ++++++++
>  arch/powerpc/include/uapi/asm/kvm.h     |    3 +++
>  arch/powerpc/kvm/book3s_xive.c          |   10 ++++++++++
>  arch/powerpc/kvm/book3s_xive_native.c   |    3 +++
>  5 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt
> index 42864935ac5d..1cf9621f8341 100644
> --- a/Documentation/virt/kvm/devices/xics.txt
> +++ b/Documentation/virt/kvm/devices/xics.txt
> @@ -3,9 +3,19 @@ XICS interrupt controller
>  Device type supported: KVM_DEV_TYPE_XICS
>  
>  Groups:
> -  KVM_DEV_XICS_SOURCES
> +  1. KVM_DEV_XICS_GRP_SOURCES
>    Attributes: One per interrupt source, indexed by the source number.
>  
> +  2. KVM_DEV_XICS_GRP_CTRL
> +  Attributes:
> +    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
> +  The kvm_device_attr.addr points to a __u32 value which is the number of
> +  interrupt server numbers (ie, highest possible vcpu id plus one).
> +  Errors:
> +    -EINVAL: Value greater than KVM_MAX_VCPUS.
> +    -EFAULT: Invalid user pointer for attr->addr.
> +    -EBUSY:  A vcpu is already connected to the device.
> +
>  This device emulates the XICS (eXternal Interrupt Controller
>  Specification) defined in PAPR.  The XICS has a set of interrupt
>  sources, each identified by a 20-bit source number, and a set of
> @@ -38,7 +48,7 @@ least-significant end of the word:
>  
>  Each source has 64 bits of state that can be read and written using
>  the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
> -KVM_DEV_XICS_SOURCES attribute group, with the attribute number being
> +KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
>  the interrupt source number.  The 64 bit state word has the following
>  bitfields, starting from the least-significant end of the word:
>  
> diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt
> index 9a24a4525253..fd418b907d0e 100644
> --- a/Documentation/virt/kvm/devices/xive.txt
> +++ b/Documentation/virt/kvm/devices/xive.txt
> @@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
>      migrating the VM.
>      Errors: none
>  
> +    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
> +    The kvm_device_attr.addr points to a __u32 value which is the number of
> +    interrupt server numbers (ie, highest possible vcpu id plus one).
> +    Errors:
> +      -EINVAL: Value greater than KVM_KVM_VCPUS.
> +      -EFAULT: Invalid user pointer for attr->addr.
> +      -EBUSY:  A vCPU is already connected to the device.
> +
>    2. KVM_DEV_XIVE_GRP_SOURCE (write only)
>    Initializes a new source in the XIVE device and mask it.
>    Attributes:
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index b0f72dea8b11..264e266a85bf 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char {
>  
>  /* PPC64 eXternal Interrupt Controller Specification */
>  #define KVM_DEV_XICS_GRP_SOURCES	1	/* 64-bit source attributes */
> +#define KVM_DEV_XICS_GRP_CTRL		2
> +#define   KVM_DEV_XICS_NR_SERVERS	1
>  
>  /* Layout of 64-bit source attribute values */
>  #define  KVM_XICS_DESTINATION_SHIFT	0
> @@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char {
>  #define KVM_DEV_XIVE_GRP_CTRL		1
>  #define   KVM_DEV_XIVE_RESET		1
>  #define   KVM_DEV_XIVE_EQ_SYNC		2
> +#define   KVM_DEV_XIVE_NR_SERVERS	3
>  #define KVM_DEV_XIVE_GRP_SOURCE		2	/* 64-bit source identifier */
>  #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG	3	/* 64-bit source identifier */
>  #define KVM_DEV_XIVE_GRP_EQ_CONFIG	4	/* 64-bit EQ identifier */
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 4a333dcfddd8..c1901583e6c0 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1905,6 +1905,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  	switch (attr->group) {
>  	case KVM_DEV_XICS_GRP_SOURCES:
>  		return xive_set_source(xive, attr->attr, attr->addr);
> +	case KVM_DEV_XICS_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XICS_NR_SERVERS:
> +			return kvmppc_xive_set_nr_servers(xive, attr->addr);
> +		}
>  	}
>  	return -ENXIO;
>  }
> @@ -1930,6 +1935,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  		    attr->attr < KVMPPC_XICS_NR_IRQS)
>  			return 0;
>  		break;
> +	case KVM_DEV_XICS_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XICS_NR_SERVERS:
> +			return 0;
> +		}
>  	}
>  	return -ENXIO;
>  }
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 5e18364d52a9..8e954c5d5efb 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  			return kvmppc_xive_reset(xive);
>  		case KVM_DEV_XIVE_EQ_SYNC:
>  			return kvmppc_xive_native_eq_sync(xive);
> +		case KVM_DEV_XIVE_NR_SERVERS:
> +			return kvmppc_xive_set_nr_servers(xive, attr->addr);
>  		}
>  		break;
>  	case KVM_DEV_XIVE_GRP_SOURCE:
> @@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_XIVE_RESET:
>  		case KVM_DEV_XIVE_EQ_SYNC:
> +		case KVM_DEV_XIVE_NR_SERVERS:
>  			return 0;
>  		}
>  		break;
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs
@ 2019-09-23 15:56     ` Cédric Le Goater
  0 siblings, 0 replies; 45+ messages in thread
From: Cédric Le Goater @ 2019-09-23 15:56 UTC (permalink / raw)
  To: Greg Kurz, Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, David Gibson,
	Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On 23/09/2019 17:44, Greg Kurz wrote:
> Add a new attribute to both legacy and native XIVE KVM devices so that
> userspace can require less interrupt servers than the current default
> (KVM_MAX_VCPUS, 2048). This will allow to allocate less VPs in OPAL,
> and likely increase the number of VMs that can run with an in-kernel
> XIVE implementation.
> 
> Since the legacy XIVE KVM device is exposed to userspace through the
> XICS KVM API, a new attribute group is added to it for this purpose.
> While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
> in the XICS documentation.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>

Reviewed-by: Cédric Le Goater <clg@kaod.org>

C.

> ---
>  Documentation/virt/kvm/devices/xics.txt |   14 ++++++++++++--
>  Documentation/virt/kvm/devices/xive.txt |    8 ++++++++
>  arch/powerpc/include/uapi/asm/kvm.h     |    3 +++
>  arch/powerpc/kvm/book3s_xive.c          |   10 ++++++++++
>  arch/powerpc/kvm/book3s_xive_native.c   |    3 +++
>  5 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/xics.txt b/Documentation/virt/kvm/devices/xics.txt
> index 42864935ac5d..1cf9621f8341 100644
> --- a/Documentation/virt/kvm/devices/xics.txt
> +++ b/Documentation/virt/kvm/devices/xics.txt
> @@ -3,9 +3,19 @@ XICS interrupt controller
>  Device type supported: KVM_DEV_TYPE_XICS
>  
>  Groups:
> -  KVM_DEV_XICS_SOURCES
> +  1. KVM_DEV_XICS_GRP_SOURCES
>    Attributes: One per interrupt source, indexed by the source number.
>  
> +  2. KVM_DEV_XICS_GRP_CTRL
> +  Attributes:
> +    2.1 KVM_DEV_XICS_NR_SERVERS (write only)
> +  The kvm_device_attr.addr points to a __u32 value which is the number of
> +  interrupt server numbers (ie, highest possible vcpu id plus one).
> +  Errors:
> +    -EINVAL: Value greater than KVM_MAX_VCPUS.
> +    -EFAULT: Invalid user pointer for attr->addr.
> +    -EBUSY:  A vcpu is already connected to the device.
> +
>  This device emulates the XICS (eXternal Interrupt Controller
>  Specification) defined in PAPR.  The XICS has a set of interrupt
>  sources, each identified by a 20-bit source number, and a set of
> @@ -38,7 +48,7 @@ least-significant end of the word:
>  
>  Each source has 64 bits of state that can be read and written using
>  the KVM_GET_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls, specifying the
> -KVM_DEV_XICS_SOURCES attribute group, with the attribute number being
> +KVM_DEV_XICS_GRP_SOURCES attribute group, with the attribute number being
>  the interrupt source number.  The 64 bit state word has the following
>  bitfields, starting from the least-significant end of the word:
>  
> diff --git a/Documentation/virt/kvm/devices/xive.txt b/Documentation/virt/kvm/devices/xive.txt
> index 9a24a4525253..fd418b907d0e 100644
> --- a/Documentation/virt/kvm/devices/xive.txt
> +++ b/Documentation/virt/kvm/devices/xive.txt
> @@ -78,6 +78,14 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
>      migrating the VM.
>      Errors: none
>  
> +    1.3 KVM_DEV_XIVE_NR_SERVERS (write only)
> +    The kvm_device_attr.addr points to a __u32 value which is the number of
> +    interrupt server numbers (ie, highest possible vcpu id plus one).
> +    Errors:
> +      -EINVAL: Value greater than KVM_KVM_VCPUS.
> +      -EFAULT: Invalid user pointer for attr->addr.
> +      -EBUSY:  A vCPU is already connected to the device.
> +
>    2. KVM_DEV_XIVE_GRP_SOURCE (write only)
>    Initializes a new source in the XIVE device and mask it.
>    Attributes:
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index b0f72dea8b11..264e266a85bf 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -667,6 +667,8 @@ struct kvm_ppc_cpu_char {
>  
>  /* PPC64 eXternal Interrupt Controller Specification */
>  #define KVM_DEV_XICS_GRP_SOURCES	1	/* 64-bit source attributes */
> +#define KVM_DEV_XICS_GRP_CTRL		2
> +#define   KVM_DEV_XICS_NR_SERVERS	1
>  
>  /* Layout of 64-bit source attribute values */
>  #define  KVM_XICS_DESTINATION_SHIFT	0
> @@ -683,6 +685,7 @@ struct kvm_ppc_cpu_char {
>  #define KVM_DEV_XIVE_GRP_CTRL		1
>  #define   KVM_DEV_XIVE_RESET		1
>  #define   KVM_DEV_XIVE_EQ_SYNC		2
> +#define   KVM_DEV_XIVE_NR_SERVERS	3
>  #define KVM_DEV_XIVE_GRP_SOURCE		2	/* 64-bit source identifier */
>  #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG	3	/* 64-bit source identifier */
>  #define KVM_DEV_XIVE_GRP_EQ_CONFIG	4	/* 64-bit EQ identifier */
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 4a333dcfddd8..c1901583e6c0 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1905,6 +1905,11 @@ static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  	switch (attr->group) {
>  	case KVM_DEV_XICS_GRP_SOURCES:
>  		return xive_set_source(xive, attr->attr, attr->addr);
> +	case KVM_DEV_XICS_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XICS_NR_SERVERS:
> +			return kvmppc_xive_set_nr_servers(xive, attr->addr);
> +		}
>  	}
>  	return -ENXIO;
>  }
> @@ -1930,6 +1935,11 @@ static int xive_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  		    attr->attr < KVMPPC_XICS_NR_IRQS)
>  			return 0;
>  		break;
> +	case KVM_DEV_XICS_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XICS_NR_SERVERS:
> +			return 0;
> +		}
>  	}
>  	return -ENXIO;
>  }
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 5e18364d52a9..8e954c5d5efb 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -921,6 +921,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  			return kvmppc_xive_reset(xive);
>  		case KVM_DEV_XIVE_EQ_SYNC:
>  			return kvmppc_xive_native_eq_sync(xive);
> +		case KVM_DEV_XIVE_NR_SERVERS:
> +			return kvmppc_xive_set_nr_servers(xive, attr->addr);
>  		}
>  		break;
>  	case KVM_DEV_XIVE_GRP_SOURCE:
> @@ -960,6 +962,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_XIVE_RESET:
>  		case KVM_DEV_XIVE_EQ_SYNC:
> +		case KVM_DEV_XIVE_NR_SERVERS:
>  			return 0;
>  		}
>  		break;
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
  2019-09-23 15:43   ` Greg Kurz
  (?)
@ 2019-09-24  5:28     ` Paul Mackerras
  -1 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2019-09-24  5:28 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> From: Cédric Le Goater <clg@kaod.org>
> 
> Do not assign the device private pointer before making sure the XIVE
> VPs are allocated in OPAL and test pointer validity when releasing
> the device.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> Signed-off-by: Greg Kurz <groug@kaod.org>

What happens in the case where the OPAL allocation fails?  Does the
host crash, or hang, or leak resources?  I presume that users can
trigger the allocation failure just by starting a suitably large
number of guests - is that right?  Is there an easier way?  I'm trying
to work out whether this is urgently needed in 5.4 and the stable
trees or not.

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
@ 2019-09-24  5:28     ` Paul Mackerras
  0 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2019-09-24  5:28 UTC (permalink / raw)
  To: Greg Kurz
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> From: Cédric Le Goater <clg@kaod.org>
> 
> Do not assign the device private pointer before making sure the XIVE
> VPs are allocated in OPAL and test pointer validity when releasing
> the device.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> Signed-off-by: Greg Kurz <groug@kaod.org>

What happens in the case where the OPAL allocation fails?  Does the
host crash, or hang, or leak resources?  I presume that users can
trigger the allocation failure just by starting a suitably large
number of guests - is that right?  Is there an easier way?  I'm trying
to work out whether this is urgently needed in 5.4 and the stable
trees or not.

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
@ 2019-09-24  5:28     ` Paul Mackerras
  0 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2019-09-24  5:28 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> From: Cédric Le Goater <clg@kaod.org>
> 
> Do not assign the device private pointer before making sure the XIVE
> VPs are allocated in OPAL and test pointer validity when releasing
> the device.
> 
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> Signed-off-by: Greg Kurz <groug@kaod.org>

What happens in the case where the OPAL allocation fails?  Does the
host crash, or hang, or leak resources?  I presume that users can
trigger the allocation failure just by starting a suitably large
number of guests - is that right?  Is there an easier way?  I'm trying
to work out whether this is urgently needed in 5.4 and the stable
trees or not.

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
  2019-09-23 15:43   ` Greg Kurz
  (?)
@ 2019-09-24  5:33     ` Paul Mackerras
  -1 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2019-09-24  5:33 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)

Have you seen a host kernel crash?  How hard would it be to exploit
this, and would it just be a denial of service, or do you think it
could be used to get a use-after-free in the kernel or something like
that?

Also, does this patch depend on the patch 2 in this series?

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-24  5:33     ` Paul Mackerras
  0 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2019-09-24  5:33 UTC (permalink / raw)
  To: Greg Kurz
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)

Have you seen a host kernel crash?  How hard would it be to exploit
this, and would it just be a denial of service, or do you think it
could be used to get a use-after-free in the kernel or something like
that?

Also, does this patch depend on the patch 2 in this series?

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-24  5:33     ` Paul Mackerras
  0 siblings, 0 replies; 45+ messages in thread
From: Paul Mackerras @ 2019-09-24  5:33 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> We currently prevent userspace to connect a new vCPU if we already have
> one with the same vCPU id. This is good but unfortunately not enough,
> because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> can return colliding values. For examples, 348 stays unchanged since it
> is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> guest's core stride is 8. Nothing currently prevents userspace to connect
> vCPUs with forged ids, that end up being associated to the same VP. This
> confuses the irq layer and likely crashes the kernel:
> 
> [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)

Have you seen a host kernel crash?  How hard would it be to exploit
this, and would it just be a denial of service, or do you think it
could be used to get a use-after-free in the kernel or something like
that?

Also, does this patch depend on the patch 2 in this series?

Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
  2019-09-24  5:28     ` Paul Mackerras
  (?)
@ 2019-09-24 11:49       ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-24 11:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Tue, 24 Sep 2019 15:28:55 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> > From: Cédric Le Goater <clg@kaod.org>
> > 
> > Do not assign the device private pointer before making sure the XIVE
> > VPs are allocated in OPAL and test pointer validity when releasing
> > the device.
> > 
> > Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> > Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> 
> What happens in the case where the OPAL allocation fails?  Does the
> host crash, or hang, or leak resources?  I presume that users can
> trigger the allocation failure just by starting a suitably large
> number of guests - is that right?  Is there an easier way?  I'm trying
> to work out whether this is urgently needed in 5.4 and the stable
> trees or not.
> 

Wait... I don't quite remember how this patch landed in my tree but when
I look at it again I have the impression it tries to fix something that
cannot happen.

It is indeed easy to trigger the allocation failure, eg. start more than
127 guests on a Witherspoon system. But if this happens, the create
function returns an error and the device isn't created. I don't see how
the release function could hence get called with a "partially initialized"
device.

Please ignore this patch. Unfortunately the rest of the series doesn't
apply cleanly without it... I'll rebase and post a v2.

Sorry for the noise :-\

> Paul.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
@ 2019-09-24 11:49       ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-24 11:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

On Tue, 24 Sep 2019 15:28:55 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> > From: Cédric Le Goater <clg@kaod.org>
> > 
> > Do not assign the device private pointer before making sure the XIVE
> > VPs are allocated in OPAL and test pointer validity when releasing
> > the device.
> > 
> > Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> > Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> 
> What happens in the case where the OPAL allocation fails?  Does the
> host crash, or hang, or leak resources?  I presume that users can
> trigger the allocation failure just by starting a suitably large
> number of guests - is that right?  Is there an easier way?  I'm trying
> to work out whether this is urgently needed in 5.4 and the stable
> trees or not.
> 

Wait... I don't quite remember how this patch landed in my tree but when
I look at it again I have the impression it tries to fix something that
cannot happen.

It is indeed easy to trigger the allocation failure, eg. start more than
127 guests on a Witherspoon system. But if this happens, the create
function returns an error and the device isn't created. I don't see how
the release function could hence get called with a "partially initialized"
device.

Please ignore this patch. Unfortunately the rest of the series doesn't
apply cleanly without it... I'll rebase and post a v2.

Sorry for the noise :-\

> Paul.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated
@ 2019-09-24 11:49       ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-24 11:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Tue, 24 Sep 2019 15:28:55 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Mon, Sep 23, 2019 at 05:43:37PM +0200, Greg Kurz wrote:
> > From: Cédric Le Goater <clg@kaod.org>
> > 
> > Do not assign the device private pointer before making sure the XIVE
> > VPs are allocated in OPAL and test pointer validity when releasing
> > the device.
> > 
> > Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
> > Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > Signed-off-by: Greg Kurz <groug@kaod.org>
> 
> What happens in the case where the OPAL allocation fails?  Does the
> host crash, or hang, or leak resources?  I presume that users can
> trigger the allocation failure just by starting a suitably large
> number of guests - is that right?  Is there an easier way?  I'm trying
> to work out whether this is urgently needed in 5.4 and the stable
> trees or not.
> 

Wait... I don't quite remember how this patch landed in my tree but when
I look at it again I have the impression it tries to fix something that
cannot happen.

It is indeed easy to trigger the allocation failure, eg. start more than
127 guests on a Witherspoon system. But if this happens, the create
function returns an error and the device isn't created. I don't see how
the release function could hence get called with a "partially initialized"
device.

Please ignore this patch. Unfortunately the rest of the series doesn't
apply cleanly without it... I'll rebase and post a v2.

Sorry for the noise :-\

> Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
  2019-09-24  5:33     ` Paul Mackerras
  (?)
@ 2019-09-24 17:26       ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-24 17:26 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Tue, 24 Sep 2019 15:33:28 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> > We currently prevent userspace to connect a new vCPU if we already have
> > one with the same vCPU id. This is good but unfortunately not enough,
> > because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> > can return colliding values. For examples, 348 stays unchanged since it
> > is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> > guest's core stride is 8. Nothing currently prevents userspace to connect
> > vCPUs with forged ids, that end up being associated to the same VP. This
> > confuses the irq layer and likely crashes the kernel:
> > 
> > [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)
> 
> Have you seen a host kernel crash?

Yes I have.

[29191.162740] genirq: Flags mismatch irq 199. 00010000 (kvm-2-2392) vs. 00010000 (kvm-2-348)
[29191.162849] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.162966] Call Trace:
[29191.163002] [c000003f7f9937e0] [c000000000c0110c] dump_stack+0xb0/0xf4 (unreliable)
[29191.163090] [c000003f7f993820] [c0000000001cb480] __setup_irq+0xa70/0xad0
[29191.163180] [c000003f7f9938d0] [c0000000001cb75c] request_threaded_irq+0x13c/0x260
[29191.163290] [c000003f7f993940] [c00800000d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm]
[29191.163396] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.163504] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.163616] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
[29191.163695] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
[29191.163806] [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
[29191.163889] [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
[29191.163962] [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
[29191.164035] xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392
[29191.164152] ------------[ cut here ]------------
[29191.164229] remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348'
[29191.164343] WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200
[29191.164501] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.165450] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.165568] NIP:  c00000000053b0cc LR: c00000000053b0c8 CTR: c0000000000ba3b0
[29191.165644] REGS: c000003f7f9934b0 TRAP: 0700   Not tainted  (5.3.0-xive-nr-servers-5.3-gku+)
[29191.165741] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48228222  XER: 20040000
[29191.165939] CFAR: c000000000131a50 IRQMASK: 0 
[29191.165939] GPR00: c00000000053b0c8 c000003f7f993740 c0000000015ec500 0000000000000057 
[29191.165939] GPR04: 0000000000000001 0000000000000000 000049fb98484262 0000000000001bcf 
[29191.165939] GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001033 
[29191.165939] GPR12: 0000000000008000 c000003ffffeb800 0000000000000000 000000012f4ce5a1 
[29191.165939] GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 
[29191.165939] GPR20: 000000012f45d918 c000003f863758b0 c000003f86375870 0000000000000006 
[29191.165939] GPR24: c000003f86375a30 0000000000000007 c0002039373d9020 c0000000014c4a48 
[29191.165939] GPR28: 0000000000000001 c000003fe62a4f6b c00020394b2e9fab c000003fe62a4ec0 
[29191.166755] NIP [c00000000053b0cc] remove_proc_entry+0x1ec/0x200
[29191.166803] LR [c00000000053b0c8] remove_proc_entry+0x1e8/0x200
[29191.166874] Call Trace:
[29191.166908] [c000003f7f993740] [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable)
[29191.167022] [c000003f7f9937e0] [c0000000001d3654] unregister_irq_proc+0x114/0x150
[29191.167106] [c000003f7f993880] [c0000000001c6284] free_desc+0x54/0xb0
[29191.167175] [c000003f7f9938c0] [c0000000001c65ec] irq_free_descs+0xac/0x100
[29191.167256] [c000003f7f993910] [c0000000001d1ff8] irq_dispose_mapping+0x68/0x80
[29191.167347] [c000003f7f993940] [c00800000d44e8a4] kvmppc_xive_attach_escalation+0x1fc/0x270 [kvm]
[29191.167480] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.167595] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.167683] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
[29191.167772] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
[29191.167863] [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
[29191.167952] [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
[29191.168002] [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
[29191.168088] Instruction dump:
[29191.168125] 2c230000 41820008 3923ff78 e8e900a0 3c82ff69 3c62ff8d 7fa6eb78 7fc5f378 
[29191.168221] 3884f080 3863b948 4bbf6925 60000000 <0fe00000> 4bffff7c fba10088 4bbf6e41 
[29191.168317] ---[ end trace b925b67a74a1d8d1 ]---
[29191.170904] BUG: Kernel NULL pointer dereference at 0x00000010
[29191.170925] Faulting instruction address: 0xc00800000d44fc04
[29191.170987] Oops: Kernel access of bad area, sig: 11 [#1]
[29191.171044] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[29191.171132] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.172045] CPU: 24 PID: 88176 Comm: qemu-system-ppc Tainted: G        W         5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.172140] NIP:  c00800000d44fc04 LR: c00800000d44fc00 CTR: c0000000001cd970
[29191.172239] REGS: c000003f7f9938e0 TRAP: 0300   Tainted: G        W          (5.3.0-xive-nr-servers-5.3-gku+)
[29191.172337] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24228882  XER: 20040000
[29191.172439] CFAR: c0000000001cd9ac DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0 
[29191.172439] GPR00: c00800000d44fc00 c000003f7f993b70 c00800000d468300 0000000000000000 
[29191.172439] GPR04: 00000000000000c7 0000000000000000 0000000000000000 c000003ffacd06d8 
[29191.172439] GPR08: 0000000000000000 c000003ffacd0738 0000000000000000 fffffffffffffffd 
[29191.172439] GPR12: 0000000000000040 c000003ffffeb800 0000000000000000 000000012f4ce5a1 
[29191.172439] GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 
[29191.172439] GPR20: 000000012f45d918 00007ffffe0d9a80 000000012f4f5df0 000000012ef8c9f8 
[29191.172439] GPR24: 0000000000000001 0000000000000000 c000003fe4501ed0 c000003f8b1d0000 
[29191.172439] GPR28: c0000033314689c0 c000003fe4501c00 c000003fe4501e70 c000003fe4501e90 
[29191.173262] NIP [c00800000d44fc04] kvmppc_xive_cleanup_vcpu+0xfc/0x210 [kvm]
[29191.173354] LR [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm]
[29191.173443] Call Trace:
[29191.173484] [c000003f7f993b70] [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] (unreliable)
[29191.173640] [c000003f7f993bd0] [c00800000d450bd4] kvmppc_xive_release+0xdc/0x1b0 [kvm]
[29191.173737] [c000003f7f993c30] [c00800000d436a98] kvm_device_release+0xb0/0x110 [kvm]
[29191.173816] [c000003f7f993c70] [c00000000046730c] __fput+0xec/0x320
[29191.173908] [c000003f7f993cd0] [c000000000164ae0] task_work_run+0x150/0x1c0
[29191.173968] [c000003f7f993d30] [c000000000025034] do_notify_resume+0x304/0x440
[29191.174047] [c000003f7f993e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74
[29191.174136] Instruction dump:
[29191.174155] 3bff0008 7fbfd040 419e0054 847e0004 2fa30000 419effec e93d0000 8929203c 
[29191.174266] 2f890000 419effb8 4800821d e8410018 <e9230010> e9490008 9b2a0039 7c0004ac 
[29191.174348] ---[ end trace b925b67a74a1d8d2 ]---
[29191.372417] 
[29192.372502] Kernel panic - not syncing: Fatal exception


> How hard would it be to exploit
> this, 

I just had to run a guest (SMT1, stride 8, 300 vCPUs) with a patched QEMU
that turns the valid vCPU id 344 into 348 when calling KVM_CAP_IRQ_XICS.

> and would it just be a denial of service, or do you think it
> could be used to get a use-after-free in the kernel or something like
> that?
> 

This triggers a cascade of errors, the ultimate one being to pass
a NULL pointer to irq_data_get_irq_handler_data() during the
escalation irq cleanup if I get it right.

> Also, does this patch depend on the patch 2 in this series?
> 

No it doesn't.

> Paul.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-24 17:26       ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-24 17:26 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

On Tue, 24 Sep 2019 15:33:28 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> > We currently prevent userspace to connect a new vCPU if we already have
> > one with the same vCPU id. This is good but unfortunately not enough,
> > because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> > can return colliding values. For examples, 348 stays unchanged since it
> > is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> > guest's core stride is 8. Nothing currently prevents userspace to connect
> > vCPUs with forged ids, that end up being associated to the same VP. This
> > confuses the irq layer and likely crashes the kernel:
> > 
> > [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)
> 
> Have you seen a host kernel crash?

Yes I have.

[29191.162740] genirq: Flags mismatch irq 199. 00010000 (kvm-2-2392) vs. 00010000 (kvm-2-348)
[29191.162849] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.162966] Call Trace:
[29191.163002] [c000003f7f9937e0] [c000000000c0110c] dump_stack+0xb0/0xf4 (unreliable)
[29191.163090] [c000003f7f993820] [c0000000001cb480] __setup_irq+0xa70/0xad0
[29191.163180] [c000003f7f9938d0] [c0000000001cb75c] request_threaded_irq+0x13c/0x260
[29191.163290] [c000003f7f993940] [c00800000d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm]
[29191.163396] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.163504] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.163616] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
[29191.163695] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
[29191.163806] [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
[29191.163889] [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
[29191.163962] [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
[29191.164035] xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392
[29191.164152] ------------[ cut here ]------------
[29191.164229] remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348'
[29191.164343] WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200
[29191.164501] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.165450] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.165568] NIP:  c00000000053b0cc LR: c00000000053b0c8 CTR: c0000000000ba3b0
[29191.165644] REGS: c000003f7f9934b0 TRAP: 0700   Not tainted  (5.3.0-xive-nr-servers-5.3-gku+)
[29191.165741] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48228222  XER: 20040000
[29191.165939] CFAR: c000000000131a50 IRQMASK: 0 
[29191.165939] GPR00: c00000000053b0c8 c000003f7f993740 c0000000015ec500 0000000000000057 
[29191.165939] GPR04: 0000000000000001 0000000000000000 000049fb98484262 0000000000001bcf 
[29191.165939] GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001033 
[29191.165939] GPR12: 0000000000008000 c000003ffffeb800 0000000000000000 000000012f4ce5a1 
[29191.165939] GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 
[29191.165939] GPR20: 000000012f45d918 c000003f863758b0 c000003f86375870 0000000000000006 
[29191.165939] GPR24: c000003f86375a30 0000000000000007 c0002039373d9020 c0000000014c4a48 
[29191.165939] GPR28: 0000000000000001 c000003fe62a4f6b c00020394b2e9fab c000003fe62a4ec0 
[29191.166755] NIP [c00000000053b0cc] remove_proc_entry+0x1ec/0x200
[29191.166803] LR [c00000000053b0c8] remove_proc_entry+0x1e8/0x200
[29191.166874] Call Trace:
[29191.166908] [c000003f7f993740] [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable)
[29191.167022] [c000003f7f9937e0] [c0000000001d3654] unregister_irq_proc+0x114/0x150
[29191.167106] [c000003f7f993880] [c0000000001c6284] free_desc+0x54/0xb0
[29191.167175] [c000003f7f9938c0] [c0000000001c65ec] irq_free_descs+0xac/0x100
[29191.167256] [c000003f7f993910] [c0000000001d1ff8] irq_dispose_mapping+0x68/0x80
[29191.167347] [c000003f7f993940] [c00800000d44e8a4] kvmppc_xive_attach_escalation+0x1fc/0x270 [kvm]
[29191.167480] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.167595] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.167683] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
[29191.167772] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
[29191.167863] [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
[29191.167952] [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
[29191.168002] [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
[29191.168088] Instruction dump:
[29191.168125] 2c230000 41820008 3923ff78 e8e900a0 3c82ff69 3c62ff8d 7fa6eb78 7fc5f378 
[29191.168221] 3884f080 3863b948 4bbf6925 60000000 <0fe00000> 4bffff7c fba10088 4bbf6e41 
[29191.168317] ---[ end trace b925b67a74a1d8d1 ]---
[29191.170904] BUG: Kernel NULL pointer dereference at 0x00000010
[29191.170925] Faulting instruction address: 0xc00800000d44fc04
[29191.170987] Oops: Kernel access of bad area, sig: 11 [#1]
[29191.171044] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[29191.171132] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.172045] CPU: 24 PID: 88176 Comm: qemu-system-ppc Tainted: G        W         5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.172140] NIP:  c00800000d44fc04 LR: c00800000d44fc00 CTR: c0000000001cd970
[29191.172239] REGS: c000003f7f9938e0 TRAP: 0300   Tainted: G        W          (5.3.0-xive-nr-servers-5.3-gku+)
[29191.172337] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24228882  XER: 20040000
[29191.172439] CFAR: c0000000001cd9ac DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0 
[29191.172439] GPR00: c00800000d44fc00 c000003f7f993b70 c00800000d468300 0000000000000000 
[29191.172439] GPR04: 00000000000000c7 0000000000000000 0000000000000000 c000003ffacd06d8 
[29191.172439] GPR08: 0000000000000000 c000003ffacd0738 0000000000000000 fffffffffffffffd 
[29191.172439] GPR12: 0000000000000040 c000003ffffeb800 0000000000000000 000000012f4ce5a1 
[29191.172439] GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 
[29191.172439] GPR20: 000000012f45d918 00007ffffe0d9a80 000000012f4f5df0 000000012ef8c9f8 
[29191.172439] GPR24: 0000000000000001 0000000000000000 c000003fe4501ed0 c000003f8b1d0000 
[29191.172439] GPR28: c0000033314689c0 c000003fe4501c00 c000003fe4501e70 c000003fe4501e90 
[29191.173262] NIP [c00800000d44fc04] kvmppc_xive_cleanup_vcpu+0xfc/0x210 [kvm]
[29191.173354] LR [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm]
[29191.173443] Call Trace:
[29191.173484] [c000003f7f993b70] [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] (unreliable)
[29191.173640] [c000003f7f993bd0] [c00800000d450bd4] kvmppc_xive_release+0xdc/0x1b0 [kvm]
[29191.173737] [c000003f7f993c30] [c00800000d436a98] kvm_device_release+0xb0/0x110 [kvm]
[29191.173816] [c000003f7f993c70] [c00000000046730c] __fput+0xec/0x320
[29191.173908] [c000003f7f993cd0] [c000000000164ae0] task_work_run+0x150/0x1c0
[29191.173968] [c000003f7f993d30] [c000000000025034] do_notify_resume+0x304/0x440
[29191.174047] [c000003f7f993e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74
[29191.174136] Instruction dump:
[29191.174155] 3bff0008 7fbfd040 419e0054 847e0004 2fa30000 419effec e93d0000 8929203c 
[29191.174266] 2f890000 419effb8 4800821d e8410018 <e9230010> e9490008 9b2a0039 7c0004ac 
[29191.174348] ---[ end trace b925b67a74a1d8d2 ]---
[29191.372417] 
[29192.372502] Kernel panic - not syncing: Fatal exception


> How hard would it be to exploit
> this, 

I just had to run a guest (SMT1, stride 8, 300 vCPUs) with a patched QEMU
that turns the valid vCPU id 344 into 348 when calling KVM_CAP_IRQ_XICS.

> and would it just be a denial of service, or do you think it
> could be used to get a use-after-free in the kernel or something like
> that?
> 

This triggers a cascade of errors, the ultimate one being to pass
a NULL pointer to irq_data_get_irq_handler_data() during the
escalation irq cleanup if I get it right.

> Also, does this patch depend on the patch 2 in this series?
> 

No it doesn't.

> Paul.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use
@ 2019-09-24 17:26       ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-24 17:26 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Tue, 24 Sep 2019 15:33:28 +1000
Paul Mackerras <paulus@ozlabs.org> wrote:

> On Mon, Sep 23, 2019 at 05:43:48PM +0200, Greg Kurz wrote:
> > We currently prevent userspace to connect a new vCPU if we already have
> > one with the same vCPU id. This is good but unfortunately not enough,
> > because VP ids derive from the packed vCPU ids, and kvmppc_pack_vcpu_id()
> > can return colliding values. For examples, 348 stays unchanged since it
> > is < KVM_MAX_VCPUS, but it is also the packed value of 2392 when the
> > guest's core stride is 8. Nothing currently prevents userspace to connect
> > vCPUs with forged ids, that end up being associated to the same VP. This
> > confuses the irq layer and likely crashes the kernel:
> > 
> > [96631.670454] genirq: Flags mismatch irq 4161. 00010000 (kvm-1-2392) vs. 00010000 (kvm-1-348)
> 
> Have you seen a host kernel crash?

Yes I have.

[29191.162740] genirq: Flags mismatch irq 199. 00010000 (kvm-2-2392) vs. 00010000 (kvm-2-348)
[29191.162849] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.162966] Call Trace:
[29191.163002] [c000003f7f9937e0] [c000000000c0110c] dump_stack+0xb0/0xf4 (unreliable)
[29191.163090] [c000003f7f993820] [c0000000001cb480] __setup_irq+0xa70/0xad0
[29191.163180] [c000003f7f9938d0] [c0000000001cb75c] request_threaded_irq+0x13c/0x260
[29191.163290] [c000003f7f993940] [c00800000d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm]
[29191.163396] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.163504] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.163616] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
[29191.163695] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
[29191.163806] [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
[29191.163889] [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
[29191.163962] [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
[29191.164035] xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392
[29191.164152] ------------[ cut here ]------------
[29191.164229] remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348'
[29191.164343] WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200
[29191.164501] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.165450] CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.165568] NIP:  c00000000053b0cc LR: c00000000053b0c8 CTR: c0000000000ba3b0
[29191.165644] REGS: c000003f7f9934b0 TRAP: 0700   Not tainted  (5.3.0-xive-nr-servers-5.3-gku+)
[29191.165741] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48228222  XER: 20040000
[29191.165939] CFAR: c000000000131a50 IRQMASK: 0 
[29191.165939] GPR00: c00000000053b0c8 c000003f7f993740 c0000000015ec500 0000000000000057 
[29191.165939] GPR04: 0000000000000001 0000000000000000 000049fb98484262 0000000000001bcf 
[29191.165939] GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001033 
[29191.165939] GPR12: 0000000000008000 c000003ffffeb800 0000000000000000 000000012f4ce5a1 
[29191.165939] GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 
[29191.165939] GPR20: 000000012f45d918 c000003f863758b0 c000003f86375870 0000000000000006 
[29191.165939] GPR24: c000003f86375a30 0000000000000007 c0002039373d9020 c0000000014c4a48 
[29191.165939] GPR28: 0000000000000001 c000003fe62a4f6b c00020394b2e9fab c000003fe62a4ec0 
[29191.166755] NIP [c00000000053b0cc] remove_proc_entry+0x1ec/0x200
[29191.166803] LR [c00000000053b0c8] remove_proc_entry+0x1e8/0x200
[29191.166874] Call Trace:
[29191.166908] [c000003f7f993740] [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable)
[29191.167022] [c000003f7f9937e0] [c0000000001d3654] unregister_irq_proc+0x114/0x150
[29191.167106] [c000003f7f993880] [c0000000001c6284] free_desc+0x54/0xb0
[29191.167175] [c000003f7f9938c0] [c0000000001c65ec] irq_free_descs+0xac/0x100
[29191.167256] [c000003f7f993910] [c0000000001d1ff8] irq_dispose_mapping+0x68/0x80
[29191.167347] [c000003f7f993940] [c00800000d44e8a4] kvmppc_xive_attach_escalation+0x1fc/0x270 [kvm]
[29191.167480] [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
[29191.167595] [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
[29191.167683] [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
[29191.167772] [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
[29191.167863] [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
[29191.167952] [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
[29191.168002] [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
[29191.168088] Instruction dump:
[29191.168125] 2c230000 41820008 3923ff78 e8e900a0 3c82ff69 3c62ff8d 7fa6eb78 7fc5f378 
[29191.168221] 3884f080 3863b948 4bbf6925 60000000 <0fe00000> 4bffff7c fba10088 4bbf6e41 
[29191.168317] ---[ end trace b925b67a74a1d8d1 ]---
[29191.170904] BUG: Kernel NULL pointer dereference at 0x00000010
[29191.170925] Faulting instruction address: 0xc00800000d44fc04
[29191.170987] Oops: Kernel access of bad area, sig: 11 [#1]
[29191.171044] LE PAGE_SIZEdK MMU=Radix MMU=Hash SMP NR_CPUS 48 NUMA PowerNV
[29191.171132] Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
[29191.172045] CPU: 24 PID: 88176 Comm: qemu-system-ppc Tainted: G        W         5.3.0-xive-nr-servers-5.3-gku+ #38
[29191.172140] NIP:  c00800000d44fc04 LR: c00800000d44fc00 CTR: c0000000001cd970
[29191.172239] REGS: c000003f7f9938e0 TRAP: 0300   Tainted: G        W          (5.3.0-xive-nr-servers-5.3-gku+)
[29191.172337] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24228882  XER: 20040000
[29191.172439] CFAR: c0000000001cd9ac DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0 
[29191.172439] GPR00: c00800000d44fc00 c000003f7f993b70 c00800000d468300 0000000000000000 
[29191.172439] GPR04: 00000000000000c7 0000000000000000 0000000000000000 c000003ffacd06d8 
[29191.172439] GPR08: 0000000000000000 c000003ffacd0738 0000000000000000 fffffffffffffffd 
[29191.172439] GPR12: 0000000000000040 c000003ffffeb800 0000000000000000 000000012f4ce5a1 
[29191.172439] GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000 
[29191.172439] GPR20: 000000012f45d918 00007ffffe0d9a80 000000012f4f5df0 000000012ef8c9f8 
[29191.172439] GPR24: 0000000000000001 0000000000000000 c000003fe4501ed0 c000003f8b1d0000 
[29191.172439] GPR28: c0000033314689c0 c000003fe4501c00 c000003fe4501e70 c000003fe4501e90 
[29191.173262] NIP [c00800000d44fc04] kvmppc_xive_cleanup_vcpu+0xfc/0x210 [kvm]
[29191.173354] LR [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm]
[29191.173443] Call Trace:
[29191.173484] [c000003f7f993b70] [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] (unreliable)
[29191.173640] [c000003f7f993bd0] [c00800000d450bd4] kvmppc_xive_release+0xdc/0x1b0 [kvm]
[29191.173737] [c000003f7f993c30] [c00800000d436a98] kvm_device_release+0xb0/0x110 [kvm]
[29191.173816] [c000003f7f993c70] [c00000000046730c] __fput+0xec/0x320
[29191.173908] [c000003f7f993cd0] [c000000000164ae0] task_work_run+0x150/0x1c0
[29191.173968] [c000003f7f993d30] [c000000000025034] do_notify_resume+0x304/0x440
[29191.174047] [c000003f7f993e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74
[29191.174136] Instruction dump:
[29191.174155] 3bff0008 7fbfd040 419e0054 847e0004 2fa30000 419effec e93d0000 8929203c 
[29191.174266] 2f890000 419effb8 4800821d e8410018 <e9230010> e9490008 9b2a0039 7c0004ac 
[29191.174348] ---[ end trace b925b67a74a1d8d2 ]---
[29191.372417] 
[29192.372502] Kernel panic - not syncing: Fatal exception


> How hard would it be to exploit
> this, 

I just had to run a guest (SMT1, stride 8, 300 vCPUs) with a patched QEMU
that turns the valid vCPU id 344 into 348 when calling KVM_CAP_IRQ_XICS.

> and would it just be a denial of service, or do you think it
> could be used to get a use-after-free in the kernel or something like
> that?
> 

This triggers a cascade of errors, the ultimate one being to pass
a NULL pointer to irq_data_get_irq_handler_data() during the
escalation irq cleanup if I get it right.

> Also, does this patch depend on the patch 2 in this series?
> 

No it doesn't.

> Paul.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
  2019-09-23 15:44   ` Greg Kurz
  (?)
@ 2019-09-25  8:47     ` Greg Kurz
  -1 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-25  8:47 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Mon, 23 Sep 2019 17:44:00 +0200
Greg Kurz <groug@kaod.org> wrote:

> The XIVE VP is an internal structure which allow the XIVE interrupt
> controller to maintain the interrupt context state of vCPUs non
> dispatched on HW threads.
> 
> When a guest is started, the XIVE KVM device allocates a block of
> XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
> id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
> With a guest's core stride of 8 and a threading mode of 1 (QEMU's
> default), a VM must run at least 256 vCPUs to actually need such a
> range of VPs.
> 
> A POWER9 system has a limited XIVE VP space : 512k and KVM is
> currently wasting this HW resource with large VP allocations,
> especially since a typical VM likely runs with a lot less vCPUs.
> 
> Make the size of the VP block configurable. Add an nr_servers
> field to the XIVE structure and a function to set it for this
> purpose.
> 
> Split VP allocation out of the device create function. Since the
> VP block isn't used before the first vCPU connects to the XIVE KVM
> device, allocation is now performed by kvmppc_xive_connect_vcpu().
> This gives the opportunity to set nr_servers in between:
> 
>           kvmppc_xive_create() / kvmppc_xive_native_create()
>                                .
>                                .
>                      kvmppc_xive_set_nr_servers()
>                                .
>                                .
>     kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()
> 
> The connect_vcpu() functions check that the vCPU id is below nr_servers
> and if it is the first vCPU they allocate the VP block. This is protected
> against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
> with the xive->lock mutex.
> 
> Also, the block is allocated once for the device lifetime: nr_servers
> should stay constant otherwise connect_vcpu() could generate a boggus
> VP id and likely crash OPAL. It is thus forbidden to update nr_servers
> once the block is allocated.
> 
> If the VP allocation fail, return ENOSPC which seems more appropriate to
> report the depletion of system wide HW resource than ENOMEM or ENXIO.
> 
> A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
> only need 256 VPs instead of 2048. If the stride is set to match the number
> of threads per core, this goes further down to 32.
> 
> This will be exposed to userspace by a subsequent patch.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  arch/powerpc/kvm/book3s_xive.c        |   59 ++++++++++++++++++++++++++-------
>  arch/powerpc/kvm/book3s_xive.h        |    4 ++
>  arch/powerpc/kvm/book3s_xive_native.c |   18 +++-------
>  3 files changed, 56 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 9ac6315fb9ae..4a333dcfddd8 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1213,13 +1213,13 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
>  
>  static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
>  {
> -	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
> +	/* We have a block of xive->nr_servers VPs. We just need to check
>  	 * raw vCPU ids are below the expected limit for this guest's
>  	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
>  	 * index that can be safely used to compute a VP id that belongs
>  	 * to the VP block.
>  	 */
> -	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
> +	return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
>  }
>  
>  int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
> @@ -1231,6 +1231,14 @@ int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
>  		return -EINVAL;
>  	}
>  
> +	if (xive->vp_base == XIVE_INVALID_VP) {
> +		xive->vp_base = xive_native_alloc_vp_block(xive->nr_servers);
> +		pr_devel("VP_Base=%x nr_servers=%d\n", xive->vp_base, xive->nr_servers);
> +
> +		if (xive->vp_base == XIVE_INVALID_VP)
> +			return -ENOSPC;
> +	}
> +
>  	vp_id = kvmppc_xive_vp(xive, cpu);
>  	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
>  		pr_devel("Duplicate !\n");
> @@ -1858,6 +1866,37 @@ int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
>  	return 0;
>  }
>  
> +int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr)
> +{
> +	u32 __user *ubufp = (u32 __user *) addr;
> +	u32 nr_servers;
> +	int rc = 0;
> +
> +	if (get_user(nr_servers, ubufp))
> +		return -EFAULT;
> +
> +	pr_devel("%s nr_servers=%u\n", __func__, nr_servers);
> +
> +	if (nr_servers > KVM_MAX_VCPUS)

Drat, this is wrong since QEMU can generate higher vCPU ids (which
is why we need to pack them in the first place). We should check
against KVM_MAX_VCPU_ID here...

> +		return -EINVAL;
> +
> +	mutex_lock(&xive->lock);
> +	/* The VP block is allocated once and freed when the device is
> +	 * released. Better not allow to change its size since its used
> +	 * by connect_vcpu to validate vCPU ids are valid (eg, setting
> +	 * it back to a higher value could allow connect_vcpu to come
> +	 * up with a VP id that goes beyond the VP block, which is likely
> +	 * to cause a crash in OPAL).
> +	 */
> +	if (xive->vp_base != XIVE_INVALID_VP)
> +		rc = -EBUSY;
> +	else
> +		xive->nr_servers = nr_servers;

... and clip down to KVM_MAX_VCPUS here.

I'll fix this in v2.

> +	mutex_unlock(&xive->lock);
> +
> +	return rc;
> +}
> +
>  static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  {
>  	struct kvmppc_xive *xive = dev->private;
> @@ -2034,7 +2073,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
>  	struct kvm *kvm = dev->kvm;
> -	int ret = 0;
>  
>  	pr_devel("Creating xive for partition\n");
>  
> @@ -2057,18 +2095,15 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	else
>  		xive->q_page_order = xive->q_order - PAGE_SHIFT;
>  
> -	/* Allocate a bunch of VPs */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> -	pr_devel("VP_Base=%x\n", xive->vp_base);
> -
> -	if (xive->vp_base == XIVE_INVALID_VP)
> -		ret = -ENOMEM;
> +	/* VP allocation is delayed to the first call to connect_vcpu */
> +	xive->vp_base = XIVE_INVALID_VP;
> +	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
> +	 * on a POWER9 system.
> +	 */
> +	xive->nr_servers = KVM_MAX_VCPUS;
>  
>  	xive->single_escalation = xive_native_has_single_escalation();
>  
> -	if (ret)
> -		return ret;
> -
>  	dev->private = xive;
>  	kvm->arch.xive = xive;
>  	return 0;
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 90cf6ec35a68..382e3a56e789 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -135,6 +135,9 @@ struct kvmppc_xive {
>  	/* Flags */
>  	u8	single_escalation;
>  
> +	/* Number of entries in the VP block */
> +	u32	nr_servers;
> +
>  	struct kvmppc_xive_ops *ops;
>  	struct address_space   *mapping;
>  	struct mutex mapping_lock;
> @@ -297,6 +300,7 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
>  void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
>  				    struct kvmppc_xive_vcpu *xc, int irq);
>  int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
> +int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr);
>  
>  #endif /* CONFIG_KVM_XICS */
>  #endif /* _KVM_PPC_BOOK3S_XICS_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 6902319c5ee9..5e18364d52a9 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1069,7 +1069,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
>  	struct kvm *kvm = dev->kvm;
> -	int ret = 0;
>  
>  	pr_devel("Creating xive native device\n");
>  
> @@ -1085,23 +1084,16 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  	mutex_init(&xive->mapping_lock);
>  	mutex_init(&xive->lock);
>  
> -	/*
> -	 * Allocate a bunch of VPs. KVM_MAX_VCPUS is a large value for
> -	 * a default. Getting the max number of CPUs the VM was
> -	 * configured with would improve our usage of the XIVE VP space.
> +	/* VP allocation is delayed to the first call to connect_vcpu */
> +	xive->vp_base = XIVE_INVALID_VP;
> +	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
> +	 * on a POWER9 system.
>  	 */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> -	pr_devel("VP_Base=%x\n", xive->vp_base);
> -
> -	if (xive->vp_base == XIVE_INVALID_VP)
> -		ret = -ENXIO;
> +	xive->nr_servers = KVM_MAX_VCPUS;
>  
>  	xive->single_escalation = xive_native_has_single_escalation();
>  	xive->ops = &kvmppc_xive_native_ops;
>  
> -	if (ret)
> -		return ret;
> -
>  	dev->private = xive;
>  	kvm->arch.xive = xive;
>  	return 0;
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
@ 2019-09-25  8:47     ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-25  8:47 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, Radim Krčmář,
	kvm-ppc, Cédric Le Goater, Paolo Bonzini, linuxppc-dev,
	David Gibson

On Mon, 23 Sep 2019 17:44:00 +0200
Greg Kurz <groug@kaod.org> wrote:

> The XIVE VP is an internal structure which allow the XIVE interrupt
> controller to maintain the interrupt context state of vCPUs non
> dispatched on HW threads.
> 
> When a guest is started, the XIVE KVM device allocates a block of
> XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
> id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
> With a guest's core stride of 8 and a threading mode of 1 (QEMU's
> default), a VM must run at least 256 vCPUs to actually need such a
> range of VPs.
> 
> A POWER9 system has a limited XIVE VP space : 512k and KVM is
> currently wasting this HW resource with large VP allocations,
> especially since a typical VM likely runs with a lot less vCPUs.
> 
> Make the size of the VP block configurable. Add an nr_servers
> field to the XIVE structure and a function to set it for this
> purpose.
> 
> Split VP allocation out of the device create function. Since the
> VP block isn't used before the first vCPU connects to the XIVE KVM
> device, allocation is now performed by kvmppc_xive_connect_vcpu().
> This gives the opportunity to set nr_servers in between:
> 
>           kvmppc_xive_create() / kvmppc_xive_native_create()
>                                .
>                                .
>                      kvmppc_xive_set_nr_servers()
>                                .
>                                .
>     kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()
> 
> The connect_vcpu() functions check that the vCPU id is below nr_servers
> and if it is the first vCPU they allocate the VP block. This is protected
> against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
> with the xive->lock mutex.
> 
> Also, the block is allocated once for the device lifetime: nr_servers
> should stay constant otherwise connect_vcpu() could generate a boggus
> VP id and likely crash OPAL. It is thus forbidden to update nr_servers
> once the block is allocated.
> 
> If the VP allocation fail, return ENOSPC which seems more appropriate to
> report the depletion of system wide HW resource than ENOMEM or ENXIO.
> 
> A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
> only need 256 VPs instead of 2048. If the stride is set to match the number
> of threads per core, this goes further down to 32.
> 
> This will be exposed to userspace by a subsequent patch.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  arch/powerpc/kvm/book3s_xive.c        |   59 ++++++++++++++++++++++++++-------
>  arch/powerpc/kvm/book3s_xive.h        |    4 ++
>  arch/powerpc/kvm/book3s_xive_native.c |   18 +++-------
>  3 files changed, 56 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 9ac6315fb9ae..4a333dcfddd8 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1213,13 +1213,13 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
>  
>  static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
>  {
> -	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
> +	/* We have a block of xive->nr_servers VPs. We just need to check
>  	 * raw vCPU ids are below the expected limit for this guest's
>  	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
>  	 * index that can be safely used to compute a VP id that belongs
>  	 * to the VP block.
>  	 */
> -	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
> +	return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
>  }
>  
>  int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
> @@ -1231,6 +1231,14 @@ int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
>  		return -EINVAL;
>  	}
>  
> +	if (xive->vp_base == XIVE_INVALID_VP) {
> +		xive->vp_base = xive_native_alloc_vp_block(xive->nr_servers);
> +		pr_devel("VP_Base=%x nr_servers=%d\n", xive->vp_base, xive->nr_servers);
> +
> +		if (xive->vp_base == XIVE_INVALID_VP)
> +			return -ENOSPC;
> +	}
> +
>  	vp_id = kvmppc_xive_vp(xive, cpu);
>  	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
>  		pr_devel("Duplicate !\n");
> @@ -1858,6 +1866,37 @@ int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
>  	return 0;
>  }
>  
> +int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr)
> +{
> +	u32 __user *ubufp = (u32 __user *) addr;
> +	u32 nr_servers;
> +	int rc = 0;
> +
> +	if (get_user(nr_servers, ubufp))
> +		return -EFAULT;
> +
> +	pr_devel("%s nr_servers=%u\n", __func__, nr_servers);
> +
> +	if (nr_servers > KVM_MAX_VCPUS)

Drat, this is wrong since QEMU can generate higher vCPU ids (which
is why we need to pack them in the first place). We should check
against KVM_MAX_VCPU_ID here...

> +		return -EINVAL;
> +
> +	mutex_lock(&xive->lock);
> +	/* The VP block is allocated once and freed when the device is
> +	 * released. Better not allow to change its size since its used
> +	 * by connect_vcpu to validate vCPU ids are valid (eg, setting
> +	 * it back to a higher value could allow connect_vcpu to come
> +	 * up with a VP id that goes beyond the VP block, which is likely
> +	 * to cause a crash in OPAL).
> +	 */
> +	if (xive->vp_base != XIVE_INVALID_VP)
> +		rc = -EBUSY;
> +	else
> +		xive->nr_servers = nr_servers;

... and clip down to KVM_MAX_VCPUS here.

I'll fix this in v2.

> +	mutex_unlock(&xive->lock);
> +
> +	return rc;
> +}
> +
>  static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  {
>  	struct kvmppc_xive *xive = dev->private;
> @@ -2034,7 +2073,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
>  	struct kvm *kvm = dev->kvm;
> -	int ret = 0;
>  
>  	pr_devel("Creating xive for partition\n");
>  
> @@ -2057,18 +2095,15 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	else
>  		xive->q_page_order = xive->q_order - PAGE_SHIFT;
>  
> -	/* Allocate a bunch of VPs */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> -	pr_devel("VP_Base=%x\n", xive->vp_base);
> -
> -	if (xive->vp_base == XIVE_INVALID_VP)
> -		ret = -ENOMEM;
> +	/* VP allocation is delayed to the first call to connect_vcpu */
> +	xive->vp_base = XIVE_INVALID_VP;
> +	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
> +	 * on a POWER9 system.
> +	 */
> +	xive->nr_servers = KVM_MAX_VCPUS;
>  
>  	xive->single_escalation = xive_native_has_single_escalation();
>  
> -	if (ret)
> -		return ret;
> -
>  	dev->private = xive;
>  	kvm->arch.xive = xive;
>  	return 0;
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 90cf6ec35a68..382e3a56e789 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -135,6 +135,9 @@ struct kvmppc_xive {
>  	/* Flags */
>  	u8	single_escalation;
>  
> +	/* Number of entries in the VP block */
> +	u32	nr_servers;
> +
>  	struct kvmppc_xive_ops *ops;
>  	struct address_space   *mapping;
>  	struct mutex mapping_lock;
> @@ -297,6 +300,7 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
>  void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
>  				    struct kvmppc_xive_vcpu *xc, int irq);
>  int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
> +int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr);
>  
>  #endif /* CONFIG_KVM_XICS */
>  #endif /* _KVM_PPC_BOOK3S_XICS_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 6902319c5ee9..5e18364d52a9 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1069,7 +1069,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
>  	struct kvm *kvm = dev->kvm;
> -	int ret = 0;
>  
>  	pr_devel("Creating xive native device\n");
>  
> @@ -1085,23 +1084,16 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  	mutex_init(&xive->mapping_lock);
>  	mutex_init(&xive->lock);
>  
> -	/*
> -	 * Allocate a bunch of VPs. KVM_MAX_VCPUS is a large value for
> -	 * a default. Getting the max number of CPUs the VM was
> -	 * configured with would improve our usage of the XIVE VP space.
> +	/* VP allocation is delayed to the first call to connect_vcpu */
> +	xive->vp_base = XIVE_INVALID_VP;
> +	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
> +	 * on a POWER9 system.
>  	 */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> -	pr_devel("VP_Base=%x\n", xive->vp_base);
> -
> -	if (xive->vp_base == XIVE_INVALID_VP)
> -		ret = -ENXIO;
> +	xive->nr_servers = KVM_MAX_VCPUS;
>  
>  	xive->single_escalation = xive_native_has_single_escalation();
>  	xive->ops = &kvmppc_xive_native_ops;
>  
> -	if (ret)
> -		return ret;
> -
>  	dev->private = xive;
>  	kvm->arch.xive = xive;
>  	return 0;
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable
@ 2019-09-25  8:47     ` Greg Kurz
  0 siblings, 0 replies; 45+ messages in thread
From: Greg Kurz @ 2019-09-25  8:47 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Cédric Le Goater,
	David Gibson, Paolo Bonzini, Radim Krčmář,
	kvm-ppc, kvm, linuxppc-dev

On Mon, 23 Sep 2019 17:44:00 +0200
Greg Kurz <groug@kaod.org> wrote:

> The XIVE VP is an internal structure which allow the XIVE interrupt
> controller to maintain the interrupt context state of vCPUs non
> dispatched on HW threads.
> 
> When a guest is started, the XIVE KVM device allocates a block of
> XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
> id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
> With a guest's core stride of 8 and a threading mode of 1 (QEMU's
> default), a VM must run at least 256 vCPUs to actually need such a
> range of VPs.
> 
> A POWER9 system has a limited XIVE VP space : 512k and KVM is
> currently wasting this HW resource with large VP allocations,
> especially since a typical VM likely runs with a lot less vCPUs.
> 
> Make the size of the VP block configurable. Add an nr_servers
> field to the XIVE structure and a function to set it for this
> purpose.
> 
> Split VP allocation out of the device create function. Since the
> VP block isn't used before the first vCPU connects to the XIVE KVM
> device, allocation is now performed by kvmppc_xive_connect_vcpu().
> This gives the opportunity to set nr_servers in between:
> 
>           kvmppc_xive_create() / kvmppc_xive_native_create()
>                                .
>                                .
>                      kvmppc_xive_set_nr_servers()
>                                .
>                                .
>     kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()
> 
> The connect_vcpu() functions check that the vCPU id is below nr_servers
> and if it is the first vCPU they allocate the VP block. This is protected
> against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
> with the xive->lock mutex.
> 
> Also, the block is allocated once for the device lifetime: nr_servers
> should stay constant otherwise connect_vcpu() could generate a boggus
> VP id and likely crash OPAL. It is thus forbidden to update nr_servers
> once the block is allocated.
> 
> If the VP allocation fail, return ENOSPC which seems more appropriate to
> report the depletion of system wide HW resource than ENOMEM or ENXIO.
> 
> A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
> only need 256 VPs instead of 2048. If the stride is set to match the number
> of threads per core, this goes further down to 32.
> 
> This will be exposed to userspace by a subsequent patch.
> 
> Signed-off-by: Greg Kurz <groug@kaod.org>
> ---
>  arch/powerpc/kvm/book3s_xive.c        |   59 ++++++++++++++++++++++++++-------
>  arch/powerpc/kvm/book3s_xive.h        |    4 ++
>  arch/powerpc/kvm/book3s_xive_native.c |   18 +++-------
>  3 files changed, 56 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 9ac6315fb9ae..4a333dcfddd8 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -1213,13 +1213,13 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
>  
>  static bool kvmppc_xive_vcpu_id_valid(struct kvmppc_xive *xive, u32 cpu)
>  {
> -	/* We have a block of KVM_MAX_VCPUS VPs. We just need to check
> +	/* We have a block of xive->nr_servers VPs. We just need to check
>  	 * raw vCPU ids are below the expected limit for this guest's
>  	 * core stride ; kvmppc_pack_vcpu_id() will pack them down to an
>  	 * index that can be safely used to compute a VP id that belongs
>  	 * to the VP block.
>  	 */
> -	return cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode;
> +	return cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode;
>  }
>  
>  int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
> @@ -1231,6 +1231,14 @@ int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp)
>  		return -EINVAL;
>  	}
>  
> +	if (xive->vp_base = XIVE_INVALID_VP) {
> +		xive->vp_base = xive_native_alloc_vp_block(xive->nr_servers);
> +		pr_devel("VP_Base=%x nr_servers=%d\n", xive->vp_base, xive->nr_servers);
> +
> +		if (xive->vp_base = XIVE_INVALID_VP)
> +			return -ENOSPC;
> +	}
> +
>  	vp_id = kvmppc_xive_vp(xive, cpu);
>  	if (kvmppc_xive_vp_in_use(xive->kvm, vp_id)) {
>  		pr_devel("Duplicate !\n");
> @@ -1858,6 +1866,37 @@ int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
>  	return 0;
>  }
>  
> +int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr)
> +{
> +	u32 __user *ubufp = (u32 __user *) addr;
> +	u32 nr_servers;
> +	int rc = 0;
> +
> +	if (get_user(nr_servers, ubufp))
> +		return -EFAULT;
> +
> +	pr_devel("%s nr_servers=%u\n", __func__, nr_servers);
> +
> +	if (nr_servers > KVM_MAX_VCPUS)

Drat, this is wrong since QEMU can generate higher vCPU ids (which
is why we need to pack them in the first place). We should check
against KVM_MAX_VCPU_ID here...

> +		return -EINVAL;
> +
> +	mutex_lock(&xive->lock);
> +	/* The VP block is allocated once and freed when the device is
> +	 * released. Better not allow to change its size since its used
> +	 * by connect_vcpu to validate vCPU ids are valid (eg, setting
> +	 * it back to a higher value could allow connect_vcpu to come
> +	 * up with a VP id that goes beyond the VP block, which is likely
> +	 * to cause a crash in OPAL).
> +	 */
> +	if (xive->vp_base != XIVE_INVALID_VP)
> +		rc = -EBUSY;
> +	else
> +		xive->nr_servers = nr_servers;

... and clip down to KVM_MAX_VCPUS here.

I'll fix this in v2.

> +	mutex_unlock(&xive->lock);
> +
> +	return rc;
> +}
> +
>  static int xive_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
>  {
>  	struct kvmppc_xive *xive = dev->private;
> @@ -2034,7 +2073,6 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
>  	struct kvm *kvm = dev->kvm;
> -	int ret = 0;
>  
>  	pr_devel("Creating xive for partition\n");
>  
> @@ -2057,18 +2095,15 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	else
>  		xive->q_page_order = xive->q_order - PAGE_SHIFT;
>  
> -	/* Allocate a bunch of VPs */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> -	pr_devel("VP_Base=%x\n", xive->vp_base);
> -
> -	if (xive->vp_base = XIVE_INVALID_VP)
> -		ret = -ENOMEM;
> +	/* VP allocation is delayed to the first call to connect_vcpu */
> +	xive->vp_base = XIVE_INVALID_VP;
> +	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
> +	 * on a POWER9 system.
> +	 */
> +	xive->nr_servers = KVM_MAX_VCPUS;
>  
>  	xive->single_escalation = xive_native_has_single_escalation();
>  
> -	if (ret)
> -		return ret;
> -
>  	dev->private = xive;
>  	kvm->arch.xive = xive;
>  	return 0;
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 90cf6ec35a68..382e3a56e789 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -135,6 +135,9 @@ struct kvmppc_xive {
>  	/* Flags */
>  	u8	single_escalation;
>  
> +	/* Number of entries in the VP block */
> +	u32	nr_servers;
> +
>  	struct kvmppc_xive_ops *ops;
>  	struct address_space   *mapping;
>  	struct mutex mapping_lock;
> @@ -297,6 +300,7 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type);
>  void xive_cleanup_single_escalation(struct kvm_vcpu *vcpu,
>  				    struct kvmppc_xive_vcpu *xc, int irq);
>  int kvmppc_xive_compute_vp_id(struct kvmppc_xive *xive, u32 cpu, u32 *vp);
> +int kvmppc_xive_set_nr_servers(struct kvmppc_xive *xive, u64 addr);
>  
>  #endif /* CONFIG_KVM_XICS */
>  #endif /* _KVM_PPC_BOOK3S_XICS_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 6902319c5ee9..5e18364d52a9 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -1069,7 +1069,6 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  {
>  	struct kvmppc_xive *xive;
>  	struct kvm *kvm = dev->kvm;
> -	int ret = 0;
>  
>  	pr_devel("Creating xive native device\n");
>  
> @@ -1085,23 +1084,16 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>  	mutex_init(&xive->mapping_lock);
>  	mutex_init(&xive->lock);
>  
> -	/*
> -	 * Allocate a bunch of VPs. KVM_MAX_VCPUS is a large value for
> -	 * a default. Getting the max number of CPUs the VM was
> -	 * configured with would improve our usage of the XIVE VP space.
> +	/* VP allocation is delayed to the first call to connect_vcpu */
> +	xive->vp_base = XIVE_INVALID_VP;
> +	/* KVM_MAX_VCPUS limits the number of VMs to roughly 64 per sockets
> +	 * on a POWER9 system.
>  	 */
> -	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> -	pr_devel("VP_Base=%x\n", xive->vp_base);
> -
> -	if (xive->vp_base = XIVE_INVALID_VP)
> -		ret = -ENXIO;
> +	xive->nr_servers = KVM_MAX_VCPUS;
>  
>  	xive->single_escalation = xive_native_has_single_escalation();
>  	xive->ops = &kvmppc_xive_native_ops;
>  
> -	if (ret)
> -		return ret;
> -
>  	dev->private = xive;
>  	kvm->arch.xive = xive;
>  	return 0;
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-09-25 11:14 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-23 15:43 [PATCH 0/6] KVM: PPC: Book3S: HV: XIVE: Allocate less VPs in OPAL Greg Kurz
2019-09-23 15:43 ` Greg Kurz
2019-09-23 15:43 ` Greg Kurz
2019-09-23 15:43 ` [PATCH 1/6] KVM: PPC: Book3S HV: XIVE: initialize private pointer when VPs are allocated Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-24  5:28   ` Paul Mackerras
2019-09-24  5:28     ` Paul Mackerras
2019-09-24  5:28     ` Paul Mackerras
2019-09-24 11:49     ` Greg Kurz
2019-09-24 11:49       ` Greg Kurz
2019-09-24 11:49       ` Greg Kurz
2019-09-23 15:43 ` [PATCH 2/6] KVM: PPC: Book3S HV: XIVE: Set kvm->arch.xive " Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:49   ` Cédric Le Goater
2019-09-23 15:49     ` Cédric Le Goater
2019-09-23 15:49     ` Cédric Le Goater
2019-09-23 15:43 ` [PATCH 3/6] KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:52   ` Cédric Le Goater
2019-09-23 15:52     ` Cédric Le Goater
2019-09-23 15:52     ` Cédric Le Goater
2019-09-24  5:33   ` Paul Mackerras
2019-09-24  5:33     ` Paul Mackerras
2019-09-24  5:33     ` Paul Mackerras
2019-09-24 17:26     ` Greg Kurz
2019-09-24 17:26       ` Greg Kurz
2019-09-24 17:26       ` Greg Kurz
2019-09-23 15:43 ` [PATCH 4/6] KVM: PPC: Book3S HV: XIVE: Compute the VP id in a common helper Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:43   ` Greg Kurz
2019-09-23 15:44 ` [PATCH 5/6] KVM: PPC: Book3S HV: XIVE: Make VP block size configurable Greg Kurz
2019-09-23 15:44   ` Greg Kurz
2019-09-23 15:44   ` Greg Kurz
2019-09-25  8:47   ` Greg Kurz
2019-09-25  8:47     ` Greg Kurz
2019-09-25  8:47     ` Greg Kurz
2019-09-23 15:44 ` [PATCH 6/6] KVM: PPC: Book3S HV: XIVE: Allow userspace to set the # of VPs Greg Kurz
2019-09-23 15:44   ` Greg Kurz
2019-09-23 15:44   ` Greg Kurz
2019-09-23 15:56   ` Cédric Le Goater
2019-09-23 15:56     ` Cédric Le Goater
2019-09-23 15:56     ` Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.