kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] In-kernel XICS interrupt controller emulation
@ 2013-02-14 23:59 Paul Mackerras
  2013-02-14 23:59 ` [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls Paul Mackerras
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-14 23:59 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

This patch series implements in-kernel emulation of the XICS interrupt
controller architecture defined in PAPR (Power Architecture Platform
Requirements, the document that defines IBM's pSeries platform
architecture).

One of the things I have done in this version is to provide a way for
this to coexist with other interrupt controller emulations.  The XICS
emulation exports a vector of function pointers that the generic
book3s code uses to call in to it.  Other emulations could set this
vector to point to their own functions.  (I realize that Scott Wood's
recently-posted patch series uses an entirely orthogonal approach and
may or may not find this useful.)

The interface defined here consists of:

* KVM_CREATE_IRQCHIP_ARGS: like KVM_CREATE_IRQCHIP but takes an
  argument struct containing a `type' field, specifying what overall
  interrupt controller architecture to emulate, and a `param' field
  (unused by the XICS emulation).  This is only called once per VM,
  before VCPUs are created.

* KVM_IRQCHIP_SET_SOURCES: used to create and configure interrupt
  sources in bulk.  The notion of "interrupt source" is fairly
  generic; a source has an identifying number (20 bits in the current
  implementation), a priority, a destination (a vcpu number or
  potentially a vcpu group identifier), plus flags indicating
  edge/level sensitivity, a masked bit, and a pending bit.

* KVM_IRQCHIP_GET_SOURCES: used to query the configuration and status
  of interrupt sources in bulk.

* A 64-bit one_reg identifier, KVM_REG_PPC_ICP_STATE, used to get and
  set per-vcpu interrupt controller state (per-vcpu priority,
  interrupt status, and pending IPI and interrupt state).

I believe this corresponds reasonably well to what was discussed at
KVM Forum.

This series is against Alex's kvm-ppc-queue branch, although it also
applies cleanly on the kvm tree's next branch.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
@ 2013-02-14 23:59 ` Paul Mackerras
  2013-03-21  8:52   ` Alexander Graf
  2013-02-15  0:00 ` [PATCH 2/9] KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external Paul Mackerras
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-14 23:59 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

From: Michael Ellerman <michael@ellerman.id.au>

For pseries machine emulation, in order to move the interrupt
controller code to the kernel, we need to intercept some RTAS
calls in the kernel itself.  This adds an infrastructure to allow
in-kernel handlers to be registered for RTAS services by name.
A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
associate token values with those service names.  Then, when the
guest requests an RTAS service with one of those token values, it
will be handled by the relevant in-kernel handler rather than being
passed up to userspace as at present.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt   |   19 ++++
 arch/powerpc/include/asm/hvcall.h   |    3 +
 arch/powerpc/include/asm/kvm_host.h |    1 +
 arch/powerpc/include/asm/kvm_ppc.h  |    4 +
 arch/powerpc/include/uapi/asm/kvm.h |    6 ++
 arch/powerpc/kvm/Makefile           |    1 +
 arch/powerpc/kvm/book3s_hv.c        |   18 +++-
 arch/powerpc/kvm/book3s_pr_papr.c   |    7 ++
 arch/powerpc/kvm/book3s_rtas.c      |  182 +++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c          |    9 +-
 include/uapi/linux/kvm.h            |    3 +
 11 files changed, 251 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_rtas.c

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c2534c3..d3e2d60 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2122,6 +2122,25 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_PPC_RTAS_DEFINE_TOKEN
+
+Capability: KVM_CAP_PPC_RTAS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_rtas_token_args
+Returns: 0 on success, -1 on error
+
+Defines a token value for a RTAS (Run Time Abstraction Services)
+service in order to allow it to be handled in the kernel.  The
+argument struct gives the name of the service, which must be the name
+of a service that has a kernel-side implementation.  If the token
+value is non-zero, it will be associated with that service, and
+subsequent RTAS calls by the guest specifying that token will be
+handled by the kernel.  If the token value is 0, then any token
+associated with the service will be forgotten, and subsequent RTAS
+calls by the guest for that service will be passed to userspace to be
+handled.
+
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index 7a86706..9ea22b2 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -269,6 +269,9 @@
 #define H_GET_MPP_X		0x314
 #define MAX_HCALL_OPCODE	H_GET_MPP_X
 
+/* Platform specific hcalls, used by KVM */
+#define H_RTAS			0xf000
+
 #ifndef __ASSEMBLY__
 
 /**
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8a72d59..8295dc7 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -255,6 +255,7 @@ struct kvm_arch {
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 #ifdef CONFIG_PPC_BOOK3S_64
 	struct list_head spapr_tce_tables;
+	struct list_head rtas_tokens;
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 44a657a..dd08cfa 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -165,6 +165,10 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+extern int kvm_vm_ioctl_rtas_define_token(struct kvm *kvm, void __user *argp);
+extern int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu);
+extern void kvmppc_rtas_tokens_free(struct kvm *kvm);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 16064d0..d90743c 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -299,6 +299,12 @@ struct kvm_allocate_rma {
 	__u64 rma_size;
 };
 
+/* for KVM_CAP_PPC_RTAS */
+struct kvm_rtas_token_args {
+	char name[120];
+	__u64 token;	/* Use a token of 0 to undefine a mapping */
+};
+
 struct kvm_book3e_206_tlb_entry {
 	__u32 mas8;
 	__u32 mas1;
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..432132c 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -86,6 +86,7 @@ kvm-book3s_64-module-objs := \
 	emulate.o \
 	book3s.o \
 	book3s_64_vio.o \
+	book3s_rtas.o \
 	$(kvm-book3s_64-objs-y)
 
 kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 80dcc53..567c264 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -479,7 +479,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 	unsigned long req = kvmppc_get_gpr(vcpu, 3);
 	unsigned long target, ret = H_SUCCESS;
 	struct kvm_vcpu *tvcpu;
-	int idx;
+	int idx, rc;
 
 	switch (req) {
 	case H_ENTER:
@@ -515,6 +515,19 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 					kvmppc_get_gpr(vcpu, 5),
 					kvmppc_get_gpr(vcpu, 6));
 		break;
+	case H_RTAS:
+		if (list_empty(&vcpu->kvm->arch.rtas_tokens))
+			return RESUME_HOST;
+
+		rc = kvmppc_rtas_hcall(vcpu);
+
+		if (rc == -ENOENT)
+			return RESUME_HOST;
+		else if (rc == 0)
+			break;
+
+		/* Send the error out to userspace via KVM_RUN */
+		return rc;
 	default:
 		return RESUME_HOST;
 	}
@@ -1821,6 +1834,7 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 	cpumask_setall(&kvm->arch.need_tlb_flush);
 
 	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
+	INIT_LIST_HEAD(&kvm->arch.rtas_tokens);
 
 	kvm->arch.rma = NULL;
 
@@ -1866,6 +1880,8 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 		kvm->arch.rma = NULL;
 	}
 
+	kvmppc_rtas_tokens_free(kvm);
+
 	kvmppc_free_hpt(kvm);
 	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
 }
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index ee02b30..4efa4a4 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -246,6 +246,13 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		clear_bit(KVM_REQ_UNHALT, &vcpu->requests);
 		vcpu->stat.halt_wakeup++;
 		return EMULATE_DONE;
+	case H_RTAS:
+		if (list_empty(&vcpu->kvm->arch.rtas_tokens))
+			return RESUME_HOST;
+		if (kvmppc_rtas_hcall(vcpu))
+			break;
+		kvmppc_set_gpr(vcpu, 3, 0);
+		return EMULATE_DONE;
 	}
 
 	return EMULATE_FAIL;
diff --git a/arch/powerpc/kvm/book3s_rtas.c b/arch/powerpc/kvm/book3s_rtas.c
new file mode 100644
index 0000000..8a324e8
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_rtas.c
@@ -0,0 +1,182 @@
+/*
+ * Copyright 2012 Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/err.h>
+
+#include <asm/uaccess.h>
+#include <asm/kvm_book3s.h>
+#include <asm/kvm_ppc.h>
+#include <asm/hvcall.h>
+#include <asm/rtas.h>
+
+
+struct rtas_handler {
+	void (*handler)(struct kvm_vcpu *vcpu, struct rtas_args *args);
+	char *name;
+};
+
+static struct rtas_handler rtas_handlers[] = { };
+
+struct rtas_token_definition {
+	struct list_head list;
+	struct rtas_handler *handler;
+	u64 token;
+};
+
+static int rtas_name_matches(char *s1, char *s2)
+{
+	struct kvm_rtas_token_args args;
+	return !strncmp(s1, s2, sizeof(args.name));
+}
+
+static int rtas_token_undefine(struct kvm *kvm, char *name)
+{
+	struct rtas_token_definition *d, *tmp;
+
+	lockdep_assert_held(&kvm->lock);
+
+	list_for_each_entry_safe(d, tmp, &kvm->arch.rtas_tokens, list) {
+		if (rtas_name_matches(d->handler->name, name)) {
+			list_del(&d->list);
+			kfree(d);
+			return 0;
+		}
+	}
+
+	/* It's not an error to undefine an undefined token */
+	return 0;
+}
+
+static int rtas_token_define(struct kvm *kvm, char *name, u64 token)
+{
+	struct rtas_token_definition *d;
+	struct rtas_handler *h;
+	bool found;
+	int i;
+
+	lockdep_assert_held(&kvm->lock);
+
+	list_for_each_entry(d, &kvm->arch.rtas_tokens, list) {
+		if (d->token == token)
+			return -EEXIST;
+	}
+
+	found = false;
+	for (i = 0; i < ARRAY_SIZE(rtas_handlers); i++) {
+		h = &rtas_handlers[i];
+		if (rtas_name_matches(h->name, name)) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found)
+		return -ENOENT;
+
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return -ENOMEM;
+
+	d->handler = h;
+	d->token = token;
+
+	list_add_tail(&d->list, &kvm->arch.rtas_tokens);
+
+	return 0;
+}
+
+int kvm_vm_ioctl_rtas_define_token(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_rtas_token_args args;
+	int rc;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+
+	mutex_lock(&kvm->lock);
+
+	if (args.token)
+		rc = rtas_token_define(kvm, args.name, args.token);
+	else
+		rc = rtas_token_undefine(kvm, args.name);
+
+	mutex_unlock(&kvm->lock);
+
+	return rc;
+}
+
+int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu)
+{
+	struct rtas_token_definition *d;
+	struct rtas_args args;
+	rtas_arg_t *orig_rets;
+	gpa_t args_phys;
+	int rc;
+
+	/* r4 contains the guest physical address of the RTAS args */
+	args_phys = kvmppc_get_gpr(vcpu, 4);
+
+	rc = kvm_read_guest(vcpu->kvm, args_phys, &args, sizeof(args));
+	if (rc)
+		goto fail;
+
+	/*
+	 * args->rets is a pointer into args->args. Now that we've
+	 * copied args we need to fix it up to point into our copy,
+	 * not the guest args. We also need to save the original
+	 * value so we can restore it on the way out.
+	 */
+	orig_rets = args.rets;
+	args.rets = &args.args[args.nargs];
+
+	mutex_lock(&vcpu->kvm->lock);
+
+	rc = -ENOENT;
+	list_for_each_entry(d, &vcpu->kvm->arch.rtas_tokens, list) {
+		if (d->token == args.token) {
+			d->handler->handler(vcpu, &args);
+			rc = 0;
+			break;
+		}
+	}
+
+	mutex_unlock(&vcpu->kvm->lock);
+
+	if (rc == 0) {
+		args.rets = orig_rets;
+		rc = kvm_write_guest(vcpu->kvm, args_phys, &args, sizeof(args));
+		if (rc)
+			goto fail;
+	}
+
+	return rc;
+
+fail:
+	/*
+	 * We only get here if the guest has called RTAS with a bogus
+	 * args pointer. That means we can't get to the args, and so we
+	 * can't fail the RTAS call. So fail right out to userspace,
+	 * which should kill the guest.
+	 */
+	return rc;
+}
+
+void kvmppc_rtas_tokens_free(struct kvm *kvm)
+{
+	struct rtas_token_definition *d, *tmp;
+
+	lockdep_assert_held(&kvm->lock);
+
+	list_for_each_entry_safe(d, tmp, &kvm->arch.rtas_tokens, list) {
+		list_del(&d->list);
+		kfree(d);
+	}
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 934413c..26d8003 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -335,6 +335,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_SPAPR_TCE:
 	case KVM_CAP_PPC_ALLOC_HTAB:
+	case KVM_CAP_PPC_RTAS:
 		r = 1;
 		break;
 #endif /* CONFIG_PPC_BOOK3S_64 */
@@ -946,8 +947,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
+		struct kvm *kvm = filp->private_data;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
 		if (r >= 0 && copy_to_user(argp, &rma, sizeof(rma)))
@@ -995,6 +996,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			r = -EFAULT;
 		break;
 	}
+	case KVM_PPC_RTAS_DEFINE_TOKEN: {
+		struct kvm *kvm = filp->private_data;
+
+		r = kvm_vm_ioctl_rtas_define_token(kvm, argp);
+		break;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 	default:
 		r = -ENOTTY;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9a2db57..1e2fda0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_S390_CSS_SUPPORT 85
 #define KVM_CAP_PPC_EPR 86
+#define KVM_CAP_PPC_RTAS 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -889,6 +890,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
+/* Available with KVM_CAP_PPC_RTAS */
+#define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xab, struct kvm_rtas_token_args)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.rc3.219.g53414


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/9] KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
  2013-02-14 23:59 ` [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls Paul Mackerras
@ 2013-02-15  0:00 ` Paul Mackerras
  2013-03-21  8:58   ` Alexander Graf
  2013-02-15  0:01 ` [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller Paul Mackerras
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:00 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

Currently kvmppc_core_dequeue_external() takes a struct kvm_interrupt *
argument and does nothing with it, in any of its implementations.
This removes it in order to make things easier for forthcoming
in-kernel interrupt controller emulation code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_ppc.h |    3 +--
 arch/powerpc/kvm/book3s.c          |    3 +--
 arch/powerpc/kvm/booke.c           |    3 +--
 arch/powerpc/kvm/powerpc.c         |    2 +-
 4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index dd08cfa..be611f6 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -104,8 +104,7 @@ extern void kvmppc_core_queue_dec(struct kvm_vcpu *vcpu);
 extern void kvmppc_core_dequeue_dec(struct kvm_vcpu *vcpu);
 extern void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
                                        struct kvm_interrupt *irq);
-extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
-                                         struct kvm_interrupt *irq);
+extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu);
 extern void kvmppc_core_flush_tlb(struct kvm_vcpu *vcpu);
 
 extern int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index a4b6452..6548445 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -160,8 +160,7 @@ void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
 	kvmppc_book3s_queue_irqprio(vcpu, vec);
 }
 
-void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
-                                  struct kvm_interrupt *irq)
+void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu)
 {
 	kvmppc_book3s_dequeue_irqprio(vcpu, BOOK3S_INTERRUPT_EXTERNAL);
 	kvmppc_book3s_dequeue_irqprio(vcpu, BOOK3S_INTERRUPT_EXTERNAL_LEVEL);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 020923e..f72abd1 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -222,8 +222,7 @@ void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
 	kvmppc_booke_queue_irqprio(vcpu, prio);
 }
 
-void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
-                                  struct kvm_interrupt *irq)
+void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu)
 {
 	clear_bit(BOOKE_IRQPRIO_EXTERNAL, &vcpu->arch.pending_exceptions);
 	clear_bit(BOOKE_IRQPRIO_EXTERNAL_LEVEL, &vcpu->arch.pending_exceptions);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 26d8003..1772883 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -741,7 +741,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq)
 {
 	if (irq->irq == KVM_INTERRUPT_UNSET) {
-		kvmppc_core_dequeue_external(vcpu, irq);
+		kvmppc_core_dequeue_external(vcpu);
 		return 0;
 	}
 
-- 
1.7.10.rc3.219.g53414


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
  2013-02-14 23:59 ` [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls Paul Mackerras
  2013-02-15  0:00 ` [PATCH 2/9] KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external Paul Mackerras
@ 2013-02-15  0:01 ` Paul Mackerras
  2013-02-15 20:05   ` Scott Wood
  2013-02-15  0:01 ` [PATCH 4/9] KVM: PPC: Book3S: Generalize interfaces to interrupt controller emulation Paul Mackerras
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).

Each vcpu gets one ICP (interrupt controller presentation) entity.
They are created automatically when the vcpu is created provided the
KVM_CREATE_IRQCHIP_ARGS ioctl has been performed.

This is based on an initial implementation by Michael Ellerman
<michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt     |   51 ++
 arch/powerpc/include/asm/kvm_book3s.h |    1 +
 arch/powerpc/include/asm/kvm_host.h   |    8 +
 arch/powerpc/include/asm/kvm_ppc.h    |   19 +
 arch/powerpc/kvm/Makefile             |    1 +
 arch/powerpc/kvm/book3s.c             |    2 +-
 arch/powerpc/kvm/book3s_hv.c          |   20 +
 arch/powerpc/kvm/book3s_pr.c          |   13 +
 arch/powerpc/kvm/book3s_pr_papr.c     |   16 +
 arch/powerpc/kvm/book3s_rtas.c        |   51 +-
 arch/powerpc/kvm/book3s_xics.c        | 1101 +++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_xics.h        |  111 ++++
 arch/powerpc/kvm/powerpc.c            |   23 +
 include/uapi/linux/kvm.h              |   29 +
 14 files changed, 1444 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xics.c
 create mode 100644 arch/powerpc/kvm/book3s_xics.h

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d3e2d60..0ff9dcf 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2141,6 +2141,57 @@ associated with the service will be forgotten, and subsequent RTAS
 calls by the guest for that service will be passed to userspace to be
 handled.
 
+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irqchip_args
+Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel.  The type field
+of the argument struct indicates the interrupt controller architecture
+of the virtual machine.  Currently the only value permitted for the
+type field is 1, indicating the XICS (eXternal Interrupt Controller
+Specification) model defined in PAPR.  For XICS, this ioctl indicates
+to the kernel that an interrupt controller presentation (ICP) entity
+should be created for every vcpu, and interrupt controller source
+(ICS) entities should be created to accommodate the sources that are
+configured with the KVM_IRQCHIP_SET_SOURCES ioctl.
+
+4.81 KVM_IRQCHIP_GET_SOURCES
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irq_sources
+Returns: 0 on success, -1 on error
+
+Copies configuration and status information about a range of interrupt
+sources into a user-supplied buffer.  The argument struct gives the
+starting interrupt source number and the number of interrupt sources.
+The user buffer is an array of 64-bit quantities, one per interrupt
+source, with (from the least- significant bit) 32 bits of interrupt
+server number, 8 bits of priority, and 1 bit each for a
+level-sensitive indicator, a masked indicator, and a pending
+indicator.  If some of the sources in the range don't exist, that is,
+have not yet been created with the KVM_IRQCHIP_SET_SOURCES ioctl,
+this returns an ENODEV error.
+
+4.82 KVM_IRQCHIP_SET_SOURCES
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irq_sources
+Returns: 0 on success, -1 on error
+
+Sets the configuration and status for a range of interrupt sources
+from information supplied in a user-supplied buffer, creating the
+sources if they don't already exist.  The argument struct gives the
+starting interrupt source number and the number of interrupt sources.
+The user buffer is formatted as for KVM_IRQCHIP_GET_SOURCES.
+
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 5a56e1c..17c9a15 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -142,6 +142,7 @@ extern int kvmppc_mmu_hv_init(void);
 extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, bool data);
 extern int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, bool data);
 extern void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, unsigned int vec);
+extern void kvmppc_book3s_dequeue_irqprio(struct kvm_vcpu *vcpu, unsigned int vec);
 extern void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags);
 extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
 			   bool upper, u32 val);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8295dc7..b05e7cd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -188,6 +188,10 @@ struct kvmppc_linear_info {
 	int		 type;
 };
 
+/* XICS components, defined in boo3s_xics.c */
+struct kvmppc_xics;
+struct kvmppc_icp;
+
 /*
  * The reverse mapping array has one entry for each HPTE,
  * which stores the guest's view of the second word of the HPTE
@@ -256,6 +260,7 @@ struct kvm_arch {
 #ifdef CONFIG_PPC_BOOK3S_64
 	struct list_head spapr_tce_tables;
 	struct list_head rtas_tokens;
+	struct kvmppc_xics *xics;
 #endif
 };
 
@@ -572,6 +577,9 @@ struct kvm_vcpu_arch {
 	u64 busy_stolen;
 	u64 busy_preempt;
 #endif
+#ifdef CONFIG_PPC_BOOK3S_64
+	struct kvmppc_icp *icp; /* XICS presentation controller */
+#endif
 };
 
 /* Values for vcpu->arch.state */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index be611f6..f0fd22b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -130,6 +130,13 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd);
+extern int kvmppc_xics_ioctl(struct kvm *kvm, unsigned ioctl, unsigned long arg);
+extern int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args);
+extern void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu);
+extern int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu);
+extern void kvmppc_xics_free(struct kvm *kvm);
+
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
@@ -167,6 +174,8 @@ extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 extern int kvm_vm_ioctl_rtas_define_token(struct kvm *kvm, void __user *argp);
 extern int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu);
 extern void kvmppc_rtas_tokens_free(struct kvm *kvm);
+extern int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority);
+extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, u32 *priority);
 
 /*
  * Cuts out inst bits with ordering according to spec.
@@ -263,6 +272,16 @@ static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 
 static inline void kvm_linear_init(void)
 {}
+
+#endif
+
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline int kvmppc_xics_enabled(struct kvm *kvm)
+{
+	return kvm->arch.xics != NULL;
+}
+#else
+static inline int kvmppc_xics_enabled(struct kvm *kvm) { return 0; }
 #endif
 
 static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 432132c..e2eb04c 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -87,6 +87,7 @@ kvm-book3s_64-module-objs := \
 	book3s.o \
 	book3s_64_vio.o \
 	book3s_rtas.o \
+	book3s_xics.o \
 	$(kvm-book3s_64-objs-y)
 
 kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 6548445..c5a4478 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -104,7 +104,7 @@ static int kvmppc_book3s_vec2irqprio(unsigned int vec)
 	return prio;
 }
 
-static void kvmppc_book3s_dequeue_irqprio(struct kvm_vcpu *vcpu,
+void kvmppc_book3s_dequeue_irqprio(struct kvm_vcpu *vcpu,
 					  unsigned int vec)
 {
 	unsigned long old_pending = vcpu->arch.pending_exceptions;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 567c264..aa3a0db 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -528,6 +528,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 
 		/* Send the error out to userspace via KVM_RUN */
 		return rc;
+	case H_XIRR:
+	case H_CPPR:
+	case H_EOI:
+	case H_IPI:
+		if (kvmppc_xics_enabled(vcpu->kvm)) {
+			ret = kvmppc_xics_hcall(vcpu, req);
+			break;
+		} /* fallthrough */
 	default:
 		return RESUME_HOST;
 	}
@@ -886,6 +894,13 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	spin_lock_init(&vcpu->arch.tbacct_lock);
 	vcpu->arch.busy_preempt = TB_NIL;
 
+	/* Create the XICS */
+	if (kvmppc_xics_enabled(kvm)) {
+		err = kvmppc_xics_create_icp(vcpu);
+		if (err < 0)
+			goto free_vcpu;
+	}
+
 	kvmppc_mmu_book3s_hv_init(vcpu);
 
 	vcpu->arch.state = KVMPPC_VCPU_NOTREADY;
@@ -937,6 +952,8 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.vpa.pinned_addr);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
 	kvm_vcpu_uninit(vcpu);
+	if (kvmppc_xics_enabled(vcpu->kvm))
+		kvmppc_xics_free_icp(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
 
@@ -1882,6 +1899,9 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 
 	kvmppc_rtas_tokens_free(kvm);
 
+	if (kvmppc_xics_enabled(kvm))
+		kvmppc_xics_free(kvm);
+
 	kvmppc_free_hpt(kvm);
 	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
 }
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 73ed11c..9b2237f 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1069,6 +1069,13 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	if (err < 0)
 		goto uninit_vcpu;
 
+	/* Create the XICS */
+	if (kvmppc_xics_enabled(kvm)) {
+		err = kvmppc_xics_create_icp(vcpu);
+		if (err < 0)
+			goto free_vcpu;
+	}
+
 	return vcpu;
 
 uninit_vcpu:
@@ -1085,6 +1092,8 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_book3s *vcpu_book3s = to_book3s(vcpu);
 
+	if (kvmppc_xics_enabled(vcpu->kvm))
+		kvmppc_xics_free_icp(vcpu);
 	free_page((unsigned long)vcpu->arch.shared & PAGE_MASK);
 	kvm_vcpu_uninit(vcpu);
 	kfree(vcpu_book3s->shadow_vcpu);
@@ -1293,6 +1302,7 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 {
 #ifdef CONFIG_PPC64
 	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
+	INIT_LIST_HEAD(&kvm->arch.rtas_tokens);
 #endif
 
 	return 0;
@@ -1303,6 +1313,9 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 #ifdef CONFIG_PPC64
 	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
 #endif
+	if (kvmppc_xics_enabled(kvm))
+		kvmppc_xics_free(kvm);
+
 }
 
 static int kvmppc_book3s_init(void)
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index 4efa4a4..94cec5b 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -227,6 +227,15 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	return EMULATE_DONE;
 }
 
+static int kvmppc_h_pr_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
+{
+	long rc = kvmppc_xics_hcall(vcpu, cmd);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
 int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 {
 	switch (cmd) {
@@ -246,6 +255,13 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		clear_bit(KVM_REQ_UNHALT, &vcpu->requests);
 		vcpu->stat.halt_wakeup++;
 		return EMULATE_DONE;
+	case H_XIRR:
+	case H_CPPR:
+	case H_EOI:
+	case H_IPI:
+		if (kvmppc_xics_enabled(vcpu->kvm))
+			return kvmppc_h_pr_xics_hcall(vcpu, cmd);
+		break;
 	case H_RTAS:
 		if (list_empty(&vcpu->kvm->arch.rtas_tokens))
 			return RESUME_HOST;
diff --git a/arch/powerpc/kvm/book3s_rtas.c b/arch/powerpc/kvm/book3s_rtas.c
index 8a324e8..6a6c1fe 100644
--- a/arch/powerpc/kvm/book3s_rtas.c
+++ b/arch/powerpc/kvm/book3s_rtas.c
@@ -18,12 +18,61 @@
 #include <asm/rtas.h>
 
 
+static void kvm_rtas_set_xive(struct kvm_vcpu *vcpu, struct rtas_args *args)
+{
+	u32 irq, server, priority;
+	int rc;
+
+	if (args->nargs != 3 || args->nret != 1) {
+		rc = -3;
+		goto out;
+	}
+
+	irq = args->args[0];
+	server = args->args[1];
+	priority = args->args[2];
+
+	rc = kvmppc_xics_set_xive(vcpu->kvm, irq, server, priority);
+	if (rc)
+		rc = -3;
+out:
+	args->rets[0] = rc;
+}
+
+static void kvm_rtas_get_xive(struct kvm_vcpu *vcpu, struct rtas_args *args)
+{
+	u32 irq, server, priority;
+	int rc;
+
+	if (args->nargs != 1 || args->nret != 3) {
+		rc = -3;
+		goto out;
+	}
+
+	irq = args->args[0];
+
+	server = priority = 0;
+	rc = kvmppc_xics_get_xive(vcpu->kvm, irq, &server, &priority);
+	if (rc) {
+		rc = -3;
+		goto out;
+	}
+
+	args->rets[1] = server;
+	args->rets[2] = priority;
+out:
+	args->rets[0] = rc;
+}
+
 struct rtas_handler {
 	void (*handler)(struct kvm_vcpu *vcpu, struct rtas_args *args);
 	char *name;
 };
 
-static struct rtas_handler rtas_handlers[] = { };
+static struct rtas_handler rtas_handlers[] = {
+	{ .name = "ibm,set-xive", .handler = kvm_rtas_set_xive },
+	{ .name = "ibm,get-xive", .handler = kvm_rtas_get_xive },
+};
 
 struct rtas_token_definition {
 	struct list_head list;
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
new file mode 100644
index 0000000..7749060
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -0,0 +1,1101 @@
+/*
+ * Copyright 2012 Michael Ellerman, IBM Corporation.
+ * Copyright 2012 Benjamin Herrenschmidt, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+
+#include <asm/uaccess.h>
+#include <asm/kvm_book3s.h>
+#include <asm/kvm_ppc.h>
+#include <asm/hvcall.h>
+#include <asm/xics.h>
+#include <asm/debug.h>
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "book3s_xics.h"
+
+#define XICS_DBG(fmt...) do { } while (0)
+//#define XICS_DBG(fmt...) do { trace_printk(fmt); } while (0)
+
+/*
+ * LOCKING
+ * =======
+ *
+ * Each ICS has a mutex protecting the information about the IRQ
+ * sources and avoiding simultaneous deliveries if the same interrupt.
+ *
+ * ICP operations are done via a single compare & swap transaction
+ * (most ICP state fits in the union kvmppc_icp_state)
+ */
+
+/*
+ * TODO
+ * ====
+ *
+ * - To speed up resends, keep a bitmap of "resend" set bits in the
+ *   ICS
+ *
+ * - Speed up server# -> ICP lookup (array ? hash table ?)
+ *
+ * - Make ICS lockless as well, or at least a per-interrupt lock or hashed
+ *   locks array to improve scalability
+ *
+ * - ioctl's to save/restore the entire state for snapshot & migration
+ */
+
+/* -- ICS routines -- */
+
+static void icp_deliver_irq(struct kvmppc_xics *xics, struct kvmppc_icp *icp,
+			    u32 new_irq);
+
+static void ics_deliver_irq(struct kvmppc_xics *xics, u32 irq, u32 level)
+{
+	struct ics_irq_state *state;
+	struct kvmppc_ics *ics;	
+	u16 src;
+
+	XICS_DBG("ics deliver %#x (level: %d)\n", irq, level);
+
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics) {
+		XICS_DBG("ics_deliver_irq: IRQ 0x%06x not found !\n", irq);
+		return;
+	}
+	state = &ics->irq_state[src];
+
+	/*
+	 * We set state->asserted locklessly. This should be fine as
+	 * we are the only setter, thus concurrent access is undefined
+	 * to begin with.
+	 */
+	if (level == KVM_INTERRUPT_SET_LEVEL)
+		state->asserted = 1;
+	else if (level == KVM_INTERRUPT_UNSET) {
+		state->asserted = 0;
+		return;
+	}
+
+	/* Attempt delivery */
+	icp_deliver_irq(xics, NULL, irq);
+}
+
+static void ics_check_resend(struct kvmppc_xics *xics, struct kvmppc_ics *ics,
+			     struct kvmppc_icp *icp)
+{
+	int i;
+
+	mutex_lock(&ics->lock);
+
+	for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+		struct ics_irq_state *state = &ics->irq_state[i];
+
+		if (!state->resend)
+			continue;
+
+		XICS_DBG("resend %#x prio %#x\n", state->number,
+			      state->priority);
+
+		mutex_unlock(&ics->lock);
+		icp_deliver_irq(xics, icp, state->number);
+		mutex_lock(&ics->lock);
+	}
+
+	mutex_unlock(&ics->lock);
+}
+
+int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority)
+{
+	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_icp *icp;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *state;
+	u16 src;
+	bool deliver;
+
+	if (!xics)
+		return -ENODEV;
+
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics)
+		return -EINVAL;
+	state = &ics->irq_state[src];
+
+	icp = kvmppc_xics_find_server(kvm, server);
+	if (!icp)
+		return -EINVAL;
+
+	mutex_lock(&ics->lock);
+
+	XICS_DBG("set_xive %#x server %#x prio %#x MP:%d RS:%d\n",
+		 irq, server, priority,
+		 state->masked_pending, state->resend);
+
+	state->server = server;
+	state->priority = priority;
+	deliver = false;
+	if ((state->masked_pending || state->resend) && priority != MASKED) {
+		state->masked_pending = 0;
+		deliver = true;
+	}
+
+	mutex_unlock(&ics->lock);
+
+	if (deliver)
+		icp_deliver_irq(xics, icp, irq);
+
+	return 0;
+}
+
+int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, u32 *priority)
+{
+	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *state;
+	u16 src;
+
+	if (!xics)
+		return -ENODEV;
+
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics)
+		return -EINVAL;
+	state = &ics->irq_state[src];
+
+	mutex_lock(&ics->lock);
+	*server = state->server;
+	*priority = state->priority;
+	mutex_unlock(&ics->lock);
+
+	return 0;
+}
+
+/* -- ICP routines, including hcalls -- */
+
+static inline bool icp_try_update(struct kvmppc_icp *icp,
+				  union kvmppc_icp_state old,
+				  union kvmppc_icp_state new,
+				  bool change_self)
+{
+	bool success;
+
+	/* Calculate new output value */
+	new.out_ee = (new.xisr && (new.pending_pri < new.cppr));
+
+	/* Attempt atomic update */
+	success = cmpxchg64(&icp->state.raw, old.raw, new.raw) == old.raw;
+	if (!success)
+		goto bail;
+
+	XICS_DBG("UPD [%04x] - C:%02x M:%02x PP: %02x PI:%06x R:%d O:%d\n",
+		 icp->vcpu->vcpu_id,
+		 old.cppr, old.mfrr, old.pending_pri, old.xisr,
+		 old.need_resend, old.out_ee);
+	XICS_DBG("UPD        - C:%02x M:%02x PP: %02x PI:%06x R:%d O:%d\n",
+		 new.cppr, new.mfrr, new.pending_pri, new.xisr,
+		 new.need_resend, new.out_ee);
+	/*
+	 * Check for output state update
+	 *
+	 * Note that this is racy since another processor could be updating
+	 * the state already. This is why we never clear the interrupt output
+	 * here, we only ever set it. The clear only happens prior to doing
+	 * an update and only by the processor itself. Currently we do it
+	 * in Accept (H_XIRR) and Up_Cppr (H_XPPR).
+	 *
+	 * We also do not try to figure out whether the EE state has changed,
+	 * we unconditionally set it if the new state calls for it for the
+	 * same reason.
+	 */
+	if (new.out_ee) {
+		kvmppc_book3s_queue_irqprio(icp->vcpu,
+					    BOOK3S_INTERRUPT_EXTERNAL_LEVEL);
+		if (!change_self)
+			kvm_vcpu_kick(icp->vcpu);
+	}
+ bail:
+	return success;
+}
+
+static void icp_check_resend(struct kvmppc_xics *xics,
+			     struct kvmppc_icp *icp)
+{
+	u32 icsid;
+	
+	/* Order this load with the test for need_resend in the caller */
+	smp_rmb();
+	for_each_set_bit(icsid, icp->resend_map, xics->max_icsid + 1) {
+		struct kvmppc_ics *ics = xics->ics[icsid];
+
+		if (!test_and_clear_bit(icsid, icp->resend_map))
+			continue;
+		if (!ics)
+			continue;
+		ics_check_resend(xics, ics, icp);
+	}
+}
+
+static bool icp_try_to_deliver(struct kvmppc_icp *icp, u32 irq, u8 priority,
+			       u32 *reject)
+{
+	union kvmppc_icp_state old_state, new_state;
+	bool success;
+
+	XICS_DBG("try deliver %#x(P:%#x) to server %#x\n", irq, priority,
+		 icp->vcpu->vcpu_id);
+
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		*reject = 0;
+
+		/* See if we can deliver */
+		success = new_state.cppr > priority &&
+			new_state.mfrr > priority &&
+			new_state.pending_pri > priority;
+
+		/*
+		 * If we can, check for a rejection and perform the
+		 * delivery
+		 */
+		if (success) {
+			*reject = new_state.xisr;
+			new_state.xisr = irq;
+			new_state.pending_pri = priority;
+		} else {
+			/*
+			 * If we failed to deliver we set need_resend
+			 * so a subsequent CPPR state change causes us
+			 * to try a new delivery.
+			 */
+			new_state.need_resend = true;
+		}
+
+	} while (!icp_try_update(icp, old_state, new_state, false));
+
+	return success;
+}
+
+static void icp_deliver_irq(struct kvmppc_xics *xics, struct kvmppc_icp *icp,
+			    u32 new_irq)
+{
+	struct ics_irq_state *state;
+	struct kvmppc_ics *ics;
+	u32 reject;
+	u16 src;	
+
+	/*
+	 * This is used both for initial delivery of an interrupt and
+	 * for subsequent rejection.
+	 *
+	 * Rejection can be racy vs. resends. We have evaluated the
+	 * rejection in an atomic ICP transaction which is now complete,
+	 * so potentially the ICP can already accept the interrupt again.
+	 *
+	 * So we need to retry the delivery. Essentially the reject path
+	 * boils down to a failed delivery. Always.
+	 *
+	 * Now the interrupt could also have moved to a different target,
+	 * thus we may need to re-do the ICP lookup as well
+	 */
+	 
+ again:
+	/* Get the ICS state and lock it */
+	ics = kvmppc_xics_find_ics(xics, new_irq, &src);
+	if (!ics) {
+		XICS_DBG("icp_deliver_irq: IRQ 0x%06x not found !\n", new_irq);
+		return;
+	}
+	state = &ics->irq_state[src];
+
+	/* Get a lock on the ICS */
+	mutex_lock(&ics->lock);
+
+	/* Get our server */
+	if (!icp || state->server != icp->vcpu->vcpu_id) {
+		icp = kvmppc_xics_find_server(xics->kvm, state->server);
+		if (!icp) {
+			pr_warning("icp_deliver_irq: IRQ 0x%06x server 0x%x"
+				   " not found !\n", new_irq, state->server);
+			goto out;
+		}
+	}
+
+	/* Clear the resend bit of that interrupt */
+	state->resend = 0;
+
+	/*
+	 * If masked, bail out
+	 *
+	 * Note: PAPR doesn't mention anything about masked pending
+	 * when doing a resend, only when doing a delivery.
+	 *
+	 * However that would have the effect of losing a masked
+	 * interrupt that was rejected and isn't consistent with
+	 * the whole masked_pending business which is about not
+	 * losing interrupts that occur while masked.
+	 *
+	 * I don't differenciate normal deliveries and resends, this
+	 * implementation will differ from PAPR and not lose such
+	 * interrupts.
+	 */
+	if (state->priority == MASKED) {
+		XICS_DBG("irq %#x masked pending\n", new_irq);
+		state->masked_pending = 1;
+		goto out;
+	}
+
+	/*
+	 * Try the delivery, this will set the need_resend flag
+	 * in the ICP as part of the atomic transaction if the
+	 * delivery is not possible.
+	 *
+	 * Note that if successful, the new delivery might have itself
+	 * rejected an interrupt that was "delivered" before we took the
+	 * icp mutex.
+	 *
+	 * In this case we do the whole sequence all over again for the
+	 * new guy. We cannot assume that the rejected interrupt is less
+	 * favored than the new one, and thus doesn't need to be delivered,
+	 * because by the time we exit icp_try_to_deliver() the target
+	 * processor may well have alrady consumed & completed it, and thus
+	 * the rejected interrupt might actually be already acceptable.
+	 */
+	if (icp_try_to_deliver(icp, new_irq, state->priority, &reject)) {
+		/*
+		 * Delivery was successful, did we reject somebody else ?
+		 */
+		if (reject && reject != XICS_IPI) {
+			mutex_unlock(&ics->lock);
+			new_irq = reject;
+			goto again;
+		}
+	} else {
+		/*
+		 * We failed to deliver the interrupt we need to set the
+		 * resend map bit and mark the ICS state as needing a resend
+		 */
+		set_bit(ics->icsid, icp->resend_map);
+		state->resend = 1;
+
+		/*
+		 * If the need_resend flag got cleared in the ICP some time
+		 * between icp_try_to_deliver() atomic update and now, then
+		 * we know it might have missed the resend_map bit. So we
+		 * retry
+		 */
+		smp_mb();
+		if (!icp->state.need_resend) {
+			mutex_unlock(&ics->lock);
+			goto again;
+		}
+	}
+ out:
+	mutex_unlock(&ics->lock);
+}
+
+static void icp_down_cppr(struct kvmppc_xics *xics, struct kvmppc_icp *icp,
+			  u8 new_cppr)
+{
+	union kvmppc_icp_state old_state, new_state;
+	bool resend;
+
+	/*
+	 * This handles several related states in one operation:
+	 *
+	 * ICP State: Down_CPPR
+	 *
+	 * Load CPPR with new value and if the XISR is 0
+	 * then check for resends:
+	 *
+	 * ICP State: Resend
+	 *
+	 * If MFRR is more favored than CPPR, check for IPIs
+	 * and notify ICS of a potential resend. This is done
+	 * asynchronously (when used in real mode, we will have
+	 * to exit here).
+	 *
+	 * We do not handle the complete Check_IPI as documented
+	 * here. In the PAPR, this state will be used for both
+	 * Set_MFRR and Down_CPPR. However, we know that we aren't
+	 * changing the MFRR state here so we don't need to handle
+	 * the case of an MFRR causing a reject of a pending irq,
+	 * this will have been handled when the MFRR was set in the
+	 * first place.
+	 *
+	 * Thus we don't have to handle rejects, only resends.
+	 *
+	 * When implementing real mode for HV KVM, resend will lead to
+	 * a H_TOO_HARD return and the whole transaction will be handled
+	 * in virtual mode.
+	 */
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		/* Down_CPPR */
+		new_state.cppr = new_cppr;
+
+		/*
+		 * Cut down Resend / Check_IPI / IPI
+		 *
+		 * The logic is that we cannot have a pending interrupt
+		 * trumped by an IPI at this point (see above), so we
+		 * know that either the pending interrupt is already an
+		 * IPI (in which case we don't care to override it) or
+		 * it's either more favored than us or non existent
+		 */
+		if (new_state.mfrr < new_cppr &&
+		    new_state.mfrr <= new_state.pending_pri) {
+			WARN_ON(new_state.xisr != XICS_IPI &&
+				new_state.xisr != 0);
+			new_state.pending_pri = new_state.mfrr;
+			new_state.xisr = XICS_IPI;
+		}
+
+		/* Latch/clear resend bit */
+		resend = new_state.need_resend;
+		new_state.need_resend = 0;
+
+	} while (!icp_try_update(icp, old_state, new_state, true));
+
+	/*
+	 * Now handle resend checks. Those are asynchronous to the ICP
+	 * state update in HW (ie bus transactions) so we can handle them
+	 * separately here too
+	 */
+	if (resend)
+		icp_check_resend(xics, icp);
+}
+
+static noinline unsigned long h_xirr(struct kvm_vcpu *vcpu)
+{
+	union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	u32 xirr;
+
+	/* First, remove EE from the processor */
+	kvmppc_book3s_dequeue_irqprio(icp->vcpu,
+				      BOOK3S_INTERRUPT_EXTERNAL_LEVEL);
+
+	/*
+	 * ICP State: Accept_Interrupt
+	 *
+	 * Return the pending interrupt (if any) along with the
+	 * current CPPR, then clear the XISR & set CPPR to the
+	 * pending priority
+	 */
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		xirr = old_state.xisr | (((u32)old_state.cppr) << 24);
+		if (!old_state.xisr)
+			break;
+		new_state.cppr = new_state.pending_pri;
+		new_state.pending_pri = 0xff;
+		new_state.xisr = 0;
+
+	} while (!icp_try_update(icp, old_state, new_state, true));
+
+	XICS_DBG("h_xirr vcpu %d xirr %#x\n", vcpu->vcpu_id, xirr);
+
+	return xirr;
+}
+
+static noinline int h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
+			  unsigned long mfrr)
+{
+        union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.xics;
+	struct kvmppc_icp *icp;
+	u32 reject;
+	bool resend;
+	bool local;
+
+	XICS_DBG("h_ipi vcpu %d to server %lu mfrr %#lx\n",
+			vcpu->vcpu_id, server, mfrr);
+
+	local = vcpu->vcpu_id == server;
+	if (local)
+		icp = vcpu->arch.icp;
+	else
+		icp = kvmppc_xics_find_server(vcpu->kvm, server);
+	if (!icp)
+		return H_PARAMETER;
+
+	/*
+	 * ICP state: Set_MFRR
+	 *
+	 * If the CPPR is more favored than the new MFRR, then
+	 * nothing needs to be rejected as there can be no XISR to
+	 * reject.  If the MFRR is being made less favored then
+	 * there might be a previously-rejected interrupt needing
+	 * to be resent.
+	 *
+	 * If the CPPR is less favored, then we might be replacing
+	 * an interrupt, and thus need to possibly reject it as in
+	 *
+	 * ICP state: Check_IPI
+	 */
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		/* Set_MFRR */
+		new_state.mfrr = mfrr;
+
+		/* Check_IPI */
+		reject = 0;
+		resend = false;
+		if (mfrr < new_state.cppr) {
+			/* Reject a pending interrupt if not an IPI */
+			if (mfrr <= new_state.pending_pri)
+				reject = new_state.xisr;
+			new_state.pending_pri = mfrr;
+			new_state.xisr = XICS_IPI;
+		}
+
+		if (mfrr > old_state.mfrr && mfrr > new_state.cppr) {
+			resend = new_state.need_resend;
+			new_state.need_resend = 0;
+		}
+	} while (!icp_try_update(icp, old_state, new_state, local));
+
+	/* Handle reject */
+	if (reject && reject != XICS_IPI)
+		icp_deliver_irq(xics, icp, reject);
+		
+	/* Handle resend */
+	if (resend)
+		icp_check_resend(xics, icp);
+
+	return H_SUCCESS;
+}
+
+static noinline void h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
+{
+	union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.xics;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	u32 reject;
+
+	XICS_DBG("h_cppr vcpu %d cppr %#lx\n", vcpu->vcpu_id, cppr);
+
+	/*
+	 * ICP State: Set_CPPR
+	 *
+	 * We can safely compare the new value with the current
+	 * value outside of the transaction as the CPPR is only
+	 * ever changed by the processor on itself
+	 */
+	if (cppr > icp->state.cppr)
+		icp_down_cppr(xics, icp, cppr);
+	else if (cppr == icp->state.cppr)
+		return;
+
+	/*
+	 * ICP State: Up_CPPR
+	 *
+	 * The processor is raising its priority, this can result
+	 * in a rejection of a pending interrupt:
+	 *
+	 * ICP State: Reject_Current
+	 *
+	 * We can remove EE from the current processor, the update
+	 * transaction will set it again if needed
+	 */
+	kvmppc_book3s_dequeue_irqprio(icp->vcpu,
+				      BOOK3S_INTERRUPT_EXTERNAL_LEVEL);
+
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		reject = 0;
+		new_state.cppr = cppr;
+
+		if (cppr <= new_state.pending_pri) {
+			reject = new_state.xisr;
+			new_state.xisr = 0;
+			new_state.pending_pri = 0xff;
+		}
+
+	} while (!icp_try_update(icp, old_state, new_state, true));
+
+	/*
+	 * Check for rejects. They are handled by doing a new delivery
+	 * attempt (see comments in icp_deliver_irq).
+	 */
+	if (reject && reject != XICS_IPI)
+		icp_deliver_irq(xics, icp, reject);
+}
+
+static noinline int h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
+{
+	struct kvmppc_xics *xics = vcpu->kvm->arch.xics;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *state;
+	u32 irq = xirr & 0x00ffffff;
+	u16 src;
+
+	XICS_DBG("h_eoi vcpu %d eoi %#lx\n", vcpu->vcpu_id, xirr);
+
+	/*
+	 * ICP State: EOI
+	 *
+	 * Note: If EOI is incorrectly used by SW to lower the CPPR
+	 * value (ie more favored), we do not check for rejection of
+	 * a pending interrupt, this is a SW error and PAPR sepcifies
+	 * that we don't have to deal with it.
+	 *
+	 * The sending of an EOI to the ICS is handled after the
+	 * CPPR update
+	 *
+	 * ICP State: Down_CPPR which we handle
+	 * in a separate function as it's shared with H_CPPR.
+	 */
+	icp_down_cppr(xics, icp, xirr >> 24);
+
+	/* IPIs have no EOI */
+	if (irq == XICS_IPI)
+		return H_SUCCESS;
+	/*
+	 * EOI handling: If the interrupt is still asserted, we need to
+	 * resend it. We can take a lockless "peek" at the ICS state here.
+	 *
+	 * "Message" interrupts will never have "asserted" set
+	 */
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics) {
+		XICS_DBG("h_eoi: IRQ 0x%06x not found !\n", irq);
+		return H_PARAMETER;
+	}
+	state = &ics->irq_state[src];
+
+	/* Still asserted, resend it */
+	if (state->asserted)
+		icp_deliver_irq(xics, icp, irq);
+
+	return H_SUCCESS;
+}
+
+int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 req)
+{
+	unsigned long res;
+	int rc = H_SUCCESS;
+
+	/* Check if we have an ICP */
+	if (!vcpu->arch.icp || !vcpu->kvm->arch.xics)
+		return H_HARDWARE;
+
+	switch (req) {
+	case H_XIRR:
+		res = h_xirr(vcpu);
+		kvmppc_set_gpr(vcpu, 4, res);
+		break;
+	case H_CPPR:
+		h_cppr(vcpu, kvmppc_get_gpr(vcpu, 4));
+		break;
+	case H_EOI:
+		rc = h_eoi(vcpu, kvmppc_get_gpr(vcpu, 4));
+		break;
+	case H_IPI:
+		rc = h_ipi(vcpu, kvmppc_get_gpr(vcpu, 4),
+			   kvmppc_get_gpr(vcpu, 5));
+		break;
+	}
+
+	return rc;
+}
+
+
+/* -- Initialisation code etc. -- */
+
+static int xics_debug_show(struct seq_file *m, void *private)
+{
+	struct kvmppc_xics *xics = m->private;
+	struct kvm *kvm = xics->kvm;
+	struct kvm_vcpu *vcpu;
+	int icsid, i;
+
+	if (!kvm)
+		return 0;
+
+	seq_printf(m, "=========\nICP state\n=========\n");
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct kvmppc_icp *icp = vcpu->arch.icp;
+		union kvmppc_icp_state state;
+
+		if (!icp)
+			continue;
+
+		state.raw = ACCESS_ONCE(icp->state.raw);
+		seq_printf(m, "cpu server %#x XIRR:%#x PPRI:%#x CPPR:%#x "
+			   "MFRR:%#x OUT:%d NR:%d\n", vcpu->vcpu_id, state.xisr,
+			   state.pending_pri, state.cppr, state.mfrr,
+			   state.out_ee, state.need_resend);
+	}
+
+	for (icsid = 0; icsid <= KVMPPC_XICS_MAX_ICS_ID; icsid++) {
+		struct kvmppc_ics *ics = xics->ics[icsid];
+
+		if (!ics)
+			continue;
+
+		seq_printf(m, "=========\nICS state for ICS 0x%x\n=========\n",
+			   icsid);
+
+		mutex_lock(&ics->lock);
+
+		for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+			struct ics_irq_state *irq = &ics->irq_state[i];
+
+			seq_printf(m, "irq 0x%06x: server %#x prio %#x save"
+				   " prio %#x asserted %d resend %d masked"
+				   " pending %d\n",
+				   irq->number, irq->server, irq->priority,
+				   irq->saved_priority, irq->asserted,
+				   irq->resend, irq->masked_pending);
+
+		}
+		mutex_unlock(&ics->lock);
+	}
+	return 0;
+}
+
+static int xics_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, xics_debug_show, inode->i_private);
+}
+
+static const struct file_operations xics_debug_fops = {
+	.open = xics_debug_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static void xics_debugfs_init(struct kvmppc_xics *xics)
+{
+	char *name;
+
+	name = kasprintf(GFP_KERNEL, "kvm-xics-%p", xics);
+	if (!name) {
+		pr_err("%s: no memory for name\n", __func__);
+		return;
+	}
+
+	xics->dentry = debugfs_create_file(name, S_IRUGO, powerpc_debugfs_root,
+					   xics, &xics_debug_fops);
+
+	pr_debug("%s: created %s\n", __func__, name);
+	kfree(name);
+}
+
+static struct kvmppc_ics *kvmppc_xics_create_ics(struct kvmppc_xics *xics,
+						 int irq)
+{
+	struct kvmppc_ics *ics;
+	int i, icsid;
+
+	icsid = irq >> KVMPPC_XICS_ICS_SHIFT;
+
+	mutex_lock(&xics->kvm->lock);
+
+	/* ICS already exists - somebody else got here first */
+	if (xics->ics[icsid])
+		goto out;
+
+	/* Create the ICS */
+	ics = kzalloc(sizeof(struct kvmppc_ics), GFP_KERNEL);
+	if (!ics)
+		goto out;
+
+	mutex_init(&ics->lock);
+	ics->icsid = icsid;
+
+	for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+		ics->irq_state[i].number = (icsid << KVMPPC_XICS_ICS_SHIFT) | i;
+		ics->irq_state[i].priority = MASKED;
+		ics->irq_state[i].saved_priority = MASKED;
+	}
+	smp_wmb();
+	xics->ics[icsid] = ics;
+
+	if (icsid > xics->max_icsid)
+		xics->max_icsid = icsid;
+
+ out:
+	mutex_unlock(&xics->kvm->lock);
+	return xics->ics[icsid];
+}
+
+int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu)
+{
+	struct kvmppc_icp *icp;
+
+	icp = kzalloc(sizeof(struct kvmppc_icp), GFP_KERNEL);
+	if (!icp)
+		return -ENOMEM;
+
+	icp->vcpu = vcpu;
+	icp->state.mfrr = MASKED;
+	icp->state.pending_pri = MASKED;
+	vcpu->arch.icp = icp;
+
+	XICS_DBG("created server for vcpu %d\n", vcpu->vcpu_id);
+
+	return 0;
+}
+
+void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->arch.icp)
+		return;
+	kfree(vcpu->arch.icp);
+	vcpu->arch.icp = NULL;
+}
+
+void kvmppc_xics_free(struct kvm *kvm)
+{
+	struct kvmppc_xics *xics = kvm->arch.xics;
+	int i;
+
+	if (!xics)
+		return;
+
+	lockdep_assert_held(&kvm->lock);
+
+	debugfs_remove(xics->dentry);
+
+	if (xics->kvm) {
+		xics->kvm->arch.xics = NULL;
+		xics->kvm = NULL;
+	}
+
+	for (i = 0; i <= xics->max_icsid; i++) {
+		if (xics->ics[i])
+			kfree(xics->ics[i]);
+	}
+	kfree(xics);
+}
+
+static int kvm_xics_get_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
+{
+	int ret = 0;
+	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *irqp;
+	u64 __user *ubufp;
+	u16 idx;
+	u64 val;
+	long int i, irq, nirq;
+
+	irq = srcs->irq;
+	ubufp = srcs->irqbuf;
+
+	while (srcs->nr_irqs > 0 && !ret) {
+		ics = kvmppc_xics_find_ics(xics, irq, &idx);
+		if (!ics)
+			return -ENOENT;
+		nirq = KVMPPC_XICS_IRQ_PER_ICS - idx;
+		if (nirq > srcs->nr_irqs)
+			nirq = srcs->nr_irqs;
+		srcs->nr_irqs -= nirq;
+		irq += nirq;
+
+		irqp = &ics->irq_state[idx];
+		mutex_lock(&ics->lock);
+		for (i = 0; i < nirq; ++i, ++irqp, ++ubufp) {
+			ret = -ENOENT;
+			if (!irqp->exists)
+				break;
+			val = irqp->server;
+			val |= ((u64)irqp->priority << KVM_IRQ_PRIORITY_SHIFT);
+			if (irqp->priority == MASKED)
+				val |= KVM_IRQ_MASKED;
+			if (irqp->asserted)
+				val |= KVM_IRQ_LEVEL_SENSITIVE |
+					KVM_IRQ_PENDING;
+			else if (irqp->masked_pending || irqp->resend)
+				val |= KVM_IRQ_PENDING;
+			ret = -EFAULT;
+			if (__put_user(val, ubufp))
+				break;
+			ret = 0;
+		}
+		mutex_unlock(&ics->lock);
+	}
+
+	return ret;
+}
+
+static int kvm_xics_set_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
+{
+	int ret = 0;
+	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *irqp;
+	u64 __user *ubufp;
+	u16 idx;
+	u64 val;
+	long int i, irq, nirq;
+
+	irq = srcs->irq;
+	ubufp = srcs->irqbuf;
+
+	if (irq < KVMPPC_XICS_FIRST_IRQ ||
+	    irq + srcs->nr_irqs > KVMPPC_XICS_NR_IRQS)
+		return -ENOENT;
+
+	while (srcs->nr_irqs > 0 && !ret) {
+		ics = kvmppc_xics_find_ics(xics, irq, &idx);
+		if (!ics) {
+			ics = kvmppc_xics_create_ics(xics, irq);
+			if (!ics)
+				return -ENOMEM;
+		}
+		nirq = KVMPPC_XICS_IRQ_PER_ICS - idx;
+		if (nirq > srcs->nr_irqs)
+			nirq = srcs->nr_irqs;
+		srcs->nr_irqs -= nirq;
+		irq += nirq;
+
+		irqp = &ics->irq_state[idx];
+		ubufp = srcs->irqbuf;
+		for (i = 0; i < nirq; ++i, ++irqp, ++ubufp) {
+			ret = -EFAULT;
+			if (__get_user(val, ubufp))
+				break;
+			ret = 0;
+
+			mutex_lock(&ics->lock);
+			irqp->server = val & KVM_IRQ_SERVER_MASK;
+			irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
+			irqp->resend = 0;
+			irqp->masked_pending = 0;
+			irqp->asserted = 0;
+			if ((val & KVM_IRQ_PENDING) &&
+			    (val & KVM_IRQ_LEVEL_SENSITIVE))
+				irqp->asserted = 1;
+			irqp->exists = 1;
+			mutex_unlock(&ics->lock);
+
+			if (val & KVM_IRQ_PENDING)
+				icp_deliver_irq(xics, NULL, irqp->number);
+		}
+	}
+
+	return ret;
+}
+
+/* -- ioctls -- */
+
+int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args)
+{
+	struct kvmppc_xics *xics;
+	int rc = 0;
+
+	mutex_lock(&kvm->lock);
+
+	/* Already there ? */
+	if (kvm->arch.xics)
+		return -EEXIST;
+
+	xics = kzalloc(sizeof(*xics), GFP_KERNEL);
+	if (!xics) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	xics->kvm = kvm;
+	kvm->arch.xics = xics;
+	xics_debugfs_init(xics);
+
+out:
+	mutex_unlock(&kvm->lock);
+	return rc;
+}
+
+static int kvm_vm_ioctl_xics_irq(struct kvm *kvm, struct kvm_irq_level *args)
+{
+	struct kvmppc_xics *xics;
+
+	/* locking against multiple callers? */
+
+	xics = kvm->arch.xics;
+	if (!xics)
+		return -ENODEV;
+
+	switch (args->level) {
+	case KVM_INTERRUPT_SET:
+	case KVM_INTERRUPT_SET_LEVEL:
+	case KVM_INTERRUPT_UNSET:
+		ics_deliver_irq(xics, args->irq, args->level);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int kvmppc_xics_ioctl(struct kvm *kvm, unsigned ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	int rc;
+
+	BUILD_BUG_ON(sizeof(union kvmppc_icp_state) != sizeof(unsigned long));
+
+	switch (ioctl) {
+	case KVM_IRQ_LINE: {
+		struct kvm_irq_level args;
+
+		rc = -EFAULT;
+		if (copy_from_user(&args, argp, sizeof(args)))
+			break;
+		rc = kvm_vm_ioctl_xics_irq(kvm, &args);
+		break;
+	}
+
+	case KVM_IRQCHIP_GET_SOURCES: {
+		struct kvm_irq_sources sources;
+
+		rc = -EFAULT;
+		if (copy_from_user(&sources, argp, sizeof(sources)))
+			break;
+		if (!access_ok(VERIFY_WRITE, sources.irqbuf,
+			       sources.nr_irqs * sizeof(u64)))
+			break;
+		rc = kvm_xics_get_sources(kvm, &sources);
+		break;
+	}
+
+	case KVM_IRQCHIP_SET_SOURCES: {
+		struct kvm_irq_sources sources;
+
+		rc = -EFAULT;
+		if (copy_from_user(&sources, argp, sizeof(sources)))
+			break;
+		if (!access_ok(VERIFY_READ, sources.irqbuf,
+			       sources.nr_irqs * sizeof(u64)))
+			break;
+		rc = kvm_xics_set_sources(kvm, &sources);
+		break;
+	}
+
+	default:
+		rc = -ENOTTY;
+		break;
+	}
+
+	return rc;
+}
diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h
new file mode 100644
index 0000000..0e20a51
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_xics.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright 2012 Michael Ellerman, IBM Corporation.
+ * Copyright 2012 Benjamin Herrenschmidt, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _KVM_PPC_BOOK3S_XICS_H
+#define _KVM_PPC_BOOK3S_XICS_H
+
+/*
+ * We use a two-level tree to store interrupt source information.
+ * There are up to 1024 ICS nodes, each of which can represent
+ * 1024 sources.
+ */
+#define KVMPPC_XICS_MAX_ICS_ID	1023
+#define KVMPPC_XICS_ICS_SHIFT	10
+#define KVMPPC_XICS_IRQ_PER_ICS	(1 << KVMPPC_XICS_ICS_SHIFT)
+#define KVMPPC_XICS_SRC_MASK	(KVMPPC_XICS_IRQ_PER_ICS - 1)
+
+/*
+ * Interrupt source numbers below this are reserved, for example
+ * 0 is "no interrupt", and 2 is used for IPIs.
+ */
+#define KVMPPC_XICS_FIRST_IRQ	16
+#define KVMPPC_XICS_NR_IRQS	((KVMPPC_XICS_MAX_ICS_ID + 1) * KVMPPC_XICS_IRQ_PER_ICS)
+
+/* Priority value to use for disabling an interrupt */
+#define MASKED	0xff
+
+/* State for one irq source */
+struct ics_irq_state {
+	u32 number;
+	u32 server;
+	u8  priority;
+	u8  saved_priority; /* currently unused */
+	u8  resend;
+	u8  masked_pending;
+	u8  asserted; /* Only for LSI */
+	u8  exists;
+};
+
+/* Atomic ICP state, updated with a single compare & swap */
+union kvmppc_icp_state {
+	unsigned long raw;
+	struct {
+		u8 out_ee : 1;
+		u8 need_resend : 1;
+		u8 cppr;
+		u8 mfrr;
+		u8 pending_pri;
+		u32 xisr;
+	};
+};
+
+/* One bit per ICS */
+#define ICP_RESEND_MAP_SIZE	(KVMPPC_XICS_MAX_ICS_ID / BITS_PER_LONG + 1)
+
+struct kvmppc_icp {
+	struct kvm_vcpu *vcpu;
+	union kvmppc_icp_state state;
+	unsigned long resend_map[ICP_RESEND_MAP_SIZE];
+};
+
+struct kvmppc_ics {
+	struct mutex lock;
+	u16 icsid;
+	struct ics_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS];
+};
+
+struct kvmppc_xics {
+	struct kvm *kvm;
+	struct dentry *dentry;
+	u32 max_icsid;
+	struct kvmppc_ics *ics[KVMPPC_XICS_MAX_ICS_ID + 1];
+};
+
+static inline struct kvmppc_icp *kvmppc_xics_find_server(struct kvm *kvm,
+							 u32 nr)
+{
+	struct kvm_vcpu *vcpu = NULL;
+	int i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (nr == vcpu->vcpu_id)
+			return vcpu->arch.icp;
+	}
+	return NULL;
+}
+
+static inline struct kvmppc_ics *kvmppc_xics_find_ics(struct kvmppc_xics *xics,
+						      u32 irq, u16 *source)
+{
+	u32 icsid = irq >> KVMPPC_XICS_ICS_SHIFT;
+	u16 src = irq & KVMPPC_XICS_SRC_MASK;
+	struct kvmppc_ics *ics;
+
+	if (source)
+		*source = src;
+	if (icsid > KVMPPC_XICS_MAX_ICS_ID)
+		return NULL;
+	ics = xics->ics[icsid];
+	if (!ics)
+		return NULL;
+	return ics;
+}
+
+
+#endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 1772883..3bcc030 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -383,6 +383,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		break;
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CAP_PPC_GET_SMMU_INFO:
+	case KVM_CAP_IRQCHIP_ARGS:
 		r = 1;
 		break;
 #endif
@@ -1002,6 +1003,28 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_rtas_define_token(kvm, argp);
 		break;
 	}
+	case KVM_IRQ_LINE:
+	case KVM_IRQCHIP_GET_SOURCES:
+	case KVM_IRQCHIP_SET_SOURCES: {
+		struct kvm *kvm = filp->private_data;
+
+		r = -ENOTTY;
+		if (kvmppc_xics_enabled(kvm))
+			r = kvmppc_xics_ioctl(kvm, ioctl, arg);
+		break;
+	}
+	case KVM_CREATE_IRQCHIP_ARGS: {
+		struct kvm *kvm = filp->private_data;
+		struct kvm_irqchip_args args;
+
+		r = -EFAULT;
+		if (copy_from_user(&args, argp, sizeof(args)))
+			break;
+		r = -EINVAL;
+		if (args.type == KVM_IRQCHIP_TYPE_XICS)
+			r = kvmppc_xics_create(kvm, &args);
+		break;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 	default:
 		r = -ENOTTY;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1e2fda0..25d73c0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -115,6 +115,7 @@ struct kvm_irq_level {
 	 * ACPI gsi notion of irq.
 	 * For IA-64 (APIC model) IOAPIC0: irq 0-23; IOAPIC1: irq 24-47..
 	 * For X86 (standard AT mode) PIC0/1: irq 0-15. IOAPIC0: 0-23..
+	 * On powerpc SPAPR, the ICS source number, level is ignored.
 	 */
 	union {
 		__u32 irq;
@@ -146,6 +147,15 @@ struct kvm_pit_config {
 
 #define KVM_PIT_SPEAKER_DUMMY     1
 
+/* for KVM_CREATE_IRQCHIP_ARGS */
+struct kvm_irqchip_args {
+	__u64 type;
+	__u64 param;
+};
+
+/* values for type */
+#define KVM_IRQCHIP_TYPE_XICS	1	/* Power server external intr ctrler */
+
 #define KVM_EXIT_UNKNOWN          0
 #define KVM_EXIT_EXCEPTION        1
 #define KVM_EXIT_IO               2
@@ -663,6 +673,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_CSS_SUPPORT 85
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_PPC_RTAS 87
+#define KVM_CAP_IRQCHIP_ARGS 88
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -805,6 +816,21 @@ struct kvm_msi {
 	__u8  pad[16];
 };
 
+struct kvm_irq_sources {
+	__u32 irq;
+	__u32 nr_irqs;
+	__u64 __user *irqbuf;
+};
+
+/* irqbuf entries are laid out like this: */
+#define KVM_IRQ_SERVER_SHIFT	0
+#define KVM_IRQ_SERVER_MASK	0xffffffffULL
+#define KVM_IRQ_PRIORITY_SHIFT	32
+#define KVM_IRQ_PRIORITY_MASK	0xff
+#define KVM_IRQ_LEVEL_SENSITIVE	(1ULL << 40)
+#define KVM_IRQ_MASKED		(1ULL << 41)
+#define KVM_IRQ_PENDING		(1ULL << 42)
+
 /*
  * ioctls for VM fds
  */
@@ -892,6 +918,9 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 /* Available with KVM_CAP_PPC_RTAS */
 #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xab, struct kvm_rtas_token_args)
+#define KVM_CREATE_IRQCHIP_ARGS   _IOW(KVMIO,  0xac, struct kvm_irqchip_args)
+#define KVM_IRQCHIP_GET_SOURCES	  _IOW(KVMIO,  0xad, struct kvm_irq_sources)
+#define KVM_IRQCHIP_SET_SOURCES	  _IOW(KVMIO,  0xae, struct kvm_irq_sources)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/9] KVM: PPC: Book3S: Generalize interfaces to interrupt controller emulation
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
                   ` (2 preceding siblings ...)
  2013-02-15  0:01 ` [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller Paul Mackerras
@ 2013-02-15  0:01 ` Paul Mackerras
  2013-02-15  0:02 ` [PATCH 5/9] KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state Paul Mackerras
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

This makes the XICS interrupt controller emulation code export a struct
containing function pointers for the various calls into the XICS code.
The generic book3s code then uses these function pointers instead of
calling directly into the XICS code (except for the XICS instantiation
function).

This should make it possible for other interrupt controller emulations
to coexist with the XICS emulation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_host.h |   11 ++++-
 arch/powerpc/include/asm/kvm_ppc.h  |   14 ------
 arch/powerpc/kvm/book3s_hv.c        |   18 +++----
 arch/powerpc/kvm/book3s_pr.c        |   14 +++---
 arch/powerpc/kvm/book3s_pr_papr.c   |   10 ++--
 arch/powerpc/kvm/book3s_xics.c      |   93 +++++++++++++++++++----------------
 arch/powerpc/kvm/powerpc.c          |    4 +-
 7 files changed, 85 insertions(+), 79 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index b05e7cd..8af4c0b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -230,8 +230,18 @@ struct kvm_arch_memory_slot {
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 };
 
+struct kvm_irq_ctrler {
+	int (*setup_vcpu)(struct kvm_vcpu *vcpu);
+	void (*teardown_vcpu)(struct kvm_vcpu *vcpu);
+	void (*free_ctrler)(struct kvm *kvm);
+	int (*hcall)(struct kvm_vcpu *vcpu, unsigned long req);
+	int (*ioctl)(struct kvm *kvm, unsigned int ioctl, unsigned long arg);
+};
+
 struct kvm_arch {
 	unsigned int lpid;
+	struct kvm_irq_ctrler *irq_ctrler;
+	void *irq_ctrler_private;
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	unsigned long hpt_virt;
 	struct revmap_entry *revmap;
@@ -260,7 +270,6 @@ struct kvm_arch {
 #ifdef CONFIG_PPC_BOOK3S_64
 	struct list_head spapr_tce_tables;
 	struct list_head rtas_tokens;
-	struct kvmppc_xics *xics;
 #endif
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f0fd22b..2308fff 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -130,12 +130,7 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
-extern int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd);
-extern int kvmppc_xics_ioctl(struct kvm *kvm, unsigned ioctl, unsigned long arg);
 extern int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args);
-extern void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu);
-extern int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu);
-extern void kvmppc_xics_free(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
@@ -275,15 +270,6 @@ static inline void kvm_linear_init(void)
 
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
-static inline int kvmppc_xics_enabled(struct kvm *kvm)
-{
-	return kvm->arch.xics != NULL;
-}
-#else
-static inline int kvmppc_xics_enabled(struct kvm *kvm) { return 0; }
-#endif
-
 static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 {
 #ifdef CONFIG_KVM_BOOKE_HV
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index aa3a0db..8f09c36 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -532,8 +532,8 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 	case H_CPPR:
 	case H_EOI:
 	case H_IPI:
-		if (kvmppc_xics_enabled(vcpu->kvm)) {
-			ret = kvmppc_xics_hcall(vcpu, req);
+		if (vcpu->kvm->arch.irq_ctrler) {
+			ret = vcpu->kvm->arch.irq_ctrler->hcall(vcpu, req);
 			break;
 		} /* fallthrough */
 	default:
@@ -894,9 +894,9 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	spin_lock_init(&vcpu->arch.tbacct_lock);
 	vcpu->arch.busy_preempt = TB_NIL;
 
-	/* Create the XICS */
-	if (kvmppc_xics_enabled(kvm)) {
-		err = kvmppc_xics_create_icp(vcpu);
+	/* Create the interrupt-controller-specific state */
+	if (kvm->arch.irq_ctrler) {
+		err = kvm->arch.irq_ctrler->setup_vcpu(vcpu);
 		if (err < 0)
 			goto free_vcpu;
 	}
@@ -952,8 +952,8 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.vpa.pinned_addr);
 	spin_unlock(&vcpu->arch.vpa_update_lock);
 	kvm_vcpu_uninit(vcpu);
-	if (kvmppc_xics_enabled(vcpu->kvm))
-		kvmppc_xics_free_icp(vcpu);
+	if (vcpu->kvm->arch.irq_ctrler)
+		vcpu->kvm->arch.irq_ctrler->teardown_vcpu(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
 
@@ -1899,8 +1899,8 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 
 	kvmppc_rtas_tokens_free(kvm);
 
-	if (kvmppc_xics_enabled(kvm))
-		kvmppc_xics_free(kvm);
+	if (kvm->arch.irq_ctrler)
+		kvm->arch.irq_ctrler->free_ctrler(kvm);
 
 	kvmppc_free_hpt(kvm);
 	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 9b2237f..0c84242 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1069,9 +1069,9 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	if (err < 0)
 		goto uninit_vcpu;
 
-	/* Create the XICS */
-	if (kvmppc_xics_enabled(kvm)) {
-		err = kvmppc_xics_create_icp(vcpu);
+	/* Create the interrupt-controller-specific vcpu state */
+	if (kvm->arch.irq_ctrler) {
+		err = kvm->arch.irq_ctrler->setup_vcpu(vcpu);
 		if (err < 0)
 			goto free_vcpu;
 	}
@@ -1092,8 +1092,8 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_book3s *vcpu_book3s = to_book3s(vcpu);
 
-	if (kvmppc_xics_enabled(vcpu->kvm))
-		kvmppc_xics_free_icp(vcpu);
+	if (vcpu->kvm->arch.irq_ctrler)
+		vcpu->kvm->arch.irq_ctrler->teardown_vcpu(vcpu);
 	free_page((unsigned long)vcpu->arch.shared & PAGE_MASK);
 	kvm_vcpu_uninit(vcpu);
 	kfree(vcpu_book3s->shadow_vcpu);
@@ -1313,8 +1313,8 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
 #ifdef CONFIG_PPC64
 	WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
 #endif
-	if (kvmppc_xics_enabled(kvm))
-		kvmppc_xics_free(kvm);
+	if (kvm->arch.irq_ctrler)
+		kvm->arch.irq_ctrler->free_ctrler(kvm);
 
 }
 
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index 94cec5b..c423b51 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -229,7 +229,11 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 
 static int kvmppc_h_pr_xics_hcall(struct kvm_vcpu *vcpu, u32 cmd)
 {
-	long rc = kvmppc_xics_hcall(vcpu, cmd);
+	long rc;
+
+	if (!vcpu->kvm->arch.irq_ctrler)
+		return EMULATE_FAIL;
+	rc = vcpu->kvm->arch.irq_ctrler->hcall(vcpu, cmd);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -259,9 +263,7 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 	case H_CPPR:
 	case H_EOI:
 	case H_IPI:
-		if (kvmppc_xics_enabled(vcpu->kvm))
-			return kvmppc_h_pr_xics_hcall(vcpu, cmd);
-		break;
+		return kvmppc_h_pr_xics_hcall(vcpu, cmd);
 	case H_RTAS:
 		if (list_empty(&vcpu->kvm->arch.rtas_tokens))
 			return RESUME_HOST;
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 7749060..2312b56 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -115,7 +115,7 @@ static void ics_check_resend(struct kvmppc_xics *xics, struct kvmppc_ics *ics,
 
 int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority)
 {
-	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
 	struct kvmppc_icp *icp;
 	struct kvmppc_ics *ics;
 	struct ics_irq_state *state;
@@ -158,7 +158,7 @@ int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority)
 
 int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, u32 *priority)
 {
-	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
 	struct kvmppc_ics *ics;
 	struct ics_irq_state *state;
 	u16 src;
@@ -514,7 +514,7 @@ static noinline int h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
 			  unsigned long mfrr)
 {
         union kvmppc_icp_state old_state, new_state;
-	struct kvmppc_xics *xics = vcpu->kvm->arch.xics;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
 	struct kvmppc_icp *icp;
 	u32 reject;
 	bool resend;
@@ -582,7 +582,7 @@ static noinline int h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
 static noinline void h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
 {
 	union kvmppc_icp_state old_state, new_state;
-	struct kvmppc_xics *xics = vcpu->kvm->arch.xics;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
 	struct kvmppc_icp *icp = vcpu->arch.icp;
 	u32 reject;
 
@@ -638,7 +638,7 @@ static noinline void h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
 
 static noinline int h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
 {
-	struct kvmppc_xics *xics = vcpu->kvm->arch.xics;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
 	struct kvmppc_icp *icp = vcpu->arch.icp;
 	struct kvmppc_ics *ics;
 	struct ics_irq_state *state;
@@ -686,13 +686,13 @@ static noinline int h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
 	return H_SUCCESS;
 }
 
-int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, u32 req)
+static int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, unsigned long req)
 {
 	unsigned long res;
 	int rc = H_SUCCESS;
 
 	/* Check if we have an ICP */
-	if (!vcpu->arch.icp || !vcpu->kvm->arch.xics)
+	if (!vcpu->arch.icp || !vcpu->kvm->arch.irq_ctrler_private)
 		return H_HARDWARE;
 
 	switch (req) {
@@ -838,7 +838,7 @@ static struct kvmppc_ics *kvmppc_xics_create_ics(struct kvmppc_xics *xics,
 	return xics->ics[icsid];
 }
 
-int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu)
+static int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_icp *icp;
 
@@ -856,7 +856,7 @@ int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu)
+static void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->arch.icp)
 		return;
@@ -864,9 +864,9 @@ void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu)
 	vcpu->arch.icp = NULL;
 }
 
-void kvmppc_xics_free(struct kvm *kvm)
+static void kvmppc_xics_free(struct kvm *kvm)
 {
-	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
 	int i;
 
 	if (!xics)
@@ -877,7 +877,7 @@ void kvmppc_xics_free(struct kvm *kvm)
 	debugfs_remove(xics->dentry);
 
 	if (xics->kvm) {
-		xics->kvm->arch.xics = NULL;
+		xics->kvm->arch.irq_ctrler_private = NULL;
 		xics->kvm = NULL;
 	}
 
@@ -891,7 +891,7 @@ void kvmppc_xics_free(struct kvm *kvm)
 static int kvm_xics_get_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 {
 	int ret = 0;
-	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
 	struct kvmppc_ics *ics;
 	struct ics_irq_state *irqp;
 	u64 __user *ubufp;
@@ -941,7 +941,7 @@ static int kvm_xics_get_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 static int kvm_xics_set_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 {
 	int ret = 0;
-	struct kvmppc_xics *xics = kvm->arch.xics;
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
 	struct kvmppc_ics *ics;
 	struct ics_irq_state *irqp;
 	u64 __user *ubufp;
@@ -999,39 +999,13 @@ static int kvm_xics_set_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 
 /* -- ioctls -- */
 
-int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args)
-{
-	struct kvmppc_xics *xics;
-	int rc = 0;
-
-	mutex_lock(&kvm->lock);
-
-	/* Already there ? */
-	if (kvm->arch.xics)
-		return -EEXIST;
-
-	xics = kzalloc(sizeof(*xics), GFP_KERNEL);
-	if (!xics) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	xics->kvm = kvm;
-	kvm->arch.xics = xics;
-	xics_debugfs_init(xics);
-
-out:
-	mutex_unlock(&kvm->lock);
-	return rc;
-}
-
 static int kvm_vm_ioctl_xics_irq(struct kvm *kvm, struct kvm_irq_level *args)
 {
 	struct kvmppc_xics *xics;
 
 	/* locking against multiple callers? */
 
-	xics = kvm->arch.xics;
+	xics = kvm->arch.irq_ctrler_private;
 	if (!xics)
 		return -ENODEV;
 
@@ -1048,7 +1022,7 @@ static int kvm_vm_ioctl_xics_irq(struct kvm *kvm, struct kvm_irq_level *args)
 	return 0;
 }
 
-int kvmppc_xics_ioctl(struct kvm *kvm, unsigned ioctl, unsigned long arg)
+static int kvmppc_xics_ioctl(struct kvm *kvm, unsigned ioctl, unsigned long arg)
 {
 	void __user *argp = (void __user *)arg;
 	int rc;
@@ -1099,3 +1073,38 @@ int kvmppc_xics_ioctl(struct kvm *kvm, unsigned ioctl, unsigned long arg)
 
 	return rc;
 }
+
+static struct kvm_irq_ctrler xics_ctrler = {
+	.setup_vcpu = kvmppc_xics_create_icp,
+	.teardown_vcpu = kvmppc_xics_free_icp,
+	.free_ctrler = kvmppc_xics_free,
+	.hcall = kvmppc_xics_hcall,
+	.ioctl = kvmppc_xics_ioctl,
+};
+
+int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args)
+{
+	struct kvmppc_xics *xics;
+	int rc = 0;
+
+	mutex_lock(&kvm->lock);
+
+	/* Already there ? */
+	if (kvm->arch.irq_ctrler)
+		return -EEXIST;
+
+	xics = kzalloc(sizeof(*xics), GFP_KERNEL);
+	if (!xics) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	xics->kvm = kvm;
+	kvm->arch.irq_ctrler = &xics_ctrler;
+	kvm->arch.irq_ctrler_private = xics;
+	xics_debugfs_init(xics);
+
+out:
+	mutex_unlock(&kvm->lock);
+	return rc;
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 3bcc030..e429839 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -1009,8 +1009,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		struct kvm *kvm = filp->private_data;
 
 		r = -ENOTTY;
-		if (kvmppc_xics_enabled(kvm))
-			r = kvmppc_xics_ioctl(kvm, ioctl, arg);
+		if (kvm->arch.irq_ctrler)
+			r = kvm->arch.irq_ctrler->ioctl(kvm, ioctl, arg);
 		break;
 	}
 	case KVM_CREATE_IRQCHIP_ARGS: {
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/9] KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
                   ` (3 preceding siblings ...)
  2013-02-15  0:01 ` [PATCH 4/9] KVM: PPC: Book3S: Generalize interfaces to interrupt controller emulation Paul Mackerras
@ 2013-02-15  0:02 ` Paul Mackerras
  2013-02-15  0:02 ` [PATCH 6/9] KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM Paul Mackerras
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:02 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

This adds the ability for userspace to save and restore the state
of the XICS interrupt presentation controllers (ICPs) via the
KVM_GET/SET_ONE_REG interface.  Since there is one ICP per vcpu, we
simply define a new 64-bit register in the ONE_REG space for the ICP
state.  The state includes the CPU priority setting, the pending IPI
priority, and the priority and source number of any pending external
interrupt.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt   |    1 +
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |   13 +++++
 arch/powerpc/kvm/book3s.c           |   19 ++++++++
 arch/powerpc/kvm/book3s_xics.c      |   92 +++++++++++++++++++++++++++++++++++
 5 files changed, 127 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 0ff9dcf..466636b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1762,6 +1762,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_VPA_DTL   | 128
   PPC   | KVM_REG_PPC_EPCR	| 32
   PPC   | KVM_REG_PPC_EPR	| 32
+  PPC   | KVM_REG_PPC_ICP_STATE | 64
 
 4.69 KVM_GET_ONE_REG
 
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8af4c0b..2eb4c27 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -236,6 +236,8 @@ struct kvm_irq_ctrler {
 	void (*free_ctrler)(struct kvm *kvm);
 	int (*hcall)(struct kvm_vcpu *vcpu, unsigned long req);
 	int (*ioctl)(struct kvm *kvm, unsigned int ioctl, unsigned long arg);
+	u64 (*get_one_reg)(struct kvm_vcpu *vcpu, u64 reg);
+	int (*set_one_reg)(struct kvm_vcpu *vcpu, u64 reg, u64 val);
 };
 
 struct kvm_arch {
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index d90743c..8e34553 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -423,4 +423,17 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_EPCR	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x85)
 #define KVM_REG_PPC_EPR		(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x86)
 
+/* Per-vcpu interrupt controller state */
+#define KVM_REG_PPC_ICP_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x87)
+
+/* Layout of above for XICS */
+#define  KVM_REG_PPC_ICP_CPPR_SHIFT	56	/* current proc priority */
+#define  KVM_REG_PPC_ICP_CPPR_MASK	0xff
+#define  KVM_REG_PPC_ICP_XISR_SHIFT	32	/* interrupt status field */
+#define  KVM_REG_PPC_ICP_XISR_MASK	0xffffff
+#define  KVM_REG_PPC_ICP_MFRR_SHIFT	24	/* pending IPI priority */
+#define  KVM_REG_PPC_ICP_MFRR_MASK	0xff
+#define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
+#define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index c5a4478..8d5ee31 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -529,6 +529,15 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg)
 			val = get_reg_val(reg->id, vcpu->arch.vscr.u[3]);
 			break;
 #endif /* CONFIG_ALTIVEC */
+		case KVM_REG_PPC_ICP_STATE: {
+			struct kvm_irq_ctrler *ic = vcpu->kvm->arch.irq_ctrler;
+			if (!ic) {
+				r = -ENXIO;
+				break;
+			}
+			val = get_reg_val(reg->id, ic->get_one_reg(vcpu, reg->id));
+			break;
+		}
 		default:
 			r = -EINVAL;
 			break;
@@ -591,6 +600,16 @@ int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg)
 			vcpu->arch.vscr.u[3] = set_reg_val(reg->id, val);
 			break;
 #endif /* CONFIG_ALTIVEC */
+		case KVM_REG_PPC_ICP_STATE: {
+			struct kvm_irq_ctrler *ic = vcpu->kvm->arch.irq_ctrler;
+
+			if (!ic) {
+				r = -ENXIO;
+				break;
+			}
+			r = ic->set_one_reg(vcpu, reg->id, set_reg_val(reg->id, val));
+			break;
+		}
 		default:
 			r = -EINVAL;
 			break;
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 2312b56..f54b934 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -888,6 +888,96 @@ static void kvmppc_xics_free(struct kvm *kvm)
 	kfree(xics);
 }
 
+static u64 kvmppc_xics_get_icp(struct kvm_vcpu *vcpu, u64 reg_id)
+{
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	union kvmppc_icp_state state;
+
+	if (!icp)
+		return 0;
+	state = icp->state;
+	return ((u64)state.cppr << KVM_REG_PPC_ICP_CPPR_SHIFT) |
+		((u64)state.xisr << KVM_REG_PPC_ICP_XISR_SHIFT) |
+		((u64)state.mfrr << KVM_REG_PPC_ICP_MFRR_SHIFT) |
+		((u64)state.pending_pri << KVM_REG_PPC_ICP_PPRI_SHIFT);
+}
+
+static int kvmppc_xics_set_icp(struct kvm_vcpu *vcpu, u64 reg_id, u64 icpval)
+{
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
+	union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_ics *ics;
+	u8 cppr, mfrr, pending_pri;
+	u32 xisr;
+	u16 src;
+	bool resend;
+
+	if (!icp || !xics)
+		return -ENOENT;
+
+	cppr = icpval >> KVM_REG_PPC_ICP_CPPR_SHIFT;
+	xisr = (icpval >> KVM_REG_PPC_ICP_XISR_SHIFT) &
+		KVM_REG_PPC_ICP_XISR_MASK;
+	mfrr = icpval >> KVM_REG_PPC_ICP_MFRR_SHIFT;
+	pending_pri = icpval >> KVM_REG_PPC_ICP_PPRI_SHIFT;
+
+	/* Require the new state to be internally consistent */
+	if (xisr == 0) {
+		if (pending_pri != 0xff)
+			return -EINVAL;
+	} else if (xisr == XICS_IPI) {
+		if (pending_pri != mfrr || pending_pri >= cppr)
+			return -EINVAL;
+	} else {
+		if (pending_pri >= mfrr || pending_pri >= cppr)
+			return -EINVAL;
+		ics = kvmppc_xics_find_ics(xics, xisr, &src);
+		if (!ics)
+			return -EINVAL;
+	}
+
+	new_state.raw = 0;
+	new_state.cppr = cppr;
+	new_state.xisr = xisr;
+	new_state.mfrr = mfrr;
+	new_state.pending_pri = pending_pri;
+
+	/*
+	 * Deassert the CPU interrupt request.
+	 * icp_try_update will reassert it if necessary.
+	 */
+	kvmppc_book3s_dequeue_irqprio(icp->vcpu,
+				      BOOK3S_INTERRUPT_EXTERNAL_LEVEL);
+
+	/*
+	 * Note that if we displace an interrupt from old_state.xisr,
+	 * we don't mark it as rejected.  We expect userspace to set
+	 * the state of the interrupt sources to be consistent with
+	 * the ICP states (either before or afterwards, which doesn't
+	 * matter).  We do handle resends due to CPPR becoming less
+	 * favoured because that is necessary to end up with a
+	 * consistent state in the situation where userspace restores
+	 * the ICS states before the ICP states.
+	 */
+	do {
+		old_state = ACCESS_ONCE(icp->state);
+
+		if (new_state.mfrr <= old_state.mfrr) {
+			resend = false;
+			new_state.need_resend = old_state.need_resend;
+		} else {
+			resend = old_state.need_resend;
+			new_state.need_resend = 0;
+		}
+	} while (!icp_try_update(icp, old_state, new_state, false));
+
+	if (resend)
+		icp_check_resend(xics, icp);
+
+	return 0;
+}
+
 static int kvm_xics_get_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 {
 	int ret = 0;
@@ -1080,6 +1170,8 @@ static struct kvm_irq_ctrler xics_ctrler = {
 	.free_ctrler = kvmppc_xics_free,
 	.hcall = kvmppc_xics_hcall,
 	.ioctl = kvmppc_xics_ioctl,
+	.get_one_reg = kvmppc_xics_get_icp,
+	.set_one_reg = kvmppc_xics_set_icp,
 };
 
 int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args)
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/9] KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
                   ` (4 preceding siblings ...)
  2013-02-15  0:02 ` [PATCH 5/9] KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state Paul Mackerras
@ 2013-02-15  0:02 ` Paul Mackerras
  2013-02-15  0:03 ` [PATCH 7/9] KVM: PPC: Book3S HV: Add support for real mode ICP in XICS emulation Paul Mackerras
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:02 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Currently, we wake up a CPU by sending a host IPI with
smp_send_reschedule() to thread 0 of that core, which will take all
threads out of the guest, and cause them to re-evaluate their
interrupt status on the way back in.

This adds a mechanism to differentiate real host IPIs from IPIs sent
by KVM for guest threads to poke each other, in order to target the
guest threads precisely when possible and avoid that global switch of
the core to host state.

We then use this new facility in the in-kernel XICS code.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s_asm.h |    8 ++-
 arch/powerpc/include/asm/kvm_ppc.h        |   29 ++++++++
 arch/powerpc/kernel/asm-offsets.c         |    2 +
 arch/powerpc/kvm/book3s_hv.c              |   26 +++++++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  102 ++++++++++++++++++++++++-----
 arch/powerpc/kvm/book3s_xics.c            |    2 +-
 arch/powerpc/sysdev/xics/icp-native.c     |    8 +++
 7 files changed, 158 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 88609b2..923522d 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -20,6 +20,11 @@
 #ifndef __ASM_KVM_BOOK3S_ASM_H__
 #define __ASM_KVM_BOOK3S_ASM_H__
 
+/* XICS ICP register offsets */
+#define XICS_XIRR		4
+#define XICS_MFRR		0xc
+#define XICS_IPI		2	/* interrupt source # for IPIs */
+
 #ifdef __ASSEMBLY__
 
 #ifdef CONFIG_KVM_BOOK3S_HANDLER
@@ -81,10 +86,11 @@ struct kvmppc_host_state {
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	u8 hwthread_req;
 	u8 hwthread_state;
-
+	u8 host_ipi;
 	struct kvm_vcpu *kvm_vcpu;
 	struct kvmppc_vcore *kvm_vcore;
 	unsigned long xics_phys;
+	u32 saved_xirr;
 	u64 dabr;
 	u64 host_mmcr[3];
 	u32 host_pmc[8];
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 2308fff..f371af8a 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -259,6 +259,21 @@ static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 	paca[cpu].kvm_hstate.xics_phys = addr;
 }
 
+static inline u32 kvmppc_get_xics_latch(void)
+{
+	u32 xirr = get_paca()->kvm_hstate.saved_xirr;
+
+	get_paca()->kvm_hstate.saved_xirr = 0;
+
+	return xirr;
+}
+
+static inline void kvmppc_set_host_ipi(int cpu, u8 host_ipi)
+{
+	paca[cpu].kvm_hstate.host_ipi = host_ipi;
+}
+
+extern void kvmppc_fast_vcpu_kick(struct kvm_vcpu *vcpu);
 extern void kvm_linear_init(void);
 
 #else
@@ -268,6 +283,18 @@ static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 static inline void kvm_linear_init(void)
 {}
 
+static inline u32 kvmppc_get_xics_latch(void)
+{
+	return 0;
+}
+
+static inline void kvmppc_set_host_ipi(int cpu, u8 host_ipi)
+{}
+
+static inline void kvmppc_fast_vcpu_kick(struct kvm_vcpu *vcpu)
+{
+	kvm_vcpu_kick(vcpu);
+}
 #endif
 
 static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
@@ -332,4 +359,6 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
 	return ea;
 }
 
+extern void xics_wake_cpu(int cpu);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 46f6afd..c564ac3 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -543,6 +543,8 @@ int main(void)
 	HSTATE_FIELD(HSTATE_KVM_VCPU, kvm_vcpu);
 	HSTATE_FIELD(HSTATE_KVM_VCORE, kvm_vcore);
 	HSTATE_FIELD(HSTATE_XICS_PHYS, xics_phys);
+	HSTATE_FIELD(HSTATE_SAVED_XIRR, saved_xirr);
+	HSTATE_FIELD(HSTATE_HOST_IPI, host_ipi);
 	HSTATE_FIELD(HSTATE_MMCR, host_mmcr);
 	HSTATE_FIELD(HSTATE_PMC, host_pmc);
 	HSTATE_FIELD(HSTATE_PURR, host_purr);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8f09c36..1365440 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -66,6 +66,31 @@
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
+void kvmppc_fast_vcpu_kick(struct kvm_vcpu *vcpu)
+{
+	int me;
+	int cpu = vcpu->cpu;
+	wait_queue_head_t *wqp;
+
+	wqp = kvm_arch_vcpu_wq(vcpu);
+	if (waitqueue_active(wqp)) {
+		wake_up_interruptible(wqp);
+		++vcpu->stat.halt_wakeup;
+	}
+
+	me = get_cpu();
+
+	/* CPU points to the first thread of the core */
+	if (cpu != me && cpu >= 0 && cpu < nr_cpu_ids) {
+		int real_cpu = cpu + vcpu->arch.ptid;
+		if (paca[real_cpu].kvm_hstate.xics_phys)
+			xics_wake_cpu(real_cpu);
+		else if (cpu_online(cpu))
+			smp_send_reschedule(cpu);			
+	}
+	put_cpu();
+}
+
 /*
  * We use the vcpu_load/put functions to measure stolen time.
  * Stolen time is counted as time when either the vcpu is able to
@@ -985,7 +1010,6 @@ static void kvmppc_end_cede(struct kvm_vcpu *vcpu)
 }
 
 extern int __kvmppc_vcore_entry(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
-extern void xics_wake_cpu(int cpu);
 
 static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
 				   struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 10b6c35..c84c39d 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -79,10 +79,6 @@ _GLOBAL(kvmppc_hv_entry_trampoline)
  *                                                                            *
  *****************************************************************************/
 
-#define XICS_XIRR		4
-#define XICS_QIRR		0xc
-#define XICS_IPI		2	/* interrupt source # for IPIs */
-
 /*
  * We come in here when wakened from nap mode on a secondary hw thread.
  * Relocation is off and most register values are lost.
@@ -122,7 +118,7 @@ kvm_start_guest:
 	beq	27f
 25:	ld	r5,HSTATE_XICS_PHYS(r13)
 	li	r0,0xff
-	li	r6,XICS_QIRR
+	li	r6,XICS_MFRR
 	li	r7,XICS_XIRR
 	lwzcix	r8,r5,r7		/* get and ack the interrupt */
 	sync
@@ -667,17 +663,91 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 	cmpwi	r12,BOOK3S_INTERRUPT_SYSCALL
 	beq	hcall_try_real_mode
 
-	/* Check for mediated interrupts (could be done earlier really ...) */
+	/* Only handle external interrupts here on arch 206 and later */
 BEGIN_FTR_SECTION
-	cmpwi	r12,BOOK3S_INTERRUPT_EXTERNAL
-	bne+	1f
-	andi.	r0,r11,MSR_EE
-	beq	1f
-	mfspr	r5,SPRN_LPCR
-	andi.	r0,r5,LPCR_MER
+	b	ext_interrupt_to_host
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
+
+	/* External interrupt ? */
+	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
+	bne+	ext_interrupt_to_host
+
+	/* External interrupt, first check for host_ipi. If this is
+	 * set, we know the host wants us out so let's do it now
+	 */
+	lbz	r0, HSTATE_HOST_IPI(r13)
+	cmpwi	r0, 0
+	bne	ext_interrupt_to_host
+
+	/* Now read the interrupt from the ICP */
+	ld	r5, HSTATE_XICS_PHYS(r13)
+	li	r7, XICS_XIRR
+	cmpdi	r5, 0
+	beq-	ext_interrupt_to_host
+	lwzcix	r3, r5, r7
+	rlwinm.	r0, r3, 0, 0xffffff
+	sync
+	bne	1f
+
+	/* Nothing pending in the ICP, check for mediated interrupts
+	 * and bounce it to the guest
+	 */
+	andi.	r0, r11, MSR_EE
+	beq	ext_interrupt_to_host /* shouldn't happen ?? */
+	mfspr	r5, SPRN_LPCR
+	andi.	r0, r5, LPCR_MER
 	bne	bounce_ext_interrupt
-1:
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
+	b	ext_interrupt_to_host /* shouldn't happen ?? */
+
+1:	/* We found something in the ICP...
+	 *
+	 * If it's not an IPI, stash it in the PACA and return to
+	 * the host, we don't (yet) handle directing real external
+	 * interrupts directly to the guest
+	 */
+	cmpwi	r0, XICS_IPI
+	bne	ext_stash_for_host
+
+	/* It's an IPI, clear the MFRR and EOI it */
+	li	r0, 0xff
+	li	r6, XICS_MFRR
+	stbcix	r0, r5, r6		/* clear the IPI */
+	stwcix	r3, r5, r7		/* EOI it */
+	sync
+
+	/* We need to re-check host IPI now in case it got set in the
+	 * meantime. If it's clear, we bounce the interrupt to the
+	 * guest
+	 */
+	lbz	r0, HSTATE_HOST_IPI(r13)
+	cmpwi	r0, 0
+	bne-	1f
+
+	/* Allright, looks like an IPI for the guest, we need to set MER */
+	mfspr	r8,SPRN_LPCR
+	ori	r8,r8,LPCR_MER
+	mtspr	SPRN_LPCR,r8
+
+	/* And if the guest EE is set, we can deliver immediately, else
+	 * we return to the guest with MER set
+	 */
+	andi.	r0, r11, MSR_EE
+	bne	bounce_ext_interrupt
+	mr	r4, r9
+	b	fast_guest_return
+	
+	/* We raced with the host, we need to resend that IPI, bummer */
+1:	li	r0, IPI_PRIORITY
+	stbcix	r0, r5, r6		/* set the IPI */
+	sync
+	b	ext_interrupt_to_host
+
+ext_stash_for_host:
+	/* It's not an IPI and it's for the host, stash it in the PACA
+	 * before exit, it will be picked up by the host ICP driver
+	 */
+	stw	r3, HSTATE_SAVED_XIRR(r13)
+ext_interrupt_to_host:
 
 guest_exit_cont:		/* r9 = vcpu, r12 = trap, r13 = paca */
 	/* Save DEC */
@@ -820,7 +890,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
 	beq	44f
 	ld	r8,HSTATE_XICS_PHYS(r6)	/* get thread's XICS reg addr */
 	li	r0,IPI_PRIORITY
-	li	r7,XICS_QIRR
+	li	r7,XICS_MFRR
 	stbcix	r0,r7,r8		/* trigger the IPI */
 44:	srdi.	r3,r3,1
 	addi	r6,r6,PACA_SIZE
@@ -1617,7 +1687,7 @@ secondary_nap:
 	beq	37f
 	sync
 	li	r0, 0xff
-	li	r6, XICS_QIRR
+	li	r6, XICS_MFRR
 	stbcix	r0, r5, r6		/* clear the IPI */
 	stwcix	r3, r5, r7		/* EOI it */
 37:	sync
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index f54b934..2027427 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -220,7 +220,7 @@ static inline bool icp_try_update(struct kvmppc_icp *icp,
 		kvmppc_book3s_queue_irqprio(icp->vcpu,
 					    BOOK3S_INTERRUPT_EXTERNAL_LEVEL);
 		if (!change_self)
-			kvm_vcpu_kick(icp->vcpu);
+			kvmppc_fast_vcpu_kick(icp->vcpu);
 	}
  bail:
 	return success;
diff --git a/arch/powerpc/sysdev/xics/icp-native.c b/arch/powerpc/sysdev/xics/icp-native.c
index 48861d3..20b328b 100644
--- a/arch/powerpc/sysdev/xics/icp-native.c
+++ b/arch/powerpc/sysdev/xics/icp-native.c
@@ -51,6 +51,12 @@ static struct icp_ipl __iomem *icp_native_regs[NR_CPUS];
 static inline unsigned int icp_native_get_xirr(void)
 {
 	int cpu = smp_processor_id();
+	unsigned int xirr;
+
+	/* Handled an interrupt latched by KVM */
+	xirr = kvmppc_get_xics_latch();
+	if (xirr)
+		return xirr;
 
 	return in_be32(&icp_native_regs[cpu]->xirr.word);
 }
@@ -138,6 +144,7 @@ static unsigned int icp_native_get_irq(void)
 
 static void icp_native_cause_ipi(int cpu, unsigned long data)
 {
+	kvmppc_set_host_ipi(cpu, 1);
 	icp_native_set_qirr(cpu, IPI_PRIORITY);
 }
 
@@ -151,6 +158,7 @@ static irqreturn_t icp_native_ipi_action(int irq, void *dev_id)
 {
 	int cpu = smp_processor_id();
 
+	kvmppc_set_host_ipi(cpu, 0);
 	icp_native_set_qirr(cpu, 0xff);
 
 	return smp_ipi_demux();
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/9] KVM: PPC: Book3S HV: Add support for real mode ICP in XICS emulation
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
                   ` (5 preceding siblings ...)
  2013-02-15  0:02 ` [PATCH 6/9] KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM Paul Mackerras
@ 2013-02-15  0:03 ` Paul Mackerras
  2013-02-15  0:03 ` [PATCH 8/9] KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts Paul Mackerras
  2013-02-15  0:04 ` [PATCH 9/9] KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls Paul Mackerras
  8 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:03 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

This adds an implementation of the XICS hypercalls in real mode for HV
KVM, which allows us to avoid exiting the guest MMU context on all
threads for a variety of operations such as fetching a pending
interrupt, EOI of messages, IPIs, etc.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/kvm/Makefile               |    1 +
 arch/powerpc/kvm/book3s_hv_rm_xics.c    |  402 +++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   10 +-
 arch/powerpc/kvm/book3s_xics.c          |   64 ++++-
 arch/powerpc/kvm/book3s_xics.h          |   16 ++
 5 files changed, 475 insertions(+), 18 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_rm_xics.c

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index e2eb04c..895e880 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -77,6 +77,7 @@ kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
 	book3s_hv_rm_mmu.o \
 	book3s_64_vio_hv.o \
 	book3s_hv_ras.o \
+	book3s_hv_rm_xics.o \
 	book3s_hv_builtin.o
 
 kvm-book3s_64-module-objs := \
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c b/arch/powerpc/kvm/book3s_hv_rm_xics.c
new file mode 100644
index 0000000..3605e0c
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -0,0 +1,402 @@
+/*
+ * Copyright 2012 Michael Ellerman, IBM Corporation.
+ * Copyright 2012 Benjamin Herrenschmidt, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/err.h>
+
+#include <asm/kvm_book3s.h>
+#include <asm/kvm_ppc.h>
+#include <asm/hvcall.h>
+#include <asm/xics.h>
+#include <asm/debug.h>
+#include <asm/synch.h>
+#include <asm/ppc-opcode.h>
+
+#include "book3s_xics.h"
+
+#define DEBUG_PASSUP
+
+static inline void rm_writeb(unsigned long paddr, u8 val)
+{
+	__asm__ __volatile__("sync; stbcix %0,0,%1"
+		: : "r" (val), "r" (paddr) : "memory");
+}
+
+static void icp_rm_set_vcpu_irq(struct kvm_vcpu *vcpu, struct kvm_vcpu *this_vcpu)
+{
+	struct kvmppc_icp *this_icp = this_vcpu->arch.icp;
+	unsigned long xics_phys;
+	int cpu;
+
+	/* Mark the target VCPU as having an interrupt pending */
+	vcpu->stat.queue_intr++;
+	set_bit(BOOK3S_IRQPRIO_EXTERNAL_LEVEL, &vcpu->arch.pending_exceptions);
+
+	/* Kick self ? Just set MER and return */
+	if (vcpu == this_vcpu) {
+		mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) | LPCR_MER);
+		return;
+	}
+
+	/* Check if the core is loaded, if not, too hard */
+	cpu = vcpu->cpu;
+	if (cpu < 0 || cpu >= nr_cpu_ids) {
+		this_icp->rm_action |= XICS_RM_KICK_VCPU;
+		this_icp->rm_kick_target = vcpu;
+		return;
+	}
+	/* In SMT cpu will always point to thread 0, we adjust it */
+	cpu += vcpu->arch.ptid;
+
+	/* Not too hard, then poke the target */
+	xics_phys = paca[cpu].kvm_hstate.xics_phys;
+	rm_writeb(xics_phys + XICS_MFRR, IPI_PRIORITY);
+}
+
+static void icp_rm_clr_vcpu_irq(struct kvm_vcpu *vcpu)
+{
+	/* Note: Only called on self ! */
+	clear_bit(BOOK3S_IRQPRIO_EXTERNAL_LEVEL, &vcpu->arch.pending_exceptions);
+	mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~LPCR_MER);
+}
+
+static inline bool icp_rm_try_update(struct kvmppc_icp *icp,
+				     union kvmppc_icp_state old,
+				     union kvmppc_icp_state new)
+{
+	struct kvm_vcpu *this_vcpu = local_paca->kvm_hstate.kvm_vcpu;
+	bool success;
+
+	/* Calculate new output value */
+	new.out_ee = (new.xisr && (new.pending_pri < new.cppr));
+
+	/* Attempt atomic update */
+	success = cmpxchg64(&icp->state.raw, old.raw, new.raw) == old.raw;
+	if (!success)
+		goto bail;
+
+	/*
+	 * Check for output state update
+	 *
+	 * Note that this is racy since another processor could be updating
+	 * the state already. This is why we never clear the interrupt output
+	 * here, we only ever set it. The clear only happens prior to doing
+	 * an update and only by the processor itself. Currently we do it
+	 * in Accept (H_XIRR) and Up_Cppr (H_XPPR).
+	 *
+	 * We also do not try to figure out whether the EE state has changed,
+	 * we unconditionally set it if the new state calls for it. The reason
+	 * for that is that we opportunistically remove the pending interrupt
+	 * flag when raising CPPR, so we need to set it back here if an
+	 * interrupt is still pending.
+	 */
+	if (new.out_ee)
+		icp_rm_set_vcpu_irq(icp->vcpu, this_vcpu);
+
+	/* Expose the state change for debug purposes */
+	this_vcpu->arch.icp->rm_dbgstate = new;
+	this_vcpu->arch.icp->rm_dbgtgt = icp->vcpu;
+
+ bail:
+	return success;
+}
+
+static inline int check_too_hard(struct kvmppc_xics *xics, struct kvmppc_icp *icp)
+{
+	return (xics->real_mode_dbg || icp->rm_action) ? H_TOO_HARD : H_SUCCESS;
+}
+
+static void icp_rm_down_cppr(struct kvmppc_xics *xics, struct kvmppc_icp *icp,
+			     u8 new_cppr)
+{
+	union kvmppc_icp_state old_state, new_state;
+	bool resend;
+
+	/*
+	 * This handles several related states in one operation:
+	 *
+	 * ICP State: Down_CPPR
+	 *
+	 * Load CPPR with new value and if the XISR is 0
+	 * then check for resends:
+	 *
+	 * ICP State: Resend
+	 *
+	 * If MFRR is more favored than CPPR, check for IPIs
+	 * and notify ICS of a potential resend. This is done
+	 * asynchronously (when used in real mode, we will have
+	 * to exit here).
+	 *
+	 * We do not handle the complete Check_IPI as documented
+	 * here. In the PAPR, this state will be used for both
+	 * Set_MFRR and Down_CPPR. However, we know that we aren't
+	 * changing the MFRR state here so we don't need to handle
+	 * the case of an MFRR causing a reject of a pending irq,
+	 * this will have been handled when the MFRR was set in the
+	 * first place.
+	 *
+	 * Thus we don't have to handle rejects, only resends.
+	 *
+	 * When implementing real mode for HV KVM, resend will lead to
+	 * a H_TOO_HARD return and the whole transaction will be handled
+	 * in virtual mode.
+	 */
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		/* Down_CPPR */
+		new_state.cppr = new_cppr;
+
+		/*
+		 * Cut down Resend / Check_IPI / IPI
+		 *
+		 * The logic is that we cannot have a pending interrupt
+		 * trumped by an IPI at this point (see above), so we
+		 * know that either the pending interrupt is already an
+		 * IPI (in which case we don't care to override it) or
+		 * it's either more favored than us or non existent
+		 */
+		if (new_state.mfrr < new_cppr &&
+		    new_state.mfrr <= new_state.pending_pri) {
+			new_state.pending_pri = new_state.mfrr;
+			new_state.xisr = XICS_IPI;
+		}
+
+		/* Latch/clear resend bit */
+		resend = new_state.need_resend;
+		new_state.need_resend = 0;
+
+	} while (!icp_rm_try_update(icp, old_state, new_state));
+
+	/*
+	 * Now handle resend checks. Those are asynchronous to the ICP
+	 * state update in HW (ie bus transactions) so we can handle them
+	 * separately here as well.
+	 */
+	if (resend)
+		icp->rm_action |= XICS_RM_CHECK_RESEND;
+}
+
+
+unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu)
+{
+	union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	u32 xirr;
+
+	if (!xics || !xics->real_mode)
+		return H_TOO_HARD;
+
+	/* First clear the interrupt */
+	icp_rm_clr_vcpu_irq(icp->vcpu);
+
+	/*
+	 * ICP State: Accept_Interrupt
+	 *
+	 * Return the pending interrupt (if any) along with the
+	 * current CPPR, then clear the XISR & set CPPR to the
+	 * pending priority
+	 */
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		xirr = old_state.xisr | (((u32)old_state.cppr) << 24);
+		if (!old_state.xisr)
+			break;
+		new_state.cppr = new_state.pending_pri;
+		new_state.pending_pri = 0xff;
+		new_state.xisr = 0;
+
+	} while (!icp_rm_try_update(icp, old_state, new_state));
+
+	/* Return the result in GPR4 */
+	vcpu->arch.gpr[4] = xirr;
+
+	return check_too_hard(xics, icp);
+}
+
+int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, unsigned long mfrr)
+{
+        union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
+	struct kvmppc_icp *icp, *this_icp = vcpu->arch.icp;
+	u32 reject;
+	bool resend;
+	bool local;
+
+	if (!xics || !xics->real_mode)
+		return H_TOO_HARD;
+
+	local = vcpu->vcpu_id == server;
+	if (local)
+		icp = this_icp;
+	else
+		icp = kvmppc_xics_find_server(vcpu->kvm, server);
+	if (!icp)
+		return H_PARAMETER;
+
+	/*
+	 * ICP state: Set_MFRR
+	 *
+	 * If the CPPR is more favored than the new MFRR, then
+	 * nothing needs to be done as there can be no XISR to
+	 * reject.
+	 *
+	 * If the CPPR is less favored, then we might be replacing
+	 * an interrupt, and thus need to possibly reject it as in
+	 *
+	 * ICP state: Check_IPI
+	 */
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		/* Set_MFRR */
+		new_state.mfrr = mfrr;
+
+		/* Check_IPI */
+		reject = 0;
+		resend = false;
+		if (mfrr < new_state.cppr) {
+			/* Reject a pending interrupt if not an IPI */
+			if (mfrr <= new_state.pending_pri)
+				reject = new_state.xisr;
+			new_state.pending_pri = mfrr;
+			new_state.xisr = XICS_IPI;
+		}
+
+		if (mfrr > old_state.mfrr && mfrr > new_state.cppr) {
+			resend = new_state.need_resend;
+			new_state.need_resend = 0;
+		}
+	} while (!icp_rm_try_update(icp, old_state, new_state));
+
+	/* Pass rejects to virtual mode */
+	if (reject && reject != XICS_IPI) {
+		this_icp->rm_action |= XICS_RM_REJECT;
+		this_icp->rm_reject = reject;
+	}
+
+	/* Pass resends to virtual mode */
+	if (resend)
+		this_icp->rm_action |= XICS_RM_CHECK_RESEND;
+
+	return check_too_hard(xics, this_icp);
+}
+
+int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
+{
+	union kvmppc_icp_state old_state, new_state;
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	u32 reject;
+
+	if (!xics || !xics->real_mode)
+		return H_TOO_HARD;
+
+	/*
+	 * ICP State: Set_CPPR
+	 *
+	 * We can safely compare the new value with the current
+	 * value outside of the transaction as the CPPR is only
+	 * ever changed by the processor on itself
+	 */
+	if (cppr > icp->state.cppr) {
+		icp_rm_down_cppr(xics, icp, cppr);
+		goto bail;
+	} else if (cppr == icp->state.cppr)
+		return H_SUCCESS;
+
+	/*
+	 * ICP State: Up_CPPR
+	 *
+	 * The processor is raising its priority, this can result
+	 * in a rejection of a pending interrupt:
+	 *
+	 * ICP State: Reject_Current
+	 *
+	 * We can remove EE from the current processor, the update
+	 * transaction will set it again if needed
+	 */
+	icp_rm_clr_vcpu_irq(icp->vcpu);
+
+	do {
+		old_state = new_state = ACCESS_ONCE(icp->state);
+
+		reject = 0;
+		new_state.cppr = cppr;
+
+		if (cppr <= new_state.pending_pri) {
+			reject = new_state.xisr;
+			new_state.xisr = 0;
+			new_state.pending_pri = 0xff;
+		}
+
+	} while (!icp_rm_try_update(icp, old_state, new_state));
+
+	/* Pass rejects to virtual mode */
+	if (reject && reject != XICS_IPI) {
+		icp->rm_action |= XICS_RM_REJECT;
+		icp->rm_reject = reject;
+	}
+ bail:
+	return check_too_hard(xics, icp);
+}
+
+int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
+{
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *state;
+	u32 irq = xirr & 0x00ffffff;
+	u16 src;
+
+	if (!xics || !xics->real_mode)
+		return H_TOO_HARD;
+
+	/*
+	 * ICP State: EOI
+	 *
+	 * Note: If EOI is incorrectly used by SW to lower the CPPR
+	 * value (ie more favored), we do not check for rejection of
+	 * a pending interrupt, this is a SW error and PAPR sepcifies
+	 * that we don't have to deal with it.
+	 *
+	 * The sending of an EOI to the ICS is handled after the
+	 * CPPR update
+	 *
+	 * ICP State: Down_CPPR which we handle
+	 * in a separate function as it's shared with H_CPPR.
+	 */
+	icp_rm_down_cppr(xics, icp, xirr >> 24);
+
+	/* IPIs have no EOI */
+	if (irq == XICS_IPI)
+		goto bail;
+	/*
+	 * EOI handling: If the interrupt is still asserted, we need to
+	 * resend it. We can take a lockless "peek" at the ICS state here.
+	 *
+	 * "Message" interrupts will never have "asserted" set
+	 */
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics)
+		goto bail;
+	state = &ics->irq_state[src];
+
+	/* Still asserted, resend it, we make it look like a reject */
+	if (state->asserted) {
+		icp->rm_action |= XICS_RM_REJECT;
+		icp->rm_reject = irq;
+	}
+ bail:
+	return check_too_hard(xics, icp);
+}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c84c39d..fe908f5 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1411,11 +1411,11 @@ hcall_real_table:
 	.long	0		/* 0x58 */
 	.long	0		/* 0x5c */
 	.long	0		/* 0x60 */
-	.long	0		/* 0x64 */
-	.long	0		/* 0x68 */
-	.long	0		/* 0x6c */
-	.long	0		/* 0x70 */
-	.long	0		/* 0x74 */
+	.long	.kvmppc_rm_h_eoi - hcall_real_table
+	.long	.kvmppc_rm_h_cppr - hcall_real_table
+	.long	.kvmppc_rm_h_ipi - hcall_real_table
+	.long	0		/* 0x70 - H_IPOLL */
+	.long	.kvmppc_rm_h_xirr - hcall_real_table
 	.long	0		/* 0x78 */
 	.long	0		/* 0x7c */
 	.long	0		/* 0x80 */
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 2027427..abd2dde 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -27,6 +27,9 @@
 #define XICS_DBG(fmt...) do { } while (0)
 //#define XICS_DBG(fmt...) do { trace_printk(fmt); } while (0)
 
+#define ENABLE_REALMODE	true
+#define DEBUG_REALMODE	false
+
 /*
  * LOCKING
  * =======
@@ -213,8 +216,10 @@ static inline bool icp_try_update(struct kvmppc_icp *icp,
 	 * in Accept (H_XIRR) and Up_Cppr (H_XPPR).
 	 *
 	 * We also do not try to figure out whether the EE state has changed,
-	 * we unconditionally set it if the new state calls for it for the
-	 * same reason.
+	 * we unconditionally set it if the new state calls for it. The reason
+	 * for that is that we opportunistically remove the pending interrupt
+	 * flag when raising CPPR, so we need to set it back here if an
+	 * interrupt is still pending.
 	 */
 	if (new.out_ee) {
 		kvmppc_book3s_queue_irqprio(icp->vcpu,
@@ -476,7 +481,7 @@ static void icp_down_cppr(struct kvmppc_xics *xics, struct kvmppc_icp *icp,
 		icp_check_resend(xics, icp);
 }
 
-static noinline unsigned long h_xirr(struct kvm_vcpu *vcpu)
+static noinline unsigned long kvmppc_h_xirr(struct kvm_vcpu *vcpu)
 {
 	union kvmppc_icp_state old_state, new_state;
 	struct kvmppc_icp *icp = vcpu->arch.icp;
@@ -510,8 +515,8 @@ static noinline unsigned long h_xirr(struct kvm_vcpu *vcpu)
 	return xirr;
 }
 
-static noinline int h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
-			  unsigned long mfrr)
+static noinline int kvmppc_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
+				 unsigned long mfrr)
 {
         union kvmppc_icp_state old_state, new_state;
 	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
@@ -579,7 +584,7 @@ static noinline int h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
 	return H_SUCCESS;
 }
 
-static noinline void h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
+static noinline void kvmppc_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
 {
 	union kvmppc_icp_state old_state, new_state;
 	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
@@ -636,7 +641,7 @@ static noinline void h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr)
 		icp_deliver_irq(xics, icp, reject);
 }
 
-static noinline int h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
+static noinline int kvmppc_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
 {
 	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
 	struct kvmppc_icp *icp = vcpu->arch.icp;
@@ -686,29 +691,54 @@ static noinline int h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
 	return H_SUCCESS;
 }
 
+static int noinline kvmppc_xics_rm_complete(struct kvm_vcpu *vcpu, u32 hcall)
+{
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
+	struct kvmppc_icp *icp = vcpu->arch.icp;
+
+	XICS_DBG("XICS_RM: H_%x completing, act: %x state: %lx tgt: %p\n",
+		 hcall, icp->rm_action, icp->rm_dbgstate.raw, icp->rm_dbgtgt);
+
+	if (icp->rm_action & XICS_RM_KICK_VCPU)
+		kvmppc_fast_vcpu_kick(icp->rm_kick_target);
+	if (icp->rm_action & XICS_RM_CHECK_RESEND)
+		icp_check_resend(xics, icp);
+	if (icp->rm_action & XICS_RM_REJECT)
+		icp_deliver_irq(xics, icp, icp->rm_reject);
+
+	icp->rm_action = 0;
+
+	return H_SUCCESS;
+}
+
 static int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, unsigned long req)
 {
+	struct kvmppc_xics *xics = vcpu->kvm->arch.irq_ctrler_private;
 	unsigned long res;
 	int rc = H_SUCCESS;
 
 	/* Check if we have an ICP */
-	if (!vcpu->arch.icp || !vcpu->kvm->arch.irq_ctrler_private)
+	if (!xics || !vcpu->arch.icp)
 		return H_HARDWARE;
 
+	/* Check for real mode returning too hard */
+	if (xics->real_mode)
+		return kvmppc_xics_rm_complete(vcpu, req);
+
 	switch (req) {
 	case H_XIRR:
-		res = h_xirr(vcpu);
+		res = kvmppc_h_xirr(vcpu);
 		kvmppc_set_gpr(vcpu, 4, res);
 		break;
 	case H_CPPR:
-		h_cppr(vcpu, kvmppc_get_gpr(vcpu, 4));
+		kvmppc_h_cppr(vcpu, kvmppc_get_gpr(vcpu, 4));
 		break;
 	case H_EOI:
-		rc = h_eoi(vcpu, kvmppc_get_gpr(vcpu, 4));
+		rc = kvmppc_h_eoi(vcpu, kvmppc_get_gpr(vcpu, 4));
 		break;
 	case H_IPI:
-		rc = h_ipi(vcpu, kvmppc_get_gpr(vcpu, 4),
-			   kvmppc_get_gpr(vcpu, 5));
+		rc = kvmppc_h_ipi(vcpu, kvmppc_get_gpr(vcpu, 4),
+				  kvmppc_get_gpr(vcpu, 5));
 		break;
 	}
 
@@ -1196,6 +1226,14 @@ int kvmppc_xics_create(struct kvm *kvm, struct kvm_irqchip_args *args)
 	kvm->arch.irq_ctrler_private = xics;
 	xics_debugfs_init(xics);
 
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	if (cpu_has_feature(CPU_FTR_ARCH_206)) {
+		/* Enable real mode support */
+		xics->real_mode = ENABLE_REALMODE;
+		xics->real_mode_dbg = DEBUG_REALMODE;
+	}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
+
 out:
 	mutex_unlock(&kvm->lock);
 	return rc;
diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h
index 0e20a51..2da27d7 100644
--- a/arch/powerpc/kvm/book3s_xics.h
+++ b/arch/powerpc/kvm/book3s_xics.h
@@ -62,6 +62,20 @@ struct kvmppc_icp {
 	struct kvm_vcpu *vcpu;
 	union kvmppc_icp_state state;
 	unsigned long resend_map[ICP_RESEND_MAP_SIZE];
+
+	/* Real mode might find something too hard, here's the action
+	 * it might request from virtual mode
+	 */
+#define XICS_RM_KICK_VCPU	0x1
+#define XICS_RM_CHECK_RESEND	0x2
+#define XICS_RM_REJECT		0x4
+	u32 rm_action;
+	struct kvm_vcpu *rm_kick_target;
+	u32  rm_reject;
+
+	/* Debug stuff for real mode */
+	union kvmppc_icp_state rm_dbgstate;
+	struct kvm_vcpu *rm_dbgtgt;
 };
 
 struct kvmppc_ics {
@@ -74,6 +88,8 @@ struct kvmppc_xics {
 	struct kvm *kvm;
 	struct dentry *dentry;
 	u32 max_icsid;
+	bool real_mode;
+	bool real_mode_dbg;
 	struct kvmppc_ics *ics[KVMPPC_XICS_MAX_ICS_ID + 1];
 };
 
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/9] KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
                   ` (6 preceding siblings ...)
  2013-02-15  0:03 ` [PATCH 7/9] KVM: PPC: Book3S HV: Add support for real mode ICP in XICS emulation Paul Mackerras
@ 2013-02-15  0:03 ` Paul Mackerras
  2013-02-15  0:04 ` [PATCH 9/9] KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls Paul Mackerras
  8 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:03 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

This streamlines our handling of external interrupts that come in
while we're in the guest.  First, when waking up a hardware thread
that was napping, we split off the "napping due to H_CEDE" case
earlier, and use the code that handles an external interrupt (0x500)
in the guest to handle that too.  Secondly, the code that handles
those external interrupts now checks if any other thread is exiting
to the host before bouncing an external interrupt to the guest, and
also checks that there is actually an external interrupt pending for
the guest before setting the LPCR MER bit (mediated external request).

This also makes sure that we clear the "ceded" flag when we handle a
wakeup from cede in real mode, and fixes a potential infinite loop
in kvmppc_run_vcpu() which can occur if we ever end up with the ceded
flag set but MSR[EE] off.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/reg.h          |    1 +
 arch/powerpc/kvm/book3s_hv.c            |    5 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  140 +++++++++++++++++--------------
 3 files changed, 81 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 11ae3d8..abe34e0 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -254,6 +254,7 @@
 #define     LPCR_PECE1	0x00002000	/* decrementer can cause exit */
 #define     LPCR_PECE2	0x00001000	/* machine check etc can cause exit */
 #define   LPCR_MER	0x00000800	/* Mediated External Exception */
+#define   LPCR_MER_SH	11
 #define   LPCR_LPES    0x0000000c
 #define   LPCR_LPES0   0x00000008      /* LPAR Env selector 0 */
 #define   LPCR_LPES1   0x00000004      /* LPAR Env selector 1 */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1365440..bd751a3 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1384,9 +1384,12 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 			break;
 		vc->runner = vcpu;
 		n_ceded = 0;
-		list_for_each_entry(v, &vc->runnable_threads, arch.run_list)
+		list_for_each_entry(v, &vc->runnable_threads, arch.run_list) {
 			if (!v->arch.pending_exceptions)
 				n_ceded += v->arch.ceded;
+			else
+				v->arch.ceded = 0;
+		}
 		if (n_ceded == vc->n_runnable)
 			kvmppc_vcore_blocked(vc);
 		else
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index fe908f5..0c519cb 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -97,50 +97,51 @@ kvm_start_guest:
 	li	r0,1
 	stb	r0,PACA_NAPSTATELOST(r13)
 
-	/* get vcpu pointer, NULL if we have no vcpu to run */
-	ld	r4,HSTATE_KVM_VCPU(r13)
-	cmpdi	cr1,r4,0
+	/* were we napping due to cede? */
+	lbz	r0,HSTATE_NAPPING(r13)
+	cmpwi	r0,0
+	bne	kvm_end_cede
+
+	/*
+	 * We weren't napping due to cede, so this must be a secondary
+	 * thread being woken up to run a guest, or being woken up due
+	 * to a stray IPI.  (Or due to some machine check or hypervisor
+	 * maintenance interrupt while the core is in KVM.)
+	 */
 
 	/* Check the wake reason in SRR1 to see why we got here */
 	mfspr	r3,SPRN_SRR1
 	rlwinm	r3,r3,44-31,0x7		/* extract wake reason field */
 	cmpwi	r3,4			/* was it an external interrupt? */
-	bne	27f
-
-	/*
-	 * External interrupt - for now assume it is an IPI, since we
-	 * should never get any other interrupts sent to offline threads.
-	 * Only do this for secondary threads.
-	 */
-	beq	cr1,25f
-	lwz	r3,VCPU_PTID(r4)
-	cmpwi	r3,0
-	beq	27f
-25:	ld	r5,HSTATE_XICS_PHYS(r13)
-	li	r0,0xff
-	li	r6,XICS_MFRR
-	li	r7,XICS_XIRR
+	bne	27f			/* if not */
+	ld	r5,HSTATE_XICS_PHYS(r13)
+	li	r7,XICS_XIRR		/* if it was an external interrupt, */
 	lwzcix	r8,r5,r7		/* get and ack the interrupt */
 	sync
 	clrldi.	r9,r8,40		/* get interrupt source ID. */
-	beq	27f			/* none there? */
-	cmpwi	r9,XICS_IPI
-	bne	26f
+	beq	28f			/* none there? */
+	cmpwi	r9,XICS_IPI		/* was it an IPI? */
+	bne	29f
+	li	r0,0xff
+	li	r6,XICS_MFRR
 	stbcix	r0,r5,r6		/* clear IPI */
-26:	stwcix	r8,r5,r7		/* EOI the interrupt */
-
-27:	/* XXX should handle hypervisor maintenance interrupts etc. here */
+	stwcix	r8,r5,r7		/* EOI the interrupt */
+	sync				/* order loading of vcpu after that */
 
-	/* reload vcpu pointer after clearing the IPI */
+	/* get vcpu pointer, NULL if we have no vcpu to run */
 	ld	r4,HSTATE_KVM_VCPU(r13)
 	cmpdi	r4,0
 	/* if we have no vcpu to run, go back to sleep */
 	beq	kvm_no_guest
+	b	kvmppc_hv_entry
 
-	/* were we napping due to cede? */
-	lbz	r0,HSTATE_NAPPING(r13)
-	cmpwi	r0,0
-	bne	kvm_end_cede
+27:	/* XXX should handle hypervisor maintenance interrupts etc. here */
+	b	kvm_no_guest
+28:	/* SRR1 said external but ICP said nope?? */
+	b	kvm_no_guest
+29:	/* External non-IPI interrupt to offline secondary thread? help?? */
+	stw	r8,HSTATE_SAVED_XIRR(r13)
+	b	kvm_no_guest
 
 .global kvmppc_hv_entry
 kvmppc_hv_entry:
@@ -481,20 +482,20 @@ toc_tlbie_lock:
 	mtctr	r6
 	mtxer	r7
 
+	ld	r10, VCPU_PC(r4)
+	ld	r11, VCPU_MSR(r4)
 kvmppc_cede_reentry:		/* r4 = vcpu, r13 = paca */
 	ld	r6, VCPU_SRR0(r4)
 	ld	r7, VCPU_SRR1(r4)
-	ld	r10, VCPU_PC(r4)
-	ld	r11, VCPU_MSR(r4)	/* r11 = vcpu->arch.msr & ~MSR_HV */
 
+	/* r11 = vcpu->arch.msr & ~MSR_HV */
 	rldicl	r11, r11, 63 - MSR_HV_LG, 1
 	rotldi	r11, r11, 1 + MSR_HV_LG
 	ori	r11, r11, MSR_ME
 
 	/* Check if we can deliver an external or decrementer interrupt now */
 	ld	r0,VCPU_PENDING_EXC(r4)
-	li	r8,(1 << BOOK3S_IRQPRIO_EXTERNAL)
-	oris	r8,r8,(1 << BOOK3S_IRQPRIO_EXTERNAL_LEVEL)@h
+	lis	r8,(1 << BOOK3S_IRQPRIO_EXTERNAL_LEVEL)@h
 	and	r0,r0,r8
 	cmpdi	cr1,r0,0
 	andi.	r0,r11,MSR_EE
@@ -522,10 +523,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
 	/* Move SRR0 and SRR1 into the respective regs */
 5:	mtspr	SPRN_SRR0, r6
 	mtspr	SPRN_SRR1, r7
-	li	r0,0
-	stb	r0,VCPU_CEDED(r4)	/* cancel cede */
 
 fast_guest_return:
+	li	r0,0
+	stb	r0,VCPU_CEDED(r4)	/* cancel cede */
 	mtspr	SPRN_HSRR0,r10
 	mtspr	SPRN_HSRR1,r11
 
@@ -675,6 +676,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
 	/* External interrupt, first check for host_ipi. If this is
 	 * set, we know the host wants us out so let's do it now
 	 */
+do_ext_interrupt:
 	lbz	r0, HSTATE_HOST_IPI(r13)
 	cmpwi	r0, 0
 	bne	ext_interrupt_to_host
@@ -687,19 +689,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
 	lwzcix	r3, r5, r7
 	rlwinm.	r0, r3, 0, 0xffffff
 	sync
-	bne	1f
-
-	/* Nothing pending in the ICP, check for mediated interrupts
-	 * and bounce it to the guest
-	 */
-	andi.	r0, r11, MSR_EE
-	beq	ext_interrupt_to_host /* shouldn't happen ?? */
-	mfspr	r5, SPRN_LPCR
-	andi.	r0, r5, LPCR_MER
-	bne	bounce_ext_interrupt
-	b	ext_interrupt_to_host /* shouldn't happen ?? */
+	beq	3f		/* if nothing pending in the ICP */
 
-1:	/* We found something in the ICP...
+	/* We found something in the ICP...
 	 *
 	 * If it's not an IPI, stash it in the PACA and return to
 	 * the host, we don't (yet) handle directing real external
@@ -724,18 +716,35 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
 	bne-	1f
 
 	/* Allright, looks like an IPI for the guest, we need to set MER */
-	mfspr	r8,SPRN_LPCR
-	ori	r8,r8,LPCR_MER
-	mtspr	SPRN_LPCR,r8
+3:
+	/* Check if any CPU is heading out to the host, if so head out too */
+	ld	r5, HSTATE_KVM_VCORE(r13)
+	lwz	r0, VCORE_ENTRY_EXIT(r5)
+	cmpwi	r0, 0x100
+	bge	ext_interrupt_to_host
+
+	/* See if there is a pending interrupt for the guest */
+	mfspr	r8, SPRN_LPCR
+	ld	r0, VCPU_PENDING_EXC(r9)
+	/* Insert EXTERNAL_LEVEL bit into LPCR at the MER bit position */
+	rldicl.	r0, r0, 64 - BOOK3S_IRQPRIO_EXTERNAL_LEVEL, 63
+	rldimi	r8, r0, LPCR_MER_SH, 63 - LPCR_MER_SH
+	beq	2f
 
 	/* And if the guest EE is set, we can deliver immediately, else
 	 * we return to the guest with MER set
 	 */
 	andi.	r0, r11, MSR_EE
-	bne	bounce_ext_interrupt
-	mr	r4, r9
+	beq	2f
+	mtspr	SPRN_SRR0, r10
+	mtspr	SPRN_SRR1, r11
+	li	r10, BOOK3S_INTERRUPT_EXTERNAL
+	li	r11, (MSR_ME << 1) | 1	/* synthesize MSR_SF | MSR_ME */
+	rotldi	r11, r11, 63
+2:	mr	r4, r9
+	mtspr	SPRN_LPCR, r8
 	b	fast_guest_return
-	
+
 	/* We raced with the host, we need to resend that IPI, bummer */
 1:	li	r0, IPI_PRIORITY
 	stbcix	r0, r5, r6		/* set the IPI */
@@ -1466,15 +1475,6 @@ ignore_hdec:
 	mr	r4,r9
 	b	fast_guest_return
 
-bounce_ext_interrupt:
-	mr	r4,r9
-	mtspr	SPRN_SRR0,r10
-	mtspr	SPRN_SRR1,r11
-	li	r10,BOOK3S_INTERRUPT_EXTERNAL
-	li	r11,(MSR_ME << 1) | 1	/* synthesize MSR_SF | MSR_ME */
-	rotldi	r11,r11,63
-	b	fast_guest_return
-
 _GLOBAL(kvmppc_h_set_dabr)
 	std	r4,VCPU_DABR(r3)
 	/* Work around P7 bug where DABR can get corrupted on mtspr */
@@ -1580,6 +1580,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
 	b	.
 
 kvm_end_cede:
+	/* get vcpu pointer */
+	ld	r4, HSTATE_KVM_VCPU(r13)
+
 	/* Woken by external or decrementer interrupt */
 	ld	r1, HSTATE_HOST_R1(r13)
 
@@ -1619,6 +1622,16 @@ kvm_end_cede:
 	li	r0,0
 	stb	r0,HSTATE_NAPPING(r13)
 
+	/* Check the wake reason in SRR1 to see why we got here */
+	mfspr	r3, SPRN_SRR1
+	rlwinm	r3, r3, 44-31, 0x7	/* extract wake reason field */
+	cmpwi	r3, 4			/* was it an external interrupt? */
+	li	r12, BOOK3S_INTERRUPT_EXTERNAL
+	mr	r9, r4
+	ld	r10, VCPU_PC(r9)
+	ld	r11, VCPU_MSR(r9)
+	beq	do_ext_interrupt	/* if so */
+	
 	/* see if any other thread is already exiting */
 	lwz	r0,VCORE_ENTRY_EXIT(r5)
 	cmpwi	r0,0x100
@@ -1638,8 +1651,7 @@ kvm_cede_prodded:
 
 	/* we've ceded but we want to give control to the host */
 kvm_cede_exit:
-	li	r3,H_TOO_HARD
-	blr
+	b	hcall_real_fallback
 
 	/* Try to handle a machine check in real mode */
 machine_check_realmode:
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 9/9] KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls
  2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
                   ` (7 preceding siblings ...)
  2013-02-15  0:03 ` [PATCH 8/9] KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts Paul Mackerras
@ 2013-02-15  0:04 ` Paul Mackerras
  8 siblings, 0 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15  0:04 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

This adds support for the ibm,int-on and ibm,int-off RTAS calls to the
in-kernel XICS emulation and corrects the handling of the saved
priority by the ibm,set-xive RTAS call.  With this, ibm,int-off sets
the specified interrupt's priority in its saved_priority field and
sets the priority to 0xff (the least favoured value).  ibm,int-on
restores the saved_priority to the priority field, and ibm,set-xive
sets both the priority and the saved_priority to the specified
priority value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_ppc.h |    2 +
 arch/powerpc/kvm/book3s_rtas.c     |   40 ++++++++++++++
 arch/powerpc/kvm/book3s_xics.c     |  101 +++++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xics.h     |    2 +-
 4 files changed, 125 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f371af8a..dd7c1fc 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -171,6 +171,8 @@ extern int kvmppc_rtas_hcall(struct kvm_vcpu *vcpu);
 extern void kvmppc_rtas_tokens_free(struct kvm *kvm);
 extern int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority);
 extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, u32 *priority);
+extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq);
+extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq);
 
 /*
  * Cuts out inst bits with ordering according to spec.
diff --git a/arch/powerpc/kvm/book3s_rtas.c b/arch/powerpc/kvm/book3s_rtas.c
index 6a6c1fe..fc1a749 100644
--- a/arch/powerpc/kvm/book3s_rtas.c
+++ b/arch/powerpc/kvm/book3s_rtas.c
@@ -64,6 +64,44 @@ out:
 	args->rets[0] = rc;
 }
 
+static void kvm_rtas_int_off(struct kvm_vcpu *vcpu, struct rtas_args *args)
+{
+	u32 irq;
+	int rc;
+
+	if (args->nargs != 1 || args->nret != 1) {
+		rc = -3;
+		goto out;
+	}
+
+	irq = args->args[0];
+
+	rc = kvmppc_xics_int_off(vcpu->kvm, irq);
+	if (rc)
+		rc = -3;
+out:
+	args->rets[0] = rc;
+}
+
+static void kvm_rtas_int_on(struct kvm_vcpu *vcpu, struct rtas_args *args)
+{
+	u32 irq;
+	int rc;
+
+	if (args->nargs != 1 || args->nret != 1) {
+		rc = -3;
+		goto out;
+	}
+
+	irq = args->args[0];
+
+	rc = kvmppc_xics_int_on(vcpu->kvm, irq);
+	if (rc)
+		rc = -3;
+out:
+	args->rets[0] = rc;
+}
+
 struct rtas_handler {
 	void (*handler)(struct kvm_vcpu *vcpu, struct rtas_args *args);
 	char *name;
@@ -72,6 +110,8 @@ struct rtas_handler {
 static struct rtas_handler rtas_handlers[] = {
 	{ .name = "ibm,set-xive", .handler = kvm_rtas_set_xive },
 	{ .name = "ibm,get-xive", .handler = kvm_rtas_get_xive },
+	{ .name = "ibm,int-off",  .handler = kvm_rtas_int_off },
+	{ .name = "ibm,int-on",   .handler = kvm_rtas_int_on },
 };
 
 struct rtas_token_definition {
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index abd2dde..665777e 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -116,6 +116,28 @@ static void ics_check_resend(struct kvmppc_xics *xics, struct kvmppc_ics *ics,
 	mutex_unlock(&ics->lock);
 }
 
+static bool write_xive(struct kvmppc_xics *xics, struct kvmppc_ics *ics,
+		       struct ics_irq_state *state,
+		       u32 server, u32 priority, u32 saved_priority)
+{
+	bool deliver;
+
+	mutex_lock(&ics->lock);
+
+	state->server = server;
+	state->priority = priority;
+	state->saved_priority = saved_priority;
+	deliver = false;
+	if ((state->masked_pending || state->resend) && priority != MASKED) {
+		state->masked_pending = 0;
+		deliver = true;
+	}
+
+	mutex_unlock(&ics->lock);
+
+	return deliver;
+}
+
 int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority)
 {
 	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
@@ -123,7 +145,6 @@ int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority)
 	struct kvmppc_ics *ics;
 	struct ics_irq_state *state;
 	u16 src;
-	bool deliver;
 
 	if (!xics)
 		return -ENODEV;
@@ -137,23 +158,11 @@ int kvmppc_xics_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority)
 	if (!icp)
 		return -EINVAL;
 
-	mutex_lock(&ics->lock);
-
 	XICS_DBG("set_xive %#x server %#x prio %#x MP:%d RS:%d\n",
 		 irq, server, priority,
 		 state->masked_pending, state->resend);
 
-	state->server = server;
-	state->priority = priority;
-	deliver = false;
-	if ((state->masked_pending || state->resend) && priority != MASKED) {
-		state->masked_pending = 0;
-		deliver = true;
-	}
-
-	mutex_unlock(&ics->lock);
-
-	if (deliver)
+	if (write_xive(xics, ics, state, server, priority, priority))
 		icp_deliver_irq(xics, icp, irq);
 
 	return 0;
@@ -182,6 +191,53 @@ int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, u32 *server, u32 *priority)
 	return 0;
 }
 
+int kvmppc_xics_int_on(struct kvm *kvm, u32 irq)
+{
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
+	struct kvmppc_icp *icp;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *state;
+	u16 src;
+
+	if (!xics)
+		return -ENODEV;
+
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics)
+		return -EINVAL;
+	state = &ics->irq_state[src];
+
+	icp = kvmppc_xics_find_server(kvm, state->server);
+	if (!icp)
+		return -EINVAL;
+
+	if (write_xive(xics, ics, state, state->server, state->saved_priority,
+		       state->saved_priority))
+		icp_deliver_irq(xics, icp, irq);
+
+	return 0;
+}
+
+int kvmppc_xics_int_off(struct kvm *kvm, u32 irq)
+{
+	struct kvmppc_xics *xics = kvm->arch.irq_ctrler_private;
+	struct kvmppc_ics *ics;
+	struct ics_irq_state *state;
+	u16 src;
+
+	if (!xics)
+		return -ENODEV;
+
+	ics = kvmppc_xics_find_ics(xics, irq, &src);
+	if (!ics)
+		return -EINVAL;
+	state = &ics->irq_state[src];
+
+	write_xive(xics, ics, state, state->server, MASKED, state->priority);
+
+	return 0;
+}
+
 /* -- ICP routines, including hcalls -- */
 
 static inline bool icp_try_update(struct kvmppc_icp *icp,
@@ -576,7 +632,7 @@ static noinline int kvmppc_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
 	/* Handle reject */
 	if (reject && reject != XICS_IPI)
 		icp_deliver_irq(xics, icp, reject);
-		
+
 	/* Handle resend */
 	if (resend)
 		icp_check_resend(xics, icp);
@@ -1016,7 +1072,7 @@ static int kvm_xics_get_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 	struct ics_irq_state *irqp;
 	u64 __user *ubufp;
 	u16 idx;
-	u64 val;
+	u64 val, prio;
 	long int i, irq, nirq;
 
 	irq = srcs->irq;
@@ -1039,9 +1095,12 @@ static int kvm_xics_get_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 			if (!irqp->exists)
 				break;
 			val = irqp->server;
-			val |= ((u64)irqp->priority << KVM_IRQ_PRIORITY_SHIFT);
-			if (irqp->priority == MASKED)
+			prio = irqp->priority;
+			if (irqp->priority == MASKED) {
 				val |= KVM_IRQ_MASKED;
+				prio = irqp->saved_priority;
+			}
+			val |= prio << KVM_IRQ_PRIORITY_SHIFT;
 			if (irqp->asserted)
 				val |= KVM_IRQ_LEVEL_SENSITIVE |
 					KVM_IRQ_PENDING;
@@ -1099,7 +1158,11 @@ static int kvm_xics_set_sources(struct kvm *kvm, struct kvm_irq_sources *srcs)
 
 			mutex_lock(&ics->lock);
 			irqp->server = val & KVM_IRQ_SERVER_MASK;
-			irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
+			irqp->saved_priority = val >> KVM_IRQ_PRIORITY_SHIFT;
+			if (val & KVM_IRQ_MASKED)
+				irqp->priority = MASKED;
+			else
+				irqp->priority = irqp->saved_priority;
 			irqp->resend = 0;
 			irqp->masked_pending = 0;
 			irqp->asserted = 0;
diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h
index 2da27d7..9c1f1bb 100644
--- a/arch/powerpc/kvm/book3s_xics.h
+++ b/arch/powerpc/kvm/book3s_xics.h
@@ -35,7 +35,7 @@ struct ics_irq_state {
 	u32 number;
 	u32 server;
 	u8  priority;
-	u8  saved_priority; /* currently unused */
+	u8  saved_priority;
 	u8  resend;
 	u8  masked_pending;
 	u8  asserted; /* Only for LSI */
-- 
1.7.10.rc3.219.g53414

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-15  0:01 ` [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller Paul Mackerras
@ 2013-02-15 20:05   ` Scott Wood
  2013-02-15 23:18     ` Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Scott Wood @ 2013-02-15 20:05 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc

On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> This adds in-kernel emulation of the XICS (eXternal Interrupt
> Controller Specification) interrupt controller specified by PAPR, for
> both HV and PR KVM guests.
> 
> This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> should use in-kernel interrupt controller emulation, but also takes an
> argument struct that contains the type of interrupt controller
> architecture and an optional parameter.  Currently only one type value
> is defined, that which indicates the XICS architecture.

Would the device config API I posted a couple days ago work for you?

> The XICS emulation supports up to 1048560 interrupt sources.
> Interrupt source numbers below 16 are reserved; 0 is used to mean no
> interrupt and 2 is used for IPIs.  Internally these are represented in
> blocks of 1024, called ICS (interrupt controller source) entities, but
> that is not visible to userspace.
> 
> Two other new ioctls allow userspace to control the interrupt
> sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
> destination cpu, level/edge sensitivity and pending state of a range
> of interrupt sources, creating them if they don't already exist.  The
> KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
> interrupt sources (they are required to already exist).

Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING
for interrupt injection, what if there's a race with the user changing
other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
being presented as a generic API.

> +4.80 KVM_CREATE_IRQCHIP_ARGS
> +
> +Capability: KVM_CAP_IRQCHIP_ARGS
> +Architectures: ppc

Why just ppc?

> +struct kvm_irq_sources {
> +	__u32 irq;
> +	__u32 nr_irqs;
> +	__u64 __user *irqbuf;
> +};

Please no pointers in UAPI -- this would require a compat wrapper with
32-bit user and 64-bit kernel.

> +/* irqbuf entries are laid out like this: */
> +#define KVM_IRQ_SERVER_SHIFT	0
> +#define KVM_IRQ_SERVER_MASK	0xffffffffULL
> +#define KVM_IRQ_PRIORITY_SHIFT	32
> +#define KVM_IRQ_PRIORITY_MASK	0xff
> +#define KVM_IRQ_LEVEL_SENSITIVE	(1ULL << 40)
> +#define KVM_IRQ_MASKED		(1ULL << 41)
> +#define KVM_IRQ_PENDING		(1ULL << 42)

What does "server" mean?  Do you mean "laid out like this for XICS"?
Let's please have a clean separation between what is generic and what is
implementation-specific.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-15 20:05   ` Scott Wood
@ 2013-02-15 23:18     ` Paul Mackerras
  2013-02-15 23:59       ` Scott Wood
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-15 23:18 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm, kvm-ppc

On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
> On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> >From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >
> >This adds in-kernel emulation of the XICS (eXternal Interrupt
> >Controller Specification) interrupt controller specified by PAPR, for
> >both HV and PR KVM guests.
> >
> >This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> >KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> >should use in-kernel interrupt controller emulation, but also takes an
> >argument struct that contains the type of interrupt controller
> >architecture and an optional parameter.  Currently only one type value
> >is defined, that which indicates the XICS architecture.
> 
> Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.

Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.

> >The XICS emulation supports up to 1048560 interrupt sources.
> >Interrupt source numbers below 16 are reserved; 0 is used to mean no
> >interrupt and 2 is used for IPIs.  Internally these are represented in
> >blocks of 1024, called ICS (interrupt controller source) entities, but
> >that is not visible to userspace.
> >
> >Two other new ioctls allow userspace to control the interrupt
> >sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
> >destination cpu, level/edge sensitivity and pending state of a range
> >of interrupt sources, creating them if they don't already exist.  The
> >KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
> >interrupt sources (they are required to already exist).
> 
> Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING

These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.  We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration, the only time userspace needs to
use these is at initialization, when it does a SET_SOURCES to create
the interrupt sources it wants the VM to have.

> for interrupt injection, what if there's a race with the user changing
> other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
> being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.

> >+4.80 KVM_CREATE_IRQCHIP_ARGS
> >+
> >+Capability: KVM_CAP_IRQCHIP_ARGS
> >+Architectures: ppc
> 
> Why just ppc?

Just because only PPC has code to handle it at this point.  I would
hope that ARM and others could pick it up.  Maybe I should make it:

+Architectures: all (but so far only implemented on ppc)

or something.

> >+struct kvm_irq_sources {
> >+	__u32 irq;
> >+	__u32 nr_irqs;
> >+	__u64 __user *irqbuf;
> >+};
> 
> Please no pointers in UAPI -- this would require a compat wrapper with
> 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,
which is generally less efficient, but since it's mainly for migration
that probably doesn't matter.  In fact, I could probably use the
existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the
`chip_id' field for `irq' and the `pad' field for `nr_irqs'.

> >+/* irqbuf entries are laid out like this: */
> >+#define KVM_IRQ_SERVER_SHIFT	0
> >+#define KVM_IRQ_SERVER_MASK	0xffffffffULL
> >+#define KVM_IRQ_PRIORITY_SHIFT	32
> >+#define KVM_IRQ_PRIORITY_MASK	0xff
> >+#define KVM_IRQ_LEVEL_SENSITIVE	(1ULL << 40)
> >+#define KVM_IRQ_MASKED		(1ULL << 41)
> >+#define KVM_IRQ_PENDING		(1ULL << 42)
> 
> What does "server" mean?  Do you mean "laid out like this for XICS"?

Sorry, I should have made that "destination" rather than "server".
You're right, "server" is confusing, but it just means "where the
interrupt is sent to be handled".  It has nothing particularly to do
with "server" computers.

> Let's please have a clean separation between what is generic and what is
> implementation-specific.

I believe that the interface is pretty cleanly generic - the model is
a set of interrupt sources and some per-vcpu state, with priorities to
decide which interrupts get delivered when.  That describes the basics
of just about any SMP-capable interrupt controller, including MPIC.

MPIC would still need an extra interface for userspace to save and
restore things like the timer registers at live migration time, and
for userspace to configure the base address, MPIC version, etc., of
course.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-15 23:18     ` Paul Mackerras
@ 2013-02-15 23:59       ` Scott Wood
  2013-02-16  2:56         ` Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Scott Wood @ 2013-02-15 23:59 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc

On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
> On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
> > On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> > >From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > >
> > >This adds in-kernel emulation of the XICS (eXternal Interrupt
> > >Controller Specification) interrupt controller specified by PAPR,  
> for
> > >both HV and PR KVM guests.
> > >
> > >This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> > >KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> > >should use in-kernel interrupt controller emulation, but also  
> takes an
> > >argument struct that contains the type of interrupt controller
> > >architecture and an optional parameter.  Currently only one type  
> value
> > >is defined, that which indicates the XICS architecture.
> >
> > Would the device config API I posted a couple days ago work for you?
> 
> I suppose it could be made to work.  It doesn't feel like a natural
> fit though, because your API seems to assume (AFAICT) that a device is
> manipulated via registers at specific physical addresses, so I would
> have to invent an artificial set of registers with addresses and bit
> layouts, that aren't otherwise required.  The XICS is operated from
> the guest side via hcalls, not via emulated MMIO.

I don't think it makes such an assumption.  The MPIC device has  
physical registers, so it exposes them, but it also exposes things that  
are not physical registers (e.g. the per-IRQ input state).  The generic  
device control layer leaves interpretation of attributes up to the  
device.

I think it would be easier to fit XICS into the device control api  
model than to fit MPIC into this model, not to mention what would  
happen if we later want to emulate some other type of device -- x86  
already has at least one non-irqchip emulated device (i8254).

> Part of the reason I went with the API that I did is that that was
> what was agreed on at KVM Forum (as far as I can tell, not having been
> at the meeting).  Your device API seems to be quite different to that.

I wasn't there either.  It's fine to have discussions at such events,  
but it should not preclude others from having input, or better ideas  
from being considered afterward.

> > >The XICS emulation supports up to 1048560 interrupt sources.
> > >Interrupt source numbers below 16 are reserved; 0 is used to mean  
> no
> > >interrupt and 2 is used for IPIs.  Internally these are  
> represented in
> > >blocks of 1024, called ICS (interrupt controller source) entities,  
> but
> > >that is not visible to userspace.
> > >
> > >Two other new ioctls allow userspace to control the interrupt
> > >sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
> > >destination cpu, level/edge sensitivity and pending state of a  
> range
> > >of interrupt sources, creating them if they don't already exist.   
> The
> > >KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range  
> of
> > >interrupt sources (they are required to already exist).
> >
> > Why is it userspace's job to control these?  If you use  
> KVM_IRQ_PENDING
> 
> These are primarily there to support live migration.  For live
> migration, userspace needs to be able to extract the entire state of
> the virtual machine from the old guest, and then set the new guest to
> that exact same state.

In that case, the state it describes is insufficient for MPIC.

> We have live migration working in qemu for
> pSeries guests with in-kernel XICS emulation using this interface.
> If you're not doing live migration,

We don't yet, but would prefer not to assume that it'll never happen.

> > for interrupt injection, what if there's a race with the user  
> changing
> > other flags via MMIO?  Maybe this isn't an issue with XICS, but  
> this is
> > being presented as a generic API.
> 
> They're not used while the guest is running, as I said, but even if
> they were, there is appropriate locking in there to handle any races.

OK, KVM_IRQ_LINE is still used for interrupt injection.  I was hoping  
to avoid going through a standardized interface that forces a global  
interrupt numberspace.

How do MSIs get injected?

BTW, do you have any plans regarding irqfd?

> > >+struct kvm_irq_sources {
> > >+	__u32 irq;
> > >+	__u32 nr_irqs;
> > >+	__u64 __user *irqbuf;
> > >+};
> >
> > Please no pointers in UAPI -- this would require a compat wrapper  
> with
> > 32-bit user and 64-bit kernel.
> 
> Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,

It doesn't need to be a fixed size buffer.  You can still have  
pointers, but they need to be represented as a plain "__u64" with users  
casting the pointer.

> > >+/* irqbuf entries are laid out like this: */
> > >+#define KVM_IRQ_SERVER_SHIFT	0
> > >+#define KVM_IRQ_SERVER_MASK	0xffffffffULL
> > >+#define KVM_IRQ_PRIORITY_SHIFT	32
> > >+#define KVM_IRQ_PRIORITY_MASK	0xff
> > >+#define KVM_IRQ_LEVEL_SENSITIVE	(1ULL << 40)
> > >+#define KVM_IRQ_MASKED		(1ULL << 41)
> > >+#define KVM_IRQ_PENDING		(1ULL << 42)
> >
> > What does "server" mean?  Do you mean "laid out like this for XICS"?
> 
> Sorry, I should have made that "destination" rather than "server".
> You're right, "server" is confusing, but it just means "where the
> interrupt is sent to be handled".  It has nothing particularly to do
> with "server" computers.

Right, I was aware of the IBM terminology here -- just thought it  
wasn't appropriate in a generic interface.

What about interrupt controllers that allow multiple destinations?   
More than 256 priorities?  Different "levels" of output (normal,  
critical, machine check)?  Programmable vector numbers?  Active  
high/low control?

I just don't think irqchip state can be sanely made generic.

> > Let's please have a clean separation between what is generic and  
> what is
> > implementation-specific.
> 
> I believe that the interface is pretty cleanly generic - the model is
> a set of interrupt sources and some per-vcpu state, with priorities to
> decide which interrupts get delivered when.  That describes the basics
> of just about any SMP-capable interrupt controller, including MPIC.

The per-vcpu state isn't even part of this AFAICT.  It's an  
XICS-specific ONE_REG -- which is fine, but all that's left of the  
"generic" API is the get/set sources which is an imperfect match to our  
per-IRQ state and it's not clear how an implementation should extend it.

It would be straightforward to map it to the device control api by  
having XICS declare an attribute group that corresponds to IRQ sources  
(like KVM_DEV_MPIC_GRP_IRQ_ACTIVE) and the attribute data be what  
you've defined for kvm_irq_sources.  Or if you want to preserve the  
"set multiple IRQs at once" approach, you could have an attribute that  
takes exactly a kvm_irq_sources struct, though it might be better to  
have a generic batched attribute set/get ioctl (as has been proposed  
for ONE_REG if the need arises).

> MPIC would still need an extra interface for userspace to save and
> restore things like the timer registers at live migration time, and
> for userspace to configure the base address, MPIC version, etc., of
> course.

Yes, which pretty much means we need the device control api anyway --  
or MPIC-specific ioctls, which I wanted to avoid.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-15 23:59       ` Scott Wood
@ 2013-02-16  2:56         ` Paul Mackerras
  2013-02-16  3:57           ` Scott Wood
  2013-02-20 19:58           ` Marcelo Tosatti
  0 siblings, 2 replies; 32+ messages in thread
From: Paul Mackerras @ 2013-02-16  2:56 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm, kvm-ppc

On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
> On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
> >On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
> >> On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> >> >From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >> >
> >> >This adds in-kernel emulation of the XICS (eXternal Interrupt
> >> >Controller Specification) interrupt controller specified by
> >PAPR, for
> >> >both HV and PR KVM guests.
> >> >
> >> >This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> >> >KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> >> >should use in-kernel interrupt controller emulation, but also
> >takes an
> >> >argument struct that contains the type of interrupt controller
> >> >architecture and an optional parameter.  Currently only one
> >type value
> >> >is defined, that which indicates the XICS architecture.
> >>
> >> Would the device config API I posted a couple days ago work for you?
> >
> >I suppose it could be made to work.  It doesn't feel like a natural
> >fit though, because your API seems to assume (AFAICT) that a device is
> >manipulated via registers at specific physical addresses, so I would
> >have to invent an artificial set of registers with addresses and bit
> >layouts, that aren't otherwise required.  The XICS is operated from
> >the guest side via hcalls, not via emulated MMIO.
> 
> I don't think it makes such an assumption.  The MPIC device has
> physical registers, so it exposes them, but it also exposes things
> that are not physical registers (e.g. the per-IRQ input state).  The
> generic device control layer leaves interpretation of attributes up
> to the device.
> 
> I think it would be easier to fit XICS into the device control api
> model than to fit MPIC into this model, not to mention what would
> happen if we later want to emulate some other type of device -- x86
> already has at least one non-irqchip emulated device (i8254).

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.  Further, the magic means that you can
only have one instance of the device, whereas you might want to model
the interrupt controller architecture as several devices.  You could
do that using several device types, but then the interconnections
between them would also be magic.

Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.

> >Part of the reason I went with the API that I did is that that was
> >what was agreed on at KVM Forum (as far as I can tell, not having been
> >at the meeting).  Your device API seems to be quite different to that.
> 
> I wasn't there either.  It's fine to have discussions at such
> events, but it should not preclude others from having input, or
> better ideas from being considered afterward.

Sure - I was just trying to fit in with the expressed wish of the
maintainer.

> >> >The XICS emulation supports up to 1048560 interrupt sources.
> >> >Interrupt source numbers below 16 are reserved; 0 is used to
> >mean no
> >> >interrupt and 2 is used for IPIs.  Internally these are
> >represented in
> >> >blocks of 1024, called ICS (interrupt controller source)
> >entities, but
> >> >that is not visible to userspace.
> >> >
> >> >Two other new ioctls allow userspace to control the interrupt
> >> >sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
> >> >destination cpu, level/edge sensitivity and pending state of a
> >range
> >> >of interrupt sources, creating them if they don't already
> >exist.  The
> >> >KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a
> >range of
> >> >interrupt sources (they are required to already exist).
> >>
> >> Why is it userspace's job to control these?  If you use
> >KVM_IRQ_PENDING
> >
> >These are primarily there to support live migration.  For live
> >migration, userspace needs to be able to extract the entire state of
> >the virtual machine from the old guest, and then set the new guest to
> >that exact same state.
> 
> In that case, the state it describes is insufficient for MPIC.

Yes, MPIC has other random stuff in it besides interrupt control, so
that's not surprising.

> >We have live migration working in qemu for
> >pSeries guests with in-kernel XICS emulation using this interface.
> >If you're not doing live migration,
> 
> We don't yet, but would prefer not to assume that it'll never happen.
> 
> >> for interrupt injection, what if there's a race with the user
> >changing
> >> other flags via MMIO?  Maybe this isn't an issue with XICS, but
> >this is
> >> being presented as a generic API.
> >
> >They're not used while the guest is running, as I said, but even if
> >they were, there is appropriate locking in there to handle any races.
> 
> OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
> hoping to avoid going through a standardized interface that forces a
> global interrupt numberspace.

Why?

> How do MSIs get injected?

Just like other interrupts - from the point of view of the interrupt
controller they're edge-triggered interrupt sources.

> BTW, do you have any plans regarding irqfd?

I'm going to look at that next.

> 
> >> >+struct kvm_irq_sources {
> >> >+	__u32 irq;
> >> >+	__u32 nr_irqs;
> >> >+	__u64 __user *irqbuf;
> >> >+};
> >>
> >> Please no pointers in UAPI -- this would require a compat
> >wrapper with
> >> 32-bit user and 64-bit kernel.
> >
> >Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,
> 
> It doesn't need to be a fixed size buffer.  You can still have
> pointers, but they need to be represented as a plain "__u64" with
> users casting the pointer.
> 
> >> >+/* irqbuf entries are laid out like this: */
> >> >+#define KVM_IRQ_SERVER_SHIFT	0
> >> >+#define KVM_IRQ_SERVER_MASK	0xffffffffULL
> >> >+#define KVM_IRQ_PRIORITY_SHIFT	32
> >> >+#define KVM_IRQ_PRIORITY_MASK	0xff
> >> >+#define KVM_IRQ_LEVEL_SENSITIVE	(1ULL << 40)
> >> >+#define KVM_IRQ_MASKED		(1ULL << 41)
> >> >+#define KVM_IRQ_PENDING		(1ULL << 42)
> >>
> >> What does "server" mean?  Do you mean "laid out like this for XICS"?
> >
> >Sorry, I should have made that "destination" rather than "server".
> >You're right, "server" is confusing, but it just means "where the
> >interrupt is sent to be handled".  It has nothing particularly to do
> >with "server" computers.
> 
> Right, I was aware of the IBM terminology here -- just thought it
> wasn't appropriate in a generic interface.

Sure.

> What about interrupt controllers that allow multiple destinations?

The destination can be an identifier for a group of vcpus, or even a
bitmap -- that's why I made it 32 bits.

> More than 256 priorities?  Different "levels" of output (normal,
> critical, machine check)?  Programmable vector numbers?  Active
> high/low control?

There are plenty of bits free in the 64 bits per source that I have
allowed.  We can accommodate those things.  (BTW, I think having more
than 256 priorities would be insane - do you know of any actual
example that does?)

> I just don't think irqchip state can be sanely made generic.
> 
> >> Let's please have a clean separation between what is generic and
> >what is
> >> implementation-specific.
> >
> >I believe that the interface is pretty cleanly generic - the model is
> >a set of interrupt sources and some per-vcpu state, with priorities to
> >decide which interrupts get delivered when.  That describes the basics
> >of just about any SMP-capable interrupt controller, including MPIC.
> 
> The per-vcpu state isn't even part of this AFAICT.  It's an
> XICS-specific ONE_REG -- which is fine, but all that's left of the
> "generic" API is the get/set sources which is an imperfect match to
> our per-IRQ state and it's not clear how an implementation should
> extend it.

Yes, the names of the bitfields in the ICP state word are
XICS-specific, but the concepts are pretty generic - current processor
priority, identifier for interrupt awaiting service, pending IPI
request priority, pending interrupt request priority.

> It would be straightforward to map it to the device control api by
> having XICS declare an attribute group that corresponds to IRQ
> sources (like KVM_DEV_MPIC_GRP_IRQ_ACTIVE) and the attribute data be
> what you've defined for kvm_irq_sources.  Or if you want to preserve
> the "set multiple IRQs at once" approach, you could have an
> attribute that takes exactly a kvm_irq_sources struct, though it
> might be better to have a generic batched attribute set/get ioctl
> (as has been proposed for ONE_REG if the need arises).

Sure, the device control API can probably do just about anything.
We could even use it to create and control vcpus and memory slots, but
that doesn't mean we should.

> >MPIC would still need an extra interface for userspace to save and
> >restore things like the timer registers at live migration time, and
> >for userspace to configure the base address, MPIC version, etc., of
> >course.
> 
> Yes, which pretty much means we need the device control api anyway
> -- or MPIC-specific ioctls, which I wanted to avoid.

I agree that MPIC has enough extra stuff in it, besides interrupt
control, that it probably needs your device control API, at least for
that extra stuff.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-16  2:56         ` Paul Mackerras
@ 2013-02-16  3:57           ` Scott Wood
  2013-02-16  4:51             ` Paul Mackerras
  2013-02-20 19:58           ` Marcelo Tosatti
  1 sibling, 1 reply; 32+ messages in thread
From: Scott Wood @ 2013-02-16  3:57 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc

On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
> I have no particular objection to the device control API per se, but
> I have two objections to using it as the primary interface to the XICS
> emulation.
> 
> First, I dislike the magical side-effect where creating a device of a
> particular type (e.g. MPIC or XICS) automatically attaches it to the
> interrupt lines of the vcpus.  I prefer an explicit request to do
> in-kernel interrupt control.

OK.  This is device-specific behavior, so you could define it  
differently for XICS than MPIC.  I suppose we could change it for MPIC  
as well, to leave an opening for the unlikely case where we'd want to  
model an MPIC that isn't directly connected to the CPUs.

How is the explicit request made in this patchset?

> Secondly, it means that we are completely abandoning any attempt to
> define an abstract or generic interface to in-kernel interrupt
> controller emulations.  Each device will have its own unique set of
> attribute groups and its own unique userspace code to drive it, with
> no commonality between them.

Yes.  I am unconvinced that such an abstraction is well-advised  
(especially after seeing existing "generic" interfaces that are clearly  
APIC-oriented).  This isn't like normal driver interfaces where we're  
abstracting away hardware differences to let generic code use a  
device.  Userspace knows what kind of device it wants, and how it wants  
it to integrate with the rest of the emulated system.  We'd have to go  
out of our way to apply the abstraction on *both* ends.  What do we get  
from that other than a chance that the abstraction leaks?  What  
significant code actually becomes common?  kvm_set_irq() is just a thin  
wrapper around the ioctl.

> > >We have live migration working in qemu for
> > >pSeries guests with in-kernel XICS emulation using this interface.
> > >If you're not doing live migration,
> >
> > We don't yet, but would prefer not to assume that it'll never  
> happen.
> >
> > >> for interrupt injection, what if there's a race with the user
> > >changing
> > >> other flags via MMIO?  Maybe this isn't an issue with XICS, but
> > >this is
> > >> being presented as a generic API.
> > >
> > >They're not used while the guest is running, as I said, but even if
> > >they were, there is appropriate locking in there to handle any  
> races.
> >
> > OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
> > hoping to avoid going through a standardized interface that forces a
> > global interrupt numberspace.
> 
> Why?

The standardized interface doesn't make things any easier (as noted  
above, the caller is already mpic-specific code), and we'd have to come  
up with a scheme for flattening our interrupt numberspace (rather than  
introduce new attribute groups for things like IPI and timer  
interrupts).  It may still be necessary when it comes to irqfd,  
though...

> > How do MSIs get injected?
> 
> Just like other interrupts - from the point of view of the interrupt
> controller they're edge-triggered interrupt sources.

Ah right, I guess this is all set up via hcalls for XICS.

With MPIC exposing its registers via the device control api, everything  
just works -- the PCI device generates a write to the MPIC's memory  
region, the QEMU MPIC stub sends the write to the kernel as for any  
other MMIO access (this passthrough is also useful for debugging), the  
in-kernel MPIC sees the write to the "generate an MSI" register and  
does its thing.  Compare that to all special the MSI code for APIC...  
:-)

> > BTW, do you have any plans regarding irqfd?
> 
> I'm going to look at that next.

Likewise...  We should probably coordinate our efforts so that at least  
the de-APICization of the code only has to get done once.

> > What about interrupt controllers that allow multiple destinations?
> 
> The destination can be an identifier for a group of vcpus, or even a
> bitmap -- that's why I made it 32 bits.

So you can have single delivery, or be limited to 32 vcpus, or have to  
implement some destination ID allocation scheme (which is more state  
that needs to be accessible somehow).

> > More than 256 priorities?  Different "levels" of output (normal,
> > critical, machine check)?  Programmable vector numbers?  Active
> > high/low control?
> 
> There are plenty of bits free in the 64 bits per source that I have
> allowed.  We can accommodate those things.

MPIC vector numbers take up 16 of the bits.  The architected interrupt  
level field is 8 bits, though only a handful of values are actually  
needed.  Add a couple binary flags, and it gets pretty tight if a third  
type of interrupt controller starts wanting something new.

>  (BTW, I think having more
> than 256 priorities would be insane - do you know of any actual
> example that does?)

No, but hardware designers have been known to do insane things.

> > The per-vcpu state isn't even part of this AFAICT.  It's an
> > XICS-specific ONE_REG -- which is fine, but all that's left of the
> > "generic" API is the get/set sources which is an imperfect match to
> > our per-IRQ state and it's not clear how an implementation should
> > extend it.
> 
> Yes, the names of the bitfields in the ICP state word are
> XICS-specific, but the concepts are pretty generic - current processor
> priority, identifier for interrupt awaiting service, pending IPI
> request priority, pending interrupt request priority.

We don't have separate concepts of "pending IPI request priority" and  
"pending interrupt request priority".  There can be multiple interrupts  
awaiting service (or even in service, if different priorities).  We  
have both "current task priority" (which is a user-set mask-by-priority  
register) and the priority of the highest-prio in-service interrupt --  
which would "current processor priorty" be?  Etc.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-16  3:57           ` Scott Wood
@ 2013-02-16  4:51             ` Paul Mackerras
  2013-02-18 22:43               ` Scott Wood
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-16  4:51 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm, kvm-ppc

On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
> On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
> >I have no particular objection to the device control API per se, but
> >I have two objections to using it as the primary interface to the XICS
> >emulation.
> >
> >First, I dislike the magical side-effect where creating a device of a
> >particular type (e.g. MPIC or XICS) automatically attaches it to the
> >interrupt lines of the vcpus.  I prefer an explicit request to do
> >in-kernel interrupt control.
> 
> OK.  This is device-specific behavior, so you could define it
> differently for XICS than MPIC.  I suppose we could change it for
> MPIC as well, to leave an opening for the unlikely case where we'd
> want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.

> How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic "create device" ioctl that could be for any device.

> >Secondly, it means that we are completely abandoning any attempt to
> >define an abstract or generic interface to in-kernel interrupt
> >controller emulations.  Each device will have its own unique set of
> >attribute groups and its own unique userspace code to drive it, with
> >no commonality between them.
> 
> Yes.  I am unconvinced that such an abstraction is well-advised
> (especially after seeing existing "generic" interfaces that are
> clearly APIC-oriented).  This isn't like normal driver interfaces
> where we're abstracting away hardware differences to let generic
> code use a device.  Userspace knows what kind of device it wants,
> and how it wants it to integrate with the rest of the emulated
> system.  We'd have to go out of our way to apply the abstraction on
> *both* ends.  What do we get from that other than a chance that the
> abstraction leaks?  What significant code actually becomes common?
> kvm_set_irq() is just a thin wrapper around the ioctl.

I'd think there could be some code reduction in the live migration
code, but I'd need a qemu hacker to chime in here.

> >> OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
> >> hoping to avoid going through a standardized interface that forces a
> >> global interrupt numberspace.
> >
> >Why?
> 
> The standardized interface doesn't make things any easier (as noted
> above, the caller is already mpic-specific code), and we'd have to
> come up with a scheme for flattening our interrupt numberspace
> (rather than introduce new attribute groups for things like IPI and
> timer interrupts).  It may still be necessary when it comes to
> irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.

> >> How do MSIs get injected?
> >
> >Just like other interrupts - from the point of view of the interrupt
> >controller they're edge-triggered interrupt sources.
> 
> Ah right, I guess this is all set up via hcalls for XICS.
> 
> With MPIC exposing its registers via the device control api,
> everything just works -- the PCI device generates a write to the
> MPIC's memory region, the QEMU MPIC stub sends the write to the
> kernel as for any other MMIO access (this passthrough is also useful
> for debugging), the in-kernel MPIC sees the write to the "generate
> an MSI" register and does its thing.  Compare that to all special
> the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?  If that's so, then why bother with
in-kernel emulation at all?  Once you're back in userspace, it's just
as fast to do the emulation there as in the kernel.

> >There are plenty of bits free in the 64 bits per source that I have
> >allowed.  We can accommodate those things.
> 
> MPIC vector numbers take up 16 of the bits.  The architected
> interrupt level field is 8 bits, though only a handful of values are
> actually needed.  Add a couple binary flags, and it gets pretty
> tight if a third type of interrupt controller starts wanting
> something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.

> >Yes, the names of the bitfields in the ICP state word are
> >XICS-specific, but the concepts are pretty generic - current processor
> >priority, identifier for interrupt awaiting service, pending IPI
> >request priority, pending interrupt request priority.
> 
> We don't have separate concepts of "pending IPI request priority"
> and "pending interrupt request priority".  There can be multiple

Sorry, I meant "pending interrupt request".  You do have that, it's
what you read from the interrupt acknowledge register.

> interrupts awaiting service (or even in service, if different
> priorities).  We have both "current task priority" (which is a
> user-set mask-by-priority register) and the priority of the
> highest-prio in-service interrupt -- which would "current processor
> priorty" be?  Etc.

It would be the current task priority.  I assume MPIC maintains a
16-bit map of the interrupt priorities in service, so that would need
to be added.  (XICS doesn't maintain the stack in hardware, which
would require 256 bits of state, but instead gets software to do
that.)

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-16  4:51             ` Paul Mackerras
@ 2013-02-18 22:43               ` Scott Wood
  2013-02-20  0:41                 ` Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Scott Wood @ 2013-02-18 22:43 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc

On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:
> On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
> > On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
> > >I have no particular objection to the device control API per se,  
> but
> > >I have two objections to using it as the primary interface to the  
> XICS
> > >emulation.
> > >
> > >First, I dislike the magical side-effect where creating a device  
> of a
> > >particular type (e.g. MPIC or XICS) automatically attaches it to  
> the
> > >interrupt lines of the vcpus.  I prefer an explicit request to do
> > >in-kernel interrupt control.
> >
> > OK.  This is device-specific behavior, so you could define it
> > differently for XICS than MPIC.  I suppose we could change it for
> > MPIC as well, to leave an opening for the unlikely case where we'd
> > want to model an MPIC that isn't directly connected to the CPUs.
> 
> You can have cascaded MPICs; I once had a powerbook that had one MPIC
> cascaded into another.

OK.

> > How is the explicit request made in this patchset?
> 
> The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
> specific interrupt controller architecture connected to the vcpus'
> external interrupt inputs.  In that sense it's explicit, compared to a
> generic "create device" ioctl that could be for any device.

Hooking up to the CPU's interrupt lines is implicit in creating an MPIC  
(and I'm fine with changing that), not in creating any device.  I don't  
see how it's worse than being implicit in calling  
KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).

> > The standardized interface doesn't make things any easier (as noted
> > above, the caller is already mpic-specific code), and we'd have to
> > come up with a scheme for flattening our interrupt numberspace
> > (rather than introduce new attribute groups for things like IPI and
> > timer interrupts).  It may still be necessary when it comes to
> > irqfd, though...
> 
> With 24 bits of interrupt source number, you don't have to flatten it
> very far.  IPIs are handled separately and don't appear in the
> interrupt source space.

They do need to appear in the interrupt source space if we want to  
inject or irqfd them.  Most users won't want to do that, but we have  
had a customer directly assign IPIs (to communicate with an OS running  
on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't  
using them) and MPIC timers to a guest.

> > >> How do MSIs get injected?
> > >
> > >Just like other interrupts - from the point of view of the  
> interrupt
> > >controller they're edge-triggered interrupt sources.
> >
> > Ah right, I guess this is all set up via hcalls for XICS.
> >
> > With MPIC exposing its registers via the device control api,
> > everything just works -- the PCI device generates a write to the
> > MPIC's memory region, the QEMU MPIC stub sends the write to the
> > kernel as for any other MMIO access (this passthrough is also useful
> > for debugging), the in-kernel MPIC sees the write to the "generate
> > an MSI" register and does its thing.  Compare that to all special
> > the MSI code for APIC... :-)
> 
> You're doing a round trip to userspace for every MPIC register access
> by the guest?  Seriously?

No.  Accesses by the guest get handled in the kernel.  Accesses in  
QEMU, including MSIs generated by virtio, get forwarded to the kernel.

> > >There are plenty of bits free in the 64 bits per source that I have
> > >allowed.  We can accommodate those things.
> >
> > MPIC vector numbers take up 16 of the bits.  The architected
> > interrupt level field is 8 bits, though only a handful of values are
> > actually needed.  Add a couple binary flags, and it gets pretty
> > tight if a third type of interrupt controller starts wanting
> > something new.
> 
> There's enough space for MPIC to have 16 bits of vector and some
> flags.  We don't need to overdesign this.

I view anything other than passing the actual MPIC register values  
around as overdesign here, given that it is communication between  
hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the  
kernel side.

> > interrupts awaiting service (or even in service, if different
> > priorities).  We have both "current task priority" (which is a
> > user-set mask-by-priority register) and the priority of the
> > highest-prio in-service interrupt -- which would "current processor
> > priorty" be?  Etc.
> 
> It would be the current task priority.  I assume MPIC maintains a
> 16-bit map of the interrupt priorities in service, so that would need
> to be added.

We don't maintain such a map in the emulation code.  We have a per-CPU  
bitmap of the actual interrupt sources pending/active, which is another  
attribute that would need to be added in order to support migration on  
MPIC.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-18 22:43               ` Scott Wood
@ 2013-02-20  0:41                 ` Paul Mackerras
  2013-02-20  1:01                   ` Scott Wood
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-20  0:41 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm, kvm-ppc

On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote:
> On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:
> >The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
> >specific interrupt controller architecture connected to the vcpus'
> >external interrupt inputs.  In that sense it's explicit, compared to a
> >generic "create device" ioctl that could be for any device.
> 
> Hooking up to the CPU's interrupt lines is implicit in creating an
> MPIC (and I'm fine with changing that), not in creating any device.
> I don't see how it's worse than being implicit in calling
> KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).

First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of
the interrupt control subsystem, so yes it does allow for cascaded
controllers.

Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS
call, you know that the vcpus' interrupt inputs will be driven by
kernel code.  If you see a KVM_CREATE_DEVICE call, you don't know
that; they might be, or they might not be.

> >You're doing a round trip to userspace for every MPIC register access
> >by the guest?  Seriously?
> 
> No.  Accesses by the guest get handled in the kernel.  Accesses in
> QEMU, including MSIs generated by virtio, get forwarded to the
> kernel.

OK, I missed the path where that gets done, then.

> >It would be the current task priority.  I assume MPIC maintains a
> >16-bit map of the interrupt priorities in service, so that would need
> >to be added.
> 
> We don't maintain such a map in the emulation code.  We have a

Oh, so how do you handle EOI of nested interrupts?  How do you know
what to reset the CPU priority to in that case?

> per-CPU bitmap of the actual interrupt sources pending/active, which
> is another attribute that would need to be added in order to support
> migration on MPIC.

Not really, that can be recomputed from the sources easily enough.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-20  0:41                 ` Paul Mackerras
@ 2013-02-20  1:01                   ` Scott Wood
  0 siblings, 0 replies; 32+ messages in thread
From: Scott Wood @ 2013-02-20  1:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc

On 02/19/2013 06:41:11 PM, Paul Mackerras wrote:
> On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote:
> > On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:
> > >The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
> > >specific interrupt controller architecture connected to the vcpus'
> > >external interrupt inputs.  In that sense it's explicit, compared  
> to a
> > >generic "create device" ioctl that could be for any device.
> >
> > Hooking up to the CPU's interrupt lines is implicit in creating an
> > MPIC (and I'm fine with changing that), not in creating any device.
> > I don't see how it's worse than being implicit in calling
> > KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).
> 
> First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of
> the interrupt control subsystem, so yes it does allow for cascaded
> controllers.

Fine, but in that case you're dealing with a new irqchip (or irqarch if  
you prefer) type number (or param).  If you wanted to take that  
approach with cascaded MPICs (I wouldn't), you would create a new  
device type number.  I don't see the difference here.

> Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS
> call, you know that the vcpus' interrupt inputs will be driven by
> kernel code.  If you see a KVM_CREATE_DEVICE call, you don't know
> that; they might be, or they might not be.

I just don't understand what you mean here.  Nobody's suggesting that  
we make this assumption as soon as you see a "KVM_CREATE_DEVICE" call  
for any random device.  It's specifically in the creation of a  
KVM_DEV_TYPE_FSL_MPIC_20 or KVM_DEV_TYPE_FSL_MPIC_42 that this  
assumption is currently made.  I don't see how creation of one of those  
specific devices is any different from calling KVM_CREATE_IRQCHIP_ARGS  
in terms of the intent that can reasonably be inferred.

> > >You're doing a round trip to userspace for every MPIC register  
> access
> > >by the guest?  Seriously?
> >
> > No.  Accesses by the guest get handled in the kernel.  Accesses in
> > QEMU, including MSIs generated by virtio, get forwarded to the
> > kernel.
> 
> OK, I missed the path where that gets done, then.

ppce500_pci has a memory region for e500-pci-bar0 which is an alias for  
the ccsr memory region.  The mpic registers are a child of the ccsr  
memory region.  When hw/kvm/mpic.c in QEMU sees an access to that mpic  
memory region, it forwards the access to the kernel via  
KVM_DEV_MPIC_GRP_REGISTER.

> > >It would be the current task priority.  I assume MPIC maintains a
> > >16-bit map of the interrupt priorities in service, so that would  
> need
> > >to be added.
> >
> > We don't maintain such a map in the emulation code.  We have a
> 
> Oh, so how do you handle EOI of nested interrupts?

We scan the in-service bitmap for pending interrupts, and choose the  
one that has the highest priority as the one that the EOI must be  
referring to.

Just having a bitmap of priorities would only tell us the priority of  
the highest priority in-service interrupt, not which actual interrupt  
it is.

>  How do you know what to reset the CPU priority to in that case?

Once we identify the interrupt that is being EOId as above, we clear  
that bit and check again to see if there's a remaining interrupt whose  
priority is high enough to prevent a pending interrupt from being  
delivered.

> > per-CPU bitmap of the actual interrupt sources pending/active, which
> > is another attribute that would need to be added in order to support
> > migration on MPIC.
> 
> Not really, that can be recomputed from the sources easily enough.

I'm skeptical.  IPIs at least would be a problem, as would other  
multicast interrupts if we allowed that.  How would we distinguish an  
interrupt that is pending from one that is in-service, just from the  
sources?

In any case, it's a bit premature to discuss what we'd need for  
migration until QEMU itself can save/restore a normal QEMU openpic.   
When that time comes, attributes can be added for whatever extra state  
we need (if any) to extend that capability to in-kernel MPICs.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-16  2:56         ` Paul Mackerras
  2013-02-16  3:57           ` Scott Wood
@ 2013-02-20 19:58           ` Marcelo Tosatti
  2013-02-21  0:20             ` Scott Wood
  2013-02-25  0:59             ` Paul Mackerras
  1 sibling, 2 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 19:58 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, Alexander Graf, kvm, kvm-ppc

On Sat, Feb 16, 2013 at 01:56:14PM +1100, Paul Mackerras wrote:
> On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
> > On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
> > >On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
> > >> On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> > >> >From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > >> >
> > >> >This adds in-kernel emulation of the XICS (eXternal Interrupt
> > >> >Controller Specification) interrupt controller specified by
> > >PAPR, for
> > >> >both HV and PR KVM guests.
> > >> >
> > >> >This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> > >> >KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> > >> >should use in-kernel interrupt controller emulation, but also
> > >takes an
> > >> >argument struct that contains the type of interrupt controller
> > >> >architecture and an optional parameter.  Currently only one
> > >type value
> > >> >is defined, that which indicates the XICS architecture.
> > >>
> > >> Would the device config API I posted a couple days ago work for you?
> > >
> > >I suppose it could be made to work.  It doesn't feel like a natural
> > >fit though, because your API seems to assume (AFAICT) that a device is
> > >manipulated via registers at specific physical addresses, so I would
> > >have to invent an artificial set of registers with addresses and bit
> > >layouts, that aren't otherwise required.  The XICS is operated from
> > >the guest side via hcalls, not via emulated MMIO.
> > 
> > I don't think it makes such an assumption.  The MPIC device has
> > physical registers, so it exposes them, but it also exposes things
> > that are not physical registers (e.g. the per-IRQ input state).  The
> > generic device control layer leaves interpretation of attributes up
> > to the device.
> > 
> > I think it would be easier to fit XICS into the device control api
> > model than to fit MPIC into this model, not to mention what would
> > happen if we later want to emulate some other type of device -- x86
> > already has at least one non-irqchip emulated device (i8254).
> 
> I have no particular objection to the device control API per se, but
> I have two objections to using it as the primary interface to the XICS
> emulation.
> 
> First, I dislike the magical side-effect where creating a device of a
> particular type (e.g. MPIC or XICS) automatically attaches it to the
> interrupt lines of the vcpus.  I prefer an explicit request to do
> in-kernel interrupt control. 

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

>  Further, the magic means that you can
> only have one instance of the device, whereas you might want to model
> the interrupt controller architecture as several devices.  You could
> do that using several device types, but then the interconnections
> between them would also be magic.
> 
> Secondly, it means that we are completely abandoning any attempt to
> define an abstract or generic interface to in-kernel interrupt
> controller emulations.  Each device will have its own unique set of
> attribute groups and its own unique userspace code to drive it, with
> no commonality between them.

<snip>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-20 19:58           ` Marcelo Tosatti
@ 2013-02-21  0:20             ` Scott Wood
  2013-02-21  1:09               ` Marcelo Tosatti
  2013-02-24 14:08               ` Gleb Natapov
  2013-02-25  0:59             ` Paul Mackerras
  1 sibling, 2 replies; 32+ messages in thread
From: Scott Wood @ 2013-02-21  0:20 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paul Mackerras, Alexander Graf, kvm, kvm-ppc

On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
> This is probably a stupid question, but why the
> KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
> your purposes?
> 
> x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

To start, the whole IRQ routing stuff is poorly documented.

Am I supposed to make up GSI numbers and use the routing thing to  
associate them with real interrupts?  Are there constraints on what  
sort of GSI numbers I can choose (I now see in the code that  
KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is  
that documented?  It looks like the APIC implementation has default  
routes, where is that documented?)?  Where does the code live to manage  
this table, and how APICy is it (looks like the answer is "irq_comm.c,  
and very")?  I suppose I could write another implementation of the  
table management code for MPIC, though the placement of "irqchip"  
inside the route entry, rather than as an argument to KVM_IRQ_LINE,  
suggests the table is supposed to be global, not in the individual  
interrupt controller.

It looks like I'm going to have to do this anyway for irqfd, though  
that doesn't make the other uses of the device control api go away.   
Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the  
status for debugging (reading device registers doesn't quite do it,  
since the "active" bit won't show up if the interrupt is masked).  At  
that point, is it more offensive to make it read-only even though it  
would be trivial to make it read/write (which would allow users who  
don't need it to bypass the routing API), or to make it read/write and  
live with there being more than one way to do something?

KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of  
state, and because it doesn't allow debugging access to device  
registers (e.g. inspecting from the QEMU command line), and because  
it's hard to add new pieces of state if we realize we left something  
out.  It reminds be of GET/SET_SREGS.  With that, I did what you seem  
to want here, which is to adapt the existing interfaces, using feature  
flags to control optional state.  It ended up being a mess, and ONE_REG  
was introduced as a replacement.  The device control API is the  
equivalent of ONE_REG for things other than vcpus.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-21  0:20             ` Scott Wood
@ 2013-02-21  1:09               ` Marcelo Tosatti
  2013-02-21  1:45                 ` Scott Wood
  2013-02-24 14:08               ` Gleb Natapov
  1 sibling, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2013-02-21  1:09 UTC (permalink / raw)
  To: Scott Wood; +Cc: Paul Mackerras, Alexander Graf, kvm, kvm-ppc

On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
> On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
> >This is probably a stupid question, but why the
> >KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
> >your purposes?
> >
> >x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> >KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
> 
> To start, the whole IRQ routing stuff is poorly documented.
> 
> Am I supposed to make up GSI numbers and use the routing thing to
> associate them with real interrupts? 

I have no idea. Is mapping from one integer linear space (GSIs) 
to real interrupts suitable for you? 

> Are there constraints on what
> sort of GSI numbers I can choose (I now see in the code that
> KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
> is that documented?  

Don't think it is.

> It looks like the APIC implementation has
> default routes, where is that documented?)? 

In the code.

> Where does the code live to manage this table, and how APICy is it (looks like the
> answer is "irq_comm.c, and very")? 

Thinking faster than typing? Not sure what you mean.

> I suppose I could write another
> implementation of the table management code for MPIC, though the
> placement of "irqchip" inside the route entry, rather than as an
> argument to KVM_IRQ_LINE, suggests the table is supposed to be
> global, not in the individual interrupt controller.

Yes the table is global. It maps GSI ("Global System Interrupt" IIRC)
(integer) to (irqchip,pin) pair.

> It looks like I'm going to have to do this anyway for irqfd, though
> that doesn't make the other uses of the device control api go away.
> Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
> the status for debugging (reading device registers doesn't quite do
> it, since the "active" bit won't show up if the interrupt is
> masked).  

> At that point, is it more offensive to make it read-only
> even though it would be trivial to make it read/write (which would
> allow users who don't need it to bypass the routing API), or to make
> it read/write and live with there being more than one way to do
> something?

Can't follow this sentence.

> KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
> of state, and because it doesn't allow debugging access to device
> registers (e.g. inspecting from the QEMU command line), and because
> it's hard to add new pieces of state if we realize we left something
> out.  It reminds be of GET/SET_SREGS.  With that, I did what you
> seem to want here, which is to adapt the existing interfaces, using
> feature flags to control optional state.  It ended up being a mess,
> and ONE_REG was introduced as a replacement.  The device control API
> is the equivalent of ONE_REG for things other than vcpus.
> 
> -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?
- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far 
is not acceptable? 
- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-21  1:09               ` Marcelo Tosatti
@ 2013-02-21  1:45                 ` Scott Wood
  0 siblings, 0 replies; 32+ messages in thread
From: Scott Wood @ 2013-02-21  1:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Paul Mackerras, Alexander Graf, kvm, kvm-ppc

On 02/20/2013 07:09:55 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
> > On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
> > >This is probably a stupid question, but why the
> > >KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate  
> for
> > >your purposes?
> > >
> > >x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> > >KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
> >
> > To start, the whole IRQ routing stuff is poorly documented.
> >
> > Am I supposed to make up GSI numbers and use the routing thing to
> > associate them with real interrupts?
> 
> I have no idea. Is mapping from one integer linear space (GSIs)
> to real interrupts suitable for you?

I can live with it.

> > Where does the code live to manage this table, and how APICy is it  
> (looks like the
> > answer is "irq_comm.c, and very")?
> 
> Thinking faster than typing? Not sure what you mean.

Sorry...  The code to manage the table lives in virt/kvm/irq_comm.c.   
That code is very APIC-specific and not currently in a state that  
invites sharing.

> > It looks like I'm going to have to do this anyway for irqfd, though
> > that doesn't make the other uses of the device control api go away.
> > Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
> > the status for debugging (reading device registers doesn't quite do
> > it, since the "active" bit won't show up if the interrupt is
> > masked).
> 
> > At that point, is it more offensive to make it read-only
> > even though it would be trivial to make it read/write (which would
> > allow users who don't need it to bypass the routing API), or to make
> > it read/write and live with there being more than one way to do
> > something?
> 
> Can't follow this sentence.

Suppose, for the sake of irqfd (and maillist tranquility) MPIC uses  
existing KVM_IRQ_LINE and such.  Will there be objection to being able  
to use KVM_GET_DEVICE_ATTR to *get* the irq line status for debugging  
purposes (maybe also useful for migration)?  If there's no objection to  
that, would there be any actual reason, beyond saving a few lines of  
glue code, to make it a read-only attribute?

> > KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
> > of state, and because it doesn't allow debugging access to device
> > registers (e.g. inspecting from the QEMU command line), and because
> > it's hard to add new pieces of state if we realize we left something
> > out.  It reminds be of GET/SET_SREGS.  With that, I did what you
> > seem to want here, which is to adapt the existing interfaces, using
> > feature flags to control optional state.  It ended up being a mess,
> > and ONE_REG was introduced as a replacement.  The device control API
> > is the equivalent of ONE_REG for things other than vcpus.
> >
> > -Scott
> 
> - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?

Well, that's what KVM_SET_DEVICE_ATTR is.

> - Agree on poor extensibility of interface. Adding a reserved amount
> of space as padding and versioning such as has been done so far
> is not acceptable?

That's exactly what we did with SREGS, and it got declared a mess and  
replaced with ONE_REG.  I'm trying to learn from my mistakes. :-)

> - Debugging: why is reading entire register state not acceptable? Yes,
>   its slow.

For one, it's more work.  The current way works by simulating a guest  
MMIO access.  No blob format to design, implement, and keep compatible.

-Scott

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-21  0:20             ` Scott Wood
  2013-02-21  1:09               ` Marcelo Tosatti
@ 2013-02-24 14:08               ` Gleb Natapov
  1 sibling, 0 replies; 32+ messages in thread
From: Gleb Natapov @ 2013-02-24 14:08 UTC (permalink / raw)
  To: Scott Wood; +Cc: Marcelo Tosatti, Paul Mackerras, Alexander Graf, kvm, kvm-ppc

On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
> On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
> >This is probably a stupid question, but why the
> >KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
> >your purposes?
> >
> >x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> >KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
> 
> To start, the whole IRQ routing stuff is poorly documented.
> 
> Am I supposed to make up GSI numbers and use the routing thing to
> associate them with real interrupts?
You can consider GSI to be a cookie that you use to refer to whatever
data you've put into routing table by KVM_IRQ_LINE/irqfd interface.
Even on x86, when irq routing is used to inject MSI interrupt, this is
exactly how GSI is used. In MSI case it does not have a meaning besides
"look at that interrupt entry to see what MSI should be injected".

>                                       Are there constraints on what
> sort of GSI numbers I can choose (I now see in the code that
> KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
> is that documented?
The only constrain is that the number should be smalled than
KVM_MAX_IRQ_ROUTES, but this is implementation detail. Current
implementation uses array to map from GSI to a data, if a lot more
entries then currently allowed is needed implementation may be changed
to different data structure.

>                      It looks like the APIC implementation has
> default routes, where is that documented?)?
It is very PC centric, should not be even compiled for other arches.

>                                              Where does the code
> live to manage this table, and how APICy is it (looks like the
> answer is "irq_comm.c, and very")?
It is a mistake to refer to the irq routing table as APICy :). It is
certainly PC centric currently, but there is at least one HW layer
between it and the APIC. PC has global GSI space, each GSI can be
delivered via different irq chip. Some GSIs can be delivered through
multiple irq chips. Irq routing table provides mapping between GSI and
irq chips it should be delivered through. Some irq chips deliver
interrupt via APIC some not, but this is different story.

The work is needed to make the code not PC centric, but it should not be
a lot of work.

>                                     I suppose I could write another
> implementation of the table management code for MPIC, though the
> placement of "irqchip" inside the route entry, rather than as an
> argument to KVM_IRQ_LINE, suggests the table is supposed to be
> global, not in the individual interrupt controller.
> 
Yes, it is global. It sits between emulated devices and irq chips.

--
			Gleb.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-20 19:58           ` Marcelo Tosatti
  2013-02-21  0:20             ` Scott Wood
@ 2013-02-25  0:59             ` Paul Mackerras
  2013-03-21  9:20               ` Alexander Graf
  1 sibling, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-02-25  0:59 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Scott Wood, Alexander Graf, kvm, kvm-ppc

On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote:

> This is probably a stupid question, but why the
> KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
> your purposes?
> 
> x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

So, I see Scott already answered from the point of view of his MPIC
emulation stuff, but I'll answer too from the point of view of my XICS
emulation code.

My understanding, possibly imperfect, is that in a real system the
routing of GSIs to IOAPICs would either be hardwired or set up by the
BIOS, described in ACPI tables, and not modified by the operating
system.  Is that correct?  So my belief is that the GSI routing is
fundamentally distinct from and handled differently from the routing
of interrupts to CPUs, which is fully under the control of the OS.

In the XICS model we have a set of interrupt sources, each identified
by a 24-bit number.  Control operations on an interrupt source just
identify the source by its number.  Thus the interrupt source number
is like a GSI, but we don't need to map that to a different space
(e.g. IOAPIC identifier and input number) in order to operate on it,
we can just operate on it directly.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
  2013-02-14 23:59 ` [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls Paul Mackerras
@ 2013-03-21  8:52   ` Alexander Graf
  2013-04-04  5:37     ` Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Alexander Graf @ 2013-03-21  8:52 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc


On 15.02.2013, at 00:59, Paul Mackerras wrote:

> From: Michael Ellerman <michael@ellerman.id.au>
> 
> For pseries machine emulation, in order to move the interrupt
> controller code to the kernel, we need to intercept some RTAS
> calls in the kernel itself.  This adds an infrastructure to allow
> in-kernel handlers to be registered for RTAS services by name.
> A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
> associate token values with those service names.  Then, when the
> guest requests an RTAS service with one of those token values, it
> will be handled by the relevant in-kernel handler rather than being
> passed up to userspace as at present.
> 
> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> Documentation/virtual/kvm/api.txt   |   19 ++++
> arch/powerpc/include/asm/hvcall.h   |    3 +
> arch/powerpc/include/asm/kvm_host.h |    1 +
> arch/powerpc/include/asm/kvm_ppc.h  |    4 +
> arch/powerpc/include/uapi/asm/kvm.h |    6 ++
> arch/powerpc/kvm/Makefile           |    1 +
> arch/powerpc/kvm/book3s_hv.c        |   18 +++-
> arch/powerpc/kvm/book3s_pr_papr.c   |    7 ++
> arch/powerpc/kvm/book3s_rtas.c      |  182 +++++++++++++++++++++++++++++++++++
> arch/powerpc/kvm/powerpc.c          |    9 +-
> include/uapi/linux/kvm.h            |    3 +
> 11 files changed, 251 insertions(+), 2 deletions(-)
> create mode 100644 arch/powerpc/kvm/book3s_rtas.c
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index c2534c3..d3e2d60 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2122,6 +2122,25 @@ header; first `n_valid' valid entries with contents from the data
> written, then `n_invalid' invalid entries, invalidating any previously
> valid entries found.
> 
> +4.79 KVM_PPC_RTAS_DEFINE_TOKEN
> +
> +Capability: KVM_CAP_PPC_RTAS
> +Architectures: ppc
> +Type: vm ioctl
> +Parameters: struct kvm_rtas_token_args
> +Returns: 0 on success, -1 on error
> +
> +Defines a token value for a RTAS (Run Time Abstraction Services)
> +service in order to allow it to be handled in the kernel.  The
> +argument struct gives the name of the service, which must be the name
> +of a service that has a kernel-side implementation.  If the token
> +value is non-zero, it will be associated with that service, and
> +subsequent RTAS calls by the guest specifying that token will be
> +handled by the kernel.  If the token value is 0, then any token
> +associated with the service will be forgotten, and subsequent RTAS
> +calls by the guest for that service will be passed to userspace to be
> +handled.
> +
> 
> 5. The kvm_run structure
> ------------------------
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index 7a86706..9ea22b2 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -269,6 +269,9 @@
> #define H_GET_MPP_X		0x314
> #define MAX_HCALL_OPCODE	H_GET_MPP_X
> 
> +/* Platform specific hcalls, used by KVM */
> +#define H_RTAS			0xf000

How about you define a different hcall ID for this? Then QEMU would create its "rtas entry blob" such that KVM-routed RTAS handling goes to KVM directly.


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external
  2013-02-15  0:00 ` [PATCH 2/9] KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external Paul Mackerras
@ 2013-03-21  8:58   ` Alexander Graf
  0 siblings, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2013-03-21  8:58 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc


On 15.02.2013, at 01:00, Paul Mackerras wrote:

> Currently kvmppc_core_dequeue_external() takes a struct kvm_interrupt *
> argument and does nothing with it, in any of its implementations.
> This removes it in order to make things easier for forthcoming
> in-kernel interrupt controller emulation code.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Thanks, applied to kvm-ppc-queue.


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
  2013-02-25  0:59             ` Paul Mackerras
@ 2013-03-21  9:20               ` Alexander Graf
  0 siblings, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2013-03-21  9:20 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Marcelo Tosatti, Scott Wood, kvm, kvm-ppc


On 25.02.2013, at 01:59, Paul Mackerras wrote:

> On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote:
> 
>> This is probably a stupid question, but why the
>> KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
>> your purposes?
>> 
>> x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
>> KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
> 
> So, I see Scott already answered from the point of view of his MPIC
> emulation stuff, but I'll answer too from the point of view of my XICS
> emulation code.
> 
> My understanding, possibly imperfect, is that in a real system the
> routing of GSIs to IOAPICs would either be hardwired or set up by the
> BIOS, described in ACPI tables, and not modified by the operating
> system.  Is that correct?  So my belief is that the GSI routing is
> fundamentally distinct from and handled differently from the routing
> of interrupts to CPUs, which is fully under the control of the OS.

It's a different layer. I guess there's really some confusion on names here :). I'm always confused when I read "sources" and you apparently get confused when you read about GSIs.

GSIs are an ACPI concept. It's not x86 specific, it's also not APIC specific. It's just a global name space for IRQs.

Imagine you have 2 MPICs in your system. But you only want to use a single token / numer to access any IRQ on any chip. That's where GSIs come into play. They map different irqchip IRQs onto a flat number space. To speak with x86 names:

Virtualization perspective:

  QEMU -> GSI -> IOAPIC -> LAPIC -> CPU

Device perspective:

  Device irq line -> IOAPIC -> LAPIC -> CPU


The "IOAPIC" is the piece of hardware that interrupt lines get attached to. You connect a pin on it to an irq pin of your device. That talks to the LAPIC to actually schedule interrupts on target CPUs. The LAPIC then fetches interrupts and pulls the CPU's interrupt line.

Of course, things are slightly more complicated in the x86 world, as everything behind the IOAPIC also carries a payload defining which pin actually got triggered, but you get the idea.

So really just consider GSIs as a "global flat number space" for irqchip pins.


Alex

> In the XICS model we have a set of interrupt sources, each identified
> by a 24-bit number.  Control operations on an interrupt source just
> identify the source by its number.  Thus the interrupt source number
> is like a GSI, but we don't need to map that to a different space
> (e.g. IOAPIC identifier and input number) in order to operate on it,
> we can just operate on it directly.
> 
> Paul.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
  2013-03-21  8:52   ` Alexander Graf
@ 2013-04-04  5:37     ` Paul Mackerras
  2013-04-04  9:49       ` Alexander Graf
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-04-04  5:37 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
> > +/* Platform specific hcalls, used by KVM */
> > +#define H_RTAS			0xf000
> 
> How about you define a different hcall ID for this? Then QEMU would
> create its "rtas entry blob" such that KVM-routed RTAS handling goes
> to KVM directly.

QEMU can still do that, and I don't see that it would change the
kernel side if it did.  We would still have to have agreement between
the kernel and userspace as to what the hcall number for invoking the
in-kernel RTAS calls was, and the kernel would still have to keep a
list of token numbers and how they correspond to the functions it
provides.  The only thing different would be that the in-kernel RTAS
hcall could return to the guest if it didn't recognize the token
number, rather than pushing the problem up to userspace.  However,
that wouldn't make the code any simpler, and it isn't a situation
where performance is an issue.

Do you see some kernel-side improvements or simplifications from your
suggestion that I'm missing?  Remember, the guest gets the token
numbers from the device tree (properties under the /rtas node), so
they are under the control of userspace/QEMU.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
  2013-04-04  5:37     ` Paul Mackerras
@ 2013-04-04  9:49       ` Alexander Graf
  2013-04-04 22:38         ` Paul Mackerras
  0 siblings, 1 reply; 32+ messages in thread
From: Alexander Graf @ 2013-04-04  9:49 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc


On 04.04.2013, at 07:37, Paul Mackerras wrote:

> On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
>>> +/* Platform specific hcalls, used by KVM */
>>> +#define H_RTAS			0xf000
>> 
>> How about you define a different hcall ID for this? Then QEMU would
>> create its "rtas entry blob" such that KVM-routed RTAS handling goes
>> to KVM directly.
> 
> QEMU can still do that, and I don't see that it would change the
> kernel side if it did.  We would still have to have agreement between
> the kernel and userspace as to what the hcall number for invoking the
> in-kernel RTAS calls was, and the kernel would still have to keep a
> list of token numbers and how they correspond to the functions it
> provides.  The only thing different would be that the in-kernel RTAS
> hcall could return to the guest if it didn't recognize the token
> number, rather than pushing the problem up to userspace.  However,
> that wouldn't make the code any simpler, and it isn't a situation
> where performance is an issue.
> 
> Do you see some kernel-side improvements or simplifications from your
> suggestion that I'm missing?  Remember, the guest gets the token
> numbers from the device tree (properties under the /rtas node), so
> they are under the control of userspace/QEMU.

The code flow with this patch:

  <setup time>

  foreach (override in overrides)
    ioctl(OVERRIDE_RTAS, ...);

  <runtime>

  switch (hcall_id) {
  case QEMU_RTAS_ID:
    foreach (override in kvm_overrides) {
      int rtas_id = ...;
      if (override.rtas_id == rtas_id) {
        handle_rtas();
        handled = true;
      }
    }
    if (!handled)
      pass_to_qemu();
    break;
  default:
    pass_to_qemu();
    break
  }

What I'm suggesting:

  <setup time>

  nothing from KVM's point of view

  <runtime>

  switch (hcall_id) {
  case KVM_RTAS_ID:
    handle_rtas();
    break;
  default:
    pass_to_qemu();
    break;
  }


Which one looks easier and less error prone to you? :)

Speaking of which, how does user space know that the kernel actually supports a specific RTAS token?


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
  2013-04-04  9:49       ` Alexander Graf
@ 2013-04-04 22:38         ` Paul Mackerras
  2013-04-19 15:16           ` Alexander Graf
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Mackerras @ 2013-04-04 22:38 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc

On Thu, Apr 04, 2013 at 11:49:55AM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 07:37, Paul Mackerras wrote:
> 
> > On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
> >>> +/* Platform specific hcalls, used by KVM */
> >>> +#define H_RTAS			0xf000
> >> 
> >> How about you define a different hcall ID for this? Then QEMU would
> >> create its "rtas entry blob" such that KVM-routed RTAS handling goes
> >> to KVM directly.
> > 
> > QEMU can still do that, and I don't see that it would change the
> > kernel side if it did.  We would still have to have agreement between
> > the kernel and userspace as to what the hcall number for invoking the
> > in-kernel RTAS calls was, and the kernel would still have to keep a
> > list of token numbers and how they correspond to the functions it
> > provides.  The only thing different would be that the in-kernel RTAS
> > hcall could return to the guest if it didn't recognize the token
> > number, rather than pushing the problem up to userspace.  However,
> > that wouldn't make the code any simpler, and it isn't a situation
> > where performance is an issue.
> > 
> > Do you see some kernel-side improvements or simplifications from your
> > suggestion that I'm missing?  Remember, the guest gets the token
> > numbers from the device tree (properties under the /rtas node), so
> > they are under the control of userspace/QEMU.
> 
> The code flow with this patch:
> 
>   <setup time>
> 
>   foreach (override in overrides)
>     ioctl(OVERRIDE_RTAS, ...);
> 
>   <runtime>
> 
>   switch (hcall_id) {
>   case QEMU_RTAS_ID:
>     foreach (override in kvm_overrides) {
>       int rtas_id = ...;
>       if (override.rtas_id == rtas_id) {
>         handle_rtas();

Actually this is more like: override.handler();

>         handled = true;
>       }
>     }
>     if (!handled)
>       pass_to_qemu();
>     break;
>   default:
>     pass_to_qemu();
>     break
>   }
> 
> What I'm suggesting:
> 
>   <setup time>
> 
>   nothing from KVM's point of view

Actually, this can't be "nothing".

The way the RTAS calls work is that there is a name and a "token"
(32-bit integer value) for each RTAS call.  The tokens have to be
unique for each different name.  Userspace puts the names and tokens
in the device tree under the /rtas node (a set of properties where the
property name is the RTAS function name and the property value is the
token).  The guest looks up the token for each RTAS function it wants
to use, and passes the token in the argument buffer for the RTAS call.

This means that userspace has to know the names and tokens for all
supported RTAS functions, both the ones implemented in the kernel and
the ones implemented in userspace.

Also, the token numbers are pretty arbitrary, and the token numbers
for the kernel-implemented RTAS functions could be chosen by userspace
or by the kernel.  If they're chosen by the kernel, then userspace
needs a way to discover them (so it can put them in the device tree),
and also has to avoid choosing any token numbers for its functions
that collide with a kernel-chosen token.  If userspace chooses the
token numbers, it has to tell the kernel what token numbers it has
chosen for the kernel-implemented RTAS functions.  We chose the latter
since it gives userspace more control.

So this <setup time> code has to be either (your suggestion):

    foreach RTAS function possibly implemented in kernel {
        query kernel token for function, by name
	if that gives an error, mark function as needing to be
	        implemented in userspace
    }
    (userspace) allocate tokens for remaining functions,
                avoiding collisions with kernel-chosen tokens

or else it is (my suggestion):

    (userspace) allocate tokens for all RTAS functions
    foreach RTAS function possibly implemented in kernel {
        tell kernel the (name, token) correspondence
    }

>   <runtime>
> 
>   switch (hcall_id) {
>   case KVM_RTAS_ID:
>     handle_rtas();

Here, you've compressed details that you expanded in your pseudo-code
above, making this a less than fair comparison.  This handle_rtas()
function has to fetch the token and branch out to the appropriate
handler routine.  Whether that's a switch statement or a loop over
registered handlers doesn't make all that much difference.

>     break;
>   default:
>     pass_to_qemu();
>     break;
>   }
> 
> 
> Which one looks easier and less error prone to you? :)
> 
> Speaking of which, how does user space know that the kernel actually
> supports a specific RTAS token? 

It's really the names that are more important, the tokens are pretty
arbitrary.  In my scheme, userspace does a KVM_PPC_RTAS_DEFINE_TOKEN
ioctl giving the name and the (userspace-chosen) token, which gets an
error if the kernel doesn't recognize the name.  In your scheme, there
would have to be an equivalent ioctl to query the (kernel-chosen)
token for a given name, which once again would return an error if the
kernel doesn't recognize the name.  Either way the kernel has to have
a list of names that it knows about.

Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls
  2013-04-04 22:38         ` Paul Mackerras
@ 2013-04-19 15:16           ` Alexander Graf
  0 siblings, 0 replies; 32+ messages in thread
From: Alexander Graf @ 2013-04-19 15:16 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc


On 05.04.2013, at 00:38, Paul Mackerras wrote:

> On Thu, Apr 04, 2013 at 11:49:55AM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 07:37, Paul Mackerras wrote:
>> 
>>> On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
>>>>> +/* Platform specific hcalls, used by KVM */
>>>>> +#define H_RTAS			0xf000
>>>> 
>>>> How about you define a different hcall ID for this? Then QEMU would
>>>> create its "rtas entry blob" such that KVM-routed RTAS handling goes
>>>> to KVM directly.
>>> 
>>> QEMU can still do that, and I don't see that it would change the
>>> kernel side if it did.  We would still have to have agreement between
>>> the kernel and userspace as to what the hcall number for invoking the
>>> in-kernel RTAS calls was, and the kernel would still have to keep a
>>> list of token numbers and how they correspond to the functions it
>>> provides.  The only thing different would be that the in-kernel RTAS
>>> hcall could return to the guest if it didn't recognize the token
>>> number, rather than pushing the problem up to userspace.  However,
>>> that wouldn't make the code any simpler, and it isn't a situation
>>> where performance is an issue.
>>> 
>>> Do you see some kernel-side improvements or simplifications from your
>>> suggestion that I'm missing?  Remember, the guest gets the token
>>> numbers from the device tree (properties under the /rtas node), so
>>> they are under the control of userspace/QEMU.
>> 
>> The code flow with this patch:
>> 
>>  <setup time>
>> 
>>  foreach (override in overrides)
>>    ioctl(OVERRIDE_RTAS, ...);
>> 
>>  <runtime>
>> 
>>  switch (hcall_id) {
>>  case QEMU_RTAS_ID:
>>    foreach (override in kvm_overrides) {
>>      int rtas_id = ...;
>>      if (override.rtas_id == rtas_id) {
>>        handle_rtas();
> 
> Actually this is more like: override.handler();
> 
>>        handled = true;
>>      }
>>    }
>>    if (!handled)
>>      pass_to_qemu();
>>    break;
>>  default:
>>    pass_to_qemu();
>>    break
>>  }
>> 
>> What I'm suggesting:
>> 
>>  <setup time>
>> 
>>  nothing from KVM's point of view
> 
> Actually, this can't be "nothing".
> 
> The way the RTAS calls work is that there is a name and a "token"
> (32-bit integer value) for each RTAS call.  The tokens have to be
> unique for each different name.  Userspace puts the names and tokens
> in the device tree under the /rtas node (a set of properties where the
> property name is the RTAS function name and the property value is the
> token).  The guest looks up the token for each RTAS function it wants
> to use, and passes the token in the argument buffer for the RTAS call.
> 
> This means that userspace has to know the names and tokens for all
> supported RTAS functions, both the ones implemented in the kernel and
> the ones implemented in userspace.
> 
> Also, the token numbers are pretty arbitrary, and the token numbers
> for the kernel-implemented RTAS functions could be chosen by userspace
> or by the kernel.  If they're chosen by the kernel, then userspace
> needs a way to discover them (so it can put them in the device tree),
> and also has to avoid choosing any token numbers for its functions
> that collide with a kernel-chosen token.  If userspace chooses the
> token numbers, it has to tell the kernel what token numbers it has
> chosen for the kernel-implemented RTAS functions.  We chose the latter
> since it gives userspace more control.
> 
> So this <setup time> code has to be either (your suggestion):
> 
>    foreach RTAS function possibly implemented in kernel {
>        query kernel token for function, by name
> 	if that gives an error, mark function as needing to be
> 	        implemented in userspace
>    }
>    (userspace) allocate tokens for remaining functions,
>                avoiding collisions with kernel-chosen tokens
> 
> or else it is (my suggestion):
> 
>    (userspace) allocate tokens for all RTAS functions
>    foreach RTAS function possibly implemented in kernel {
>        tell kernel the (name, token) correspondence
>    }
> 
>>  <runtime>
>> 
>>  switch (hcall_id) {
>>  case KVM_RTAS_ID:
>>    handle_rtas();
> 
> Here, you've compressed details that you expanded in your pseudo-code
> above, making this a less than fair comparison.  This handle_rtas()
> function has to fetch the token and branch out to the appropriate
> handler routine.  Whether that's a switch statement or a loop over
> registered handlers doesn't make all that much difference.
> 
>>    break;
>>  default:
>>    pass_to_qemu();
>>    break;
>>  }
>> 
>> 
>> Which one looks easier and less error prone to you? :)
>> 
>> Speaking of which, how does user space know that the kernel actually
>> supports a specific RTAS token? 
> 
> It's really the names that are more important, the tokens are pretty
> arbitrary.  In my scheme, userspace does a KVM_PPC_RTAS_DEFINE_TOKEN
> ioctl giving the name and the (userspace-chosen) token, which gets an
> error if the kernel doesn't recognize the name.  In your scheme, there
> would have to be an equivalent ioctl to query the (kernel-chosen)
> token for a given name, which once again would return an error if the
> kernel doesn't recognize the name.  Either way the kernel has to have
> a list of names that it knows about.

Hrm. I think I'm slowly grasping what the real issue is.

Fair enough, your approach works for me then.


Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-04-19 15:16 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14 23:59 [PATCH 0/9] In-kernel XICS interrupt controller emulation Paul Mackerras
2013-02-14 23:59 ` [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls Paul Mackerras
2013-03-21  8:52   ` Alexander Graf
2013-04-04  5:37     ` Paul Mackerras
2013-04-04  9:49       ` Alexander Graf
2013-04-04 22:38         ` Paul Mackerras
2013-04-19 15:16           ` Alexander Graf
2013-02-15  0:00 ` [PATCH 2/9] KVM: PPC: Remove unused argument to kvmppc_core_dequeue_external Paul Mackerras
2013-03-21  8:58   ` Alexander Graf
2013-02-15  0:01 ` [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller Paul Mackerras
2013-02-15 20:05   ` Scott Wood
2013-02-15 23:18     ` Paul Mackerras
2013-02-15 23:59       ` Scott Wood
2013-02-16  2:56         ` Paul Mackerras
2013-02-16  3:57           ` Scott Wood
2013-02-16  4:51             ` Paul Mackerras
2013-02-18 22:43               ` Scott Wood
2013-02-20  0:41                 ` Paul Mackerras
2013-02-20  1:01                   ` Scott Wood
2013-02-20 19:58           ` Marcelo Tosatti
2013-02-21  0:20             ` Scott Wood
2013-02-21  1:09               ` Marcelo Tosatti
2013-02-21  1:45                 ` Scott Wood
2013-02-24 14:08               ` Gleb Natapov
2013-02-25  0:59             ` Paul Mackerras
2013-03-21  9:20               ` Alexander Graf
2013-02-15  0:01 ` [PATCH 4/9] KVM: PPC: Book3S: Generalize interfaces to interrupt controller emulation Paul Mackerras
2013-02-15  0:02 ` [PATCH 5/9] KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler state Paul Mackerras
2013-02-15  0:02 ` [PATCH 6/9] KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM Paul Mackerras
2013-02-15  0:03 ` [PATCH 7/9] KVM: PPC: Book3S HV: Add support for real mode ICP in XICS emulation Paul Mackerras
2013-02-15  0:03 ` [PATCH 8/9] KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts Paul Mackerras
2013-02-15  0:04 ` [PATCH 9/9] KVM: PPC: Book3S: Add support for ibm,int-on/off RTAS calls Paul Mackerras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).