All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
@ 2013-02-14  5:49 ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm

Scott Wood (6):
  kvm: add device control API
  kvm/ppc: add a notifier chain for vcpu creation/destruction.
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation

 Documentation/virtual/kvm/api.txt          |   76 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   36 +
 arch/powerpc/include/asm/kvm_host.h        |    9 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    4 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    | 1890 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   31 +-
 include/linux/kvm_host.h                   |   44 +-
 include/uapi/linux/kvm.h                   |   34 +
 virt/kvm/kvm_main.c                        |  141 +++
 13 files changed, 2268 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
@ 2013-02-14  5:49 ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm

Scott Wood (6):
  kvm: add device control API
  kvm/ppc: add a notifier chain for vcpu creation/destruction.
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation

 Documentation/virtual/kvm/api.txt          |   76 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   36 +
 arch/powerpc/include/asm/kvm_host.h        |    9 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    4 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    | 1890 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   31 +-
 include/linux/kvm_host.h                   |   44 +-
 include/uapi/linux/kvm.h                   |   34 +
 virt/kvm/kvm_main.c                        |  141 +++
 13 files changed, 2268 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH 1/6] kvm: add device control API
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-14  5:49   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt        |   76 ++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/linux/kvm_host.h                 |   21 +++++
 include/uapi/linux/kvm.h                 |   25 ++++++
 virt/kvm/kvm_main.c                      |  127 ++++++++++++++++++++++++++++++
 5 files changed, 250 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c2534c3..5bcdb42 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The returned handle
+can be used with KVM_SET/GET_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	id;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is invalid
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific except for certain global attributes.  See
+individual device documentation in the "devices" directory.  As with
+ONE_REG, the size of the data transferred is defined by the particular
+attribute.
+
+Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
+   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE
+
+struct kvm_device_attr {
+	__u32	dev;		/* id from KVM_CREATE_DEVICE */
+	__u32	group;		/* KVM_DEV_ATTR_COMMON or device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is invalid
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 5. The kvm_run structure
 ------------------------
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0350e0d..dbaf012 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -335,6 +335,25 @@ struct kvm_memslots {
 	short id_to_index[KVM_MEM_SLOTS_NUM];
 };
 
+/*
+ * The worst case number of simultaneous devices will likely be very low
+ * (usually zero or one) for the forseeable future.  If the worst case
+ * exceeds this, then it can be increased, or we can convert to idr.
+ */
+#define KVM_MAX_DEVICES 4
+
+struct kvm_device {
+	u32 type;
+
+	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
+};
+
 struct kvm {
 	spinlock_t mmu_lock;
 	struct mutex slots_lock;
@@ -385,6 +404,8 @@ struct kvm {
 	long mmu_notifier_count;
 #endif
 	long tlbs_dirty;
+	struct kvm_device *devices[KVM_MAX_DEVICES];
+	unsigned int num_devices;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9a2db57..1f348e0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_S390_CSS_SUPPORT 85
 #define KVM_CAP_PPC_EPR 86
+#define KVM_CAP_DEVICE_CTRL 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -890,6 +891,30 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 
+/* Available with KVM_CAP_DEVICE_CTRL */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	id;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	dev;		/* id from KVM_CREATE_DEVICE */
+	__u32	group;		/* KVM_DEV_ATTR_COMMON or device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+#define KVM_DEV_ATTR_COMMON		0
+#define   KVM_DEV_ATTR_TYPE		0 /* 32-bit */
+
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xac, struct kvm_create_device)
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xaf, struct kvm_device_attr)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e93630..baf8481 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -580,6 +580,18 @@ void kvm_free_physmem(struct kvm *kvm)
 	kfree(kvm->memslots);
 }
 
+static void kvm_destroy_devices(struct kvm *kvm)
+{
+	int i;
+
+	for (i = 0; i < kvm->num_devices; i++) {
+		kvm->devices[i]->destroy(kvm, kvm->devices[i]);
+		kvm->devices[i] = NULL;
+	}
+
+	kvm->num_devices = 0;
+}
+
 static void kvm_destroy_vm(struct kvm *kvm)
 {
 	int i;
@@ -590,6 +602,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	list_del(&kvm->vm_list);
 	raw_spin_unlock(&kvm_lock);
 	kvm_free_irq_routing(kvm);
+	kvm_destroy_devices(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++)
 		kvm_io_bus_destroy(kvm->buses[i]);
 	kvm_coalesced_mmio_free(kvm);
@@ -2178,6 +2191,86 @@ out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct kvm_device *dev = NULL;
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+	int id;
+	int r;
+
+	mutex_lock(&kvm->lock);
+
+	id = kvm->num_devices;
+	if (id >= KVM_MAX_DEVICES && !test) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	switch (cd->type) {
+	default:
+		r = -ENODEV;
+		goto out;
+	}
+
+	if (test) {
+		WARN_ON_ONCE(dev);
+		goto out;
+	}
+
+	if (r == 0) {
+		WARN_ON_ONCE(dev->type != cd->type);
+
+		kvm->devices[id] = dev;
+		cd->id = id;
+		kvm->num_devices++;
+	}
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_ioctl_device_attr(struct kvm *kvm, int ioctl,
+				 struct kvm_device_attr *attr)
+{
+	struct kvm_device *dev;
+	int (*accessor)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+
+	if (attr->dev >= KVM_MAX_DEVICES)
+		return -ENODEV;
+
+	dev = kvm->devices[attr->dev];
+	if (!dev)
+		return -ENODEV;
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		if (attr->group == KVM_DEV_ATTR_COMMON &&
+		    attr->attr == KVM_DEV_ATTR_TYPE)
+			return -EPERM;
+
+		accessor = dev->set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		if (attr->group == KVM_DEV_ATTR_COMMON &&
+		    attr->attr == KVM_DEV_ATTR_TYPE)
+			return dev->type;
+
+		accessor = dev->get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = dev->has_attr;
+		break;
+	}
+
+	if (!accessor)
+		return -EPERM;
+
+	return accessor(kvm, dev, attr);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2292,6 +2385,40 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_create_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
+	case KVM_SET_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_HAS_DEVICE_ATTR: {
+		struct kvm_device_attr attr;
+
+		r = -EFAULT;
+		if (copy_from_user(&attr, argp, sizeof(attr)))
+			goto out;
+
+		r = kvm_ioctl_device_attr(kvm, ioctl, &attr);
+		if (r)
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-14  5:49   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt        |   76 ++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/linux/kvm_host.h                 |   21 +++++
 include/uapi/linux/kvm.h                 |   25 ++++++
 virt/kvm/kvm_main.c                      |  127 ++++++++++++++++++++++++++++++
 5 files changed, 250 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c2534c3..5bcdb42 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The returned handle
+can be used with KVM_SET/GET_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	id;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is invalid
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific except for certain global attributes.  See
+individual device documentation in the "devices" directory.  As with
+ONE_REG, the size of the data transferred is defined by the particular
+attribute.
+
+Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
+   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE
+
+struct kvm_device_attr {
+	__u32	dev;		/* id from KVM_CREATE_DEVICE */
+	__u32	group;		/* KVM_DEV_ATTR_COMMON or device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device id is invalid
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 5. The kvm_run structure
 ------------------------
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0350e0d..dbaf012 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -335,6 +335,25 @@ struct kvm_memslots {
 	short id_to_index[KVM_MEM_SLOTS_NUM];
 };
 
+/*
+ * The worst case number of simultaneous devices will likely be very low
+ * (usually zero or one) for the forseeable future.  If the worst case
+ * exceeds this, then it can be increased, or we can convert to idr.
+ */
+#define KVM_MAX_DEVICES 4
+
+struct kvm_device {
+	u32 type;
+
+	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
+};
+
 struct kvm {
 	spinlock_t mmu_lock;
 	struct mutex slots_lock;
@@ -385,6 +404,8 @@ struct kvm {
 	long mmu_notifier_count;
 #endif
 	long tlbs_dirty;
+	struct kvm_device *devices[KVM_MAX_DEVICES];
+	unsigned int num_devices;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9a2db57..1f348e0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_S390_CSS_SUPPORT 85
 #define KVM_CAP_PPC_EPR 86
+#define KVM_CAP_DEVICE_CTRL 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -890,6 +891,30 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 
+/* Available with KVM_CAP_DEVICE_CTRL */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	id;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	dev;		/* id from KVM_CREATE_DEVICE */
+	__u32	group;		/* KVM_DEV_ATTR_COMMON or device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+#define KVM_DEV_ATTR_COMMON		0
+#define   KVM_DEV_ATTR_TYPE		0 /* 32-bit */
+
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xac, struct kvm_create_device)
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xaf, struct kvm_device_attr)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2e93630..baf8481 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -580,6 +580,18 @@ void kvm_free_physmem(struct kvm *kvm)
 	kfree(kvm->memslots);
 }
 
+static void kvm_destroy_devices(struct kvm *kvm)
+{
+	int i;
+
+	for (i = 0; i < kvm->num_devices; i++) {
+		kvm->devices[i]->destroy(kvm, kvm->devices[i]);
+		kvm->devices[i] = NULL;
+	}
+
+	kvm->num_devices = 0;
+}
+
 static void kvm_destroy_vm(struct kvm *kvm)
 {
 	int i;
@@ -590,6 +602,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	list_del(&kvm->vm_list);
 	raw_spin_unlock(&kvm_lock);
 	kvm_free_irq_routing(kvm);
+	kvm_destroy_devices(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++)
 		kvm_io_bus_destroy(kvm->buses[i]);
 	kvm_coalesced_mmio_free(kvm);
@@ -2178,6 +2191,86 @@ out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct kvm_device *dev = NULL;
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+	int id;
+	int r;
+
+	mutex_lock(&kvm->lock);
+
+	id = kvm->num_devices;
+	if (id >= KVM_MAX_DEVICES && !test) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	switch (cd->type) {
+	default:
+		r = -ENODEV;
+		goto out;
+	}
+
+	if (test) {
+		WARN_ON_ONCE(dev);
+		goto out;
+	}
+
+	if (r = 0) {
+		WARN_ON_ONCE(dev->type != cd->type);
+
+		kvm->devices[id] = dev;
+		cd->id = id;
+		kvm->num_devices++;
+	}
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
+static int kvm_ioctl_device_attr(struct kvm *kvm, int ioctl,
+				 struct kvm_device_attr *attr)
+{
+	struct kvm_device *dev;
+	int (*accessor)(struct kvm *kvm, struct kvm_device *dev,
+			struct kvm_device_attr *attr);
+
+	if (attr->dev >= KVM_MAX_DEVICES)
+		return -ENODEV;
+
+	dev = kvm->devices[attr->dev];
+	if (!dev)
+		return -ENODEV;
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		if (attr->group = KVM_DEV_ATTR_COMMON &&
+		    attr->attr = KVM_DEV_ATTR_TYPE)
+			return -EPERM;
+
+		accessor = dev->set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		if (attr->group = KVM_DEV_ATTR_COMMON &&
+		    attr->attr = KVM_DEV_ATTR_TYPE)
+			return dev->type;
+
+		accessor = dev->get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = dev->has_attr;
+		break;
+	}
+
+	if (!accessor)
+		return -EPERM;
+
+	return accessor(kvm, dev, attr);
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2292,6 +2385,40 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_create_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
+	case KVM_SET_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_HAS_DEVICE_ATTR: {
+		struct kvm_device_attr attr;
+
+		r = -EFAULT;
+		if (copy_from_user(&attr, argp, sizeof(attr)))
+			goto out;
+
+		r = kvm_ioctl_device_attr(kvm, ioctl, &attr);
+		if (r)
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r = -ENOTTY)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 2/6] kvm/ppc: add a notifier chain for vcpu creation/destruction.
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-14  5:49   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

This will be used by the in-kernel MPIC to update its per-vcpu data
structures, and other vcpu init actions may benefit from migrating to
this over fixed initialization.

The notifier itself is kept in the non-arch-specific struct, and
initialized from non-arch-specific code.  I was hoping to make the
entire mechanism non-arch-specific, but vm and vcpu destruction is too
arch-specific for that to happen yet -- there's no hook in
non-arch-code for per-vcpu destruction.  Even just adding it to all
current architectures made me hesitate, as I lack confidence in
understanding what is going on on x86 (why are kvm_arch_vcpu_free and
kvm_arch_vcpu_destroy so different?).

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/powerpc.c |   19 +++++++++++++++----
 include/linux/kvm_host.h   |   19 +++++++++++++++++++
 virt/kvm/kvm_main.c        |    2 ++
 3 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 934413c..61989f4 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -440,11 +440,20 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id)
 {
 	struct kvm_vcpu *vcpu;
+	int ret;
+
 	vcpu = kvmppc_core_vcpu_create(kvm, id);
-	if (!IS_ERR(vcpu)) {
-		vcpu->arch.wqp = &vcpu->wq;
-		kvmppc_create_vcpu_debugfs(vcpu, id);
-	}
+	if (IS_ERR(vcpu))
+		return vcpu;
+
+	vcpu->arch.wqp = &vcpu->wq;
+	kvmppc_create_vcpu_debugfs(vcpu, id);
+
+	ret = blocking_notifier_call_chain(&kvm->vcpu_notifier, 1, vcpu);
+	ret = notifier_to_errno(ret);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
 	return vcpu;
 }
 
@@ -455,6 +464,8 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
+	blocking_notifier_call_chain(&vcpu->kvm->vcpu_notifier, 0, vcpu);
+
 	/* Make sure we're not using the vcpu anymore */
 	hrtimer_cancel(&vcpu->arch.dec_timer);
 	tasklet_kill(&vcpu->arch.tasklet);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dbaf012..3d28037 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -406,6 +406,25 @@ struct kvm {
 	long tlbs_dirty;
 	struct kvm_device *devices[KVM_MAX_DEVICES];
 	unsigned int num_devices;
+
+	/*
+	 * notifier pointer is vcpu; "val" is 1 for create, 0 for destroy
+	 *
+	 * Creation notice is after other vcpu init is complete, but before
+	 * the vcpu has been exposed to userspace.
+	 *
+	 * Destruction notice is before other vcpu destruction begins, but
+	 * after the vcpu is no longer able to execute (either the vm is
+	 * being destroyed, or vcpu init failed and was never exposed).
+	 *
+	 * If a listener encounters an error during a creation event that
+	 * precludes a working vcpu, it should terminate the notifier chain
+	 * with an error.  However, destruction notifications should never
+	 * be terminated and destruction listeners should be prepared
+	 * to accept getting called without having seen the creation
+	 * notice.
+	 */
+	struct blocking_notifier_head vcpu_notifier;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index baf8481..dd4c78d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -508,6 +508,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (r)
 		goto out_err;
 
+	BLOCKING_INIT_NOTIFIER_HEAD(&kvm->vcpu_notifier);
+
 	raw_spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	raw_spin_unlock(&kvm_lock);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 2/6] kvm/ppc: add a notifier chain for vcpu creation/destruction.
@ 2013-02-14  5:49   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

This will be used by the in-kernel MPIC to update its per-vcpu data
structures, and other vcpu init actions may benefit from migrating to
this over fixed initialization.

The notifier itself is kept in the non-arch-specific struct, and
initialized from non-arch-specific code.  I was hoping to make the
entire mechanism non-arch-specific, but vm and vcpu destruction is too
arch-specific for that to happen yet -- there's no hook in
non-arch-code for per-vcpu destruction.  Even just adding it to all
current architectures made me hesitate, as I lack confidence in
understanding what is going on on x86 (why are kvm_arch_vcpu_free and
kvm_arch_vcpu_destroy so different?).

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/powerpc.c |   19 +++++++++++++++----
 include/linux/kvm_host.h   |   19 +++++++++++++++++++
 virt/kvm/kvm_main.c        |    2 ++
 3 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 934413c..61989f4 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -440,11 +440,20 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id)
 {
 	struct kvm_vcpu *vcpu;
+	int ret;
+
 	vcpu = kvmppc_core_vcpu_create(kvm, id);
-	if (!IS_ERR(vcpu)) {
-		vcpu->arch.wqp = &vcpu->wq;
-		kvmppc_create_vcpu_debugfs(vcpu, id);
-	}
+	if (IS_ERR(vcpu))
+		return vcpu;
+
+	vcpu->arch.wqp = &vcpu->wq;
+	kvmppc_create_vcpu_debugfs(vcpu, id);
+
+	ret = blocking_notifier_call_chain(&kvm->vcpu_notifier, 1, vcpu);
+	ret = notifier_to_errno(ret);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
 	return vcpu;
 }
 
@@ -455,6 +464,8 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
+	blocking_notifier_call_chain(&vcpu->kvm->vcpu_notifier, 0, vcpu);
+
 	/* Make sure we're not using the vcpu anymore */
 	hrtimer_cancel(&vcpu->arch.dec_timer);
 	tasklet_kill(&vcpu->arch.tasklet);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dbaf012..3d28037 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -406,6 +406,25 @@ struct kvm {
 	long tlbs_dirty;
 	struct kvm_device *devices[KVM_MAX_DEVICES];
 	unsigned int num_devices;
+
+	/*
+	 * notifier pointer is vcpu; "val" is 1 for create, 0 for destroy
+	 *
+	 * Creation notice is after other vcpu init is complete, but before
+	 * the vcpu has been exposed to userspace.
+	 *
+	 * Destruction notice is before other vcpu destruction begins, but
+	 * after the vcpu is no longer able to execute (either the vm is
+	 * being destroyed, or vcpu init failed and was never exposed).
+	 *
+	 * If a listener encounters an error during a creation event that
+	 * precludes a working vcpu, it should terminate the notifier chain
+	 * with an error.  However, destruction notifications should never
+	 * be terminated and destruction listeners should be prepared
+	 * to accept getting called without having seen the creation
+	 * notice.
+	 */
+	struct blocking_notifier_head vcpu_notifier;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index baf8481..dd4c78d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -508,6 +508,8 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (r)
 		goto out_err;
 
+	BLOCKING_INIT_NOTIFIER_HEAD(&kvm->vcpu_notifier);
+
 	raw_spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	raw_spin_unlock(&kvm_lock);
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 3/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-14  5:49   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] == inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] == output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq == opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ == 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] == 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask == 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask == (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i == opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level =
+			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr =
+	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level =
+		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval =
+		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr == 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) == 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr == 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ == -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq == -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 3/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
@ 2013-02-14  5:49   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] = inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] = output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq = opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ = 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] = 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask = 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask = (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i = opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level +			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr +	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level +		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval +		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr = 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) = 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr = 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ = -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq = -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 4/6] kvm/ppc/mpic: remove some obviously unneeded code
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-14  5:49   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] == inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] == output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 4/6] kvm/ppc/mpic: remove some obviously unneeded code
@ 2013-02-14  5:49   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] = inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] = output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 5/6] kvm/ppc/mpic: adapt to kernel style and environment
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-14  5:49   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq == opp->max_irq) {
+		if (irq == opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ == 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ == 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] == 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] == 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask == 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i == opp->nb_cpus) {
+			if (i == opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level =
-			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level =
-		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr == 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) == 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr == 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ == -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq == -1) {
+	if (irq == -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 5/6] kvm/ppc/mpic: adapt to kernel style and environment
@ 2013-02-14  5:49   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq = opp->max_irq) {
+		if (irq = opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ = 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ = 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] = 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] = 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask = 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i = opp->nb_cpus) {
+			if (i = opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level -			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level -		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr = 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) = 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr = 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ = -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq = -1) {
+	if (irq = -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-14  5:49   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/devices/mpic.txt |   36 ++
 arch/powerpc/include/asm/kvm_host.h        |    9 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    4 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    |  875 +++++++++++++++++++++++-----
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    4 +-
 include/uapi/linux/kvm.h                   |   17 +-
 virt/kvm/kvm_main.c                        |   12 +
 11 files changed, 822 insertions(+), 164 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..1ef30f0
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,36 @@
+MPIC interrupt controller
+=========================
+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access MPIC register state.  "attr" is the byte offset into
+    the MPIC register space.  Accesses must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8a72d59..be81c7a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -256,6 +256,7 @@ struct kvm_arch {
 #ifdef CONFIG_PPC_BOOK3S_64
 	struct list_head spapr_tce_tables;
 #endif
+	void *irqchip_priv;
 };
 
 /*
@@ -359,6 +360,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -520,7 +526,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -587,5 +593,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 44a657a..d46504d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -165,6 +165,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -271,6 +273,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 4730c95..18d5e72 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 020923e..8483cb2 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -347,7 +347,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -428,8 +428,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear == true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr == true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr == true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+				kvmppc_mpic_set_epr(vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..27040e4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,18 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/notifier.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -89,6 +101,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +111,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(struct kvm_io_device *this, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(struct kvm_io_device *this, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +140,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +167,35 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
+};
+
+struct openpic;
+
+struct sub_region {
+	struct kvm_io_device iodev;
+	struct openpic *opp;
+	gpa_t base;
+	int size;
 };
 
 struct openpic {
+	struct kvm_device dev;
+	struct kvm *kvm;
+	gpa_t reg_base;
+	spinlock_t lock;
+	struct notifier_block vcpu_notifier;
+
+	struct sub_region sub_io_mem[6];
+	int sub_count;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +236,51 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_UNSET,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu, &irq);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +341,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +355,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ == 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] == 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +395,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +410,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +486,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +504,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +517,14 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
+
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +574,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +609,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +620,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +635,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +684,77 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
+#if 0
 	bool mpic_proxy = false;
+#endif
 
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
+#if 0
 	/* Set external proxy mode */
 	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
 		mpic_proxy = true;
 
 	ppce500_set_mpic_proxy(mpic_proxy);
+#endif
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_get_val32(int len, const void *ptr, u32 *val)
 {
-	struct openpic *opp = opaque;
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	memcpy(val, ptr, min(len, 4));
+	return 0;
+}
+
+static int openpic_put_val32(int len, void *ptr, u32 val)
+{
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice
+	 * to people dumping registers -- it works in real hardware
+	 * (reads only, not writes).
+	 */
+	if (len > 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	memcpy(ptr, &val, min(len, 4));
+	return 0;
+}
+
+static int openpic_gbl_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
+{
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+#if 0
 	struct irq_dest *dst;
-	int idx;
+#endif
+	u32 val;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
+
+	spin_lock_irq(&opp->lock);
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -654,7 +767,7 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		openpic_cpu_write_internal(this, addr, val, get_current_cpu());
 		break;
 	case 0x1000:		/* FRR */
 		break;
@@ -668,14 +781,18 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
 				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
 					idx);
+#if 0
 				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+				mpic_irq_raise(opp, dst, OPENPIC_OUTPUT_RESET);
+#endif
 			} else if (!(val & (1 << idx)) &&
 				   (opp->pir & (1 << idx))) {
 				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
 					idx);
+#if 0
 				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+				mpic_irq_lower(opp, dst, OPENPIC_OUTPUT_RESET);
+#endif
 			}
 		}
 		opp->pir = val;
@@ -695,21 +812,34 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+	return 0;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(struct kvm_io_device *this, gpa_t addr,
+			    int len, void *ptr)
 {
-	struct openpic *opp = opaque;
-	uint32_t retval;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 retval;
+	int ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
+
+	spin_lock_irq(&opp->lock);
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +861,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval =
-		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		retval = openpic_cpu_read_internal(this, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,33 +880,51 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+	ret = openpic_put_val32(len, ptr, retval);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	struct openpic *opp = opaque;
-	int idx;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr == 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
 	addr = addr & 0x30;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		break;
@@ -795,15 +943,25 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(struct kvm_io_device *this, gpa_t addr,
+				 int len, void *ptr)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	uint32_t retval = -1;
-	int idx;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +971,9 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -828,24 +989,40 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		break;
 	}
 
+	spin_unlock_irq(&opp->lock);
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+	ret = openpic_put_val32(len, ptr, retval);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	struct openpic *opp = opaque;
-	int idx;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
+	int ret, idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x1f) {
 	case 0x00:
 		write_IRQreg_ivpr(opp, idx, val);
@@ -857,20 +1034,32 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+
+	spin_unlock_irq(&opp->lock);
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(struct kvm_io_device *this, uint64_t addr,
+			    int len, void *ptr)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	uint32_t retval;
-	int idx;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x1f) {
 	case 0x00:
 		retval = read_IRQreg_ivpr(opp, idx);
@@ -883,21 +1072,38 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
+	spin_unlock_irq(&opp->lock);
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+
+	ret = openpic_put_val32(len, ptr, retval);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
 	int idx = opp->irq_msi;
-	int srs, ibs;
+	int srs, ibs, ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
+
+	spin_lock_irq(&opp->lock);
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,20 +1117,31 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(struct kvm_io_device *this, gpa_t addr,
+			    int len, void *ptr)
 {
-	struct openpic *opp = opaque;
-	uint64_t r = 0;
-	int i, srs;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	uint32_t r = 0;
+	int i, srs, ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return 1;
 
 	srs = addr >> 4;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr) {
 	case 0x00:
 	case 0x10:
@@ -945,45 +1162,76 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	spin_unlock_irq(&opp->lock);
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+
+	ret = openpic_put_val32(len, ptr, r);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(struct kvm_io_device *this, gpa_t addr,
+				int len, void *ptr)
 {
-	uint64_t r = 0;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	uint32_t r = 0;
+	int ret;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	ret = openpic_put_val32(len, ptr, r);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(struct kvm_io_device *this, gpa_t addr,
+				 int len, const void *ptr)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	int ret;
+	uint32_t val;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(struct kvm_io_device *this, gpa_t addr,
+				      u32 val, int idx)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1256,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1291,38 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
+	int ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	spin_lock_irq(&opp->lock);
+	ret = openpic_cpu_write_internal(this, addr, val,
+					 (addr & 0x1f000) >> 12);
+
+	spin_unlock_irq(&opp->lock);
+	return ret;
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1332,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1375,37 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-	struct openpic *opp = opaque;
+	struct kvm *kvm = vcpu->kvm;
+	struct openpic *opp = kvm->arch.irqchip_priv;
+	int cpu = vcpu->vcpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
+static int openpic_cpu_read_internal(struct kvm_io_device *this, gpa_t addr,
+				     u32 *ptr, int idx)
+{
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,12 +1427,35 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(struct kvm_io_device *this, gpa_t addr,
+			    int len, void *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	int ret;
+	u32 val;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	spin_lock_irq(&opp->lock);
+	ret = openpic_cpu_read_internal(this, addr, &val,
+					(addr & 0x1f000) >> 12);
+	spin_unlock_irq(&opp->lock);
+	if (ret < 0)
+		return 1;
+
+	ret = openpic_put_val32(len, ptr, val);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
 static const struct kvm_io_device_ops openpic_glb_ops_be = {
@@ -1205,11 +1513,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,39 +1533,55 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list)
 {
+	mutex_lock(&opp->kvm->slots_lock);
+
 	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+		struct sub_region *sub;
+
+		BUG_ON(opp->sub_count >= ARRAY_SIZE(opp->sub_io_mem));
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+		sub = &opp->sub_io_mem[opp->sub_count];
+		sub->opp = opp;
+		sub->base = opp->reg_base + list->start_addr;
+		sub->size = list->size;
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		kvm_iodevice_init(&sub->iodev, list->ops);
 
-		(*count)++;
+		kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+			opp->reg_base + list->start_addr, list->size,
+			&sub->iodev);
+
+		opp->sub_count++;
 		list++;
 	}
+
+	mutex_unlock(&opp->kvm->slots_lock);
+}
+
+static void unmap_all(struct openpic *opp)
+{
+	int i;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	for (i = 0; i < opp->sub_count; i++) {
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS,
+			&opp->sub_io_mem[i].iodev);
+	}
+
+	mutex_unlock(&opp->kvm->slots_lock);
+
+	opp->sub_count = 0;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int set_base_addr(struct kvm *kvm, struct kvm_device *dev,
+			 struct kvm_device_attr *attr)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+	u64 base;
+
 	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
@@ -1278,11 +1601,239 @@ static int openpic_init(SysBusDevice *dev)
 		{NULL}
 	};
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+	if (copy_from_user(&base, (u64 __iomem *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base == opp->reg_base)
+		return 0;
+
+	unmap_all(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base == 0)
+		return 0;
 
 	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+		map_list(opp, list_be);
+		map_list(opp, list_fsl);
+
+		break;
+
+	case KVM_DEV_TYPE_FSL_MPIC_42:
+		map_list(opp, list_be);
+		map_list(opp, list_fsl);
+
+		break;
+
 	default:
+		WARN_ON_ONCE(1);
+	}
+
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (!opp->sub_count)
+		return -EPERM;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	if (addr > 0x40000)
+		return -ENXIO;
+
+	addr += opp->reg_base;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	if (type == ATTR_SET)
+		ret = kvm_io_bus_write(opp->kvm, KVM_MMIO_BUS, addr, 4, val);
+	else
+		ret = kvm_io_bus_read(opp->kvm, KVM_MMIO_BUS, addr, 4, val);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct kvm *kvm, struct kvm_device *dev,
+			 struct kvm_device_attr *attr)
+{
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(kvm, dev, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct kvm *kvm, struct kvm_device *dev,
+			 struct kvm_device_attr *attr)
+{
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			attr64 = opp->reg_base;
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		attr32 = opp->src[attr->attr].pending;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static void mpic_destroy(struct kvm *kvm, struct kvm_device *dev)
+{
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+
+	blocking_notifier_chain_unregister(&kvm->vcpu_notifier,
+					   &opp->vcpu_notifier);
+
+	unmap_all(opp);
+	kfree(opp);
+}
+
+static int add_cpu(struct openpic *opp, struct kvm_vcpu *vcpu)
+{
+	u32 id = vcpu->vcpu_id;
+
+	if (id < 0 || id >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	WARN_ON(opp->dst[id].vcpu);
+	opp->dst[id].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, id + 1);
+
+	spin_unlock_irq(&opp->lock);
+
+	if (opp->mpic_mode_mask == GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+	return 0;
+}
+
+static int kvm_mpic_vcpu_notifier(struct notifier_block *nb,
+				  unsigned long create, void *v)
+{
+	struct openpic *opp = container_of(nb, struct openpic, vcpu_notifier);
+	struct kvm_vcpu *vcpu = v;
+	int ret;
+
+	if (create) {
+		ret = add_cpu(opp, vcpu);
+		if (ret < 0)
+			return notifier_from_errno(ret);
+	}
+
+	return NOTIFY_OK;
+}
+
+int kvm_create_mpic(struct kvm *kvm, u32 type, struct kvm_device **dev)
+{
+	struct openpic *opp;
+	struct kvm_vcpu *vcpu;
+	int ret, i;
+
+	if (kvm->arch.irqchip_priv)
+		return -EEXIST;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return 0;
+
+	kvm->arch.irqchip_priv = opp;
+	opp->kvm = kvm;
+	opp->model = type;
+	spin_lock_init(&opp->lock);
+
+	switch (opp->model) {
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1841,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1852,39 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
+	openpic_reset(opp);
+
+	opp->dev.type = type;
+	opp->dev.set_attr = mpic_set_attr;
+	opp->dev.get_attr = mpic_get_attr;
+	opp->dev.destroy = mpic_destroy;
+	*dev = &opp->dev;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		ret = add_cpu(opp, vcpu);
+		if (ret < 0)
+			goto err;
+	}
+
+	opp->vcpu_notifier.notifier_call = kvm_mpic_vcpu_notifier;
+
+	/* FIXME: register notifier for subsequently created vcpus */
+	ret = blocking_notifier_chain_register(&kvm->vcpu_notifier,
+					       &opp->vcpu_notifier);
+	if (ret < 0)
+		goto err;
+
 	return 0;
+
+err:
+	kfree(opp);
+	return ret;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 61989f4..e3d09f7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -781,7 +782,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -927,6 +931,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -945,7 +950,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -957,7 +961,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -967,7 +970,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -984,7 +986,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -997,7 +998,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d28037..48342a6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1059,5 +1059,7 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 }
 
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
-#endif
 
+int kvm_create_mpic(struct kvm *kvm, u32 type, struct kvm_device **dev);
+
+#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1f348e0..1048a03 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -910,10 +910,19 @@ struct kvm_device_attr {
 #define KVM_DEV_ATTR_COMMON		0
 #define   KVM_DEV_ATTR_TYPE		0 /* 32-bit */
 
-#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xac, struct kvm_create_device)
-#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
-#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
-#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xaf, struct kvm_device_attr)
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xab, struct kvm_create_device)
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xac, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
 
 /*
  * ioctls for vcpu fds
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dd4c78d..db0c2b3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2210,6 +2210,18 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	}
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42: {
+		if (test) {
+			r = 0;
+			break;
+		}
+
+		r = kvm_create_mpic(kvm, cd->type, &dev);
+		break;
+	}
+#endif
 	default:
 		r = -ENODEV;
 		goto out;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-02-14  5:49   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-14  5:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/devices/mpic.txt |   36 ++
 arch/powerpc/include/asm/kvm_host.h        |    9 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    4 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    |  875 +++++++++++++++++++++++-----
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    4 +-
 include/uapi/linux/kvm.h                   |   17 +-
 virt/kvm/kvm_main.c                        |   12 +
 11 files changed, 822 insertions(+), 164 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..1ef30f0
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,36 @@
+MPIC interrupt controller
+============+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access MPIC register state.  "attr" is the byte offset into
+    the MPIC register space.  Accesses must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8a72d59..be81c7a 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -256,6 +256,7 @@ struct kvm_arch {
 #ifdef CONFIG_PPC_BOOK3S_64
 	struct list_head spapr_tce_tables;
 #endif
+	void *irqchip_priv;
 };
 
 /*
@@ -359,6 +360,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -520,7 +526,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -587,5 +593,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 44a657a..d46504d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -165,6 +165,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -271,6 +273,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 4730c95..18d5e72 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 020923e..8483cb2 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -347,7 +347,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -428,8 +428,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear = true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr = true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr = true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+				kvmppc_mpic_set_epr(vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..27040e4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,18 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/notifier.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -89,6 +101,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +111,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(struct kvm_io_device *this, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(struct kvm_io_device *this, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +140,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +167,35 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
+};
+
+struct openpic;
+
+struct sub_region {
+	struct kvm_io_device iodev;
+	struct openpic *opp;
+	gpa_t base;
+	int size;
 };
 
 struct openpic {
+	struct kvm_device dev;
+	struct kvm *kvm;
+	gpa_t reg_base;
+	spinlock_t lock;
+	struct notifier_block vcpu_notifier;
+
+	struct sub_region sub_io_mem[6];
+	int sub_count;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +236,51 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_UNSET,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu, &irq);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +341,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +355,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ = 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] = 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +395,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +410,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +486,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +504,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +517,14 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
+
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +574,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +609,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +620,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +635,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +684,77 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
+#if 0
 	bool mpic_proxy = false;
+#endif
 
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
+#if 0
 	/* Set external proxy mode */
 	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
 		mpic_proxy = true;
 
 	ppce500_set_mpic_proxy(mpic_proxy);
+#endif
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_get_val32(int len, const void *ptr, u32 *val)
 {
-	struct openpic *opp = opaque;
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	memcpy(val, ptr, min(len, 4));
+	return 0;
+}
+
+static int openpic_put_val32(int len, void *ptr, u32 val)
+{
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice
+	 * to people dumping registers -- it works in real hardware
+	 * (reads only, not writes).
+	 */
+	if (len > 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	memcpy(ptr, &val, min(len, 4));
+	return 0;
+}
+
+static int openpic_gbl_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
+{
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+#if 0
 	struct irq_dest *dst;
-	int idx;
+#endif
+	u32 val;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
+
+	spin_lock_irq(&opp->lock);
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -654,7 +767,7 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		openpic_cpu_write_internal(this, addr, val, get_current_cpu());
 		break;
 	case 0x1000:		/* FRR */
 		break;
@@ -668,14 +781,18 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
 				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
 					idx);
+#if 0
 				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+				mpic_irq_raise(opp, dst, OPENPIC_OUTPUT_RESET);
+#endif
 			} else if (!(val & (1 << idx)) &&
 				   (opp->pir & (1 << idx))) {
 				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
 					idx);
+#if 0
 				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+				mpic_irq_lower(opp, dst, OPENPIC_OUTPUT_RESET);
+#endif
 			}
 		}
 		opp->pir = val;
@@ -695,21 +812,34 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+	return 0;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(struct kvm_io_device *this, gpa_t addr,
+			    int len, void *ptr)
 {
-	struct openpic *opp = opaque;
-	uint32_t retval;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 retval;
+	int ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
+
+	spin_lock_irq(&opp->lock);
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +861,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval -		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		retval = openpic_cpu_read_internal(this, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,33 +880,51 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+	ret = openpic_put_val32(len, ptr, retval);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	struct openpic *opp = opaque;
-	int idx;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr = 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
 	addr = addr & 0x30;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		break;
@@ -795,15 +943,25 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(struct kvm_io_device *this, gpa_t addr,
+				 int len, void *ptr)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	uint32_t retval = -1;
-	int idx;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +971,9 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -828,24 +989,40 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		break;
 	}
 
+	spin_unlock_irq(&opp->lock);
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+	ret = openpic_put_val32(len, ptr, retval);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	struct openpic *opp = opaque;
-	int idx;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
+	int ret, idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x1f) {
 	case 0x00:
 		write_IRQreg_ivpr(opp, idx, val);
@@ -857,20 +1034,32 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+
+	spin_unlock_irq(&opp->lock);
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(struct kvm_io_device *this, uint64_t addr,
+			    int len, void *ptr)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	uint32_t retval;
-	int idx;
+	int ret, idx;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr & 0x1f) {
 	case 0x00:
 		retval = read_IRQreg_ivpr(opp, idx);
@@ -883,21 +1072,38 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
+	spin_unlock_irq(&opp->lock);
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+
+	ret = openpic_put_val32(len, ptr, retval);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
 	int idx = opp->irq_msi;
-	int srs, ibs;
+	int srs, ibs, ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
+
+	spin_lock_irq(&opp->lock);
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,20 +1117,31 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	spin_unlock_irq(&opp->lock);
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(struct kvm_io_device *this, gpa_t addr,
+			    int len, void *ptr)
 {
-	struct openpic *opp = opaque;
-	uint64_t r = 0;
-	int i, srs;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	uint32_t r = 0;
+	int i, srs, ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return 1;
 
 	srs = addr >> 4;
 
+	spin_lock_irq(&opp->lock);
+
 	switch (addr) {
 	case 0x00:
 	case 0x10:
@@ -945,45 +1162,76 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	spin_unlock_irq(&opp->lock);
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+
+	ret = openpic_put_val32(len, ptr, r);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(struct kvm_io_device *this, gpa_t addr,
+				int len, void *ptr)
 {
-	uint64_t r = 0;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	uint32_t r = 0;
+	int ret;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	ret = openpic_put_val32(len, ptr, r);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(struct kvm_io_device *this, gpa_t addr,
+				 int len, const void *ptr)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	int ret;
+	uint32_t val;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(struct kvm_io_device *this, gpa_t addr,
+				      u32 val, int idx)
 {
-	struct openpic *opp = opaque;
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1256,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1291,38 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(struct kvm_io_device *this, gpa_t addr,
+			     int len, const void *ptr)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	u32 val;
+	int ret;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	ret = openpic_get_val32(len, ptr, &val);
+	if (ret < 0)
+		return 1;
+
+	spin_lock_irq(&opp->lock);
+	ret = openpic_cpu_write_internal(this, addr, val,
+					 (addr & 0x1f000) >> 12);
+
+	spin_unlock_irq(&opp->lock);
+	return ret;
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1332,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1375,37 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-	struct openpic *opp = opaque;
+	struct kvm *kvm = vcpu->kvm;
+	struct openpic *opp = kvm->arch.irqchip_priv;
+	int cpu = vcpu->vcpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) = GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
+static int openpic_cpu_read_internal(struct kvm_io_device *this, gpa_t addr,
+				     u32 *ptr, int idx)
+{
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,12 +1427,35 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(struct kvm_io_device *this, gpa_t addr,
+			    int len, void *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct sub_region *sub = container_of(this, struct sub_region, iodev);
+	struct openpic *opp = sub->opp;
+	int ret;
+	u32 val;
+
+	addr -= sub->base;
+	if (addr > sub->size)
+		return 1;
+
+	spin_lock_irq(&opp->lock);
+	ret = openpic_cpu_read_internal(this, addr, &val,
+					(addr & 0x1f000) >> 12);
+	spin_unlock_irq(&opp->lock);
+	if (ret < 0)
+		return 1;
+
+	ret = openpic_put_val32(len, ptr, val);
+	if (ret < 0)
+		return 1;
+
+	return 0;
 }
 
 static const struct kvm_io_device_ops openpic_glb_ops_be = {
@@ -1205,11 +1513,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,39 +1533,55 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list)
 {
+	mutex_lock(&opp->kvm->slots_lock);
+
 	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+		struct sub_region *sub;
+
+		BUG_ON(opp->sub_count >= ARRAY_SIZE(opp->sub_io_mem));
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+		sub = &opp->sub_io_mem[opp->sub_count];
+		sub->opp = opp;
+		sub->base = opp->reg_base + list->start_addr;
+		sub->size = list->size;
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		kvm_iodevice_init(&sub->iodev, list->ops);
 
-		(*count)++;
+		kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+			opp->reg_base + list->start_addr, list->size,
+			&sub->iodev);
+
+		opp->sub_count++;
 		list++;
 	}
+
+	mutex_unlock(&opp->kvm->slots_lock);
+}
+
+static void unmap_all(struct openpic *opp)
+{
+	int i;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	for (i = 0; i < opp->sub_count; i++) {
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS,
+			&opp->sub_io_mem[i].iodev);
+	}
+
+	mutex_unlock(&opp->kvm->slots_lock);
+
+	opp->sub_count = 0;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int set_base_addr(struct kvm *kvm, struct kvm_device *dev,
+			 struct kvm_device_attr *attr)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+	u64 base;
+
 	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
@@ -1278,11 +1601,239 @@ static int openpic_init(SysBusDevice *dev)
 		{NULL}
 	};
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+	if (copy_from_user(&base, (u64 __iomem *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base = opp->reg_base)
+		return 0;
+
+	unmap_all(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base = 0)
+		return 0;
 
 	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+		map_list(opp, list_be);
+		map_list(opp, list_fsl);
+
+		break;
+
+	case KVM_DEV_TYPE_FSL_MPIC_42:
+		map_list(opp, list_be);
+		map_list(opp, list_fsl);
+
+		break;
+
 	default:
+		WARN_ON_ONCE(1);
+	}
+
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (!opp->sub_count)
+		return -EPERM;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	if (addr > 0x40000)
+		return -ENXIO;
+
+	addr += opp->reg_base;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	if (type = ATTR_SET)
+		ret = kvm_io_bus_write(opp->kvm, KVM_MMIO_BUS, addr, 4, val);
+	else
+		ret = kvm_io_bus_read(opp->kvm, KVM_MMIO_BUS, addr, 4, val);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct kvm *kvm, struct kvm_device *dev,
+			 struct kvm_device_attr *attr)
+{
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(kvm, dev, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct kvm *kvm, struct kvm_device *dev,
+			 struct kvm_device_attr *attr)
+{
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			attr64 = opp->reg_base;
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		attr32 = opp->src[attr->attr].pending;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static void mpic_destroy(struct kvm *kvm, struct kvm_device *dev)
+{
+	struct openpic *opp = container_of(dev, struct openpic, dev);
+
+	blocking_notifier_chain_unregister(&kvm->vcpu_notifier,
+					   &opp->vcpu_notifier);
+
+	unmap_all(opp);
+	kfree(opp);
+}
+
+static int add_cpu(struct openpic *opp, struct kvm_vcpu *vcpu)
+{
+	u32 id = vcpu->vcpu_id;
+
+	if (id < 0 || id >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	WARN_ON(opp->dst[id].vcpu);
+	opp->dst[id].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, id + 1);
+
+	spin_unlock_irq(&opp->lock);
+
+	if (opp->mpic_mode_mask = GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+	return 0;
+}
+
+static int kvm_mpic_vcpu_notifier(struct notifier_block *nb,
+				  unsigned long create, void *v)
+{
+	struct openpic *opp = container_of(nb, struct openpic, vcpu_notifier);
+	struct kvm_vcpu *vcpu = v;
+	int ret;
+
+	if (create) {
+		ret = add_cpu(opp, vcpu);
+		if (ret < 0)
+			return notifier_from_errno(ret);
+	}
+
+	return NOTIFY_OK;
+}
+
+int kvm_create_mpic(struct kvm *kvm, u32 type, struct kvm_device **dev)
+{
+	struct openpic *opp;
+	struct kvm_vcpu *vcpu;
+	int ret, i;
+
+	if (kvm->arch.irqchip_priv)
+		return -EEXIST;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return 0;
+
+	kvm->arch.irqchip_priv = opp;
+	opp->kvm = kvm;
+	opp->model = type;
+	spin_lock_init(&opp->lock);
+
+	switch (opp->model) {
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1841,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1852,39 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
+	openpic_reset(opp);
+
+	opp->dev.type = type;
+	opp->dev.set_attr = mpic_set_attr;
+	opp->dev.get_attr = mpic_get_attr;
+	opp->dev.destroy = mpic_destroy;
+	*dev = &opp->dev;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		ret = add_cpu(opp, vcpu);
+		if (ret < 0)
+			goto err;
+	}
+
+	opp->vcpu_notifier.notifier_call = kvm_mpic_vcpu_notifier;
+
+	/* FIXME: register notifier for subsequently created vcpus */
+	ret = blocking_notifier_chain_register(&kvm->vcpu_notifier,
+					       &opp->vcpu_notifier);
+	if (ret < 0)
+		goto err;
+
 	return 0;
+
+err:
+	kfree(opp);
+	return ret;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 61989f4..e3d09f7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -781,7 +782,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -927,6 +931,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -945,7 +950,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -957,7 +961,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -967,7 +970,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -984,7 +986,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -997,7 +998,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d28037..48342a6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1059,5 +1059,7 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 }
 
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
-#endif
 
+int kvm_create_mpic(struct kvm *kvm, u32 type, struct kvm_device **dev);
+
+#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1f348e0..1048a03 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -910,10 +910,19 @@ struct kvm_device_attr {
 #define KVM_DEV_ATTR_COMMON		0
 #define   KVM_DEV_ATTR_TYPE		0 /* 32-bit */
 
-#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xac, struct kvm_create_device)
-#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
-#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
-#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xaf, struct kvm_device_attr)
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xab, struct kvm_create_device)
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xac, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xad, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xae, struct kvm_device_attr)
 
 /*
  * ioctls for vcpu fds
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dd4c78d..db0c2b3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2210,6 +2210,18 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	}
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42: {
+		if (test) {
+			r = 0;
+			break;
+		}
+
+		r = kvm_create_mpic(kvm, cd->type, &dev);
+		break;
+	}
+#endif
 	default:
 		r = -ENODEV;
 		goto out;
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
  2013-02-14  5:49 ` Scott Wood
@ 2013-02-18 12:04   ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-18 12:04 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

Can you tell us why mpic should be in kernel? Is it used often by modern
guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs
are delivered through mpic too)? On x86 we actually would've preferred
to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
hence my question.

On Wed, Feb 13, 2013 at 11:49:14PM -0600, Scott Wood wrote:
> Scott Wood (6):
>   kvm: add device control API
>   kvm/ppc: add a notifier chain for vcpu creation/destruction.
>   kvm/ppc/mpic: import hw/openpic.c from QEMU
>   kvm/ppc/mpic: remove some obviously unneeded code
>   kvm/ppc/mpic: adapt to kernel style and environment
>   kvm/ppc/mpic: in-kernel MPIC emulation
> 
>  Documentation/virtual/kvm/api.txt          |   76 ++
>  Documentation/virtual/kvm/devices/README   |    1 +
>  Documentation/virtual/kvm/devices/mpic.txt |   36 +
>  arch/powerpc/include/asm/kvm_host.h        |    9 +-
>  arch/powerpc/include/asm/kvm_ppc.h         |    4 +
>  arch/powerpc/kvm/Kconfig                   |    5 +
>  arch/powerpc/kvm/Makefile                  |    2 +
>  arch/powerpc/kvm/booke.c                   |   10 +-
>  arch/powerpc/kvm/mpic.c                    | 1890 ++++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c                 |   31 +-
>  include/linux/kvm_host.h                   |   44 +-
>  include/uapi/linux/kvm.h                   |   34 +
>  virt/kvm/kvm_main.c                        |  141 +++
>  13 files changed, 2268 insertions(+), 15 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/devices/README
>  create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
>  create mode 100644 arch/powerpc/kvm/mpic.c
> 
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
@ 2013-02-18 12:04   ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-18 12:04 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

Can you tell us why mpic should be in kernel? Is it used often by modern
guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs
are delivered through mpic too)? On x86 we actually would've preferred
to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
hence my question.

On Wed, Feb 13, 2013 at 11:49:14PM -0600, Scott Wood wrote:
> Scott Wood (6):
>   kvm: add device control API
>   kvm/ppc: add a notifier chain for vcpu creation/destruction.
>   kvm/ppc/mpic: import hw/openpic.c from QEMU
>   kvm/ppc/mpic: remove some obviously unneeded code
>   kvm/ppc/mpic: adapt to kernel style and environment
>   kvm/ppc/mpic: in-kernel MPIC emulation
> 
>  Documentation/virtual/kvm/api.txt          |   76 ++
>  Documentation/virtual/kvm/devices/README   |    1 +
>  Documentation/virtual/kvm/devices/mpic.txt |   36 +
>  arch/powerpc/include/asm/kvm_host.h        |    9 +-
>  arch/powerpc/include/asm/kvm_ppc.h         |    4 +
>  arch/powerpc/kvm/Kconfig                   |    5 +
>  arch/powerpc/kvm/Makefile                  |    2 +
>  arch/powerpc/kvm/booke.c                   |   10 +-
>  arch/powerpc/kvm/mpic.c                    | 1890 ++++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c                 |   31 +-
>  include/linux/kvm_host.h                   |   44 +-
>  include/uapi/linux/kvm.h                   |   34 +
>  virt/kvm/kvm_main.c                        |  141 +++
>  13 files changed, 2268 insertions(+), 15 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/devices/README
>  create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
>  create mode 100644 arch/powerpc/kvm/mpic.c
> 
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-14  5:49   ` Scott Wood
@ 2013-02-18 12:21     ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-18 12:21 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
You are not only provide different way to create in kernel irq chip you
also use an alternate way to trigger interrupt lines. Before going into
interface specifics lets think about whether it is really worth it? x86
obviously support old way and will have to for some, very long, time.
ARM vGIC code, that is ready to go upstream, uses old way too. So it will
be 2 archs against one. Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-18 12:21     ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-18 12:21 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
You are not only provide different way to create in kernel irq chip you
also use an alternate way to trigger interrupt lines. Before going into
interface specifics lets think about whether it is really worth it? x86
obviously support old way and will have to for some, very long, time.
ARM vGIC code, that is ready to go upstream, uses old way too. So it will
be 2 archs against one. Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-18 12:21     ` Gleb Natapov
@ 2013-02-18 23:01       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-18 23:01 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> Copying Christoffer since ARM has in kernel irq chip too.
> 
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> You are not only provide different way to create in kernel irq chip  
> you
> also use an alternate way to trigger interrupt lines. Before going  
> into
> interface specifics lets think about whether it is really worth it?

Which "it" do you mean here?

The ability to set/get attributes is needed.  Sorry, but "get or set  
one blob of data, up to 512 bytes, for the entire irqchip" is just not  
good enough -- assuming you don't want us to start sticking pointers  
and commands in *that* data. :-)

If you mean the way to inject interrupts, it's simpler this way.  Why  
go out of our way to inject common glue code into a communication path  
between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or  
rather, why make that common glue be specific to this one function when  
we could reuse the same communication glue used for other things, such  
as device state?

And that's just for regular interrupts.  MSIs are vastly simpler on  
MPIC than what x86 does.

> x86 obviously support old way and will have to for some, very long,  
> time.

Sure.

> ARM vGIC code, that is ready to go upstream, uses old way too. So it  
> will
> be 2 archs against one.

I wasn't aware that that's how it worked. :-P

I was trying to be considerate by not making the entire thing  
gratuitously PPC or MPIC specific, as some others seem inclined to do  
(e.g. see irqfd and APIC).  We already had a discussion on ARM's "set  
address" ioctl and rather than extend *that* interface, they preferred  
to just stick something ARM-specific in ASAP with the understanding  
that it would be replaced (or more accurately, kept around as a thin  
wrapper around the new stuff) later.

> Christoffer do you think the proposed way it
> better for your needs. Are you willing to make vGIC use it?
> 
> Scott, what other devices are you planning to support with this
> interface?

At the moment I do not have plans for other devices, though what does  
it hurt for the capability to be there?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-18 23:01       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-18 23:01 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> Copying Christoffer since ARM has in kernel irq chip too.
> 
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> You are not only provide different way to create in kernel irq chip  
> you
> also use an alternate way to trigger interrupt lines. Before going  
> into
> interface specifics lets think about whether it is really worth it?

Which "it" do you mean here?

The ability to set/get attributes is needed.  Sorry, but "get or set  
one blob of data, up to 512 bytes, for the entire irqchip" is just not  
good enough -- assuming you don't want us to start sticking pointers  
and commands in *that* data. :-)

If you mean the way to inject interrupts, it's simpler this way.  Why  
go out of our way to inject common glue code into a communication path  
between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or  
rather, why make that common glue be specific to this one function when  
we could reuse the same communication glue used for other things, such  
as device state?

And that's just for regular interrupts.  MSIs are vastly simpler on  
MPIC than what x86 does.

> x86 obviously support old way and will have to for some, very long,  
> time.

Sure.

> ARM vGIC code, that is ready to go upstream, uses old way too. So it  
> will
> be 2 archs against one.

I wasn't aware that that's how it worked. :-P

I was trying to be considerate by not making the entire thing  
gratuitously PPC or MPIC specific, as some others seem inclined to do  
(e.g. see irqfd and APIC).  We already had a discussion on ARM's "set  
address" ioctl and rather than extend *that* interface, they preferred  
to just stick something ARM-specific in ASAP with the understanding  
that it would be replaced (or more accurately, kept around as a thin  
wrapper around the new stuff) later.

> Christoffer do you think the proposed way it
> better for your needs. Are you willing to make vGIC use it?
> 
> Scott, what other devices are you planning to support with this
> interface?

At the moment I do not have plans for other devices, though what does  
it hurt for the capability to be there?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
  2013-02-18 12:04   ` Gleb Natapov
@ 2013-02-18 23:05     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-18 23:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm

On 02/18/2013 06:04:51 AM, Gleb Natapov wrote:
> Can you tell us why mpic should be in kernel? Is it used often by  
> modern
> guests or may be they prefer MSI for interrupt delivery (hmm may be  
> MSIs
> are delivered through mpic too)?

Yes, MSIs are delivered through the mpic.

Plus, MSIs are only (normally) for PCI(e).  We have embedded "system on  
a chip"s with important non-PCI devices, which cannot use MSIs.

> On x86 we actually would've preferred
> to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
> historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
> hence my question.

We don't have that same split on this hardware.  MPIC is one device  
that covers all of it.  Some of the functionality is per-CPU, but it is  
not easily extracted from the rest.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
@ 2013-02-18 23:05     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-18 23:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm

On 02/18/2013 06:04:51 AM, Gleb Natapov wrote:
> Can you tell us why mpic should be in kernel? Is it used often by  
> modern
> guests or may be they prefer MSI for interrupt delivery (hmm may be  
> MSIs
> are delivered through mpic too)?

Yes, MSIs are delivered through the mpic.

Plus, MSIs are only (normally) for PCI(e).  We have embedded "system on  
a chip"s with important non-PCI devices, which cannot use MSIs.

> On x86 we actually would've preferred
> to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
> historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
> hence my question.

We don't have that same split on this hardware.  MPIC is one device  
that covers all of it.  Some of the functionality is per-CPU, but it is  
not easily extracted from the rest.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-18 23:01       ` Scott Wood
@ 2013-02-19  0:43         ` Christoffer Dall
  -1 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19  0:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm

On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
>>
>> Copying Christoffer since ARM has in kernel irq chip too.
>>
>> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>> > Currently, devices that are emulated inside KVM are configured in a
>> > hardcoded manner based on an assumption that any given architecture
>> > only has one way to do it.  If there's any need to access device state,
>> > it is done through inflexible one-purpose-only IOCTLs (e.g.
>> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> > cumbersome and depletes a limited numberspace.
>> >
>> > This API provides a mechanism to instantiate a device of a certain
>> > type, returning an ID that can be used to set/get attributes of the
>> > device.  Attributes may include configuration parameters (e.g.
>> > register base address), device state, operational commands, etc.  It
>> > is similar to the ONE_REG API, except that it acts on devices rather
>> > than vcpus.
>> You are not only provide different way to create in kernel irq chip you
>> also use an alternate way to trigger interrupt lines. Before going into
>> interface specifics lets think about whether it is really worth it?
>
>
> Which "it" do you mean here?
>
> The ability to set/get attributes is needed.  Sorry, but "get or set one
> blob of data, up to 512 bytes, for the entire irqchip" is just not good
> enough -- assuming you don't want us to start sticking pointers and commands
> in *that* data. :-)
>
> If you mean the way to inject interrupts, it's simpler this way.  Why go out
> of our way to inject common glue code into a communication path between
> hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or rather, why
> make that common glue be specific to this one function when we could reuse
> the same communication glue used for other things, such as device state?
>
> And that's just for regular interrupts.  MSIs are vastly simpler on MPIC
> than what x86 does.
>
>
>> x86 obviously support old way and will have to for some, very long, time.
>
>
> Sure.
>
>
>> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
>> be 2 archs against one.
>
>
> I wasn't aware that that's how it worked. :-P
>
> I was trying to be considerate by not making the entire thing gratuitously
> PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and
> APIC).  We already had a discussion on ARM's "set address" ioctl and rather
> than extend *that* interface, they preferred to just stick something
> ARM-specific in ASAP with the understanding that it would be replaced (or
> more accurately, kept around as a thin wrapper around the new stuff) later.
>
>
>> Christoffer do you think the proposed way it
>> better for your needs. Are you willing to make vGIC use it?
>>

Well it won't improve functionality much from the current hardware
point of view, but the proposed interface is superior to what we have
now. Adding and coordinating new interfaces is indeed a pain, so a
generic interface which is flexible enough to cater for a certain
group of needs, is welcome imho. And this does seem to fit the bill.

I can imagine that if there's support for a future ARM gic version or
if we add in-kernel support for other stuff on ARM, then this
interface will be useful, and in fact using the current interface to
support two separate, but similar, interrupt controllers could get
messy from a user space point of view.

I am definitely willing to change to use this interface, the agreement
on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this.

I had some nits on the RFC, which I'll send separately.

>> Scott, what other devices are you planning to support with this
>> interface?
>
>
> At the moment I do not have plans for other devices, though what does it
> hurt for the capability to be there?
>

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19  0:43         ` Christoffer Dall
  0 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19  0:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm

On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
>>
>> Copying Christoffer since ARM has in kernel irq chip too.
>>
>> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>> > Currently, devices that are emulated inside KVM are configured in a
>> > hardcoded manner based on an assumption that any given architecture
>> > only has one way to do it.  If there's any need to access device state,
>> > it is done through inflexible one-purpose-only IOCTLs (e.g.
>> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> > cumbersome and depletes a limited numberspace.
>> >
>> > This API provides a mechanism to instantiate a device of a certain
>> > type, returning an ID that can be used to set/get attributes of the
>> > device.  Attributes may include configuration parameters (e.g.
>> > register base address), device state, operational commands, etc.  It
>> > is similar to the ONE_REG API, except that it acts on devices rather
>> > than vcpus.
>> You are not only provide different way to create in kernel irq chip you
>> also use an alternate way to trigger interrupt lines. Before going into
>> interface specifics lets think about whether it is really worth it?
>
>
> Which "it" do you mean here?
>
> The ability to set/get attributes is needed.  Sorry, but "get or set one
> blob of data, up to 512 bytes, for the entire irqchip" is just not good
> enough -- assuming you don't want us to start sticking pointers and commands
> in *that* data. :-)
>
> If you mean the way to inject interrupts, it's simpler this way.  Why go out
> of our way to inject common glue code into a communication path between
> hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or rather, why
> make that common glue be specific to this one function when we could reuse
> the same communication glue used for other things, such as device state?
>
> And that's just for regular interrupts.  MSIs are vastly simpler on MPIC
> than what x86 does.
>
>
>> x86 obviously support old way and will have to for some, very long, time.
>
>
> Sure.
>
>
>> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
>> be 2 archs against one.
>
>
> I wasn't aware that that's how it worked. :-P
>
> I was trying to be considerate by not making the entire thing gratuitously
> PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and
> APIC).  We already had a discussion on ARM's "set address" ioctl and rather
> than extend *that* interface, they preferred to just stick something
> ARM-specific in ASAP with the understanding that it would be replaced (or
> more accurately, kept around as a thin wrapper around the new stuff) later.
>
>
>> Christoffer do you think the proposed way it
>> better for your needs. Are you willing to make vGIC use it?
>>

Well it won't improve functionality much from the current hardware
point of view, but the proposed interface is superior to what we have
now. Adding and coordinating new interfaces is indeed a pain, so a
generic interface which is flexible enough to cater for a certain
group of needs, is welcome imho. And this does seem to fit the bill.

I can imagine that if there's support for a future ARM gic version or
if we add in-kernel support for other stuff on ARM, then this
interface will be useful, and in fact using the current interface to
support two separate, but similar, interrupt controllers could get
messy from a user space point of view.

I am definitely willing to change to use this interface, the agreement
on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this.

I had some nits on the RFC, which I'll send separately.

>> Scott, what other devices are you planning to support with this
>> interface?
>
>
> At the moment I do not have plans for other devices, though what does it
> hurt for the capability to be there?
>

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-14  5:49   ` Scott Wood
@ 2013-02-19  0:44     ` Christoffer Dall
  -1 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19  0:44 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com> wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
>
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
>
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
>
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
>  Documentation/virtual/kvm/api.txt        |   76 ++++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/linux/kvm_host.h                 |   21 +++++
>  include/uapi/linux/kvm.h                 |   25 ++++++
>  virt/kvm/kvm_main.c                      |  127 ++++++++++++++++++++++++++++++
>  5 files changed, 250 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index c2534c3..5bcdb42 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Creates an emulated device in the kernel.  The returned handle
> +can be used with KVM_SET/GET_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> +       __u32   id;     /* out: device handle */
> +       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device id is invalid
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific except for certain global attributes.  See
> +individual device documentation in the "devices" directory.  As with
> +ONE_REG, the size of the data transferred is defined by the particular
> +attribute.
> +
> +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
> +   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE
> +
> +struct kvm_device_attr {
> +       __u32   dev;            /* id from KVM_CREATE_DEVICE */
> +       __u32   group;          /* KVM_DEV_ATTR_COMMON or device-defined */
> +       __u64   attr;           /* group-defined */
> +       __u64   addr;           /* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device id is invalid
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>
>  5. The kvm_run structure
>  ------------------------
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0350e0d..dbaf012 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_memslots {
>         short id_to_index[KVM_MEM_SLOTS_NUM];
>  };
>
> +/*
> + * The worst case number of simultaneous devices will likely be very low
> + * (usually zero or one) for the forseeable future.  If the worst case
> + * exceeds this, then it can be increased, or we can convert to idr.
> + */

This comment is on the heavy side (if at all needed). If you want to
remind people of idr, just put that in a single line. A define is a
define is a define.

> +#define KVM_MAX_DEVICES 4
> +
> +struct kvm_device {
> +       u32 type;
> +
> +       int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> +};
> +
>  struct kvm {
>         spinlock_t mmu_lock;
>         struct mutex slots_lock;
> @@ -385,6 +404,8 @@ struct kvm {
>         long mmu_notifier_count;
>  #endif
>         long tlbs_dirty;
> +       struct kvm_device *devices[KVM_MAX_DEVICES];
> +       unsigned int num_devices;
>  };
>
>  #define kvm_err(fmt, ...) \
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9a2db57..1f348e0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_HTAB_FD 84
>  #define KVM_CAP_S390_CSS_SUPPORT 85
>  #define KVM_CAP_PPC_EPR 86
> +#define KVM_CAP_DEVICE_CTRL 87
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -890,6 +891,30 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_PPC_HTAB_FD */
>  #define KVM_PPC_GET_HTAB_FD      _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>
> +/* Available with KVM_CAP_DEVICE_CTRL */
> +#define KVM_CREATE_DEVICE_TEST         1
> +
> +struct kvm_create_device {
> +       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> +       __u32   id;     /* out: device handle */
> +       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +       __u32   dev;            /* id from KVM_CREATE_DEVICE */
> +       __u32   group;          /* KVM_DEV_ATTR_COMMON or device-defined */
> +       __u64   attr;           /* group-defined */
> +       __u64   addr;           /* userspace address of attr data */
> +};
> +
> +#define KVM_DEV_ATTR_COMMON            0
> +#define   KVM_DEV_ATTR_TYPE            0 /* 32-bit */
> +
> +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct kvm_create_device)
> +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct kvm_device_attr)

_IOWR ?

> +#define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xaf, struct kvm_device_attr)
> +
>  /*
>   * ioctls for vcpu fds
>   */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2e93630..baf8481 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -580,6 +580,18 @@ void kvm_free_physmem(struct kvm *kvm)
>         kfree(kvm->memslots);
>  }
>
> +static void kvm_destroy_devices(struct kvm *kvm)
> +{
> +       int i;
> +
> +       for (i = 0; i < kvm->num_devices; i++) {
> +               kvm->devices[i]->destroy(kvm, kvm->devices[i]);
> +               kvm->devices[i] = NULL;
> +       }
> +
> +       kvm->num_devices = 0;
> +}
> +
>  static void kvm_destroy_vm(struct kvm *kvm)
>  {
>         int i;
> @@ -590,6 +602,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>         list_del(&kvm->vm_list);
>         raw_spin_unlock(&kvm_lock);
>         kvm_free_irq_routing(kvm);
> +       kvm_destroy_devices(kvm);
>         for (i = 0; i < KVM_NR_BUSES; i++)
>                 kvm_io_bus_destroy(kvm->buses[i]);
>         kvm_coalesced_mmio_free(kvm);
> @@ -2178,6 +2191,86 @@ out:
>  }
>  #endif
>
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +                                  struct kvm_create_device *cd)
> +{
> +       struct kvm_device *dev = NULL;
> +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +       int id;
> +       int r;
> +
> +       mutex_lock(&kvm->lock);
> +
> +       id = kvm->num_devices;
> +       if (id >= KVM_MAX_DEVICES && !test) {
> +               r = -ENOSPC;
> +               goto out;
> +       }
> +
> +       switch (cd->type) {
> +       default:
> +               r = -ENODEV;
> +               goto out;
> +       }

do we really believe that there will be any arch-generic recognition
of types? shouldn't this be a call to an arch-specific function
instead. Which makes me wonder whether the device type IDs should be
arch specific as well...

> +
> +       if (test) {
> +               WARN_ON_ONCE(dev);
> +               goto out;
> +       }
> +
> +       if (r == 0) {
> +               WARN_ON_ONCE(dev->type != cd->type);
> +
> +               kvm->devices[id] = dev;
> +               cd->id = id;
> +               kvm->num_devices++;
> +       }
> +
> +out:
> +       mutex_unlock(&kvm->lock);
> +       return r;
> +}
> +
> +static int kvm_ioctl_device_attr(struct kvm *kvm, int ioctl,
> +                                struct kvm_device_attr *attr)
> +{
> +       struct kvm_device *dev;
> +       int (*accessor)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +
> +       if (attr->dev >= KVM_MAX_DEVICES)
> +               return -ENODEV;
> +
> +       dev = kvm->devices[attr->dev];
> +       if (!dev)
> +               return -ENODEV;
> +
> +       switch (ioctl) {
> +       case KVM_SET_DEVICE_ATTR:
> +               if (attr->group == KVM_DEV_ATTR_COMMON &&
> +                   attr->attr == KVM_DEV_ATTR_TYPE)
> +                       return -EPERM;
> +
> +               accessor = dev->set_attr;
> +               break;
> +       case KVM_GET_DEVICE_ATTR:
> +               if (attr->group == KVM_DEV_ATTR_COMMON &&
> +                   attr->attr == KVM_DEV_ATTR_TYPE)
> +                       return dev->type;
> +
> +               accessor = dev->get_attr;
> +               break;
> +       case KVM_HAS_DEVICE_ATTR:
> +               accessor = dev->has_attr;
> +               break;
> +       }
> +
> +       if (!accessor)
> +               return -EPERM;
> +
> +       return accessor(kvm, dev, attr);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -2292,6 +2385,40 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>  #endif
> +       case KVM_CREATE_DEVICE: {
> +               struct kvm_create_device cd;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&cd, argp, sizeof(cd)))
> +                       goto out;
> +
> +               r = kvm_ioctl_create_device(kvm, &cd);
> +               if (r)
> +                       goto out;
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &cd, sizeof(cd)))
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       }
> +       case KVM_SET_DEVICE_ATTR:
> +       case KVM_GET_DEVICE_ATTR:
> +       case KVM_HAS_DEVICE_ATTR: {
> +               struct kvm_device_attr attr;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&attr, argp, sizeof(attr)))
> +                       goto out;
> +
> +               r = kvm_ioctl_device_attr(kvm, ioctl, &attr);
> +               if (r)
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       }
>         default:
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>                 if (r == -ENOTTY)
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19  0:44     ` Christoffer Dall
  0 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19  0:44 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com> wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
>
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
>
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
>
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
>  Documentation/virtual/kvm/api.txt        |   76 ++++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/linux/kvm_host.h                 |   21 +++++
>  include/uapi/linux/kvm.h                 |   25 ++++++
>  virt/kvm/kvm_main.c                      |  127 ++++++++++++++++++++++++++++++
>  5 files changed, 250 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index c2534c3..5bcdb42 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Creates an emulated device in the kernel.  The returned handle
> +can be used with KVM_SET/GET_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> +       __u32   id;     /* out: device handle */
> +       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device id is invalid
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific except for certain global attributes.  See
> +individual device documentation in the "devices" directory.  As with
> +ONE_REG, the size of the data transferred is defined by the particular
> +attribute.
> +
> +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
> +   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE
> +
> +struct kvm_device_attr {
> +       __u32   dev;            /* id from KVM_CREATE_DEVICE */
> +       __u32   group;          /* KVM_DEV_ATTR_COMMON or device-defined */
> +       __u64   attr;           /* group-defined */
> +       __u64   addr;           /* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device id is invalid
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>
>  5. The kvm_run structure
>  ------------------------
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0350e0d..dbaf012 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_memslots {
>         short id_to_index[KVM_MEM_SLOTS_NUM];
>  };
>
> +/*
> + * The worst case number of simultaneous devices will likely be very low
> + * (usually zero or one) for the forseeable future.  If the worst case
> + * exceeds this, then it can be increased, or we can convert to idr.
> + */

This comment is on the heavy side (if at all needed). If you want to
remind people of idr, just put that in a single line. A define is a
define is a define.

> +#define KVM_MAX_DEVICES 4
> +
> +struct kvm_device {
> +       u32 type;
> +
> +       int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +       void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> +};
> +
>  struct kvm {
>         spinlock_t mmu_lock;
>         struct mutex slots_lock;
> @@ -385,6 +404,8 @@ struct kvm {
>         long mmu_notifier_count;
>  #endif
>         long tlbs_dirty;
> +       struct kvm_device *devices[KVM_MAX_DEVICES];
> +       unsigned int num_devices;
>  };
>
>  #define kvm_err(fmt, ...) \
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9a2db57..1f348e0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_HTAB_FD 84
>  #define KVM_CAP_S390_CSS_SUPPORT 85
>  #define KVM_CAP_PPC_EPR 86
> +#define KVM_CAP_DEVICE_CTRL 87
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -890,6 +891,30 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_PPC_HTAB_FD */
>  #define KVM_PPC_GET_HTAB_FD      _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
>
> +/* Available with KVM_CAP_DEVICE_CTRL */
> +#define KVM_CREATE_DEVICE_TEST         1
> +
> +struct kvm_create_device {
> +       __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> +       __u32   id;     /* out: device handle */
> +       __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +       __u32   dev;            /* id from KVM_CREATE_DEVICE */
> +       __u32   group;          /* KVM_DEV_ATTR_COMMON or device-defined */
> +       __u64   attr;           /* group-defined */
> +       __u64   addr;           /* userspace address of attr data */
> +};
> +
> +#define KVM_DEV_ATTR_COMMON            0
> +#define   KVM_DEV_ATTR_TYPE            0 /* 32-bit */
> +
> +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct kvm_create_device)
> +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct kvm_device_attr)

_IOWR ?

> +#define KVM_HAS_DEVICE_ATTR      _IOW(KVMIO,  0xaf, struct kvm_device_attr)
> +
>  /*
>   * ioctls for vcpu fds
>   */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2e93630..baf8481 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -580,6 +580,18 @@ void kvm_free_physmem(struct kvm *kvm)
>         kfree(kvm->memslots);
>  }
>
> +static void kvm_destroy_devices(struct kvm *kvm)
> +{
> +       int i;
> +
> +       for (i = 0; i < kvm->num_devices; i++) {
> +               kvm->devices[i]->destroy(kvm, kvm->devices[i]);
> +               kvm->devices[i] = NULL;
> +       }
> +
> +       kvm->num_devices = 0;
> +}
> +
>  static void kvm_destroy_vm(struct kvm *kvm)
>  {
>         int i;
> @@ -590,6 +602,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>         list_del(&kvm->vm_list);
>         raw_spin_unlock(&kvm_lock);
>         kvm_free_irq_routing(kvm);
> +       kvm_destroy_devices(kvm);
>         for (i = 0; i < KVM_NR_BUSES; i++)
>                 kvm_io_bus_destroy(kvm->buses[i]);
>         kvm_coalesced_mmio_free(kvm);
> @@ -2178,6 +2191,86 @@ out:
>  }
>  #endif
>
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +                                  struct kvm_create_device *cd)
> +{
> +       struct kvm_device *dev = NULL;
> +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +       int id;
> +       int r;
> +
> +       mutex_lock(&kvm->lock);
> +
> +       id = kvm->num_devices;
> +       if (id >= KVM_MAX_DEVICES && !test) {
> +               r = -ENOSPC;
> +               goto out;
> +       }
> +
> +       switch (cd->type) {
> +       default:
> +               r = -ENODEV;
> +               goto out;
> +       }

do we really believe that there will be any arch-generic recognition
of types? shouldn't this be a call to an arch-specific function
instead. Which makes me wonder whether the device type IDs should be
arch specific as well...

> +
> +       if (test) {
> +               WARN_ON_ONCE(dev);
> +               goto out;
> +       }
> +
> +       if (r = 0) {
> +               WARN_ON_ONCE(dev->type != cd->type);
> +
> +               kvm->devices[id] = dev;
> +               cd->id = id;
> +               kvm->num_devices++;
> +       }
> +
> +out:
> +       mutex_unlock(&kvm->lock);
> +       return r;
> +}
> +
> +static int kvm_ioctl_device_attr(struct kvm *kvm, int ioctl,
> +                                struct kvm_device_attr *attr)
> +{
> +       struct kvm_device *dev;
> +       int (*accessor)(struct kvm *kvm, struct kvm_device *dev,
> +                       struct kvm_device_attr *attr);
> +
> +       if (attr->dev >= KVM_MAX_DEVICES)
> +               return -ENODEV;
> +
> +       dev = kvm->devices[attr->dev];
> +       if (!dev)
> +               return -ENODEV;
> +
> +       switch (ioctl) {
> +       case KVM_SET_DEVICE_ATTR:
> +               if (attr->group = KVM_DEV_ATTR_COMMON &&
> +                   attr->attr = KVM_DEV_ATTR_TYPE)
> +                       return -EPERM;
> +
> +               accessor = dev->set_attr;
> +               break;
> +       case KVM_GET_DEVICE_ATTR:
> +               if (attr->group = KVM_DEV_ATTR_COMMON &&
> +                   attr->attr = KVM_DEV_ATTR_TYPE)
> +                       return dev->type;
> +
> +               accessor = dev->get_attr;
> +               break;
> +       case KVM_HAS_DEVICE_ATTR:
> +               accessor = dev->has_attr;
> +               break;
> +       }
> +
> +       if (!accessor)
> +               return -EPERM;
> +
> +       return accessor(kvm, dev, attr);
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>                            unsigned int ioctl, unsigned long arg)
>  {
> @@ -2292,6 +2385,40 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>  #endif
> +       case KVM_CREATE_DEVICE: {
> +               struct kvm_create_device cd;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&cd, argp, sizeof(cd)))
> +                       goto out;
> +
> +               r = kvm_ioctl_create_device(kvm, &cd);
> +               if (r)
> +                       goto out;
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &cd, sizeof(cd)))
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       }
> +       case KVM_SET_DEVICE_ATTR:
> +       case KVM_GET_DEVICE_ATTR:
> +       case KVM_HAS_DEVICE_ATTR: {
> +               struct kvm_device_attr attr;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&attr, argp, sizeof(attr)))
> +                       goto out;
> +
> +               r = kvm_ioctl_device_attr(kvm, ioctl, &attr);
> +               if (r)
> +                       goto out;
> +
> +               r = 0;
> +               break;
> +       }
>         default:
>                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>                 if (r = -ENOTTY)
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19  0:44     ` Christoffer Dall
@ 2013-02-19  0:53       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-19  0:53 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Alexander Graf, kvm-ppc, kvm

On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > index 0350e0d..dbaf012 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -335,6 +335,25 @@ struct kvm_memslots {
> >         short id_to_index[KVM_MEM_SLOTS_NUM];
> >  };
> >
> > +/*
> > + * The worst case number of simultaneous devices will likely be  
> very low
> > + * (usually zero or one) for the forseeable future.  If the worst  
> case
> > + * exceeds this, then it can be increased, or we can convert to  
> idr.
> > + */
> 
> This comment is on the heavy side (if at all needed). If you want to
> remind people of idr, just put that in a single line. A define is a
> define is a define.

OK.

> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct  
> kvm_create_device)
> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct  
> kvm_device_attr)
> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct  
> kvm_device_attr)
> 
> _IOWR ?

struct kvm_device_attr itself is write-only, though the data pointed to  
by the addr field goes the other way for GET.  ONE_REG is in the same  
situation and also uses _IOW for both.

> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> > +                                  struct kvm_create_device *cd)
> > +{
> > +       struct kvm_device *dev = NULL;
> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> > +       int id;
> > +       int r;
> > +
> > +       mutex_lock(&kvm->lock);
> > +
> > +       id = kvm->num_devices;
> > +       if (id >= KVM_MAX_DEVICES && !test) {
> > +               r = -ENOSPC;
> > +               goto out;
> > +       }
> > +
> > +       switch (cd->type) {
> > +       default:
> > +               r = -ENODEV;
> > +               goto out;
> > +       }
> 
> do we really believe that there will be any arch-generic recognition
> of types? shouldn't this be a call to an arch-specific function
> instead. Which makes me wonder whether the device type IDs should be
> arch specific as well...

I prefer to look at it from the other direction -- is there any reason  
why this *should* be architecture specific?  What will that make easier?

By doing device recognition here we don't need a separate copy of this  
per arch (including some #ifdef or modifying every arch at once --  
including ARM which I can't modify yet because it's not merged), and  
*if* we should end up with an in-kernel-emulated device that gets used  
across multiple architectures, it would be annoying to have to modify  
all relevant architectures (and worse to deal with per-arch  
numberspaces).

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19  0:53       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-19  0:53 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Alexander Graf, kvm-ppc, kvm

On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > index 0350e0d..dbaf012 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -335,6 +335,25 @@ struct kvm_memslots {
> >         short id_to_index[KVM_MEM_SLOTS_NUM];
> >  };
> >
> > +/*
> > + * The worst case number of simultaneous devices will likely be  
> very low
> > + * (usually zero or one) for the forseeable future.  If the worst  
> case
> > + * exceeds this, then it can be increased, or we can convert to  
> idr.
> > + */
> 
> This comment is on the heavy side (if at all needed). If you want to
> remind people of idr, just put that in a single line. A define is a
> define is a define.

OK.

> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct  
> kvm_create_device)
> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct  
> kvm_device_attr)
> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct  
> kvm_device_attr)
> 
> _IOWR ?

struct kvm_device_attr itself is write-only, though the data pointed to  
by the addr field goes the other way for GET.  ONE_REG is in the same  
situation and also uses _IOW for both.

> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> > +                                  struct kvm_create_device *cd)
> > +{
> > +       struct kvm_device *dev = NULL;
> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> > +       int id;
> > +       int r;
> > +
> > +       mutex_lock(&kvm->lock);
> > +
> > +       id = kvm->num_devices;
> > +       if (id >= KVM_MAX_DEVICES && !test) {
> > +               r = -ENOSPC;
> > +               goto out;
> > +       }
> > +
> > +       switch (cd->type) {
> > +       default:
> > +               r = -ENODEV;
> > +               goto out;
> > +       }
> 
> do we really believe that there will be any arch-generic recognition
> of types? shouldn't this be a call to an arch-specific function
> instead. Which makes me wonder whether the device type IDs should be
> arch specific as well...

I prefer to look at it from the other direction -- is there any reason  
why this *should* be architecture specific?  What will that make easier?

By doing device recognition here we don't need a separate copy of this  
per arch (including some #ifdef or modifying every arch at once --  
including ARM which I can't modify yet because it's not merged), and  
*if* we should end up with an in-kernel-emulated device that gets used  
across multiple architectures, it would be annoying to have to modify  
all relevant architectures (and worse to deal with per-arch  
numberspaces).

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19  0:53       ` Scott Wood
@ 2013-02-19  5:50         ` Christoffer Dall
  -1 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19  5:50 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
>>
>> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > index 0350e0d..dbaf012 100644
>> > --- a/include/linux/kvm_host.h
>> > +++ b/include/linux/kvm_host.h
>> > @@ -335,6 +335,25 @@ struct kvm_memslots {
>> >         short id_to_index[KVM_MEM_SLOTS_NUM];
>> >  };
>> >
>> > +/*
>> > + * The worst case number of simultaneous devices will likely be very
>> > low
>> > + * (usually zero or one) for the forseeable future.  If the worst case
>> > + * exceeds this, then it can be increased, or we can convert to idr.
>> > + */
>>
>> This comment is on the heavy side (if at all needed). If you want to
>> remind people of idr, just put that in a single line. A define is a
>> define is a define.
>
>
> OK.
>
>
>> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
>> > kvm_create_device)
>> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
>> > kvm_device_attr)
>> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
>> > kvm_device_attr)
>>
>> _IOWR ?
>
>
> struct kvm_device_attr itself is write-only, though the data pointed to by
> the addr field goes the other way for GET.  ONE_REG is in the same situation
> and also uses _IOW for both.
>
>

ok.

Btw., what about the size of the attr? implicitly defined through the attr id?

>> > +static int kvm_ioctl_create_device(struct kvm *kvm,
>> > +                                  struct kvm_create_device *cd)
>> > +{
>> > +       struct kvm_device *dev = NULL;
>> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>> > +       int id;
>> > +       int r;
>> > +
>> > +       mutex_lock(&kvm->lock);
>> > +
>> > +       id = kvm->num_devices;
>> > +       if (id >= KVM_MAX_DEVICES && !test) {
>> > +               r = -ENOSPC;
>> > +               goto out;
>> > +       }
>> > +
>> > +       switch (cd->type) {
>> > +       default:
>> > +               r = -ENODEV;
>> > +               goto out;
>> > +       }
>>
>> do we really believe that there will be any arch-generic recognition
>> of types? shouldn't this be a call to an arch-specific function
>> instead. Which makes me wonder whether the device type IDs should be
>> arch specific as well...
>
>
> I prefer to look at it from the other direction -- is there any reason why
> this *should* be architecture specific?  What will that make easier?
>

The fact that you don't have to create static inlines for the
architectures that don't define the functions that get called or have
to similar #ifdef tricks, and I also think it's easier to read the
arch-specific bits of the code that way, instead of some arbitrary
function that you have to trace through to figure out where it's
called from.

> By doing device recognition here we don't need a separate copy of this per
> arch (including some #ifdef or modifying every arch at once -- including ARM
> which I can't modify yet because it's not merged), and *if* we should end up
> with an in-kernel-emulated device that gets used across multiple
> architectures, it would be annoying to have to modify all relevant
> architectures (and worse to deal with per-arch numberspaces).

I would say that's exactly what you're going to need with your approach:

switch (cd->type) {
case KVM_ARM_VGIC_V2_0:
    kvm_arm_vgic_v2_0_create(...);
}


are you going to ifdef here in this function, or? I think it's cleaner
to have the single arch-specific hooks and handle the cases there.

The use case of having a single device which is so central to the
system that we emulate it inside the kernel and is shared across
multiple archs is pretty far fetched to me.

However, this is internal and can always be changed, so if everyone
agrees on the overall API, whichever way you implement it is fine with
me.

>
> -Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19  5:50         ` Christoffer Dall
  0 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19  5:50 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
>>
>> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > index 0350e0d..dbaf012 100644
>> > --- a/include/linux/kvm_host.h
>> > +++ b/include/linux/kvm_host.h
>> > @@ -335,6 +335,25 @@ struct kvm_memslots {
>> >         short id_to_index[KVM_MEM_SLOTS_NUM];
>> >  };
>> >
>> > +/*
>> > + * The worst case number of simultaneous devices will likely be very
>> > low
>> > + * (usually zero or one) for the forseeable future.  If the worst case
>> > + * exceeds this, then it can be increased, or we can convert to idr.
>> > + */
>>
>> This comment is on the heavy side (if at all needed). If you want to
>> remind people of idr, just put that in a single line. A define is a
>> define is a define.
>
>
> OK.
>
>
>> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
>> > kvm_create_device)
>> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
>> > kvm_device_attr)
>> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
>> > kvm_device_attr)
>>
>> _IOWR ?
>
>
> struct kvm_device_attr itself is write-only, though the data pointed to by
> the addr field goes the other way for GET.  ONE_REG is in the same situation
> and also uses _IOW for both.
>
>

ok.

Btw., what about the size of the attr? implicitly defined through the attr id?

>> > +static int kvm_ioctl_create_device(struct kvm *kvm,
>> > +                                  struct kvm_create_device *cd)
>> > +{
>> > +       struct kvm_device *dev = NULL;
>> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>> > +       int id;
>> > +       int r;
>> > +
>> > +       mutex_lock(&kvm->lock);
>> > +
>> > +       id = kvm->num_devices;
>> > +       if (id >= KVM_MAX_DEVICES && !test) {
>> > +               r = -ENOSPC;
>> > +               goto out;
>> > +       }
>> > +
>> > +       switch (cd->type) {
>> > +       default:
>> > +               r = -ENODEV;
>> > +               goto out;
>> > +       }
>>
>> do we really believe that there will be any arch-generic recognition
>> of types? shouldn't this be a call to an arch-specific function
>> instead. Which makes me wonder whether the device type IDs should be
>> arch specific as well...
>
>
> I prefer to look at it from the other direction -- is there any reason why
> this *should* be architecture specific?  What will that make easier?
>

The fact that you don't have to create static inlines for the
architectures that don't define the functions that get called or have
to similar #ifdef tricks, and I also think it's easier to read the
arch-specific bits of the code that way, instead of some arbitrary
function that you have to trace through to figure out where it's
called from.

> By doing device recognition here we don't need a separate copy of this per
> arch (including some #ifdef or modifying every arch at once -- including ARM
> which I can't modify yet because it's not merged), and *if* we should end up
> with an in-kernel-emulated device that gets used across multiple
> architectures, it would be annoying to have to modify all relevant
> architectures (and worse to deal with per-arch numberspaces).

I would say that's exactly what you're going to need with your approach:

switch (cd->type) {
case KVM_ARM_VGIC_V2_0:
    kvm_arm_vgic_v2_0_create(...);
}


are you going to ifdef here in this function, or? I think it's cleaner
to have the single arch-specific hooks and handle the cases there.

The use case of having a single device which is so central to the
system that we emulate it inside the kernel and is shared across
multiple archs is pretty far fetched to me.

However, this is internal and can always be changed, so if everyone
agrees on the overall API, whichever way you implement it is fine with
me.

>
> -Scott
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-18 23:01       ` Scott Wood
@ 2013-02-19 12:24         ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-19 12:24 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.
> >>
> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
"It" is adding of a new interface if it will have only one user while
existing one can be adjusted for your needs. If ARM people are on board
I feel much better about it. The question is how on board they are :)
are they willing to make vGIC use it before upstream merge? vGIC is
merged separately from KVM code, so it will not affect merging of KVM
itself.

> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)
> 
Proposed interface sticks pointers into ioctl data, so why doing the same
for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. For signaling irqs (I
think this is what you mean by "commands") we have KVM_IRQ_LINE.

> If you mean the way to inject interrupts, it's simpler this way.
> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
You will need glue anyway and I do no see how amount of it is much
different one way or the other. Gluing qemu_set_irq() to
ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much different.

Of course, since the interface you propose is not irq chip specific we
need non irq chip specific way to talk to it. But how do you propose
to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
KVM_SET_DEVICE_ATTR needs to return data back and getting data back from
"set" ioctl is strange. Other devices may get other commands that need
response, so if we design generic interface we should take it into
account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
misnomer, you do not set internal device attribute, you toggle external
input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.

> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
What worked? That vGIC uses existing interface or that non generic
interface used by many arches wins generic one used by only one arch?

> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
I am not against generic interfaces in general and proposed one in
particular (I have comments about it but this is for other emails),
I am trying to make sure it will be used by more than one user before
committing to it. APIs are easy to add and impossible to remove.

> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19 12:24         ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-19 12:24 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.
> >>
> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
"It" is adding of a new interface if it will have only one user while
existing one can be adjusted for your needs. If ARM people are on board
I feel much better about it. The question is how on board they are :)
are they willing to make vGIC use it before upstream merge? vGIC is
merged separately from KVM code, so it will not affect merging of KVM
itself.

> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)
> 
Proposed interface sticks pointers into ioctl data, so why doing the same
for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. For signaling irqs (I
think this is what you mean by "commands") we have KVM_IRQ_LINE.

> If you mean the way to inject interrupts, it's simpler this way.
> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
You will need glue anyway and I do no see how amount of it is much
different one way or the other. Gluing qemu_set_irq() to
ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much different.

Of course, since the interface you propose is not irq chip specific we
need non irq chip specific way to talk to it. But how do you propose
to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
KVM_SET_DEVICE_ATTR needs to return data back and getting data back from
"set" ioctl is strange. Other devices may get other commands that need
response, so if we design generic interface we should take it into
account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
misnomer, you do not set internal device attribute, you toggle external
input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.

> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
What worked? That vGIC uses existing interface or that non generic
interface used by many arches wins generic one used by only one arch?

> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
I am not against generic interfaces in general and proposed one in
particular (I have comments about it but this is for other emails),
I am trying to make sure it will be used by more than one user before
committing to it. APIs are easy to add and impossible to remove.

> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19  5:50         ` Christoffer Dall
@ 2013-02-19 12:45           ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-19 12:45 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Mon, Feb 18, 2013 at 09:50:44PM -0800, Christoffer Dall wrote:
> >> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> >> > +                                  struct kvm_create_device *cd)
> >> > +{
> >> > +       struct kvm_device *dev = NULL;
> >> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> >> > +       int id;
> >> > +       int r;
> >> > +
> >> > +       mutex_lock(&kvm->lock);
> >> > +
> >> > +       id = kvm->num_devices;
> >> > +       if (id >= KVM_MAX_DEVICES && !test) {
> >> > +               r = -ENOSPC;
> >> > +               goto out;
> >> > +       }
> >> > +
> >> > +       switch (cd->type) {
> >> > +       default:
> >> > +               r = -ENODEV;
> >> > +               goto out;
> >> > +       }
> >>
> >> do we really believe that there will be any arch-generic recognition
> >> of types? shouldn't this be a call to an arch-specific function
> >> instead. Which makes me wonder whether the device type IDs should be
> >> arch specific as well...
> >
> >
> > I prefer to look at it from the other direction -- is there any reason why
> > this *should* be architecture specific?  What will that make easier?
> >
> 
> The fact that you don't have to create static inlines for the
> architectures that don't define the functions that get called or have
> to similar #ifdef tricks, and I also think it's easier to read the
> arch-specific bits of the code that way, instead of some arbitrary
> function that you have to trace through to figure out where it's
> called from.
> 
> > By doing device recognition here we don't need a separate copy of this per
> > arch (including some #ifdef or modifying every arch at once -- including ARM
> > which I can't modify yet because it's not merged), and *if* we should end up
> > with an in-kernel-emulated device that gets used across multiple
> > architectures, it would be annoying to have to modify all relevant
> > architectures (and worse to deal with per-arch numberspaces).
> 
> I would say that's exactly what you're going to need with your approach:
> 
> switch (cd->type) {
> case KVM_ARM_VGIC_V2_0:
>     kvm_arm_vgic_v2_0_create(...);
> }
> 
> 
> are you going to ifdef here in this function, or? I think it's cleaner
> to have the single arch-specific hooks and handle the cases there.
> 
That is exactly what last patch is doing:
+#ifdef CONFIG_KVM_MPIC
+       case KVM_DEV_TYPE_FSL_MPIC_20:
+       case KVM_DEV_TYPE_FSL_MPIC_42: {
+               if (test) {
+                       r = 0;
+                       break;
+               }
+
+               r = kvm_create_mpic(kvm, cd->type, &dev);
+               break;
+       }
+#endif

> The use case of having a single device which is so central to the
> system that we emulate it inside the kernel and is shared across
> multiple archs is pretty far fetched to me.
> 
There is (or should I say was) one such device: IOAPIC. It is shared
between ia64 and x86.

Unless we have device that is shared between all/some arches I am with
Christoffer on this one. If such device will appear we ca do:

kvm_ioctl_create_device()
{
	switch (cd->type) {
         case DEVICEA_SHARED_BY_ALL_ARCHS:
                r = createa()
         break;
         case DEVICEB_SHARED_BY_ALL_ARCHS:
	        r = createb()
	 break;
         default:
           r = kvm_ioctl_arch_create_device();
       }
}

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19 12:45           ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-19 12:45 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Mon, Feb 18, 2013 at 09:50:44PM -0800, Christoffer Dall wrote:
> >> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> >> > +                                  struct kvm_create_device *cd)
> >> > +{
> >> > +       struct kvm_device *dev = NULL;
> >> > +       bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> >> > +       int id;
> >> > +       int r;
> >> > +
> >> > +       mutex_lock(&kvm->lock);
> >> > +
> >> > +       id = kvm->num_devices;
> >> > +       if (id >= KVM_MAX_DEVICES && !test) {
> >> > +               r = -ENOSPC;
> >> > +               goto out;
> >> > +       }
> >> > +
> >> > +       switch (cd->type) {
> >> > +       default:
> >> > +               r = -ENODEV;
> >> > +               goto out;
> >> > +       }
> >>
> >> do we really believe that there will be any arch-generic recognition
> >> of types? shouldn't this be a call to an arch-specific function
> >> instead. Which makes me wonder whether the device type IDs should be
> >> arch specific as well...
> >
> >
> > I prefer to look at it from the other direction -- is there any reason why
> > this *should* be architecture specific?  What will that make easier?
> >
> 
> The fact that you don't have to create static inlines for the
> architectures that don't define the functions that get called or have
> to similar #ifdef tricks, and I also think it's easier to read the
> arch-specific bits of the code that way, instead of some arbitrary
> function that you have to trace through to figure out where it's
> called from.
> 
> > By doing device recognition here we don't need a separate copy of this per
> > arch (including some #ifdef or modifying every arch at once -- including ARM
> > which I can't modify yet because it's not merged), and *if* we should end up
> > with an in-kernel-emulated device that gets used across multiple
> > architectures, it would be annoying to have to modify all relevant
> > architectures (and worse to deal with per-arch numberspaces).
> 
> I would say that's exactly what you're going to need with your approach:
> 
> switch (cd->type) {
> case KVM_ARM_VGIC_V2_0:
>     kvm_arm_vgic_v2_0_create(...);
> }
> 
> 
> are you going to ifdef here in this function, or? I think it's cleaner
> to have the single arch-specific hooks and handle the cases there.
> 
That is exactly what last patch is doing:
+#ifdef CONFIG_KVM_MPIC
+       case KVM_DEV_TYPE_FSL_MPIC_20:
+       case KVM_DEV_TYPE_FSL_MPIC_42: {
+               if (test) {
+                       r = 0;
+                       break;
+               }
+
+               r = kvm_create_mpic(kvm, cd->type, &dev);
+               break;
+       }
+#endif

> The use case of having a single device which is so central to the
> system that we emulate it inside the kernel and is shared across
> multiple archs is pretty far fetched to me.
> 
There is (or should I say was) one such device: IOAPIC. It is shared
between ia64 and x86.

Unless we have device that is shared between all/some arches I am with
Christoffer on this one. If such device will appear we ca do:

kvm_ioctl_create_device()
{
	switch (cd->type) {
         case DEVICEA_SHARED_BY_ALL_ARCHS:
                r = createa()
         break;
         case DEVICEB_SHARED_BY_ALL_ARCHS:
	        r = createb()
	 break;
         default:
           r = kvm_ioctl_arch_create_device();
       }
}

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19 12:24         ` Gleb Natapov
@ 2013-02-19 15:51           ` Christoffer Dall
  -1 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19 15:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Tue, Feb 19, 2013 at 4:24 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
>> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
>> >Copying Christoffer since ARM has in kernel irq chip too.
>> >
>> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>> >> Currently, devices that are emulated inside KVM are configured in a
>> >> hardcoded manner based on an assumption that any given architecture
>> >> only has one way to do it.  If there's any need to access device
>> >state,
>> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
>> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> >> cumbersome and depletes a limited numberspace.
>> >>
>> >> This API provides a mechanism to instantiate a device of a certain
>> >> type, returning an ID that can be used to set/get attributes of the
>> >> device.  Attributes may include configuration parameters (e.g.
>> >> register base address), device state, operational commands, etc.  It
>> >> is similar to the ONE_REG API, except that it acts on devices rather
>> >> than vcpus.
>> >You are not only provide different way to create in kernel irq
>> >chip you
>> >also use an alternate way to trigger interrupt lines. Before going
>> >into
>> >interface specifics lets think about whether it is really worth it?
>>
>> Which "it" do you mean here?
>>
> "It" is adding of a new interface if it will have only one user while
> existing one can be adjusted for your needs. If ARM people are on board
> I feel much better about it. The question is how on board they are :)
> are they willing to make vGIC use it before upstream merge? vGIC is
> merged separately from KVM code, so it will not affect merging of KVM
> itself.

Everything is queued for 3.9, the vgic+timers code is in the arm-soc
tree, so this is not going to change right now, but I can see it
happening for 3.10 if this proposed interface is accepted.

>
>> The ability to set/get attributes is needed.  Sorry, but "get or set
>> one blob of data, up to 512 bytes, for the entire irqchip" is just
>> not good enough -- assuming you don't want us to start sticking
>> pointers and commands in *that* data. :-)
>>
> Proposed interface sticks pointers into ioctl data, so why doing the same
> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. For signaling irqs (I
> think this is what you mean by "commands") we have KVM_IRQ_LINE.
>
>> If you mean the way to inject interrupts, it's simpler this way.
>> Why go out of our way to inject common glue code into a
>> communication path between hw/kvm/mpic.c in QEMU and
>> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
>> glue be specific to this one function when we could reuse the same
>> communication glue used for other things, such as device state?
> You will need glue anyway and I do no see how amount of it is much
> different one way or the other. Gluing qemu_set_irq() to
> ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much different.
>
> Of course, since the interface you propose is not irq chip specific we
> need non irq chip specific way to talk to it. But how do you propose
> to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
> KVM_SET_DEVICE_ATTR needs to return data back and getting data back from
> "set" ioctl is strange. Other devices may get other commands that need
> response, so if we design generic interface we should take it into
> account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> misnomer, you do not set internal device attribute, you toggle external
> input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
>

I agree on using KVM_SET_DEVICE_ATTR for injecting interrupts is a bit
funky, for the ARM uses we would use this for setting the address used
to expose distributor and CPU interfaces to guests, not to inject
interrupt, first approximation.

>>
>> And that's just for regular interrupts.  MSIs are vastly simpler on
>> MPIC than what x86 does.
>>
>> >x86 obviously support old way and will have to for some, very
>> >long, time.
>>
>> Sure.
>>
>> >ARM vGIC code, that is ready to go upstream, uses old way too. So
>> >it will
>> >be 2 archs against one.
>>
>> I wasn't aware that that's how it worked. :-P
>>
> What worked? That vGIC uses existing interface or that non generic
> interface used by many arches wins generic one used by only one arch?
>
>> I was trying to be considerate by not making the entire thing
>> gratuitously PPC or MPIC specific, as some others seem inclined to
>> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
>> "set address" ioctl and rather than extend *that* interface, they
>> preferred to just stick something ARM-specific in ASAP with the
>> understanding that it would be replaced (or more accurately, kept
>> around as a thin wrapper around the new stuff) later.
>>
> I am not against generic interfaces in general and proposed one in
> particular (I have comments about it but this is for other emails),
> I am trying to make sure it will be used by more than one user before
> committing to it. APIs are easy to add and impossible to remove.
>
>> >Christoffer do you think the proposed way it
>> >better for your needs. Are you willing to make vGIC use it?
>> >
>> >Scott, what other devices are you planning to support with this
>> >interface?
>>
>> At the moment I do not have plans for other devices, though what
>> does it hurt for the capability to be there?
>>
>> -Scott
>
> --
>                         Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19 15:51           ` Christoffer Dall
  0 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-19 15:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Tue, Feb 19, 2013 at 4:24 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
>> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
>> >Copying Christoffer since ARM has in kernel irq chip too.
>> >
>> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>> >> Currently, devices that are emulated inside KVM are configured in a
>> >> hardcoded manner based on an assumption that any given architecture
>> >> only has one way to do it.  If there's any need to access device
>> >state,
>> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
>> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> >> cumbersome and depletes a limited numberspace.
>> >>
>> >> This API provides a mechanism to instantiate a device of a certain
>> >> type, returning an ID that can be used to set/get attributes of the
>> >> device.  Attributes may include configuration parameters (e.g.
>> >> register base address), device state, operational commands, etc.  It
>> >> is similar to the ONE_REG API, except that it acts on devices rather
>> >> than vcpus.
>> >You are not only provide different way to create in kernel irq
>> >chip you
>> >also use an alternate way to trigger interrupt lines. Before going
>> >into
>> >interface specifics lets think about whether it is really worth it?
>>
>> Which "it" do you mean here?
>>
> "It" is adding of a new interface if it will have only one user while
> existing one can be adjusted for your needs. If ARM people are on board
> I feel much better about it. The question is how on board they are :)
> are they willing to make vGIC use it before upstream merge? vGIC is
> merged separately from KVM code, so it will not affect merging of KVM
> itself.

Everything is queued for 3.9, the vgic+timers code is in the arm-soc
tree, so this is not going to change right now, but I can see it
happening for 3.10 if this proposed interface is accepted.

>
>> The ability to set/get attributes is needed.  Sorry, but "get or set
>> one blob of data, up to 512 bytes, for the entire irqchip" is just
>> not good enough -- assuming you don't want us to start sticking
>> pointers and commands in *that* data. :-)
>>
> Proposed interface sticks pointers into ioctl data, so why doing the same
> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile. For signaling irqs (I
> think this is what you mean by "commands") we have KVM_IRQ_LINE.
>
>> If you mean the way to inject interrupts, it's simpler this way.
>> Why go out of our way to inject common glue code into a
>> communication path between hw/kvm/mpic.c in QEMU and
>> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
>> glue be specific to this one function when we could reuse the same
>> communication glue used for other things, such as device state?
> You will need glue anyway and I do no see how amount of it is much
> different one way or the other. Gluing qemu_set_irq() to
> ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much different.
>
> Of course, since the interface you propose is not irq chip specific we
> need non irq chip specific way to talk to it. But how do you propose
> to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
> KVM_SET_DEVICE_ATTR needs to return data back and getting data back from
> "set" ioctl is strange. Other devices may get other commands that need
> response, so if we design generic interface we should take it into
> account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> misnomer, you do not set internal device attribute, you toggle external
> input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
>

I agree on using KVM_SET_DEVICE_ATTR for injecting interrupts is a bit
funky, for the ARM uses we would use this for setting the address used
to expose distributor and CPU interfaces to guests, not to inject
interrupt, first approximation.

>>
>> And that's just for regular interrupts.  MSIs are vastly simpler on
>> MPIC than what x86 does.
>>
>> >x86 obviously support old way and will have to for some, very
>> >long, time.
>>
>> Sure.
>>
>> >ARM vGIC code, that is ready to go upstream, uses old way too. So
>> >it will
>> >be 2 archs against one.
>>
>> I wasn't aware that that's how it worked. :-P
>>
> What worked? That vGIC uses existing interface or that non generic
> interface used by many arches wins generic one used by only one arch?
>
>> I was trying to be considerate by not making the entire thing
>> gratuitously PPC or MPIC specific, as some others seem inclined to
>> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
>> "set address" ioctl and rather than extend *that* interface, they
>> preferred to just stick something ARM-specific in ASAP with the
>> understanding that it would be replaced (or more accurately, kept
>> around as a thin wrapper around the new stuff) later.
>>
> I am not against generic interfaces in general and proposed one in
> particular (I have comments about it but this is for other emails),
> I am trying to make sure it will be used by more than one user before
> committing to it. APIs are easy to add and impossible to remove.
>
>> >Christoffer do you think the proposed way it
>> >better for your needs. Are you willing to make vGIC use it?
>> >
>> >Scott, what other devices are you planning to support with this
>> >interface?
>>
>> At the moment I do not have plans for other devices, though what
>> does it hurt for the capability to be there?
>>
>> -Scott
>
> --
>                         Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19  5:50         ` Christoffer Dall
@ 2013-02-19 20:16           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-19 20:16 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Alexander Graf, kvm-ppc, kvm

On 02/18/2013 11:50:44 PM, Christoffer Dall wrote:
> On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
> >>
> >> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood  
> <scottwood@freescale.com>
> >> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
> >> > kvm_create_device)
> >> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
> >> > kvm_device_attr)
> >> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
> >> > kvm_device_attr)
> >>
> >> _IOWR ?
> >
> >
> > struct kvm_device_attr itself is write-only, though the data  
> pointed to by
> > the addr field goes the other way for GET.  ONE_REG is in the same  
> situation
> > and also uses _IOW for both.
> >
> >
> 
> ok.
> 
> Btw., what about the size of the attr? implicitly defined through the  
> attr id?

Yes, same as in ONE_REG.

> and I also think it's easier to read the
> arch-specific bits of the code that way, instead of some arbitrary
> function that you have to trace through to figure out where it's
> called from.

I don't follow.

> > By doing device recognition here we don't need a separate copy of  
> this per
> > arch (including some #ifdef or modifying every arch at once --  
> including ARM
> > which I can't modify yet because it's not merged), and *if* we  
> should end up
> > with an in-kernel-emulated device that gets used across multiple
> > architectures, it would be annoying to have to modify all relevant
> > architectures (and worse to deal with per-arch numberspaces).
> 
> I would say that's exactly what you're going to need with your  
> approach:
> 
> switch (cd->type) {
> case KVM_ARM_VGIC_V2_0:
>     kvm_arm_vgic_v2_0_create(...);
> }
> 
> 
> are you going to ifdef here in this function, or? I think it's cleaner
> to have the single arch-specific hooks and handle the cases there.

There's an ifdef, as you can see from the patch that adds MPIC  
support.  But it's the same ifdef that gets used to determine whether  
the device code gets built in.  Nothing special needs to be added; no  
per-architecture hoop to jump through.

Note that we would still need per-device ifdefs in the arch code,  
because not all PPC KVM builds are going to have MPIC support.

What if, instead of a switch statement and ifdefs, it operated on a  
registration basis?

> The use case of having a single device which is so central to the
> system that we emulate it inside the kernel and is shared across
> multiple archs is pretty far fetched to me.

I don't think it's that far fetched.  APIC is shared between x86 and  
ia64 -- even if APIC has no need for anything beyond existing API, it  
shows that it's not a crazy possibility.  Freescale already has some  
devices that are shared between PPC and ARM, and that set will expand  
(probably not to irq controllers, though the probability is non-zero)  
with Layerscape, about which Freeescale says, "The unique,  
core-agnostic architecture incorporates the optimum core for the given  
application—ARM cores or Power Architecture cores."  We already need to  
go back and non-ppc-ize various drivers, including their reliance on  
I/O accessors that are defined in architecture-specific ways...  Making  
things gratuitiously architecture specific is just a bad idea, even if  
the "use case" for it actually being used on multiple architectures is  
remote.

Normal kernel drivers tend to go in drivers/, not arch/, even if  
they're only used on one architecture...

> However, this is internal and can always be changed, so if everyone
> agrees on the overall API, whichever way you implement it is fine with
> me.

We at least need the numberspace to not be architecture-specific if we  
want to retain the possibility of changing later -- not to mention what  
happens if architectures merge.  I see that "arm" and "arm64" are  
separate, despite the fact that other architectures that used to be  
split this way have since merged.  Maybe "arm64" is too different from  
"arm" for that to happen, but who knows...

...and if they don't merge, wouldn't that be a likely case for devices  
shared across architectures?  Does arm64 use gic/vgic?  This post  
suggests that there is at least something in common (the bit about  
"once the GIC code is shared between
arm and arm64"):
http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19 20:16           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-19 20:16 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Alexander Graf, kvm-ppc, kvm

On 02/18/2013 11:50:44 PM, Christoffer Dall wrote:
> On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com>  
> wrote:
> > On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
> >>
> >> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood  
> <scottwood@freescale.com>
> >> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
> >> > kvm_create_device)
> >> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
> >> > kvm_device_attr)
> >> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
> >> > kvm_device_attr)
> >>
> >> _IOWR ?
> >
> >
> > struct kvm_device_attr itself is write-only, though the data  
> pointed to by
> > the addr field goes the other way for GET.  ONE_REG is in the same  
> situation
> > and also uses _IOW for both.
> >
> >
> 
> ok.
> 
> Btw., what about the size of the attr? implicitly defined through the  
> attr id?

Yes, same as in ONE_REG.

> and I also think it's easier to read the
> arch-specific bits of the code that way, instead of some arbitrary
> function that you have to trace through to figure out where it's
> called from.

I don't follow.

> > By doing device recognition here we don't need a separate copy of  
> this per
> > arch (including some #ifdef or modifying every arch at once --  
> including ARM
> > which I can't modify yet because it's not merged), and *if* we  
> should end up
> > with an in-kernel-emulated device that gets used across multiple
> > architectures, it would be annoying to have to modify all relevant
> > architectures (and worse to deal with per-arch numberspaces).
> 
> I would say that's exactly what you're going to need with your  
> approach:
> 
> switch (cd->type) {
> case KVM_ARM_VGIC_V2_0:
>     kvm_arm_vgic_v2_0_create(...);
> }
> 
> 
> are you going to ifdef here in this function, or? I think it's cleaner
> to have the single arch-specific hooks and handle the cases there.

There's an ifdef, as you can see from the patch that adds MPIC  
support.  But it's the same ifdef that gets used to determine whether  
the device code gets built in.  Nothing special needs to be added; no  
per-architecture hoop to jump through.

Note that we would still need per-device ifdefs in the arch code,  
because not all PPC KVM builds are going to have MPIC support.

What if, instead of a switch statement and ifdefs, it operated on a  
registration basis?

> The use case of having a single device which is so central to the
> system that we emulate it inside the kernel and is shared across
> multiple archs is pretty far fetched to me.

I don't think it's that far fetched.  APIC is shared between x86 and  
ia64 -- even if APIC has no need for anything beyond existing API, it  
shows that it's not a crazy possibility.  Freescale already has some  
devices that are shared between PPC and ARM, and that set will expand  
(probably not to irq controllers, though the probability is non-zero)  
with Layerscape, about which Freeescale says, "The unique,  
core-agnostic architecture incorporates the optimum core for the given  
application—ARM cores or Power Architecture cores."  We already need to  
go back and non-ppc-ize various drivers, including their reliance on  
I/O accessors that are defined in architecture-specific ways...  Making  
things gratuitiously architecture specific is just a bad idea, even if  
the "use case" for it actually being used on multiple architectures is  
remote.

Normal kernel drivers tend to go in drivers/, not arch/, even if  
they're only used on one architecture...

> However, this is internal and can always be changed, so if everyone
> agrees on the overall API, whichever way you implement it is fine with
> me.

We at least need the numberspace to not be architecture-specific if we  
want to retain the possibility of changing later -- not to mention what  
happens if architectures merge.  I see that "arm" and "arm64" are  
separate, despite the fact that other architectures that used to be  
split this way have since merged.  Maybe "arm64" is too different from  
"arm" for that to happen, but who knows...

...and if they don't merge, wouldn't that be a likely case for devices  
shared across architectures?  Does arm64 use gic/vgic?  This post  
suggests that there is at least something in common (the bit about  
"once the GIC code is shared between
arm and arm64"):
http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19 12:24         ` Gleb Natapov
@ 2013-02-19 21:16           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-19 21:16 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > The ability to set/get attributes is needed.  Sorry, but "get or set
> > one blob of data, up to 512 bytes, for the entire irqchip" is just
> > not good enough -- assuming you don't want us to start sticking
> > pointers and commands in *that* data. :-)
> >
> Proposed interface sticks pointers into ioctl data, so why doing the  
> same
> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.

There's a difference between putting a pointer in an ioctl control  
structure that is specifically documented as being that way (as in  
ONE_REG), versus taking an ioctl that claims to be setting/getting a  
blob of state and embedding pointers in it.  It would be like sticking  
a pointer in the attribute payload of this API, which I think is  
something to be discouraged.  It'd also be using KVM_SET_IRQCHIP to  
read data, which is the sort of thing you object to later on regarding  
KVM_IRQ_LINE_STATUS.

Then there's the silliness of transporting 512 bytes just to read a  
descriptor for transporting something else.

> For signaling irqs (I think this is what you mean by "commands") we  
> have KVM_IRQ_LINE.

It's one type of command.  Another is setting the address.  Another is  
writing to registers that have side effects (this is how MSI injection  
is done on MPIC, just as in real hardware).

What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real  
(non-glue/wrapper) code can become common?

And I really hope you don't want us to do MSIs the x86 way.

In the XICS thread, Paul brought up the possibliity of cascaded MPICs.   
It's not relevant to the systems we're trying to model, but if one did  
want to use the in-kernel irqchip interface for that, it would be  
really nice to be able to operate on a specific MPIC for injection  
rather than have to come up with some sort of global identifier (above  
and beyond the minor flattening we'd need to do to represent a single  
MPIC's interrupts in a flat numberspace).

> > If you mean the way to inject interrupts, it's simpler this way.
> > Why go out of our way to inject common glue code into a
> > communication path between hw/kvm/mpic.c in QEMU and
> > arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> > glue be specific to this one function when we could reuse the same
> > communication glue used for other things, such as device state?
> You will need glue anyway and I do no see how amount of it is much
> different one way or the other.

It uses glue that we need to be present for other things anyway.  If it  
weren't for XICS we wouldn't need a KVM_IRQ_LINE implementation at all  
on PPC.  It may not be a major difference, but it doesn't affect  
anything but MPIC and it seems more straightforward this way.

> Gluing qemu_set_irq() to
> ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much  
> different.

qemu_set_irq() is not glued to either of those.  It's glued to  
kvm_openpic_set_irq(), kvm_ioapic_set_irq(), etc.  It's already not  
generic code.

> Of course, since the interface you propose is not irq chip specific

This part of it is.

> we need non irq chip specific way to talk to it. But how do you  
> propose
> to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?

That one's not even in api.txt, so could you explain what exactly it's  
supposed to return, and why it's needed?

AFAICT, the only thing it gets used for in QEMU is coalescing  
mc146818rtc interrupts.

Could an error return be used for cases where the IRQ was not  
delivered, in the very unlikely event that we want to implement  
something similar on MPIC?  Note again that MPIC's decision to use or  
not use KVM_IRQ_LINE is only about what MPIC does; it is not inherent  
in the device control API.

> KVM_SET_DEVICE_ATTR needs to return data back and getting data back  
> from
> "set" ioctl is strange.

If we really need a single atomic operation to both read and write,  
beyond returning error values, then yes, that would be a new ioctl.  It  
could be added in the future if needed.

> Other devices may get other commands that need
> response, so if we design generic interface we should take it into
> account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> misnomer, you do not set internal device attribute, you toggle  
> external
> input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.

I see no need for a separate ioctl in terms of the underlying  
infrastructure for distinguishing "attribute" from "write-only  
command".  I'm open to improvements on what the ioctl is called.  It's  
basically like setting a register on a device, except I was concerned  
that if we actually called it a "register" that people would take it  
too literally and think it's only for the architected register state of  
the emulated device.

> > >ARM vGIC code, that is ready to go upstream, uses old way too. So
> > >it will
> > >be 2 archs against one.
> >
> > I wasn't aware that that's how it worked. :-P
> >
> What worked? That vGIC uses existing interface or that non generic
> interface used by many arches wins generic one used by only one arch?

The latter.  Two wrongs don't make a right, and adding another  
inextensible, device-specific API is not the answer to the existing  
APIs being too inextensible and device/arch-specific.  Some portion  
will always need to be device-specific because we're controlling the  
creation and of a specific device, but the glue does not need to be.

> APIs are easy to add and impossible to remove.

That's why I want to get it right this time.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-19 21:16           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-19 21:16 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > The ability to set/get attributes is needed.  Sorry, but "get or set
> > one blob of data, up to 512 bytes, for the entire irqchip" is just
> > not good enough -- assuming you don't want us to start sticking
> > pointers and commands in *that* data. :-)
> >
> Proposed interface sticks pointers into ioctl data, so why doing the  
> same
> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.

There's a difference between putting a pointer in an ioctl control  
structure that is specifically documented as being that way (as in  
ONE_REG), versus taking an ioctl that claims to be setting/getting a  
blob of state and embedding pointers in it.  It would be like sticking  
a pointer in the attribute payload of this API, which I think is  
something to be discouraged.  It'd also be using KVM_SET_IRQCHIP to  
read data, which is the sort of thing you object to later on regarding  
KVM_IRQ_LINE_STATUS.

Then there's the silliness of transporting 512 bytes just to read a  
descriptor for transporting something else.

> For signaling irqs (I think this is what you mean by "commands") we  
> have KVM_IRQ_LINE.

It's one type of command.  Another is setting the address.  Another is  
writing to registers that have side effects (this is how MSI injection  
is done on MPIC, just as in real hardware).

What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real  
(non-glue/wrapper) code can become common?

And I really hope you don't want us to do MSIs the x86 way.

In the XICS thread, Paul brought up the possibliity of cascaded MPICs.   
It's not relevant to the systems we're trying to model, but if one did  
want to use the in-kernel irqchip interface for that, it would be  
really nice to be able to operate on a specific MPIC for injection  
rather than have to come up with some sort of global identifier (above  
and beyond the minor flattening we'd need to do to represent a single  
MPIC's interrupts in a flat numberspace).

> > If you mean the way to inject interrupts, it's simpler this way.
> > Why go out of our way to inject common glue code into a
> > communication path between hw/kvm/mpic.c in QEMU and
> > arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> > glue be specific to this one function when we could reuse the same
> > communication glue used for other things, such as device state?
> You will need glue anyway and I do no see how amount of it is much
> different one way or the other.

It uses glue that we need to be present for other things anyway.  If it  
weren't for XICS we wouldn't need a KVM_IRQ_LINE implementation at all  
on PPC.  It may not be a major difference, but it doesn't affect  
anything but MPIC and it seems more straightforward this way.

> Gluing qemu_set_irq() to
> ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much  
> different.

qemu_set_irq() is not glued to either of those.  It's glued to  
kvm_openpic_set_irq(), kvm_ioapic_set_irq(), etc.  It's already not  
generic code.

> Of course, since the interface you propose is not irq chip specific

This part of it is.

> we need non irq chip specific way to talk to it. But how do you  
> propose
> to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?

That one's not even in api.txt, so could you explain what exactly it's  
supposed to return, and why it's needed?

AFAICT, the only thing it gets used for in QEMU is coalescing  
mc146818rtc interrupts.

Could an error return be used for cases where the IRQ was not  
delivered, in the very unlikely event that we want to implement  
something similar on MPIC?  Note again that MPIC's decision to use or  
not use KVM_IRQ_LINE is only about what MPIC does; it is not inherent  
in the device control API.

> KVM_SET_DEVICE_ATTR needs to return data back and getting data back  
> from
> "set" ioctl is strange.

If we really need a single atomic operation to both read and write,  
beyond returning error values, then yes, that would be a new ioctl.  It  
could be added in the future if needed.

> Other devices may get other commands that need
> response, so if we design generic interface we should take it into
> account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> misnomer, you do not set internal device attribute, you toggle  
> external
> input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.

I see no need for a separate ioctl in terms of the underlying  
infrastructure for distinguishing "attribute" from "write-only  
command".  I'm open to improvements on what the ioctl is called.  It's  
basically like setting a register on a device, except I was concerned  
that if we actually called it a "register" that people would take it  
too literally and think it's only for the architected register state of  
the emulated device.

> > >ARM vGIC code, that is ready to go upstream, uses old way too. So
> > >it will
> > >be 2 archs against one.
> >
> > I wasn't aware that that's how it worked. :-P
> >
> What worked? That vGIC uses existing interface or that non generic
> interface used by many arches wins generic one used by only one arch?

The latter.  Two wrongs don't make a right, and adding another  
inextensible, device-specific API is not the answer to the existing  
APIs being too inextensible and device/arch-specific.  Some portion  
will always need to be device-specific because we're controlling the  
creation and of a specific device, but the glue does not need to be.

> APIs are easy to add and impossible to remove.

That's why I want to get it right this time.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19 20:16           ` Scott Wood
@ 2013-02-20  2:16             ` Christoffer Dall
  -1 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-20  2:16 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Feb 19, 2013 at 12:16 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 11:50:44 PM, Christoffer Dall wrote:
>>
>> On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
>> >>
>> >> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>
>> >> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
>> >> > kvm_create_device)
>> >> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
>> >> > kvm_device_attr)
>> >> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
>> >> > kvm_device_attr)
>> >>
>> >> _IOWR ?
>> >
>> >
>> > struct kvm_device_attr itself is write-only, though the data pointed to
>> > by
>> > the addr field goes the other way for GET.  ONE_REG is in the same
>> > situation
>> > and also uses _IOW for both.
>> >
>> >
>>
>> ok.
>>
>> Btw., what about the size of the attr? implicitly defined through the attr
>> id?
>
>
> Yes, same as in ONE_REG.
>
>
>> and I also think it's easier to read the
>> arch-specific bits of the code that way, instead of some arbitrary
>> function that you have to trace through to figure out where it's
>> called from.
>
>
> I don't follow.
>
>

I just mean that if you look in arch/XXX/kvm/XXX.c and see a function
called kvm_create_xxx_dev(...) then you're like, what context is this
being called from again and in which order, etc. Of course we can name
the function kvm_ioctl_create_dev_xxx(...), but I still like to be
able to follow the flow of things that are really arch specific, but
anyhow, this is a weak argument.

>> > By doing device recognition here we don't need a separate copy of this
>> > per
>> > arch (including some #ifdef or modifying every arch at once -- including
>> > ARM
>> > which I can't modify yet because it's not merged), and *if* we should
>> > end up
>> > with an in-kernel-emulated device that gets used across multiple
>> > architectures, it would be annoying to have to modify all relevant
>> > architectures (and worse to deal with per-arch numberspaces).
>>
>> I would say that's exactly what you're going to need with your approach:
>>
>> switch (cd->type) {
>> case KVM_ARM_VGIC_V2_0:
>>     kvm_arm_vgic_v2_0_create(...);
>> }
>>
>>
>> are you going to ifdef here in this function, or? I think it's cleaner
>> to have the single arch-specific hooks and handle the cases there.
>
>
> There's an ifdef, as you can see from the patch that adds MPIC support.  But
> it's the same ifdef that gets used to determine whether the device code gets
> built in.  Nothing special needs to be added; no per-architecture hoop to
> jump through.
>
> Note that we would still need per-device ifdefs in the arch code, because
> not all PPC KVM builds are going to have MPIC support.

yeah, that's the same on ARM.

>
> What if, instead of a switch statement and ifdefs, it operated on a
> registration basis?
>
>

Probably just makes the code more confusing and harder to grep with
the limited number of in-kernel devices we support. Between that and
your current approach, I prefer the current approach. Anyhow, the
whole thing is internal state, as I wrote earlier, and by no means a
deal breaker for me, and as long as we don't have to mess around with
include/linux/kvm_host.h for changing arch-specific stuff, which I
believe is the case even with the current design, I'm ok with it.

>> The use case of having a single device which is so central to the
>> system that we emulate it inside the kernel and is shared across
>> multiple archs is pretty far fetched to me.
>
>
> I don't think it's that far fetched.  APIC is shared between x86 and ia64 --
> even if APIC has no need for anything beyond existing API, it shows that
> it's not a crazy possibility.  Freescale already has some devices that are
> shared between PPC and ARM, and that set will expand (probably not to irq
> controllers, though the probability is non-zero) with Layerscape, about
> which Freeescale says, "The unique, core-agnostic architecture incorporates
> the optimum core for the given application—ARM cores or Power Architecture
> cores."  We already need to go back and non-ppc-ize various drivers,
> including their reliance on I/O accessors that are defined in
> architecture-specific ways...  Making things gratuitiously architecture
> specific is just a bad idea, even if the "use case" for it actually being
> used on multiple architectures is remote.
>

For the record I think this is a simplification: making this generic
always comes at the cost of some added complexity exactly due to the
loss of being specific. It's a balancing act to figure out which is
preferred given the magnitude of the cost. As I indicate above, this
case is not too bad.

> Normal kernel drivers tend to go in drivers/, not arch/, even if they're
> only used on one architecture...
>

I know, but then you don't have #ifdef CONFIG_SOME_WEIRD_DEVICE in
kernel/xxx.c which is the only thing I find mildly unpleasing.

But let's not waste any more time on this detail.

>
>> However, this is internal and can always be changed, so if everyone
>> agrees on the overall API, whichever way you implement it is fine with
>> me.
>
>
> We at least need the numberspace to not be architecture-specific if we want
> to retain the possibility of changing later -- not to mention what happens
> if architectures merge.  I see that "arm" and "arm64" are separate, despite
> the fact that other architectures that used to be split this way have since
> merged.  Maybe "arm64" is too different from "arm" for that to happen, but
> who knows...
>

Fair point, nobody knows.

> ...and if they don't merge, wouldn't that be a likely case for devices
> shared across architectures?  Does arm64 use gic/vgic?  This post suggests
> that there is at least something in common (the bit about "once the GIC code
> is shared between
> arm and arm64"):
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html
>

I'm not sure how much of that is public at this point, or even
determined. But KVM already shares code between arm64 and arm, so I
guess I thought of this as a single architecture from the point of
view of virt/kvm/kvm_main.c, but that may be incorrect actually.

I really need to find time to play around more with the arm64 code.

Thanks for the thoughts.

-Christoffer

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20  2:16             ` Christoffer Dall
  0 siblings, 0 replies; 261+ messages in thread
From: Christoffer Dall @ 2013-02-20  2:16 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Feb 19, 2013 at 12:16 PM, Scott Wood <scottwood@freescale.com> wrote:
> On 02/18/2013 11:50:44 PM, Christoffer Dall wrote:
>>
>> On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood <scottwood@freescale.com>
>> wrote:
>> > On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
>> >>
>> >> On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood <scottwood@freescale.com>
>> >> > +#define KVM_CREATE_DEVICE        _IOWR(KVMIO,  0xac, struct
>> >> > kvm_create_device)
>> >> > +#define KVM_SET_DEVICE_ATTR      _IOW(KVMIO,  0xad, struct
>> >> > kvm_device_attr)
>> >> > +#define KVM_GET_DEVICE_ATTR      _IOW(KVMIO,  0xae, struct
>> >> > kvm_device_attr)
>> >>
>> >> _IOWR ?
>> >
>> >
>> > struct kvm_device_attr itself is write-only, though the data pointed to
>> > by
>> > the addr field goes the other way for GET.  ONE_REG is in the same
>> > situation
>> > and also uses _IOW for both.
>> >
>> >
>>
>> ok.
>>
>> Btw., what about the size of the attr? implicitly defined through the attr
>> id?
>
>
> Yes, same as in ONE_REG.
>
>
>> and I also think it's easier to read the
>> arch-specific bits of the code that way, instead of some arbitrary
>> function that you have to trace through to figure out where it's
>> called from.
>
>
> I don't follow.
>
>

I just mean that if you look in arch/XXX/kvm/XXX.c and see a function
called kvm_create_xxx_dev(...) then you're like, what context is this
being called from again and in which order, etc. Of course we can name
the function kvm_ioctl_create_dev_xxx(...), but I still like to be
able to follow the flow of things that are really arch specific, but
anyhow, this is a weak argument.

>> > By doing device recognition here we don't need a separate copy of this
>> > per
>> > arch (including some #ifdef or modifying every arch at once -- including
>> > ARM
>> > which I can't modify yet because it's not merged), and *if* we should
>> > end up
>> > with an in-kernel-emulated device that gets used across multiple
>> > architectures, it would be annoying to have to modify all relevant
>> > architectures (and worse to deal with per-arch numberspaces).
>>
>> I would say that's exactly what you're going to need with your approach:
>>
>> switch (cd->type) {
>> case KVM_ARM_VGIC_V2_0:
>>     kvm_arm_vgic_v2_0_create(...);
>> }
>>
>>
>> are you going to ifdef here in this function, or? I think it's cleaner
>> to have the single arch-specific hooks and handle the cases there.
>
>
> There's an ifdef, as you can see from the patch that adds MPIC support.  But
> it's the same ifdef that gets used to determine whether the device code gets
> built in.  Nothing special needs to be added; no per-architecture hoop to
> jump through.
>
> Note that we would still need per-device ifdefs in the arch code, because
> not all PPC KVM builds are going to have MPIC support.

yeah, that's the same on ARM.

>
> What if, instead of a switch statement and ifdefs, it operated on a
> registration basis?
>
>

Probably just makes the code more confusing and harder to grep with
the limited number of in-kernel devices we support. Between that and
your current approach, I prefer the current approach. Anyhow, the
whole thing is internal state, as I wrote earlier, and by no means a
deal breaker for me, and as long as we don't have to mess around with
include/linux/kvm_host.h for changing arch-specific stuff, which I
believe is the case even with the current design, I'm ok with it.

>> The use case of having a single device which is so central to the
>> system that we emulate it inside the kernel and is shared across
>> multiple archs is pretty far fetched to me.
>
>
> I don't think it's that far fetched.  APIC is shared between x86 and ia64 --
> even if APIC has no need for anything beyond existing API, it shows that
> it's not a crazy possibility.  Freescale already has some devices that are
> shared between PPC and ARM, and that set will expand (probably not to irq
> controllers, though the probability is non-zero) with Layerscape, about
> which Freeescale says, "The unique, core-agnostic architecture incorporates
> the optimum core for the given application—ARM cores or Power Architecture
> cores."  We already need to go back and non-ppc-ize various drivers,
> including their reliance on I/O accessors that are defined in
> architecture-specific ways...  Making things gratuitiously architecture
> specific is just a bad idea, even if the "use case" for it actually being
> used on multiple architectures is remote.
>

For the record I think this is a simplification: making this generic
always comes at the cost of some added complexity exactly due to the
loss of being specific. It's a balancing act to figure out which is
preferred given the magnitude of the cost. As I indicate above, this
case is not too bad.

> Normal kernel drivers tend to go in drivers/, not arch/, even if they're
> only used on one architecture...
>

I know, but then you don't have #ifdef CONFIG_SOME_WEIRD_DEVICE in
kernel/xxx.c which is the only thing I find mildly unpleasing.

But let's not waste any more time on this detail.

>
>> However, this is internal and can always be changed, so if everyone
>> agrees on the overall API, whichever way you implement it is fine with
>> me.
>
>
> We at least need the numberspace to not be architecture-specific if we want
> to retain the possibility of changing later -- not to mention what happens
> if architectures merge.  I see that "arm" and "arm64" are separate, despite
> the fact that other architectures that used to be split this way have since
> merged.  Maybe "arm64" is too different from "arm" for that to happen, but
> who knows...
>

Fair point, nobody knows.

> ...and if they don't merge, wouldn't that be a likely case for devices
> shared across architectures?  Does arm64 use gic/vgic?  This post suggests
> that there is at least something in common (the bit about "once the GIC code
> is shared between
> arm and arm64"):
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html
>

I'm not sure how much of that is public at this point, or even
determined. But KVM already shares code between arm64 and arm, so I
guess I thought of this as a single architecture from the point of
view of virt/kvm/kvm_main.c, but that may be incorrect actually.

I really need to find time to play around more with the arm64 code.

Thanks for the thoughts.

-Christoffer

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-19 21:16           ` Scott Wood
@ 2013-02-20 13:09             ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-20 13:09 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >>
> >Proposed interface sticks pointers into ioctl data, so why doing
> >the same
> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> 
> There's a difference between putting a pointer in an ioctl control
> structure that is specifically documented as being that way (as in
> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> blob of state and embedding pointers in it.  It would be like
> sticking a pointer in the attribute payload of this API, which I
> think is something to be discouraged.
If documentation is what differentiate for you between silly and smart
then write documentation instead of new interfaces.

KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
x86, nothing prevent you from adding MPIC specifics to the interface,
Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
for you to transfer the state put pointers there and _document_ them.
But with 512 bytes you can transfer properties inline, so you probably
do not need pointer there anyway. I see you have three properties 2 of
them 32bit and one 64bit.

>                                        It'd also be using
> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> to later on regarding KVM_IRQ_LINE_STATUS.
> 
Do not see why.

> Then there's the silliness of transporting 512 bytes just to read a
> descriptor for transporting something else.
> 
Yes, agree. But is this enough of a reason to introduce entirely new
interface? Is it on performance critical path? Doubt it, unless you
abuse the interface to send interrupts, but then isn't it silty to
do copy_from_user() twice to inject an interrupt like proposed interface
does?

> >For signaling irqs (I think this is what you mean by "commands")
> >we have KVM_IRQ_LINE.
> 
> It's one type of command.  Another is setting the address.  Another
> is writing to registers that have side effects (this is how MSI
> injection is done on MPIC, just as in real hardware).
> 
Setting the address is setting an attribute. Sending MSI is a command.
Things you set/get during init/migration are attributes. Things you do
to cause side-effects are commands.

> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> (non-glue/wrapper) code can become common?
> 
No new ioctl with exactly same result (well actually even faster since
less copying is done). You need to show us the benefits of the new interface
vs existing one, not vice versa.

> And I really hope you don't want us to do MSIs the x86 way.
> 
What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to glue it
to mpic.

> In the XICS thread, Paul brought up the possibliity of cascaded
> MPICs.  It's not relevant to the systems we're trying to model, but
> if one did want to use the in-kernel irqchip interface for that, it
> would be really nice to be able to operate on a specific MPIC for
> injection rather than have to come up with some sort of global
> identifier (above and beyond the minor flattening we'd need to do to
> represent a single MPIC's interrupts in a flat numberspace).
> 
ARM encodes information in irq field of KVM_IRQ_LINE like that:
  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
  field: | irq_type  | vcpu_index |   irq_number    |
Why will similar approach not work?

> >> If you mean the way to inject interrupts, it's simpler this way.
> >> Why go out of our way to inject common glue code into a
> >> communication path between hw/kvm/mpic.c in QEMU and
> >> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> >> glue be specific to this one function when we could reuse the same
> >> communication glue used for other things, such as device state?
> >You will need glue anyway and I do no see how amount of it is much
> >different one way or the other.
> 
> It uses glue that we need to be present for other things anyway.  If
> it weren't for XICS we wouldn't need a KVM_IRQ_LINE implementation
> at all on PPC.  It may not be a major difference, but it doesn't
> affect anything but MPIC and it seems more straightforward this way.
> 
We are talking about something like 4 lines of userspace code including
bracket. I do not think this is strong point in favor of different
interface.

> >Gluing qemu_set_irq() to
> >ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much
> >different.
> 
> qemu_set_irq() is not glued to either of those.  It's glued to
> kvm_openpic_set_irq(), kvm_ioapic_set_irq(), etc.  It's already not
> generic code.
> 
OK, this does not invalidates my argument though.

> >Of course, since the interface you propose is not irq chip specific
> 
> This part of it is.
> 
> >we need non irq chip specific way to talk to it. But how do you
> >propose
> >to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
> 
> That one's not even in api.txt, so could you explain what exactly
> it's supposed to return, and why it's needed?
True. We need to add it. Your guess below is correct.

> 
> AFAICT, the only thing it gets used for in QEMU is coalescing
> mc146818rtc interrupts.
> 
At present yes. Still need to be supportable.

> Could an error return be used for cases where the IRQ was not
> delivered, in the very unlikely event that we want to implement
> something similar on MPIC?
We can, but I do not think it will be good API. This condition is not an
error.

>                            Note again that MPIC's decision to use
> or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> inherent in the device control API.
That's the crux of the problem though. MPIC tries to be different just
for the sake to be different. Why? The only explanation you provide is
because current API is "silly", not that you cannot implement MPIC with
it or it will be unnecessary slow, just "silly".

> 
> >KVM_SET_DEVICE_ATTR needs to return data back and getting data
> >back from
> >"set" ioctl is strange.
> 
> If we really need a single atomic operation to both read and write,
> beyond returning error values, then yes, that would be a new ioctl.
> It could be added in the future if needed.
> 
> >Other devices may get other commands that need
> >response, so if we design generic interface we should take it into
> >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> >misnomer, you do not set internal device attribute, you toggle
> >external
> >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> 
> I see no need for a separate ioctl in terms of the underlying
> infrastructure for distinguishing "attribute" from "write-only
> command".  I'm open to improvements on what the ioctl is called.
> It's basically like setting a register on a device, except I was
> concerned that if we actually called it a "register" that people
> would take it too literally and think it's only for the architected
> register state of the emulated device.
I agree "attribute" is better name than "register", but injecting
interrupt is not setting an attribute.

> 
> >> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >> >it will
> >> >be 2 archs against one.
> >>
> >> I wasn't aware that that's how it worked. :-P
> >>
> >What worked? That vGIC uses existing interface or that non generic
> >interface used by many arches wins generic one used by only one arch?
> 
> The latter.  Two wrongs don't make a right, and adding another
> inextensible, device-specific API is not the answer to the existing
> APIs being too inextensible and device/arch-specific.  Some portion
> will always need to be device-specific because we're controlling the
> creation and of a specific device, but the glue does not need to be.
>
This is not "adding another inextensible, device-specific API" vs "adding
cool generic extensible API" though. It is "using existing inextensible,
device-specific API" vs "adding cool generic extensible API".

> >APIs are easy to add and impossible to remove.
> 
> That's why I want to get it right this time.
> 
And what if you'll fail? What if next architecture will bring new
developer that will proclaim your new interface "silly" since it does not
allow for device destruction and do not return file descriptor for newly
created device that userspace can do select on to wait for a device's
events or mmap memory for fast userspace/device communication?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20 13:09             ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-20 13:09 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >>
> >Proposed interface sticks pointers into ioctl data, so why doing
> >the same
> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> 
> There's a difference between putting a pointer in an ioctl control
> structure that is specifically documented as being that way (as in
> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> blob of state and embedding pointers in it.  It would be like
> sticking a pointer in the attribute payload of this API, which I
> think is something to be discouraged.
If documentation is what differentiate for you between silly and smart
then write documentation instead of new interfaces.

KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
x86, nothing prevent you from adding MPIC specifics to the interface,
Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
for you to transfer the state put pointers there and _document_ them.
But with 512 bytes you can transfer properties inline, so you probably
do not need pointer there anyway. I see you have three properties 2 of
them 32bit and one 64bit.

>                                        It'd also be using
> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> to later on regarding KVM_IRQ_LINE_STATUS.
> 
Do not see why.

> Then there's the silliness of transporting 512 bytes just to read a
> descriptor for transporting something else.
> 
Yes, agree. But is this enough of a reason to introduce entirely new
interface? Is it on performance critical path? Doubt it, unless you
abuse the interface to send interrupts, but then isn't it silty to
do copy_from_user() twice to inject an interrupt like proposed interface
does?

> >For signaling irqs (I think this is what you mean by "commands")
> >we have KVM_IRQ_LINE.
> 
> It's one type of command.  Another is setting the address.  Another
> is writing to registers that have side effects (this is how MSI
> injection is done on MPIC, just as in real hardware).
> 
Setting the address is setting an attribute. Sending MSI is a command.
Things you set/get during init/migration are attributes. Things you do
to cause side-effects are commands.

> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> (non-glue/wrapper) code can become common?
> 
No new ioctl with exactly same result (well actually even faster since
less copying is done). You need to show us the benefits of the new interface
vs existing one, not vice versa.

> And I really hope you don't want us to do MSIs the x86 way.
> 
What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to glue it
to mpic.

> In the XICS thread, Paul brought up the possibliity of cascaded
> MPICs.  It's not relevant to the systems we're trying to model, but
> if one did want to use the in-kernel irqchip interface for that, it
> would be really nice to be able to operate on a specific MPIC for
> injection rather than have to come up with some sort of global
> identifier (above and beyond the minor flattening we'd need to do to
> represent a single MPIC's interrupts in a flat numberspace).
> 
ARM encodes information in irq field of KVM_IRQ_LINE like that:
 šbits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
  field: | irq_type  | vcpu_index |   irq_number    |
Why will similar approach not work?

> >> If you mean the way to inject interrupts, it's simpler this way.
> >> Why go out of our way to inject common glue code into a
> >> communication path between hw/kvm/mpic.c in QEMU and
> >> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> >> glue be specific to this one function when we could reuse the same
> >> communication glue used for other things, such as device state?
> >You will need glue anyway and I do no see how amount of it is much
> >different one way or the other.
> 
> It uses glue that we need to be present for other things anyway.  If
> it weren't for XICS we wouldn't need a KVM_IRQ_LINE implementation
> at all on PPC.  It may not be a major difference, but it doesn't
> affect anything but MPIC and it seems more straightforward this way.
> 
We are talking about something like 4 lines of userspace code including
bracket. I do not think this is strong point in favor of different
interface.

> >Gluing qemu_set_irq() to
> >ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much
> >different.
> 
> qemu_set_irq() is not glued to either of those.  It's glued to
> kvm_openpic_set_irq(), kvm_ioapic_set_irq(), etc.  It's already not
> generic code.
> 
OK, this does not invalidates my argument though.

> >Of course, since the interface you propose is not irq chip specific
> 
> This part of it is.
> 
> >we need non irq chip specific way to talk to it. But how do you
> >propose
> >to model things like KVM_IRQ_LINE_STATUS with KVM_SET_DEVICE_ATTR?
> 
> That one's not even in api.txt, so could you explain what exactly
> it's supposed to return, and why it's needed?
True. We need to add it. Your guess below is correct.

> 
> AFAICT, the only thing it gets used for in QEMU is coalescing
> mc146818rtc interrupts.
> 
At present yes. Still need to be supportable.

> Could an error return be used for cases where the IRQ was not
> delivered, in the very unlikely event that we want to implement
> something similar on MPIC?
We can, but I do not think it will be good API. This condition is not an
error.

>                            Note again that MPIC's decision to use
> or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> inherent in the device control API.
That's the crux of the problem though. MPIC tries to be different just
for the sake to be different. Why? The only explanation you provide is
because current API is "silly", not that you cannot implement MPIC with
it or it will be unnecessary slow, just "silly".

> 
> >KVM_SET_DEVICE_ATTR needs to return data back and getting data
> >back from
> >"set" ioctl is strange.
> 
> If we really need a single atomic operation to both read and write,
> beyond returning error values, then yes, that would be a new ioctl.
> It could be added in the future if needed.
> 
> >Other devices may get other commands that need
> >response, so if we design generic interface we should take it into
> >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is a
> >misnomer, you do not set internal device attribute, you toggle
> >external
> >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> 
> I see no need for a separate ioctl in terms of the underlying
> infrastructure for distinguishing "attribute" from "write-only
> command".  I'm open to improvements on what the ioctl is called.
> It's basically like setting a register on a device, except I was
> concerned that if we actually called it a "register" that people
> would take it too literally and think it's only for the architected
> register state of the emulated device.
I agree "attribute" is better name than "register", but injecting
interrupt is not setting an attribute.

> 
> >> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >> >it will
> >> >be 2 archs against one.
> >>
> >> I wasn't aware that that's how it worked. :-P
> >>
> >What worked? That vGIC uses existing interface or that non generic
> >interface used by many arches wins generic one used by only one arch?
> 
> The latter.  Two wrongs don't make a right, and adding another
> inextensible, device-specific API is not the answer to the existing
> APIs being too inextensible and device/arch-specific.  Some portion
> will always need to be device-specific because we're controlling the
> creation and of a specific device, but the glue does not need to be.
>
This is not "adding another inextensible, device-specific API" vs "adding
cool generic extensible API" though. It is "using existing inextensible,
device-specific API" vs "adding cool generic extensible API".

> >APIs are easy to add and impossible to remove.
> 
> That's why I want to get it right this time.
> 
And what if you'll fail? What if next architecture will bring new
developer that will proclaim your new interface "silly" since it does not
allow for device destruction and do not return file descriptor for newly
created device that userspace can do select on to wait for a device's
events or mmap memory for fast userspace/device communication?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-18 23:01       ` Scott Wood
@ 2013-02-20 21:17         ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 21:17 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.

Creation of ioctl has advantages. It makes explicit what the 
data contains and how it should be interpreted.
This means that for example, policy control can be performed at ioctl
level (can group, by ioctl number, which ioctl calls are allowed after
VM creation, for example).

It also makes it explicit that its a userspace interface which
applications depend.

With a single 'set device attribute' ioctl its more difficult
to do that.

Abstracting the details also makes it cumbersome to read 
strace output :-)

> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)

Why not? Is it necessary to constantly read/write attributes?

> If you mean the way to inject interrupts, it's simpler this way.

Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20 21:17         ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 21:17 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.

Creation of ioctl has advantages. It makes explicit what the 
data contains and how it should be interpreted.
This means that for example, policy control can be performed at ioctl
level (can group, by ioctl number, which ioctl calls are allowed after
VM creation, for example).

It also makes it explicit that its a userspace interface which
applications depend.

With a single 'set device attribute' ioctl its more difficult
to do that.

Abstracting the details also makes it cumbersome to read 
strace output :-)

> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)

Why not? Is it necessary to constantly read/write attributes?

> If you mean the way to inject interrupts, it's simpler this way.

Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 13:09             ` Gleb Natapov
@ 2013-02-20 21:28               ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 21:28 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > 
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and smart
> then write documentation instead of new interfaces.
> 
> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> x86, nothing prevent you from adding MPIC specifics to the interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> for you to transfer the state put pointers there and _document_ them.
> But with 512 bytes you can transfer properties inline, so you probably
> do not need pointer there anyway. I see you have three properties 2 of
> them 32bit and one 64bit.
> 
> >                                        It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > to later on regarding KVM_IRQ_LINE_STATUS.
> > 
> Do not see why.
> 
> > Then there's the silliness of transporting 512 bytes just to read a
> > descriptor for transporting something else.
> > 
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed interface
> does?
> 
> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> > 
> > It's one type of command.  Another is setting the address.  Another
> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> > 
> Setting the address is setting an attribute. Sending MSI is a command.
> Things you set/get during init/migration are attributes. Things you do
> to cause side-effects are commands.

Yes, it would be good to restrict what can be done via the interface
(to avoid abstracting away problems). At a first glance, i would say
it should allow for "initial configuration of device state", such as
registers etc.

Why exactly you need to set attributes Scott?

> > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > (non-glue/wrapper) code can become common?
> > 
> No new ioctl with exactly same result (well actually even faster since
> less copying is done). You need to show us the benefits of the new interface
> vs existing one, not vice versa.

Also related to this question is the point of avoiding the
implementation of devices to be spread across userspace and the kernel
(this is one point Avi brought up often). If the device emulation
is implemented entirely in the kernel, all is necessary are initial
values of device registers (and retrieval of those values later for
savevm/migration).

It is then not necessary to set device attributes on a live guest and
deal with the complications associated with that.

<snip>

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20 21:28               ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 21:28 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > 
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and smart
> then write documentation instead of new interfaces.
> 
> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> x86, nothing prevent you from adding MPIC specifics to the interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> for you to transfer the state put pointers there and _document_ them.
> But with 512 bytes you can transfer properties inline, so you probably
> do not need pointer there anyway. I see you have three properties 2 of
> them 32bit and one 64bit.
> 
> >                                        It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > to later on regarding KVM_IRQ_LINE_STATUS.
> > 
> Do not see why.
> 
> > Then there's the silliness of transporting 512 bytes just to read a
> > descriptor for transporting something else.
> > 
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed interface
> does?
> 
> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> > 
> > It's one type of command.  Another is setting the address.  Another
> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> > 
> Setting the address is setting an attribute. Sending MSI is a command.
> Things you set/get during init/migration are attributes. Things you do
> to cause side-effects are commands.

Yes, it would be good to restrict what can be done via the interface
(to avoid abstracting away problems). At a first glance, i would say
it should allow for "initial configuration of device state", such as
registers etc.

Why exactly you need to set attributes Scott?

> > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > (non-glue/wrapper) code can become common?
> > 
> No new ioctl with exactly same result (well actually even faster since
> less copying is done). You need to show us the benefits of the new interface
> vs existing one, not vice versa.

Also related to this question is the point of avoiding the
implementation of devices to be spread across userspace and the kernel
(this is one point Avi brought up often). If the device emulation
is implemented entirely in the kernel, all is necessary are initial
values of device registers (and retrieval of those values later for
savevm/migration).

It is then not necessary to set device attributes on a live guest and
deal with the complications associated with that.

<snip>


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 21:28               ` Marcelo Tosatti
@ 2013-02-20 22:44                 ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 22:44 UTC (permalink / raw)
  To: Gleb Natapov, Avi Kivity
  Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 06:28:24PM -0300, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > > >> not good enough -- assuming you don't want us to start sticking
> > > >> pointers and commands in *that* data. :-)
> > > >>
> > > >Proposed interface sticks pointers into ioctl data, so why doing
> > > >the same
> > > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > > 
> > > There's a difference between putting a pointer in an ioctl control
> > > structure that is specifically documented as being that way (as in
> > > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > > blob of state and embedding pointers in it.  It would be like
> > > sticking a pointer in the attribute payload of this API, which I
> > > think is something to be discouraged.
> > If documentation is what differentiate for you between silly and smart
> > then write documentation instead of new interfaces.
> > 
> > KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> > x86, nothing prevent you from adding MPIC specifics to the interface,
> > Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> > for you to transfer the state put pointers there and _document_ them.
> > But with 512 bytes you can transfer properties inline, so you probably
> > do not need pointer there anyway. I see you have three properties 2 of
> > them 32bit and one 64bit.
> > 
> > >                                        It'd also be using
> > > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > > to later on regarding KVM_IRQ_LINE_STATUS.
> > > 
> > Do not see why.
> > 
> > > Then there's the silliness of transporting 512 bytes just to read a
> > > descriptor for transporting something else.
> > > 
> > Yes, agree. But is this enough of a reason to introduce entirely new
> > interface? Is it on performance critical path? Doubt it, unless you
> > abuse the interface to send interrupts, but then isn't it silty to
> > do copy_from_user() twice to inject an interrupt like proposed interface
> > does?
> > 
> > > >For signaling irqs (I think this is what you mean by "commands")
> > > >we have KVM_IRQ_LINE.
> > > 
> > > It's one type of command.  Another is setting the address.  Another
> > > is writing to registers that have side effects (this is how MSI
> > > injection is done on MPIC, just as in real hardware).
> > > 
> > Setting the address is setting an attribute. Sending MSI is a command.
> > Things you set/get during init/migration are attributes. Things you do
> > to cause side-effects are commands.
> 
> Yes, it would be good to restrict what can be done via the interface
> (to avoid abstracting away problems). At a first glance, i would say
> it should allow for "initial configuration of device state", such as
> registers etc.
> 
> Why exactly you need to set attributes Scott?
> 
> > > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > > (non-glue/wrapper) code can become common?
> > > 
> > No new ioctl with exactly same result (well actually even faster since
> > less copying is done). You need to show us the benefits of the new interface
> > vs existing one, not vice versa.
> 
> Also related to this question is the point of avoiding the
> implementation of devices to be spread across userspace and the kernel
> (this is one point Avi brought up often). If the device emulation
> is implemented entirely in the kernel, all is necessary are initial
> values of device registers (and retrieval of those values later for
> savevm/migration).
> 
> It is then not necessary to set device attributes on a live guest and
> deal with the complications associated with that.
> 
> <snip>

I suppose the atomic-test-and-set type operations introduced to ONE_REG ioctls
are one example of such complications.

Avi, can you remind us what are the drawbacks of having separate
userspace/kernel device emulation?

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20 22:44                 ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-20 22:44 UTC (permalink / raw)
  To: Gleb Natapov, Avi Kivity
  Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 06:28:24PM -0300, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > > >> not good enough -- assuming you don't want us to start sticking
> > > >> pointers and commands in *that* data. :-)
> > > >>
> > > >Proposed interface sticks pointers into ioctl data, so why doing
> > > >the same
> > > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > > 
> > > There's a difference between putting a pointer in an ioctl control
> > > structure that is specifically documented as being that way (as in
> > > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > > blob of state and embedding pointers in it.  It would be like
> > > sticking a pointer in the attribute payload of this API, which I
> > > think is something to be discouraged.
> > If documentation is what differentiate for you between silly and smart
> > then write documentation instead of new interfaces.
> > 
> > KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> > x86, nothing prevent you from adding MPIC specifics to the interface,
> > Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> > for you to transfer the state put pointers there and _document_ them.
> > But with 512 bytes you can transfer properties inline, so you probably
> > do not need pointer there anyway. I see you have three properties 2 of
> > them 32bit and one 64bit.
> > 
> > >                                        It'd also be using
> > > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > > to later on regarding KVM_IRQ_LINE_STATUS.
> > > 
> > Do not see why.
> > 
> > > Then there's the silliness of transporting 512 bytes just to read a
> > > descriptor for transporting something else.
> > > 
> > Yes, agree. But is this enough of a reason to introduce entirely new
> > interface? Is it on performance critical path? Doubt it, unless you
> > abuse the interface to send interrupts, but then isn't it silty to
> > do copy_from_user() twice to inject an interrupt like proposed interface
> > does?
> > 
> > > >For signaling irqs (I think this is what you mean by "commands")
> > > >we have KVM_IRQ_LINE.
> > > 
> > > It's one type of command.  Another is setting the address.  Another
> > > is writing to registers that have side effects (this is how MSI
> > > injection is done on MPIC, just as in real hardware).
> > > 
> > Setting the address is setting an attribute. Sending MSI is a command.
> > Things you set/get during init/migration are attributes. Things you do
> > to cause side-effects are commands.
> 
> Yes, it would be good to restrict what can be done via the interface
> (to avoid abstracting away problems). At a first glance, i would say
> it should allow for "initial configuration of device state", such as
> registers etc.
> 
> Why exactly you need to set attributes Scott?
> 
> > > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > > (non-glue/wrapper) code can become common?
> > > 
> > No new ioctl with exactly same result (well actually even faster since
> > less copying is done). You need to show us the benefits of the new interface
> > vs existing one, not vice versa.
> 
> Also related to this question is the point of avoiding the
> implementation of devices to be spread across userspace and the kernel
> (this is one point Avi brought up often). If the device emulation
> is implemented entirely in the kernel, all is necessary are initial
> values of device registers (and retrieval of those values later for
> savevm/migration).
> 
> It is then not necessary to set device attributes on a live guest and
> deal with the complications associated with that.
> 
> <snip>

I suppose the atomic-test-and-set type operations introduced to ONE_REG ioctls
are one example of such complications.

Avi, can you remind us what are the drawbacks of having separate
userspace/kernel device emulation?


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 21:17         ` Marcelo Tosatti
@ 2013-02-20 23:20           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-20 23:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> > >Copying Christoffer since ARM has in kernel irq chip too.
> > >
> > >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > >> Currently, devices that are emulated inside KVM are configured  
> in a
> > >> hardcoded manner based on an assumption that any given  
> architecture
> > >> only has one way to do it.  If there's any need to access device
> > >state,
> > >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> > >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing  
> is
> > >> cumbersome and depletes a limited numberspace.
> 
> Creation of ioctl has advantages. It makes explicit what the
> data contains and how it should be interpreted.
> This means that for example, policy control can be performed at ioctl
> level (can group, by ioctl number, which ioctl calls are allowed after
> VM creation, for example).
> 
> It also makes it explicit that its a userspace interface which
> applications depend.
> 
> With a single 'set device attribute' ioctl its more difficult
> to do that.

I don't see why creating ioctls rather than device attributes (or  
whatever you want to call them) differs this way.  Device attributes  
are inherently userspace interfaces, just as ioctls are.  What the data  
contains and how it is interpreted are documented, albeit in a more  
lightweight format than KVM usually uses for ioctls.

An ioctl is no guarantee of good documentation.  KVM is far better than  
most of the kernel in that regard, but even there some ioctls are  
missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this  
thread), and some others are inadequately explained (e.g. IRQ routing).

By "policy control", do you just mean that some ioctls act on a  
/dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is  
pretty similar to having a device fd, except for the mechanism used.

The main things that I dislike about adding new things at the ioctl  
level are:
- limited numberspace, and more prone to merge conflicts than a  
device-specific numberspace
- having to add a new pathway to get into the driver for each ioctl,  
rather than having all operations on a particular device go to the
   right place, and the device implementation interprets further (this  
assumes that we're talking about vm ioctls, and not returning a
   new fd for the device)
- awkward way of negotiating capabilities with userspace (have to  
declare the capability id separately, and add code somewhere outside the
   device implementation to respond to capability requests)
- api.txt formatting that imposes yet another document-internal  
numberspace to conflict on :-)

> Abstracting the details also makes it cumbersome to read
> strace output :-)

You'd have to update strace one way or another if you want  
pretty-printed output.  Having a more restricted API than arbitrary  
ioctls could actually improve the situation -- this could be a good  
reason for including the attribute length as an explicit parameter  
rather than being implicit in the attribute id.  Then you'd just teach  
strace about the device control API once, and not for each new  
attribute or device that gets added.

> > >> This API provides a mechanism to instantiate a device of a  
> certain
> > >> type, returning an ID that can be used to set/get attributes of  
> the
> > >> device.  Attributes may include configuration parameters (e.g.
> > >> register base address), device state, operational commands,  
> etc.  It
> > >> is similar to the ONE_REG API, except that it acts on devices  
> rather
> > >> than vcpus.
> > >You are not only provide different way to create in kernel irq
> > >chip you
> > >also use an alternate way to trigger interrupt lines. Before going
> > >into
> > >interface specifics lets think about whether it is really worth it?
> >
> > Which "it" do you mean here?
> >
> > The ability to set/get attributes is needed.  Sorry, but "get or set
> > one blob of data, up to 512 bytes, for the entire irqchip" is just
> > not good enough -- assuming you don't want us to start sticking
> > pointers and commands in *that* data. :-)
> 
> Why not? Is it necessary to constantly read/write attributes?

It's not about how often we do it, but how much state we have,  
especially if we ever want to implement migration.

> > If you mean the way to inject interrupts, it's simpler this way.
> 
> Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

Yes, but even if that gets shot down (the best objection IMHO is the  
one nobody is raising -- how to hook into irqfd), we still need the  
rest of it.  I think we'd even want to keep attributes for IRQ line  
status so that we have a way to read it, even if just for debugging.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20 23:20           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-20 23:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> > >Copying Christoffer since ARM has in kernel irq chip too.
> > >
> > >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > >> Currently, devices that are emulated inside KVM are configured  
> in a
> > >> hardcoded manner based on an assumption that any given  
> architecture
> > >> only has one way to do it.  If there's any need to access device
> > >state,
> > >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> > >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing  
> is
> > >> cumbersome and depletes a limited numberspace.
> 
> Creation of ioctl has advantages. It makes explicit what the
> data contains and how it should be interpreted.
> This means that for example, policy control can be performed at ioctl
> level (can group, by ioctl number, which ioctl calls are allowed after
> VM creation, for example).
> 
> It also makes it explicit that its a userspace interface which
> applications depend.
> 
> With a single 'set device attribute' ioctl its more difficult
> to do that.

I don't see why creating ioctls rather than device attributes (or  
whatever you want to call them) differs this way.  Device attributes  
are inherently userspace interfaces, just as ioctls are.  What the data  
contains and how it is interpreted are documented, albeit in a more  
lightweight format than KVM usually uses for ioctls.

An ioctl is no guarantee of good documentation.  KVM is far better than  
most of the kernel in that regard, but even there some ioctls are  
missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this  
thread), and some others are inadequately explained (e.g. IRQ routing).

By "policy control", do you just mean that some ioctls act on a  
/dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is  
pretty similar to having a device fd, except for the mechanism used.

The main things that I dislike about adding new things at the ioctl  
level are:
- limited numberspace, and more prone to merge conflicts than a  
device-specific numberspace
- having to add a new pathway to get into the driver for each ioctl,  
rather than having all operations on a particular device go to the
   right place, and the device implementation interprets further (this  
assumes that we're talking about vm ioctls, and not returning a
   new fd for the device)
- awkward way of negotiating capabilities with userspace (have to  
declare the capability id separately, and add code somewhere outside the
   device implementation to respond to capability requests)
- api.txt formatting that imposes yet another document-internal  
numberspace to conflict on :-)

> Abstracting the details also makes it cumbersome to read
> strace output :-)

You'd have to update strace one way or another if you want  
pretty-printed output.  Having a more restricted API than arbitrary  
ioctls could actually improve the situation -- this could be a good  
reason for including the attribute length as an explicit parameter  
rather than being implicit in the attribute id.  Then you'd just teach  
strace about the device control API once, and not for each new  
attribute or device that gets added.

> > >> This API provides a mechanism to instantiate a device of a  
> certain
> > >> type, returning an ID that can be used to set/get attributes of  
> the
> > >> device.  Attributes may include configuration parameters (e.g.
> > >> register base address), device state, operational commands,  
> etc.  It
> > >> is similar to the ONE_REG API, except that it acts on devices  
> rather
> > >> than vcpus.
> > >You are not only provide different way to create in kernel irq
> > >chip you
> > >also use an alternate way to trigger interrupt lines. Before going
> > >into
> > >interface specifics lets think about whether it is really worth it?
> >
> > Which "it" do you mean here?
> >
> > The ability to set/get attributes is needed.  Sorry, but "get or set
> > one blob of data, up to 512 bytes, for the entire irqchip" is just
> > not good enough -- assuming you don't want us to start sticking
> > pointers and commands in *that* data. :-)
> 
> Why not? Is it necessary to constantly read/write attributes?

It's not about how often we do it, but how much state we have,  
especially if we ever want to implement migration.

> > If you mean the way to inject interrupts, it's simpler this way.
> 
> Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

Yes, but even if that gets shot down (the best objection IMHO is the  
one nobody is raising -- how to hook into irqfd), we still need the  
rest of it.  I think we'd even want to keep attributes for IRQ line  
status so that we have a way to read it, even if just for debugging.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 21:28               ` Marcelo Tosatti
@ 2013-02-20 23:53                 ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-20 23:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/20/2013 03:28:24 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > > >> The ability to set/get attributes is needed.  Sorry, but "get  
> or set
> > > >> one blob of data, up to 512 bytes, for the entire irqchip" is  
> just
> > > >> not good enough -- assuming you don't want us to start sticking
> > > >> pointers and commands in *that* data. :-)
> > > >>
> > > >Proposed interface sticks pointers into ioctl data, so why doing
> > > >the same
> > > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > >
> > > There's a difference between putting a pointer in an ioctl control
> > > structure that is specifically documented as being that way (as in
> > > ONE_REG), versus taking an ioctl that claims to be  
> setting/getting a
> > > blob of state and embedding pointers in it.  It would be like
> > > sticking a pointer in the attribute payload of this API, which I
> > > think is something to be discouraged.
> > If documentation is what differentiate for you between silly and  
> smart
> > then write documentation instead of new interfaces.
> >
> > KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of  
> data on
> > x86, nothing prevent you from adding MPIC specifics to the  
> interface,
> > Add mpic state into kvm_irqchip structure and if 512 bytes is not  
> enough
> > for you to transfer the state put pointers there and _document_  
> them.
> > But with 512 bytes you can transfer properties inline, so you  
> probably
> > do not need pointer there anyway. I see you have three properties 2  
> of
> > them 32bit and one 64bit.
> >
> > >                                        It'd also be using
> > > KVM_SET_IRQCHIP to read data, which is the sort of thing you  
> object
> > > to later on regarding KVM_IRQ_LINE_STATUS.
> > >
> > Do not see why.
> >
> > > Then there's the silliness of transporting 512 bytes just to read  
> a
> > > descriptor for transporting something else.
> > >
> > Yes, agree. But is this enough of a reason to introduce entirely new
> > interface? Is it on performance critical path? Doubt it, unless you
> > abuse the interface to send interrupts, but then isn't it silty to
> > do copy_from_user() twice to inject an interrupt like proposed  
> interface
> > does?
> >
> > > >For signaling irqs (I think this is what you mean by "commands")
> > > >we have KVM_IRQ_LINE.
> > >
> > > It's one type of command.  Another is setting the address.   
> Another
> > > is writing to registers that have side effects (this is how MSI
> > > injection is done on MPIC, just as in real hardware).
> > >
> > Setting the address is setting an attribute. Sending MSI is a  
> command.
> > Things you set/get during init/migration are attributes. Things you  
> do
> > to cause side-effects are commands.
> 
> Yes, it would be good to restrict what can be done via the interface
> (to avoid abstracting away problems). At a first glance, i would say
> it should allow for "initial configuration of device state", such as
> registers etc.
> 
> Why exactly you need to set attributes Scott?

That's best answered by looking at patch 6/6 and discussing the actual  
attributes that are defined so far.

The need to set the base address of the registers is straightforward.   
When ARM added their special ioctl for this, we discussed it being  
eventually replaced with a more general API (in fact, that was the  
reason they put "ARM" in the name).

Access to device registers was originally intended for debugging, and  
eventually migration.  It turned out to also be very useful for  
injecting MSIs -- nothing special needed to be done.  It Just Works(tm)  
the same way it does in hardware, with an MMIO write from a PCI device  
to a specific MPIC register.  Again, irqfd may complicate this, but  
there's no need for QEMU-generated MSIs to have to take a circuitous  
path.

Finally, there's the interrupt source attribute.  Even if we get rid of  
that, we'll need MPIC-specific documentation on how to flatten the IRQ  
numberspace, and if we ever have a cascaded PIC situation things could  
get ugly since there's no way to identify a specific IRQ controller in  
KVM_IRQ_LINE.

> Also related to this question is the point of avoiding the
> implementation of devices to be spread across userspace and the kernel
> (this is one point Avi brought up often). If the device emulation
> is implemented entirely in the kernel, all is necessary are initial
> values of device registers (and retrieval of those values later for
> savevm/migration).

MPIC emulation is entirely in the kernel with this patchset -- though  
the register that lets you reset cores will likely need to be kicked  
back to QEMU.

> It is then not necessary to set device attributes on a live guest and
> deal with the complications associated with that.

Which complications?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-20 23:53                 ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-20 23:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/20/2013 03:28:24 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > > >> The ability to set/get attributes is needed.  Sorry, but "get  
> or set
> > > >> one blob of data, up to 512 bytes, for the entire irqchip" is  
> just
> > > >> not good enough -- assuming you don't want us to start sticking
> > > >> pointers and commands in *that* data. :-)
> > > >>
> > > >Proposed interface sticks pointers into ioctl data, so why doing
> > > >the same
> > > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > >
> > > There's a difference between putting a pointer in an ioctl control
> > > structure that is specifically documented as being that way (as in
> > > ONE_REG), versus taking an ioctl that claims to be  
> setting/getting a
> > > blob of state and embedding pointers in it.  It would be like
> > > sticking a pointer in the attribute payload of this API, which I
> > > think is something to be discouraged.
> > If documentation is what differentiate for you between silly and  
> smart
> > then write documentation instead of new interfaces.
> >
> > KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of  
> data on
> > x86, nothing prevent you from adding MPIC specifics to the  
> interface,
> > Add mpic state into kvm_irqchip structure and if 512 bytes is not  
> enough
> > for you to transfer the state put pointers there and _document_  
> them.
> > But with 512 bytes you can transfer properties inline, so you  
> probably
> > do not need pointer there anyway. I see you have three properties 2  
> of
> > them 32bit and one 64bit.
> >
> > >                                        It'd also be using
> > > KVM_SET_IRQCHIP to read data, which is the sort of thing you  
> object
> > > to later on regarding KVM_IRQ_LINE_STATUS.
> > >
> > Do not see why.
> >
> > > Then there's the silliness of transporting 512 bytes just to read  
> a
> > > descriptor for transporting something else.
> > >
> > Yes, agree. But is this enough of a reason to introduce entirely new
> > interface? Is it on performance critical path? Doubt it, unless you
> > abuse the interface to send interrupts, but then isn't it silty to
> > do copy_from_user() twice to inject an interrupt like proposed  
> interface
> > does?
> >
> > > >For signaling irqs (I think this is what you mean by "commands")
> > > >we have KVM_IRQ_LINE.
> > >
> > > It's one type of command.  Another is setting the address.   
> Another
> > > is writing to registers that have side effects (this is how MSI
> > > injection is done on MPIC, just as in real hardware).
> > >
> > Setting the address is setting an attribute. Sending MSI is a  
> command.
> > Things you set/get during init/migration are attributes. Things you  
> do
> > to cause side-effects are commands.
> 
> Yes, it would be good to restrict what can be done via the interface
> (to avoid abstracting away problems). At a first glance, i would say
> it should allow for "initial configuration of device state", such as
> registers etc.
> 
> Why exactly you need to set attributes Scott?

That's best answered by looking at patch 6/6 and discussing the actual  
attributes that are defined so far.

The need to set the base address of the registers is straightforward.   
When ARM added their special ioctl for this, we discussed it being  
eventually replaced with a more general API (in fact, that was the  
reason they put "ARM" in the name).

Access to device registers was originally intended for debugging, and  
eventually migration.  It turned out to also be very useful for  
injecting MSIs -- nothing special needed to be done.  It Just Works(tm)  
the same way it does in hardware, with an MMIO write from a PCI device  
to a specific MPIC register.  Again, irqfd may complicate this, but  
there's no need for QEMU-generated MSIs to have to take a circuitous  
path.

Finally, there's the interrupt source attribute.  Even if we get rid of  
that, we'll need MPIC-specific documentation on how to flatten the IRQ  
numberspace, and if we ever have a cascaded PIC situation things could  
get ugly since there's no way to identify a specific IRQ controller in  
KVM_IRQ_LINE.

> Also related to this question is the point of avoiding the
> implementation of devices to be spread across userspace and the kernel
> (this is one point Avi brought up often). If the device emulation
> is implemented entirely in the kernel, all is necessary are initial
> values of device registers (and retrieval of those values later for
> savevm/migration).

MPIC emulation is entirely in the kernel with this patchset -- though  
the register that lets you reset cores will likely need to be kicked  
back to QEMU.

> It is then not necessary to set device attributes on a live guest and
> deal with the complications associated with that.

Which complications?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 23:20           ` Scott Wood
@ 2013-02-21  0:01             ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-21  0:01 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 05:20:50PM -0600, Scott Wood wrote:
>  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >> >Copying Christoffer since ARM has in kernel irq chip too.
> >> >
> >> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> >> Currently, devices that are emulated inside KVM are
> >configured in a
> >> >> hardcoded manner based on an assumption that any given
> >architecture
> >> >> only has one way to do it.  If there's any need to access device
> >> >state,
> >> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little
> >thing is
> >> >> cumbersome and depletes a limited numberspace.
> >
> >Creation of ioctl has advantages. It makes explicit what the
> >data contains and how it should be interpreted.
> >This means that for example, policy control can be performed at ioctl
> >level (can group, by ioctl number, which ioctl calls are allowed after
> >VM creation, for example).
> >
> >It also makes it explicit that its a userspace interface which
> >applications depend.
> >
> >With a single 'set device attribute' ioctl its more difficult
> >to do that.
> 
> I don't see why creating ioctls rather than device attributes (or
> whatever you want to call them) differs this way.  Device attributes
> are inherently userspace interfaces, just as ioctls are.  What the
> data contains and how it is interpreted are documented, albeit in a
> more lightweight format than KVM usually uses for ioctls.
> 
> An ioctl is no guarantee of good documentation.  KVM is far better
> than most of the kernel in that regard, but even there some ioctls
> are missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this
> thread), and some others are inadequately explained (e.g. IRQ
> routing).
> 
> By "policy control", do you just mean that some ioctls act on a
> /dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is
> pretty similar to having a device fd, except for the mechanism used.

No, i mean policy control acting on ioctl numbers. Its essentially the
same issue as with 'strace' (in that the its necessary to parse the
ioctl data further).

> The main things that I dislike about adding new things at the ioctl
> level are:
> - limited numberspace, and more prone to merge conflicts than a
> device-specific numberspace

Can reserve per-architecture ranges to deal with that.

> - having to add a new pathway to get into the driver for each ioctl,
> rather than having all operations on a particular device go to the
>   right place, and the device implementation interprets further
> (this assumes that we're talking about vm ioctls, and not returning
> a
>   new fd for the device)

Agree with that point.

> - awkward way of negotiating capabilities with userspace (have to
> declare the capability id separately, and add code somewhere outside
> the
>   device implementation to respond to capability requests)

Agree with that point.

> - api.txt formatting that imposes yet another document-internal
> numberspace to conflict on :-)

The main problem, to me, is that the interface allows flexibility but
the details of using such flexibility are not worked out.

That is, lack of definitions such as when setting attributes is allowed.
With a big blog, that information is implicit: a SET operation is a full
device reset.

With individual attributes:
- Which attributes can be set individually?
- Is there an order which must be respected?
- Locking.
- End result: more logic/code necessary.

> >Abstracting the details also makes it cumbersome to read
> >strace output :-)
> 
> You'd have to update strace one way or another if you want
> pretty-printed output.  Having a more restricted API than arbitrary
> ioctls could actually improve the situation -- this could be a good
> reason for including the attribute length as an explicit parameter
> rather than being implicit in the attribute id.  Then you'd just
> teach strace about the device control API once, and not for each new
> attribute or device that gets added.
> 
> >> >> This API provides a mechanism to instantiate a device of a
> >certain
> >> >> type, returning an ID that can be used to set/get attributes
> >of the
> >> >> device.  Attributes may include configuration parameters (e.g.
> >> >> register base address), device state, operational commands,
> >etc.  It
> >> >> is similar to the ONE_REG API, except that it acts on devices
> >rather
> >> >> than vcpus.
> >> >You are not only provide different way to create in kernel irq
> >> >chip you
> >> >also use an alternate way to trigger interrupt lines. Before going
> >> >into
> >> >interface specifics lets think about whether it is really worth it?
> >>
> >> Which "it" do you mean here?
> >>
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >
> >Why not? Is it necessary to constantly read/write attributes?
> 
> It's not about how often we do it, but how much state we have,
> especially if we ever want to implement migration.

Migration reads/writes the device state once, which is supposedly much
smaller than VM's RAM, so can't see the logic here.

> >> If you mean the way to inject interrupts, it's simpler this way.
> >
> >Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?
> 
> Yes, but even if that gets shot down (the best objection IMHO is the
> one nobody is raising -- how to hook into irqfd), we still need the
> rest of it.  

Why are individual attributes necessary ? Still unclear.

How about dropping it? And then assume full blob write is a device reset.

> I think we'd even want to keep attributes for IRQ line
> status so that we have a way to read it, even if just for debugging.

No problem reading full blob on that case.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  0:01             ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-21  0:01 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 05:20:50PM -0600, Scott Wood wrote:
>  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >> >Copying Christoffer since ARM has in kernel irq chip too.
> >> >
> >> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> >> Currently, devices that are emulated inside KVM are
> >configured in a
> >> >> hardcoded manner based on an assumption that any given
> >architecture
> >> >> only has one way to do it.  If there's any need to access device
> >> >state,
> >> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little
> >thing is
> >> >> cumbersome and depletes a limited numberspace.
> >
> >Creation of ioctl has advantages. It makes explicit what the
> >data contains and how it should be interpreted.
> >This means that for example, policy control can be performed at ioctl
> >level (can group, by ioctl number, which ioctl calls are allowed after
> >VM creation, for example).
> >
> >It also makes it explicit that its a userspace interface which
> >applications depend.
> >
> >With a single 'set device attribute' ioctl its more difficult
> >to do that.
> 
> I don't see why creating ioctls rather than device attributes (or
> whatever you want to call them) differs this way.  Device attributes
> are inherently userspace interfaces, just as ioctls are.  What the
> data contains and how it is interpreted are documented, albeit in a
> more lightweight format than KVM usually uses for ioctls.
> 
> An ioctl is no guarantee of good documentation.  KVM is far better
> than most of the kernel in that regard, but even there some ioctls
> are missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this
> thread), and some others are inadequately explained (e.g. IRQ
> routing).
> 
> By "policy control", do you just mean that some ioctls act on a
> /dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is
> pretty similar to having a device fd, except for the mechanism used.

No, i mean policy control acting on ioctl numbers. Its essentially the
same issue as with 'strace' (in that the its necessary to parse the
ioctl data further).

> The main things that I dislike about adding new things at the ioctl
> level are:
> - limited numberspace, and more prone to merge conflicts than a
> device-specific numberspace

Can reserve per-architecture ranges to deal with that.

> - having to add a new pathway to get into the driver for each ioctl,
> rather than having all operations on a particular device go to the
>   right place, and the device implementation interprets further
> (this assumes that we're talking about vm ioctls, and not returning
> a
>   new fd for the device)

Agree with that point.

> - awkward way of negotiating capabilities with userspace (have to
> declare the capability id separately, and add code somewhere outside
> the
>   device implementation to respond to capability requests)

Agree with that point.

> - api.txt formatting that imposes yet another document-internal
> numberspace to conflict on :-)

The main problem, to me, is that the interface allows flexibility but
the details of using such flexibility are not worked out.

That is, lack of definitions such as when setting attributes is allowed.
With a big blog, that information is implicit: a SET operation is a full
device reset.

With individual attributes:
- Which attributes can be set individually?
- Is there an order which must be respected?
- Locking.
- End result: more logic/code necessary.

> >Abstracting the details also makes it cumbersome to read
> >strace output :-)
> 
> You'd have to update strace one way or another if you want
> pretty-printed output.  Having a more restricted API than arbitrary
> ioctls could actually improve the situation -- this could be a good
> reason for including the attribute length as an explicit parameter
> rather than being implicit in the attribute id.  Then you'd just
> teach strace about the device control API once, and not for each new
> attribute or device that gets added.
> 
> >> >> This API provides a mechanism to instantiate a device of a
> >certain
> >> >> type, returning an ID that can be used to set/get attributes
> >of the
> >> >> device.  Attributes may include configuration parameters (e.g.
> >> >> register base address), device state, operational commands,
> >etc.  It
> >> >> is similar to the ONE_REG API, except that it acts on devices
> >rather
> >> >> than vcpus.
> >> >You are not only provide different way to create in kernel irq
> >> >chip you
> >> >also use an alternate way to trigger interrupt lines. Before going
> >> >into
> >> >interface specifics lets think about whether it is really worth it?
> >>
> >> Which "it" do you mean here?
> >>
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >
> >Why not? Is it necessary to constantly read/write attributes?
> 
> It's not about how often we do it, but how much state we have,
> especially if we ever want to implement migration.

Migration reads/writes the device state once, which is supposedly much
smaller than VM's RAM, so can't see the logic here.

> >> If you mean the way to inject interrupts, it's simpler this way.
> >
> >Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?
> 
> Yes, but even if that gets shot down (the best objection IMHO is the
> one nobody is raising -- how to hook into irqfd), we still need the
> rest of it.  

Why are individual attributes necessary ? Still unclear.

How about dropping it? And then assume full blob write is a device reset.

> I think we'd even want to keep attributes for IRQ line
> status so that we have a way to read it, even if just for debugging.

No problem reading full blob on that case.


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 23:53                 ` Scott Wood
@ 2013-02-21  0:14                   ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-21  0:14 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >Why exactly you need to set attributes Scott?
> 
> That's best answered by looking at patch 6/6 and discussing the
> actual attributes that are defined so far.
> 
> The need to set the base address of the registers is
> straightforward.  When ARM added their special ioctl for this, we
> discussed it being eventually replaced with a more general API (in
> fact, that was the reason they put "ARM" in the name).
> 
> Access to device registers was originally intended for debugging,
> and eventually migration.  It turned out to also be very useful for
> injecting MSIs -- nothing special needed to be done.  It Just
> Works(tm) the same way it does in hardware, with an MMIO write from
> a PCI device to a specific MPIC register.  Again, irqfd may
> complicate this, but there's no need for QEMU-generated MSIs to have
> to take a circuitous path.
> 
> Finally, there's the interrupt source attribute.  Even if we get rid
> of that, we'll need MPIC-specific documentation on how to flatten
> the IRQ numberspace, and if we ever have a cascaded PIC situation
> things could get ugly since there's no way to identify a specific
> IRQ controller in KVM_IRQ_LINE.
> 
> >Also related to this question is the point of avoiding the
> >implementation of devices to be spread across userspace and the kernel
> >(this is one point Avi brought up often). If the device emulation
> >is implemented entirely in the kernel, all is necessary are initial
> >values of device registers (and retrieval of those values later for
> >savevm/migration).
> 
> MPIC emulation is entirely in the kernel with this patchset --
> though the register that lets you reset cores will likely need to be
> kicked back to QEMU.
> 
> >It is then not necessary to set device attributes on a live guest and
> >deal with the complications associated with that.
> 
> Which complications?
> 
> -Scott

Semantics of individual attribute writes, for one.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  0:14                   ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-21  0:14 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >Why exactly you need to set attributes Scott?
> 
> That's best answered by looking at patch 6/6 and discussing the
> actual attributes that are defined so far.
> 
> The need to set the base address of the registers is
> straightforward.  When ARM added their special ioctl for this, we
> discussed it being eventually replaced with a more general API (in
> fact, that was the reason they put "ARM" in the name).
> 
> Access to device registers was originally intended for debugging,
> and eventually migration.  It turned out to also be very useful for
> injecting MSIs -- nothing special needed to be done.  It Just
> Works(tm) the same way it does in hardware, with an MMIO write from
> a PCI device to a specific MPIC register.  Again, irqfd may
> complicate this, but there's no need for QEMU-generated MSIs to have
> to take a circuitous path.
> 
> Finally, there's the interrupt source attribute.  Even if we get rid
> of that, we'll need MPIC-specific documentation on how to flatten
> the IRQ numberspace, and if we ever have a cascaded PIC situation
> things could get ugly since there's no way to identify a specific
> IRQ controller in KVM_IRQ_LINE.
> 
> >Also related to this question is the point of avoiding the
> >implementation of devices to be spread across userspace and the kernel
> >(this is one point Avi brought up often). If the device emulation
> >is implemented entirely in the kernel, all is necessary are initial
> >values of device registers (and retrieval of those values later for
> >savevm/migration).
> 
> MPIC emulation is entirely in the kernel with this patchset --
> though the register that lets you reset cores will likely need to be
> kicked back to QEMU.
> 
> >It is then not necessary to set device attributes on a live guest and
> >deal with the complications associated with that.
> 
> Which complications?
> 
> -Scott

Semantics of individual attribute writes, for one.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21  0:01             ` Marcelo Tosatti
@ 2013-02-21  0:33               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-21  0:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/20/2013 06:01:00 PM, Marcelo Tosatti wrote:
> The main problem, to me, is that the interface allows flexibility but
> the details of using such flexibility are not worked out.
> 
> That is, lack of definitions such as when setting attributes is  
> allowed.

That is defined by the individual attributes.

There is one constraint that I did forget to indicate in mpic.txt --  
KVM_DEV_MPIC_GRP_REGISTER can only be used when the registers are  
mapped.

> With a big blog, that information is implicit: a SET operation is a  
> full
> device reset.

So, for example, how would I handle the guest changing the location of  
the MPIC registers dynamically (similar to changing a PCI BAR)?  QEMU  
doesn't implement this currently, but the hardware does, so the kernel  
interface shouldn't preclude it.

What if we later discover that we need some extra bit of state that  
wasn't included in the initial definition?  Especially since we don't  
do migration at all yet on ppc, so we can't even use what QEMU  
currently saves as a reference.

What if a particular array within the state blob needs to be scaled up  
because we now have more sources?

Will we need to version the blob?  It just gets to be a mess, like  
SREGS.

And then we need special code for packing/unpacking the blob, whereas  
with this patchset we reuse the same emulation code guest MMIO uses.

> With individual attributes:
> - Which attributes can be set individually?
> - Is there an order which must be respected?

If there are constraints, the device documentation should specify them.

> - Locking.

Same locking I need for MMIO and interrupt injection via other paths.

> - End result: more logic/code necessary.

That's a tradeoff for individual devices to make, in cases where it's  
actually true.

> > It's not about how often we do it, but how much state we have,
> > especially if we ever want to implement migration.
> 
> Migration reads/writes the device state once, which is supposedly much
> smaller than VM's RAM, so can't see the logic here.

struct kvm_irqchip can only hold 512 bytes.  We have more state than  
that.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  0:33               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-21  0:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/20/2013 06:01:00 PM, Marcelo Tosatti wrote:
> The main problem, to me, is that the interface allows flexibility but
> the details of using such flexibility are not worked out.
> 
> That is, lack of definitions such as when setting attributes is  
> allowed.

That is defined by the individual attributes.

There is one constraint that I did forget to indicate in mpic.txt --  
KVM_DEV_MPIC_GRP_REGISTER can only be used when the registers are  
mapped.

> With a big blog, that information is implicit: a SET operation is a  
> full
> device reset.

So, for example, how would I handle the guest changing the location of  
the MPIC registers dynamically (similar to changing a PCI BAR)?  QEMU  
doesn't implement this currently, but the hardware does, so the kernel  
interface shouldn't preclude it.

What if we later discover that we need some extra bit of state that  
wasn't included in the initial definition?  Especially since we don't  
do migration at all yet on ppc, so we can't even use what QEMU  
currently saves as a reference.

What if a particular array within the state blob needs to be scaled up  
because we now have more sources?

Will we need to version the blob?  It just gets to be a mess, like  
SREGS.

And then we need special code for packing/unpacking the blob, whereas  
with this patchset we reuse the same emulation code guest MMIO uses.

> With individual attributes:
> - Which attributes can be set individually?
> - Is there an order which must be respected?

If there are constraints, the device documentation should specify them.

> - Locking.

Same locking I need for MMIO and interrupt injection via other paths.

> - End result: more logic/code necessary.

That's a tradeoff for individual devices to make, in cases where it's  
actually true.

> > It's not about how often we do it, but how much state we have,
> > especially if we ever want to implement migration.
> 
> Migration reads/writes the device state once, which is supposedly much
> smaller than VM's RAM, so can't see the logic here.

struct kvm_irqchip can only hold 512 bytes.  We have more state than  
that.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21  0:14                   ` Marcelo Tosatti
@ 2013-02-21  1:28                     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-21  1:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> > >It is then not necessary to set device attributes on a live guest  
> and
> > >deal with the complications associated with that.
> >
> > Which complications?
> >
> > -Scott
> 
> Semantics of individual attribute writes, for one.

When the attribute is a device register, the hardware documentation  
takes care of that.  Otherwise, the semantics are documented in the  
device-specific documentation -- which can include restricting when the  
access is allowed.  Same as with any other interface documentation.

I suppose mpic.txt could use an additional statement that  
KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by  
the guest.

> Locking versus currently executing VCPUs, for another (see how
> KVM_SET_IRQ's RCU usage, for instance, that is something should be
> shared).

If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you  
have a GSI routing table, which this patchset doesn't.

Assuming we end up having a routing table to support irqfd, we still  
can't share the code as is, since it's APIC-specific.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  1:28                     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-21  1:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> > >It is then not necessary to set device attributes on a live guest  
> and
> > >deal with the complications associated with that.
> >
> > Which complications?
> >
> > -Scott
> 
> Semantics of individual attribute writes, for one.

When the attribute is a device register, the hardware documentation  
takes care of that.  Otherwise, the semantics are documented in the  
device-specific documentation -- which can include restricting when the  
access is allowed.  Same as with any other interface documentation.

I suppose mpic.txt could use an additional statement that  
KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by  
the guest.

> Locking versus currently executing VCPUs, for another (see how
> KVM_SET_IRQ's RCU usage, for instance, that is something should be
> shared).

If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you  
have a GSI routing table, which this patchset doesn't.

Assuming we end up having a routing table to support irqfd, we still  
can't share the code as is, since it's APIC-specific.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20 13:09             ` Gleb Natapov
@ 2013-02-21  2:05               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-21  2:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get or  
> set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is  
> just
> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and smart
> then write documentation instead of new interfaces.

You mean like what we did with SREGS, that got deprecated and replaced  
with ONE_REG?

How is writing documentation not creating new interfaces, if the  
documentation is different from what the interface is currently  
understood to do?
Note that Marcelo seems to view KVM_SET_IRQCHIP as effectively being a  
device reset, which is rather different.

> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data  
> on
> x86, nothing prevent you from adding MPIC specifics to the interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not  
> enough
> for you to transfer the state put pointers there and _document_ them.

So basically, you want me to keep this interface but share the ioctl  
number with an older interface? :-P

> But with 512 bytes you can transfer properties inline, so you probably
> do not need pointer there anyway. I see you have three properties 2 of
> them 32bit and one 64bit.

Three *groups* of properties.  One of the property groups is per  
source, and we can have hundreds of sources.  Another exposes the  
register space, which is 64 KiB (admittedly it's somewhat sparse, but  
there's more than 512 bytes of real data in there).  And we don't  
necessarily want to set *everything*.

> >                                        It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > to later on regarding KVM_IRQ_LINE_STATUS.
> >
> Do not see why.

It's either that, or have the data direction of the "chip" field in  
KVM_GET_IRQCHIP not be entirely in the "get" direction.

> > Then there's the silliness of transporting 512 bytes just to read a
> > descriptor for transporting something else.
> >
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed  
> interface
> does?

It should probably be get_user() instead, which is pretty fast in the  
absence of a fault.

> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> >
> > It's one type of command.  Another is setting the address.  Another
> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> >
> Setting the address is setting an attribute. Sending MSI is a command.
> Things you set/get during init/migration are attributes. Things you do
> to cause side-effects are commands.

What if I set the address at a time that isn't init/migration (the  
hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not an  
attribute due to how the caller uses it?

> > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > (non-glue/wrapper) code can become common?
> >
> No new ioctl with exactly same result (well actually even faster since
> less copying is done).

Which ioctl would go away?

> You need to show us the benefits of the new interface
> vs existing one, not vice versa.

Well, as I said to Marcello, the real reason why we probably need to  
use the existing interface is irqfd.  That doesn't make the device  
control stuff go away.

> > And I really hope you don't want us to do MSIs the x86 way.
> >
> What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to  
> glue it
> to mpic.

We'll have to write extra code for it compared to the current way where  
it works with *zero* code beyond what is wanted for other purposes such  
as debug and (eventually) migration.  At least it's more direct than  
having to establish a GSI route...

> > In the XICS thread, Paul brought up the possibliity of cascaded
> > MPICs.  It's not relevant to the systems we're trying to model, but
> > if one did want to use the in-kernel irqchip interface for that, it
> > would be really nice to be able to operate on a specific MPIC for
> > injection rather than have to come up with some sort of global
> > identifier (above and beyond the minor flattening we'd need to do to
> > represent a single MPIC's interrupts in a flat numberspace).
> >
> ARM encodes information in irq field of KVM_IRQ_LINE like that:
>   bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
>   field: | irq_type  | vcpu_index |   irq_number    |
> Why will similar approach not work?

Well, it conflicts with the GSI routing stuff, and I don't see an irq  
chip ID field...

But otherwise (or assuming you mean to use such an encoding when  
setting up a GSI route), I didn't say this part couldn't be made to  
work.  It will require new kernel code for managing a GSI table in a  
non-APIC way, and a new callback into the device code, but as I've said  
elsewhere I think we need it for irqfd anyway.  If I use KVM_IRQ_LINE  
for injecting interrupts, do you still object to the rest of it?

> > Could an error return be used for cases where the IRQ was not
> > delivered, in the very unlikely event that we want to implement
> > something similar on MPIC?
> We can, but I do not think it will be good API. This condition is not  
> an
> error.

-EBUSY seems appropriate enough...

> >                            Note again that MPIC's decision to use
> > or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> > inherent in the device control API.
> That's the crux of the problem though. MPIC tries to be different just
> for the sake to be different. Why? The only explanation you provide is
> because current API is "silly", not that you cannot implement MPIC  
> with
> it or it will be unnecessary slow, just "silly".

It's not about "silliness" as that this new thing I added for other  
reasons did the job just as well (again, except when it comes to  
irqfd), and avoided the need for a GSI table, etc.  IRQ injection was  
not the main point of the new interface.

> > >Other devices may get other commands that need
> > >response, so if we design generic interface we should take it into
> > >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is  
> a
> > >misnomer, you do not set internal device attribute, you toggle
> > >external
> > >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> >
> > I see no need for a separate ioctl in terms of the underlying
> > infrastructure for distinguishing "attribute" from "write-only
> > command".  I'm open to improvements on what the ioctl is called.
> > It's basically like setting a register on a device, except I was
> > concerned that if we actually called it a "register" that people
> > would take it too literally and think it's only for the architected
> > register state of the emulated device.
> I agree "attribute" is better name than "register", but injecting
> interrupt is not setting an attribute.

It's a dynamic attribute -- the state of the input line.  Better names  
are welcome.  I don't see this difference as enough to warrant separate  
ioctls.

> > >> >ARM vGIC code, that is ready to go upstream, uses old way too.  
> So
> > >> >it will
> > >> >be 2 archs against one.
> > >>
> > >> I wasn't aware that that's how it worked. :-P
> > >>
> > >What worked? That vGIC uses existing interface or that non generic
> > >interface used by many arches wins generic one used by only one  
> arch?
> >
> > The latter.  Two wrongs don't make a right, and adding another
> > inextensible, device-specific API is not the answer to the existing
> > APIs being too inextensible and device/arch-specific.  Some portion
> > will always need to be device-specific because we're controlling the
> > creation and of a specific device, but the glue does not need to be.
> >
> This is not "adding another inextensible, device-specific API" vs  
> "adding
> cool generic extensible API" though. It is "using existing  
> inextensible,
> device-specific API" vs "adding cool generic extensible API".

The "existing inextensible device-specific API" doesn't have support  
for this "specific device".  Something new has to be added one way or  
another.

> > >APIs are easy to add and impossible to remove.
> >
> > That's why I want to get it right this time.
> >
> And what if you'll fail?

That's always a possibility of course.  I don't think that's a good  
reason to avoid trying to move in the right direction.

> What if next architecture will bring new
> developer that will proclaim your new interface "silly" since it does  
> not
> allow for device destruction and do not return file descriptor for  
> newly
> created device that userspace can do select on to wait for a device's
> events or mmap memory for fast userspace/device communication?

The device id that gets returned is arbitrary; you could turn it into  
an fd later with no loss of compatibility.

Device destruction would complicate things and I would not support  
requiring all devices to allow it.  If someone wanted to add it for  
certain devices, at the interface level it would just be a new ioctl.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  2:05               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-21  2:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get or  
> set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is  
> just
> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and smart
> then write documentation instead of new interfaces.

You mean like what we did with SREGS, that got deprecated and replaced  
with ONE_REG?

How is writing documentation not creating new interfaces, if the  
documentation is different from what the interface is currently  
understood to do?
Note that Marcelo seems to view KVM_SET_IRQCHIP as effectively being a  
device reset, which is rather different.

> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data  
> on
> x86, nothing prevent you from adding MPIC specifics to the interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not  
> enough
> for you to transfer the state put pointers there and _document_ them.

So basically, you want me to keep this interface but share the ioctl  
number with an older interface? :-P

> But with 512 bytes you can transfer properties inline, so you probably
> do not need pointer there anyway. I see you have three properties 2 of
> them 32bit and one 64bit.

Three *groups* of properties.  One of the property groups is per  
source, and we can have hundreds of sources.  Another exposes the  
register space, which is 64 KiB (admittedly it's somewhat sparse, but  
there's more than 512 bytes of real data in there).  And we don't  
necessarily want to set *everything*.

> >                                        It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > to later on regarding KVM_IRQ_LINE_STATUS.
> >
> Do not see why.

It's either that, or have the data direction of the "chip" field in  
KVM_GET_IRQCHIP not be entirely in the "get" direction.

> > Then there's the silliness of transporting 512 bytes just to read a
> > descriptor for transporting something else.
> >
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed  
> interface
> does?

It should probably be get_user() instead, which is pretty fast in the  
absence of a fault.

> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> >
> > It's one type of command.  Another is setting the address.  Another
> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> >
> Setting the address is setting an attribute. Sending MSI is a command.
> Things you set/get during init/migration are attributes. Things you do
> to cause side-effects are commands.

What if I set the address at a time that isn't init/migration (the  
hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not an  
attribute due to how the caller uses it?

> > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > (non-glue/wrapper) code can become common?
> >
> No new ioctl with exactly same result (well actually even faster since
> less copying is done).

Which ioctl would go away?

> You need to show us the benefits of the new interface
> vs existing one, not vice versa.

Well, as I said to Marcello, the real reason why we probably need to  
use the existing interface is irqfd.  That doesn't make the device  
control stuff go away.

> > And I really hope you don't want us to do MSIs the x86 way.
> >
> What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to  
> glue it
> to mpic.

We'll have to write extra code for it compared to the current way where  
it works with *zero* code beyond what is wanted for other purposes such  
as debug and (eventually) migration.  At least it's more direct than  
having to establish a GSI route...

> > In the XICS thread, Paul brought up the possibliity of cascaded
> > MPICs.  It's not relevant to the systems we're trying to model, but
> > if one did want to use the in-kernel irqchip interface for that, it
> > would be really nice to be able to operate on a specific MPIC for
> > injection rather than have to come up with some sort of global
> > identifier (above and beyond the minor flattening we'd need to do to
> > represent a single MPIC's interrupts in a flat numberspace).
> >
> ARM encodes information in irq field of KVM_IRQ_LINE like that:
>   bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
>   field: | irq_type  | vcpu_index |   irq_number    |
> Why will similar approach not work?

Well, it conflicts with the GSI routing stuff, and I don't see an irq  
chip ID field...

But otherwise (or assuming you mean to use such an encoding when  
setting up a GSI route), I didn't say this part couldn't be made to  
work.  It will require new kernel code for managing a GSI table in a  
non-APIC way, and a new callback into the device code, but as I've said  
elsewhere I think we need it for irqfd anyway.  If I use KVM_IRQ_LINE  
for injecting interrupts, do you still object to the rest of it?

> > Could an error return be used for cases where the IRQ was not
> > delivered, in the very unlikely event that we want to implement
> > something similar on MPIC?
> We can, but I do not think it will be good API. This condition is not  
> an
> error.

-EBUSY seems appropriate enough...

> >                            Note again that MPIC's decision to use
> > or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> > inherent in the device control API.
> That's the crux of the problem though. MPIC tries to be different just
> for the sake to be different. Why? The only explanation you provide is
> because current API is "silly", not that you cannot implement MPIC  
> with
> it or it will be unnecessary slow, just "silly".

It's not about "silliness" as that this new thing I added for other  
reasons did the job just as well (again, except when it comes to  
irqfd), and avoided the need for a GSI table, etc.  IRQ injection was  
not the main point of the new interface.

> > >Other devices may get other commands that need
> > >response, so if we design generic interface we should take it into
> > >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts is  
> a
> > >misnomer, you do not set internal device attribute, you toggle
> > >external
> > >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> >
> > I see no need for a separate ioctl in terms of the underlying
> > infrastructure for distinguishing "attribute" from "write-only
> > command".  I'm open to improvements on what the ioctl is called.
> > It's basically like setting a register on a device, except I was
> > concerned that if we actually called it a "register" that people
> > would take it too literally and think it's only for the architected
> > register state of the emulated device.
> I agree "attribute" is better name than "register", but injecting
> interrupt is not setting an attribute.

It's a dynamic attribute -- the state of the input line.  Better names  
are welcome.  I don't see this difference as enough to warrant separate  
ioctls.

> > >> >ARM vGIC code, that is ready to go upstream, uses old way too.  
> So
> > >> >it will
> > >> >be 2 archs against one.
> > >>
> > >> I wasn't aware that that's how it worked. :-P
> > >>
> > >What worked? That vGIC uses existing interface or that non generic
> > >interface used by many arches wins generic one used by only one  
> arch?
> >
> > The latter.  Two wrongs don't make a right, and adding another
> > inextensible, device-specific API is not the answer to the existing
> > APIs being too inextensible and device/arch-specific.  Some portion
> > will always need to be device-specific because we're controlling the
> > creation and of a specific device, but the glue does not need to be.
> >
> This is not "adding another inextensible, device-specific API" vs  
> "adding
> cool generic extensible API" though. It is "using existing  
> inextensible,
> device-specific API" vs "adding cool generic extensible API".

The "existing inextensible device-specific API" doesn't have support  
for this "specific device".  Something new has to be added one way or  
another.

> > >APIs are easy to add and impossible to remove.
> >
> > That's why I want to get it right this time.
> >
> And what if you'll fail?

That's always a possibility of course.  I don't think that's a good  
reason to avoid trying to move in the right direction.

> What if next architecture will bring new
> developer that will proclaim your new interface "silly" since it does  
> not
> allow for device destruction and do not return file descriptor for  
> newly
> created device that userspace can do select on to wait for a device's
> events or mmap memory for fast userspace/device communication?

The device id that gets returned is arbitrary; you could turn it into  
an fd later with no loss of compatibility.

Device destruction would complicate things and I would not support  
requiring all devices to allow it.  If someone wanted to add it for  
certain devices, at the interface level it would just be a new ioctl.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21  1:28                     ` Scott Wood
@ 2013-02-21  6:39                       ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-21  6:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Marcelo Tosatti, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.
> 
> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
If access to an attribute has a guest visible side effect you cannot
use this interface for migration. This is exactly the point in favor
of distinguish accesses that have side effects (COMMAND or whatever)
and accesses that set/get attribute (SET|GET_ATTR).

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  6:39                       ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-21  6:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Marcelo Tosatti, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.
> 
> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
If access to an attribute has a guest visible side effect you cannot
use this interface for migration. This is exactly the point in favor
of distinguish accesses that have side effects (COMMAND or whatever)
and accesses that set/get attribute (SET|GET_ATTR).

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21  2:05               ` Scott Wood
@ 2013-02-21  8:22                 ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-21  8:22 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
>  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> >> The ability to set/get attributes is needed.  Sorry, but "get
> >or set
> >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> >just
> >> >> not good enough -- assuming you don't want us to start sticking
> >> >> pointers and commands in *that* data. :-)
> >> >>
> >> >Proposed interface sticks pointers into ioctl data, so why doing
> >> >the same
> >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >>
> >> There's a difference between putting a pointer in an ioctl control
> >> structure that is specifically documented as being that way (as in
> >> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> >> blob of state and embedding pointers in it.  It would be like
> >> sticking a pointer in the attribute payload of this API, which I
> >> think is something to be discouraged.
> >If documentation is what differentiate for you between silly and smart
> >then write documentation instead of new interfaces.
> 
> You mean like what we did with SREGS, that got deprecated and
> replaced with ONE_REG?
> 
SREGS is implemented by ppc. I see ONE_REG as addition to REGS
interface. You can access all register at once or you can access them
one by one. If there is a need we can add MULTIPLE_REGS that will get
list of requested REGS. The interface is not over generic i.e it does
not try to replace KVM_RUN by writing special register.

> How is writing documentation not creating new interfaces, if the
> documentation is different from what the interface is currently
> understood to do?
If this case you misunderstand what I am proposing. The interface
sets/gets irq chip state and this is what it will continue to do. What
needs to be documented is the format of mpic irqchip. Nobody expects it
to be the same for all irq chips.
 
> Note that Marcelo seems to view KVM_SET_IRQCHIP as effectively being
> a device reset, which is rather different.
> 
Marcelo views the interface exactly the same as I view it. It is used for
initializing device state (reset is one of those time when it happens)
and to transfer device state for migration purposes.

> >KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of
> >data on
> >x86, nothing prevent you from adding MPIC specifics to the interface,
> >Add mpic state into kvm_irqchip structure and if 512 bytes is not
> >enough
> >for you to transfer the state put pointers there and _document_ them.
> 
> So basically, you want me to keep this interface but share the ioctl
> number with an older interface? :-P
If that is what you want. Obviously you can drop things that makes
proposed interface generic one.

> 
> >But with 512 bytes you can transfer properties inline, so you probably
> >do not need pointer there anyway. I see you have three properties 2 of
> >them 32bit and one 64bit.
> 
> Three *groups* of properties.  One of the property groups is per
> source, and we can have hundreds of sources.  Another exposes the
> register space, which is 64 KiB (admittedly it's somewhat sparse,
> but there's more than 512 bytes of real data in there). 
I mean that you still access each property one by one but since each
individual one is not bigger than 64bit you can put it inline and do not
need pointers, or you can access groups of properties if each one fits
into the buffer.

>                                                    And we
> don't necessarily want to set *everything*.
What are those cases? You do need to on reset/migration.

> 
> >>                                        It'd also be using
> >> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> >> to later on regarding KVM_IRQ_LINE_STATUS.
> >>
> >Do not see why.
> 
> It's either that, or have the data direction of the "chip" field in
> KVM_GET_IRQCHIP not be entirely in the "get" direction.
> 
Still do not follow. Example?

> >> Then there's the silliness of transporting 512 bytes just to read a
> >> descriptor for transporting something else.
> >>
> >Yes, agree. But is this enough of a reason to introduce entirely new
> >interface? Is it on performance critical path? Doubt it, unless you
> >abuse the interface to send interrupts, but then isn't it silty to
> >do copy_from_user() twice to inject an interrupt like proposed
> >interface
> >does?
> 
> It should probably be get_user() instead, which is pretty fast in
> the absence of a fault.
> 
> >> >For signaling irqs (I think this is what you mean by "commands")
> >> >we have KVM_IRQ_LINE.
> >>
> >> It's one type of command.  Another is setting the address.  Another
> >> is writing to registers that have side effects (this is how MSI
> >> injection is done on MPIC, just as in real hardware).
> >>
> >Setting the address is setting an attribute. Sending MSI is a command.
> >Things you set/get during init/migration are attributes. Things you do
> >to cause side-effects are commands.
> 
> What if I set the address at a time that isn't init/migration (the
> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> an attribute due to how the caller uses it?
> 
What's the interface for guest to move it? Why it goes via userspace?
You can move APIC base too, but this does not involve userspace. But
even if you do go via userspace, it is just a guest asking to change device
configuration, so using SET_ATTR to set new configuration is fine.

> >> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> >> (non-glue/wrapper) code can become common?
> >>
> >No new ioctl with exactly same result (well actually even faster since
> >less copying is done).
> 
> Which ioctl would go away?
> 
Those that you propose in your new interface.

> >You need to show us the benefits of the new interface
> >vs existing one, not vice versa.
> 
> Well, as I said to Marcello, the real reason why we probably need to
> use the existing interface is irqfd.  That doesn't make the device
> control stuff go away.
> 
> >> And I really hope you don't want us to do MSIs the x86 way.
> >>
> >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> >glue it
> >to mpic.
> 
> We'll have to write extra code for it compared to the current way
> where it works with *zero* code beyond what is wanted for other
> purposes such as debug and (eventually) migration.  At least it's
> more direct than having to establish a GSI route...
If just writing a register cause MSI to be send how do you distinguish
between write that should send MSI and write that is done on migration
to transfer current value? We had that problem with MSRs on x86. We had
to, eventually, add a flag that tells us the reason of MSR access.

> 
> >> In the XICS thread, Paul brought up the possibliity of cascaded
> >> MPICs.  It's not relevant to the systems we're trying to model, but
> >> if one did want to use the in-kernel irqchip interface for that, it
> >> would be really nice to be able to operate on a specific MPIC for
> >> injection rather than have to come up with some sort of global
> >> identifier (above and beyond the minor flattening we'd need to do to
> >> represent a single MPIC's interrupts in a flat numberspace).
> >>
> >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> >  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> >  field: | irq_type  | vcpu_index |   irq_number    |
> >Why will similar approach not work?
> 
> Well, it conflicts with the GSI routing stuff, and I don't see an
> irq chip ID field...
It does :( Can't say I am happy about it, but I skipped the discussion
about the interface back then and it is too late to complain now. Since,
as you notices, irqfd interfaces with irq routing I wonder what's ARM
plan about it. But if you choose to go ARM way the format is ARM specific,
so you can use your own encoding and put irq chip information there.

> 
> But otherwise (or assuming you mean to use such an encoding when
> setting up a GSI route), I didn't say this part couldn't be made to
> work.  It will require new kernel code for managing a GSI table in a
> non-APIC way, and a new callback into the device code, but as I've
> said elsewhere I think we need it for irqfd anyway.  If I use
> KVM_IRQ_LINE for injecting interrupts, do you still object to the
> rest of it?
The rest of what, proposed interface? There are two separate discussions
happening here interleaved. First is "do we need to introduce new generic
interface for device creation when existing one, albeit not ideal, can be
used" and I am OK with that as long as ARM moves to it for 3.10, although
I would prefer to have some example of what this interface will be used
for besides irq chips otherwise it will be just another way to create
irqchip. Second one is "how the interface should look like". And here I
think that strong distinction is needed between setting the attributes
and sending commands with side effects for reasons explained all over
this ml thread.

> 
> >> Could an error return be used for cases where the IRQ was not
> >> delivered, in the very unlikely event that we want to implement
> >> something similar on MPIC?
> >We can, but I do not think it will be good API. This condition is
> >not an
> >error.
> 
> -EBUSY seems appropriate enough...
It is. Other commands may have more elaborate return status. Generic
interface should take this into account.

> 
> >>                            Note again that MPIC's decision to use
> >> or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> >> inherent in the device control API.
> >That's the crux of the problem though. MPIC tries to be different just
> >for the sake to be different. Why? The only explanation you provide is
> >because current API is "silly", not that you cannot implement MPIC
> >with
> >it or it will be unnecessary slow, just "silly".
> 
> It's not about "silliness" as that this new thing I added for other
> reasons did the job just as well (again, except when it comes to
> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> was not the main point of the new interface.
Having generic interface for device creation and then make some devices
special by allowing them to be used with KVM_IRQ_LINE makes little
sense, so IRQ injection may be not the main point of the new interface,
but communication with a device created by the new interface is
something that cannot be ignored at the interface design stage.
 
> 
> >> >Other devices may get other commands that need
> >> >response, so if we design generic interface we should take it into
> >> >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts
> >is a
> >> >misnomer, you do not set internal device attribute, you toggle
> >> >external
> >> >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> >>
> >> I see no need for a separate ioctl in terms of the underlying
> >> infrastructure for distinguishing "attribute" from "write-only
> >> command".  I'm open to improvements on what the ioctl is called.
> >> It's basically like setting a register on a device, except I was
> >> concerned that if we actually called it a "register" that people
> >> would take it too literally and think it's only for the architected
> >> register state of the emulated device.
> >I agree "attribute" is better name than "register", but injecting
> >interrupt is not setting an attribute.
> 
> It's a dynamic attribute -- the state of the input line.  Better
> names are welcome.  I don't see this difference as enough to warrant
> separate ioctls.
As long as you use the same attribute for migration and interrupt injection
purpose I do. If you use separate attributes for migration and interrupt
injection then not having separate ioctl is just a hack.

> 
> >> >> >ARM vGIC code, that is ready to go upstream, uses old way
> >too. So
> >> >> >it will
> >> >> >be 2 archs against one.
> >> >>
> >> >> I wasn't aware that that's how it worked. :-P
> >> >>
> >> >What worked? That vGIC uses existing interface or that non generic
> >> >interface used by many arches wins generic one used by only one
> >arch?
> >>
> >> The latter.  Two wrongs don't make a right, and adding another
> >> inextensible, device-specific API is not the answer to the existing
> >> APIs being too inextensible and device/arch-specific.  Some portion
> >> will always need to be device-specific because we're controlling the
> >> creation and of a specific device, but the glue does not need to be.
> >>
> >This is not "adding another inextensible, device-specific API" vs
> >"adding
> >cool generic extensible API" though. It is "using existing
> >inextensible,
> >device-specific API" vs "adding cool generic extensible API".
> 
> The "existing inextensible device-specific API" doesn't have support
> for this "specific device".  Something new has to be added one way
> or another.
> 
And extending existing interface (which is already supports more then one
irqchips BTW) wins.

> >> >APIs are easy to add and impossible to remove.
> >>
> >> That's why I want to get it right this time.
> >>
> >And what if you'll fail?
> 
> That's always a possibility of course.  I don't think that's a good
> reason to avoid trying to move in the right direction.
It is not, but that is not the point I am trying to make :)

> 
> >What if next architecture will bring new
> >developer that will proclaim your new interface "silly" since it
> >does not
> >allow for device destruction and do not return file descriptor for
> >newly
> >created device that userspace can do select on to wait for a device's
> >events or mmap memory for fast userspace/device communication?
> 
> The device id that gets returned is arbitrary; you could turn it
> into an fd later with no loss of compatibility.
> 
> Device destruction would complicate things and I would not support
> requiring all devices to allow it.  If someone wanted to add it for
> certain devices, at the interface level it would just be a new
> ioctl.
> 
And here is the point that I am trying to make. You propose how your
interface can be extended and I agree, it can. But that other guy will
see it in other light: Why should I extend interface that is broken
instead of providing new, perfect one.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21  8:22                 ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-21  8:22 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
>  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> >> The ability to set/get attributes is needed.  Sorry, but "get
> >or set
> >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> >just
> >> >> not good enough -- assuming you don't want us to start sticking
> >> >> pointers and commands in *that* data. :-)
> >> >>
> >> >Proposed interface sticks pointers into ioctl data, so why doing
> >> >the same
> >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >>
> >> There's a difference between putting a pointer in an ioctl control
> >> structure that is specifically documented as being that way (as in
> >> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> >> blob of state and embedding pointers in it.  It would be like
> >> sticking a pointer in the attribute payload of this API, which I
> >> think is something to be discouraged.
> >If documentation is what differentiate for you between silly and smart
> >then write documentation instead of new interfaces.
> 
> You mean like what we did with SREGS, that got deprecated and
> replaced with ONE_REG?
> 
SREGS is implemented by ppc. I see ONE_REG as addition to REGS
interface. You can access all register at once or you can access them
one by one. If there is a need we can add MULTIPLE_REGS that will get
list of requested REGS. The interface is not over generic i.e it does
not try to replace KVM_RUN by writing special register.

> How is writing documentation not creating new interfaces, if the
> documentation is different from what the interface is currently
> understood to do?
If this case you misunderstand what I am proposing. The interface
sets/gets irq chip state and this is what it will continue to do. What
needs to be documented is the format of mpic irqchip. Nobody expects it
to be the same for all irq chips.
 
> Note that Marcelo seems to view KVM_SET_IRQCHIP as effectively being
> a device reset, which is rather different.
> 
Marcelo views the interface exactly the same as I view it. It is used for
initializing device state (reset is one of those time when it happens)
and to transfer device state for migration purposes.

> >KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of
> >data on
> >x86, nothing prevent you from adding MPIC specifics to the interface,
> >Add mpic state into kvm_irqchip structure and if 512 bytes is not
> >enough
> >for you to transfer the state put pointers there and _document_ them.
> 
> So basically, you want me to keep this interface but share the ioctl
> number with an older interface? :-P
If that is what you want. Obviously you can drop things that makes
proposed interface generic one.

> 
> >But with 512 bytes you can transfer properties inline, so you probably
> >do not need pointer there anyway. I see you have three properties 2 of
> >them 32bit and one 64bit.
> 
> Three *groups* of properties.  One of the property groups is per
> source, and we can have hundreds of sources.  Another exposes the
> register space, which is 64 KiB (admittedly it's somewhat sparse,
> but there's more than 512 bytes of real data in there). 
I mean that you still access each property one by one but since each
individual one is not bigger than 64bit you can put it inline and do not
need pointers, or you can access groups of properties if each one fits
into the buffer.

>                                                    And we
> don't necessarily want to set *everything*.
What are those cases? You do need to on reset/migration.

> 
> >>                                        It'd also be using
> >> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> >> to later on regarding KVM_IRQ_LINE_STATUS.
> >>
> >Do not see why.
> 
> It's either that, or have the data direction of the "chip" field in
> KVM_GET_IRQCHIP not be entirely in the "get" direction.
> 
Still do not follow. Example?

> >> Then there's the silliness of transporting 512 bytes just to read a
> >> descriptor for transporting something else.
> >>
> >Yes, agree. But is this enough of a reason to introduce entirely new
> >interface? Is it on performance critical path? Doubt it, unless you
> >abuse the interface to send interrupts, but then isn't it silty to
> >do copy_from_user() twice to inject an interrupt like proposed
> >interface
> >does?
> 
> It should probably be get_user() instead, which is pretty fast in
> the absence of a fault.
> 
> >> >For signaling irqs (I think this is what you mean by "commands")
> >> >we have KVM_IRQ_LINE.
> >>
> >> It's one type of command.  Another is setting the address.  Another
> >> is writing to registers that have side effects (this is how MSI
> >> injection is done on MPIC, just as in real hardware).
> >>
> >Setting the address is setting an attribute. Sending MSI is a command.
> >Things you set/get during init/migration are attributes. Things you do
> >to cause side-effects are commands.
> 
> What if I set the address at a time that isn't init/migration (the
> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> an attribute due to how the caller uses it?
> 
What's the interface for guest to move it? Why it goes via userspace?
You can move APIC base too, but this does not involve userspace. But
even if you do go via userspace, it is just a guest asking to change device
configuration, so using SET_ATTR to set new configuration is fine.

> >> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> >> (non-glue/wrapper) code can become common?
> >>
> >No new ioctl with exactly same result (well actually even faster since
> >less copying is done).
> 
> Which ioctl would go away?
> 
Those that you propose in your new interface.

> >You need to show us the benefits of the new interface
> >vs existing one, not vice versa.
> 
> Well, as I said to Marcello, the real reason why we probably need to
> use the existing interface is irqfd.  That doesn't make the device
> control stuff go away.
> 
> >> And I really hope you don't want us to do MSIs the x86 way.
> >>
> >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> >glue it
> >to mpic.
> 
> We'll have to write extra code for it compared to the current way
> where it works with *zero* code beyond what is wanted for other
> purposes such as debug and (eventually) migration.  At least it's
> more direct than having to establish a GSI route...
If just writing a register cause MSI to be send how do you distinguish
between write that should send MSI and write that is done on migration
to transfer current value? We had that problem with MSRs on x86. We had
to, eventually, add a flag that tells us the reason of MSR access.

> 
> >> In the XICS thread, Paul brought up the possibliity of cascaded
> >> MPICs.  It's not relevant to the systems we're trying to model, but
> >> if one did want to use the in-kernel irqchip interface for that, it
> >> would be really nice to be able to operate on a specific MPIC for
> >> injection rather than have to come up with some sort of global
> >> identifier (above and beyond the minor flattening we'd need to do to
> >> represent a single MPIC's interrupts in a flat numberspace).
> >>
> >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> > šbits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> >  field: | irq_type  | vcpu_index |   irq_number    |
> >Why will similar approach not work?
> 
> Well, it conflicts with the GSI routing stuff, and I don't see an
> irq chip ID field...
It does :( Can't say I am happy about it, but I skipped the discussion
about the interface back then and it is too late to complain now. Since,
as you notices, irqfd interfaces with irq routing I wonder what's ARM
plan about it. But if you choose to go ARM way the format is ARM specific,
so you can use your own encoding and put irq chip information there.

> 
> But otherwise (or assuming you mean to use such an encoding when
> setting up a GSI route), I didn't say this part couldn't be made to
> work.  It will require new kernel code for managing a GSI table in a
> non-APIC way, and a new callback into the device code, but as I've
> said elsewhere I think we need it for irqfd anyway.  If I use
> KVM_IRQ_LINE for injecting interrupts, do you still object to the
> rest of it?
The rest of what, proposed interface? There are two separate discussions
happening here interleaved. First is "do we need to introduce new generic
interface for device creation when existing one, albeit not ideal, can be
used" and I am OK with that as long as ARM moves to it for 3.10, although
I would prefer to have some example of what this interface will be used
for besides irq chips otherwise it will be just another way to create
irqchip. Second one is "how the interface should look like". And here I
think that strong distinction is needed between setting the attributes
and sending commands with side effects for reasons explained all over
this ml thread.

> 
> >> Could an error return be used for cases where the IRQ was not
> >> delivered, in the very unlikely event that we want to implement
> >> something similar on MPIC?
> >We can, but I do not think it will be good API. This condition is
> >not an
> >error.
> 
> -EBUSY seems appropriate enough...
It is. Other commands may have more elaborate return status. Generic
interface should take this into account.

> 
> >>                            Note again that MPIC's decision to use
> >> or not use KVM_IRQ_LINE is only about what MPIC does; it is not
> >> inherent in the device control API.
> >That's the crux of the problem though. MPIC tries to be different just
> >for the sake to be different. Why? The only explanation you provide is
> >because current API is "silly", not that you cannot implement MPIC
> >with
> >it or it will be unnecessary slow, just "silly".
> 
> It's not about "silliness" as that this new thing I added for other
> reasons did the job just as well (again, except when it comes to
> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> was not the main point of the new interface.
Having generic interface for device creation and then make some devices
special by allowing them to be used with KVM_IRQ_LINE makes little
sense, so IRQ injection may be not the main point of the new interface,
but communication with a device created by the new interface is
something that cannot be ignored at the interface design stage.
 
> 
> >> >Other devices may get other commands that need
> >> >response, so if we design generic interface we should take it into
> >> >account. I think using KVM_SET_DEVICE_ATTR to inject interrupts
> >is a
> >> >misnomer, you do not set internal device attribute, you toggle
> >> >external
> >> >input. My be another ioctl KVM_SEND_DEVICE_COMMAND is needed.
> >>
> >> I see no need for a separate ioctl in terms of the underlying
> >> infrastructure for distinguishing "attribute" from "write-only
> >> command".  I'm open to improvements on what the ioctl is called.
> >> It's basically like setting a register on a device, except I was
> >> concerned that if we actually called it a "register" that people
> >> would take it too literally and think it's only for the architected
> >> register state of the emulated device.
> >I agree "attribute" is better name than "register", but injecting
> >interrupt is not setting an attribute.
> 
> It's a dynamic attribute -- the state of the input line.  Better
> names are welcome.  I don't see this difference as enough to warrant
> separate ioctls.
As long as you use the same attribute for migration and interrupt injection
purpose I do. If you use separate attributes for migration and interrupt
injection then not having separate ioctl is just a hack.

> 
> >> >> >ARM vGIC code, that is ready to go upstream, uses old way
> >too. So
> >> >> >it will
> >> >> >be 2 archs against one.
> >> >>
> >> >> I wasn't aware that that's how it worked. :-P
> >> >>
> >> >What worked? That vGIC uses existing interface or that non generic
> >> >interface used by many arches wins generic one used by only one
> >arch?
> >>
> >> The latter.  Two wrongs don't make a right, and adding another
> >> inextensible, device-specific API is not the answer to the existing
> >> APIs being too inextensible and device/arch-specific.  Some portion
> >> will always need to be device-specific because we're controlling the
> >> creation and of a specific device, but the glue does not need to be.
> >>
> >This is not "adding another inextensible, device-specific API" vs
> >"adding
> >cool generic extensible API" though. It is "using existing
> >inextensible,
> >device-specific API" vs "adding cool generic extensible API".
> 
> The "existing inextensible device-specific API" doesn't have support
> for this "specific device".  Something new has to be added one way
> or another.
> 
And extending existing interface (which is already supports more then one
irqchips BTW) wins.

> >> >APIs are easy to add and impossible to remove.
> >>
> >> That's why I want to get it right this time.
> >>
> >And what if you'll fail?
> 
> That's always a possibility of course.  I don't think that's a good
> reason to avoid trying to move in the right direction.
It is not, but that is not the point I am trying to make :)

> 
> >What if next architecture will bring new
> >developer that will proclaim your new interface "silly" since it
> >does not
> >allow for device destruction and do not return file descriptor for
> >newly
> >created device that userspace can do select on to wait for a device's
> >events or mmap memory for fast userspace/device communication?
> 
> The device id that gets returned is arbitrary; you could turn it
> into an fd later with no loss of compatibility.
> 
> Device destruction would complicate things and I would not support
> requiring all devices to allow it.  If someone wanted to add it for
> certain devices, at the interface level it would just be a new
> ioctl.
> 
And here is the point that I am trying to make. You propose how your
interface can be extended and I agree, it can. But that other guy will
see it in other light: Why should I extend interface that is broken
instead of providing new, perfect one.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21  1:28                     ` Scott Wood
@ 2013-02-21 23:03                       ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-21 23:03 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  

You are not writing to the registers from the CPU point of view.

> Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.

Again, you are talking about the semantics of device access from the
software operating on the machine view. We are discussing hypervisor
userspace <-> hypervisor kernel interface.

In general you never have to set attributes on a device after it has
been initialized, because there is state associated with that device
that requires proper handling (example: if you modify a timer counter
register of a timer device, any software timers used to emulate the
timer counter must be cancelled).

Also, it is necessary to provide proper locking of device attribute
write versus vcpu device access. So far we have been focusing on having 
a lockless vcpu path.

So when device attributes can be modified has implications beyond what
may seem visible at first.

Are this reasonable arguments?

Basically abstract 'device attributes' are too abstract.

However, your proposed interface deals with sucky capability, versioning
and namespace conflicts we have now. Note these items can probably be
improved separately.

> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
> >Locking versus currently executing VCPUs, for another (see how
> >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> >shared).
> 
> If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> you have a GSI routing table, which this patchset doesn't.
> 
> Assuming we end up having a routing table to support irqfd, we still
> can't share the code as is, since it's APIC-specific.

Suppose it is worthwhile to attempt to share code as much as possible.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-21 23:03                       ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-21 23:03 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  

You are not writing to the registers from the CPU point of view.

> Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.

Again, you are talking about the semantics of device access from the
software operating on the machine view. We are discussing hypervisor
userspace <-> hypervisor kernel interface.

In general you never have to set attributes on a device after it has
been initialized, because there is state associated with that device
that requires proper handling (example: if you modify a timer counter
register of a timer device, any software timers used to emulate the
timer counter must be cancelled).

Also, it is necessary to provide proper locking of device attribute
write versus vcpu device access. So far we have been focusing on having 
a lockless vcpu path.

So when device attributes can be modified has implications beyond what
may seem visible at first.

Are this reasonable arguments?

Basically abstract 'device attributes' are too abstract.

However, your proposed interface deals with sucky capability, versioning
and namespace conflicts we have now. Note these items can probably be
improved separately.

> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
> >Locking versus currently executing VCPUs, for another (see how
> >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> >shared).
> 
> If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> you have a GSI routing table, which this patchset doesn't.
> 
> Assuming we end up having a routing table to support irqfd, we still
> can't share the code as is, since it's APIC-specific.

Suppose it is worthwhile to attempt to share code as much as possible.


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21 23:03                       ` Marcelo Tosatti
@ 2013-02-22  2:00                         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-22  2:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> > On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> > >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> > >> >It is then not necessary to set device attributes on a live
> > >guest and
> > >> >deal with the complications associated with that.
> > >>
> > >> Which complications?
> > >>
> > >> -Scott
> > >
> > >Semantics of individual attribute writes, for one.
> >
> > When the attribute is a device register, the hardware documentation
> > takes care of that.
> 
> You are not writing to the registers from the CPU point of view.

That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and implemented  
on MPIC (with the exception of registers whose behavior changes based  
on which specific vcpu you use to access them).  If/when we have a need  
to set/get state in a different manner, that's a separate attribute  
group.

> > Otherwise, the semantics are documented in the
> > device-specific documentation -- which can include restricting when
> > the access is allowed.  Same as with any other interface
> > documentation.
> 
> Again, you are talking about the semantics of device access from the
> software operating on the machine view. We are discussing hypervisor
> userspace <-> hypervisor kernel interface.

And I was talking about the userspace-to-hypervisor kernel interface  
documentation.  It just happens that one specific MPIC device group  
("when the attribute is a device register") is defined with respect to  
what guest software would see if it did a similar access.

> In general you never have to set attributes on a device after it has
> been initialized, because there is state associated with that device
> that requires proper handling (example: if you modify a timer counter
> register of a timer device, any software timers used to emulate the
> timer counter must be cancelled).

Yes, it requires proper handling and the MMIO code does that.

If and when we add raw state accessors, it's totally reasonable for  
there to be command/attribute-specific documented restrictions on when  
the access can be done.

> Also, it is necessary to provide proper locking of device attribute
> write versus vcpu device access. So far we have been focusing on  
> having
> a lockless vcpu path.

How is device access related to vcpus?  Existing irqchip code is not  
lockless.

> So when device attributes can be modified has implications beyond what
> may seem visible at first.
> 
> Are this reasonable arguments?
> 
> Basically abstract 'device attributes' are too abstract.

It's up to the device-specific documentation to make them not abstract  
(I admit there are a few details missing in mpic.txt that I've pointed  
out in this thread -- it is RFC v1 after all).  This wouldn't be any  
different if we used separate ioctls for everything.  It's like saying  
abstract 'ioctl' is too abstract.

> However, your proposed interface deals with sucky capability,  
> versioning
> and namespace conflicts we have now. Note these items can probably be
> improved separately.

Any particular proposals?

> > I suppose mpic.txt could use an additional statement that
> > KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> > by the guest.
> >
> > >Locking versus currently executing VCPUs, for another (see how
> > >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> > >shared).
> >
> > If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> > you have a GSI routing table, which this patchset doesn't.
> >
> > Assuming we end up having a routing table to support irqfd, we still
> > can't share the code as is, since it's APIC-specific.
> 
> Suppose it is worthwhile to attempt to share code as much as possible.

Sure... my point is it isn't a case of "the common code is right over  
there, why aren't you using it?"  I'll try to share what I reasonably  
can, subject to my limited knowledge of how the APIC stuff works.  The  
irqfd code is substantial enough that refactoring for sharing should be  
worthwhile.  I'm not so sure about irq_comm.c.

-scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-22  2:00                         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-22  2:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> > On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> > >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> > >> >It is then not necessary to set device attributes on a live
> > >guest and
> > >> >deal with the complications associated with that.
> > >>
> > >> Which complications?
> > >>
> > >> -Scott
> > >
> > >Semantics of individual attribute writes, for one.
> >
> > When the attribute is a device register, the hardware documentation
> > takes care of that.
> 
> You are not writing to the registers from the CPU point of view.

That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and implemented  
on MPIC (with the exception of registers whose behavior changes based  
on which specific vcpu you use to access them).  If/when we have a need  
to set/get state in a different manner, that's a separate attribute  
group.

> > Otherwise, the semantics are documented in the
> > device-specific documentation -- which can include restricting when
> > the access is allowed.  Same as with any other interface
> > documentation.
> 
> Again, you are talking about the semantics of device access from the
> software operating on the machine view. We are discussing hypervisor
> userspace <-> hypervisor kernel interface.

And I was talking about the userspace-to-hypervisor kernel interface  
documentation.  It just happens that one specific MPIC device group  
("when the attribute is a device register") is defined with respect to  
what guest software would see if it did a similar access.

> In general you never have to set attributes on a device after it has
> been initialized, because there is state associated with that device
> that requires proper handling (example: if you modify a timer counter
> register of a timer device, any software timers used to emulate the
> timer counter must be cancelled).

Yes, it requires proper handling and the MMIO code does that.

If and when we add raw state accessors, it's totally reasonable for  
there to be command/attribute-specific documented restrictions on when  
the access can be done.

> Also, it is necessary to provide proper locking of device attribute
> write versus vcpu device access. So far we have been focusing on  
> having
> a lockless vcpu path.

How is device access related to vcpus?  Existing irqchip code is not  
lockless.

> So when device attributes can be modified has implications beyond what
> may seem visible at first.
> 
> Are this reasonable arguments?
> 
> Basically abstract 'device attributes' are too abstract.

It's up to the device-specific documentation to make them not abstract  
(I admit there are a few details missing in mpic.txt that I've pointed  
out in this thread -- it is RFC v1 after all).  This wouldn't be any  
different if we used separate ioctls for everything.  It's like saying  
abstract 'ioctl' is too abstract.

> However, your proposed interface deals with sucky capability,  
> versioning
> and namespace conflicts we have now. Note these items can probably be
> improved separately.

Any particular proposals?

> > I suppose mpic.txt could use an additional statement that
> > KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> > by the guest.
> >
> > >Locking versus currently executing VCPUs, for another (see how
> > >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> > >shared).
> >
> > If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> > you have a GSI routing table, which this patchset doesn't.
> >
> > Assuming we end up having a routing table to support irqfd, we still
> > can't share the code as is, since it's APIC-specific.
> 
> Suppose it is worthwhile to attempt to share code as much as possible.

Sure... my point is it isn't a case of "the common code is right over  
there, why aren't you using it?"  I'll try to share what I reasonably  
can, subject to my limited knowledge of how the APIC stuff works.  The  
irqfd code is substantial enough that refactoring for sharing should be  
worthwhile.  I'm not so sure about irq_comm.c.

-scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-21  8:22                 ` Gleb Natapov
@ 2013-02-22  2:17                   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-22  2:17 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras

  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> > >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> >> The ability to set/get attributes is needed.  Sorry, but "get
> > >or set
> > >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> > >just
> > >> >> not good enough -- assuming you don't want us to start  
> sticking
> > >> >> pointers and commands in *that* data. :-)
> > >> >>
> > >> >Proposed interface sticks pointers into ioctl data, so why doing
> > >> >the same
> > >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > >>
> > >> There's a difference between putting a pointer in an ioctl  
> control
> > >> structure that is specifically documented as being that way (as  
> in
> > >> ONE_REG), versus taking an ioctl that claims to be  
> setting/getting a
> > >> blob of state and embedding pointers in it.  It would be like
> > >> sticking a pointer in the attribute payload of this API, which I
> > >> think is something to be discouraged.
> > >If documentation is what differentiate for you between silly and  
> smart
> > >then write documentation instead of new interfaces.
> >
> > You mean like what we did with SREGS, that got deprecated and
> > replaced with ONE_REG?
> >
> SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> interface. You can access all register at once or you can access them
> one by one. If there is a need we can add MULTIPLE_REGS that will get
> list of requested REGS.

http://www.spinics.net/lists/kvm-ppc/msg04876.html
http://www.spinics.net/lists/kvm-ppc/msg05842.html

> The interface is not over generic i.e it does
> not try to replace KVM_RUN by writing special register.

Sigh.

> >                                                    And we
> > don't necessarily want to set *everything*.
> What are those cases? You do need to on reset/migration.

Why do we want to set all the registers on reset, rather than tell the  
in-kernel device to reset?  The default state came from the kernel in  
the first place on irqchip creation...

> > >>                                        It'd also be using
> > >> KVM_SET_IRQCHIP to read data, which is the sort of thing you  
> object
> > >> to later on regarding KVM_IRQ_LINE_STATUS.
> > >>
> > >Do not see why.
> >
> > It's either that, or have the data direction of the "chip" field in
> > KVM_GET_IRQCHIP not be entirely in the "get" direction.
> >
> Still do not follow. Example?

struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is  
insufficient to communicate attribute type plus a pointer.  So if we  
want to provide a pointer for the kernel to write the attribute into,  
it has to read from memory that the ioctl definition suggests should  
only be written to.

> > >Setting the address is setting an attribute. Sending MSI is a  
> command.
> > >Things you set/get during init/migration are attributes. Things  
> you do
> > >to cause side-effects are commands.
> >
> > What if I set the address at a time that isn't init/migration (the
> > hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> > an attribute due to how the caller uses it?
> >
> What's the interface for guest to move it?

Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.

> Why it goes via userspace?

Because the mechanism in question doesn't just move MPIC.  It moves a  
big block of a bunch of different devices all at once.

> You can move APIC base too, but this does not involve userspace. But
> even if you do go via userspace, it is just a guest asking to change  
> device
> configuration, so using SET_ATTR to set new configuration is fine.
> 
> > >> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What  
> real
> > >> (non-glue/wrapper) code can become common?
> > >>
> > >No new ioctl with exactly same result (well actually even faster  
> since
> > >less copying is done).
> >
> > Which ioctl would go away?
> >
> Those that you propose in your new interface.

No, they wouldn't.  At most one MPIC attribute group would go away  
(though as I've noted it would still be useful to be able to "get"  
those attributes for debugging).

> > >> And I really hope you don't want us to do MSIs the x86 way.
> > >>
> > >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> > >glue it
> > >to mpic.
> >
> > We'll have to write extra code for it compared to the current way
> > where it works with *zero* code beyond what is wanted for other
> > purposes such as debug and (eventually) migration.  At least it's
> > more direct than having to establish a GSI route...
> If just writing a register cause MSI to be send how do you distinguish
> between write that should send MSI and write that is done on migration
> to transfer current value?

It is a write-only command register.  The registers that contain the  
state are elsewhere.

Again, we do not currently support migration on MPIC.  It is a very low  
priority for embedded.  We do not wish to rule it out entirely, but it  
most likely would require adding more state accesors.

> We had that problem with MSRs on x86. We had
> to, eventually, add a flag that tells us the reason of MSR access.

The equivalent to that flag would be using the right kind of accessor  
for what you want to do (simulated guest access versus backdoor state  
access).

> > >> In the XICS thread, Paul brought up the possibliity of cascaded
> > >> MPICs.  It's not relevant to the systems we're trying to model,  
> but
> > >> if one did want to use the in-kernel irqchip interface for that,  
> it
> > >> would be really nice to be able to operate on a specific MPIC for
> > >> injection rather than have to come up with some sort of global
> > >> identifier (above and beyond the minor flattening we'd need to  
> do to
> > >> represent a single MPIC's interrupts in a flat numberspace).
> > >>
> > >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> > >  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> > >  field: | irq_type  | vcpu_index |   irq_number    |
> > >Why will similar approach not work?
> >
> > Well, it conflicts with the GSI routing stuff, and I don't see an
> > irq chip ID field...
> It does :( Can't say I am happy about it, but I skipped the discussion
> about the interface back then and it is too late to complain now.  
> Since,
> as you notices, irqfd interfaces with irq routing I wonder what's ARM
> plan about it. But if you choose to go ARM way the format is ARM  
> specific,
> so you can use your own encoding and put irq chip information there.

Well, we do want to support irqfd, so I don't think we'll be following  
ARM here.

> > But otherwise (or assuming you mean to use such an encoding when
> > setting up a GSI route), I didn't say this part couldn't be made to
> > work.  It will require new kernel code for managing a GSI table in a
> > non-APIC way, and a new callback into the device code, but as I've
> > said elsewhere I think we need it for irqfd anyway.  If I use
> > KVM_IRQ_LINE for injecting interrupts, do you still object to the
> > rest of it?
> The rest of what, proposed interface? There are two separate  
> discussions
> happening here interleaved. First is "do we need to introduce new  
> generic
> interface for device creation when existing one, albeit not ideal,  
> can be
> used" and I am OK with that as long as ARM moves to it for 3.10,  
> although
> I would prefer to have some example of what this interface will be  
> used
> for besides irq chips otherwise it will be just another way to create
> irqchip.

We need a new way to create irqchips anyway, even if it's just what the  
XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to  
KVM_CREATE_DEVICE except it doesn't return an identifier for operating  
on a specific device).  And of course we want to sort this out before  
either patchset gets merged, so we don't end up adding both methods.  I  
suspect the XICS patchset flew under your radar because it has "PPC:"  
in front of it and the subject line doesn't mention the ioctl, but I'm  
not the only one that felt the need for a few new ioctls.

As for other types of devices, x86 has i8254.c emulated in-kernel -- I  
know that's not going to switch to the new interface, but it could have  
if it existed back then.  I can also see creating an in-kernel  
emulation device for doing MMIO filtering on some piece of embedded  
hardware that guests need to access with reasonable performance, but  
the hardware desginers screwed up the protection slightly (e.g. put  
other things in the same 4K page).  We've done such filtering before in  
our standalone hypervisor; the question is whether it happens to  
anything with enough performance requirements to be done in the kernel.

> Second one is "how the interface should look like". And here I
> think that strong distinction is needed between setting the attributes
> and sending commands with side effects for reasons explained all over
> this ml thread.

OK, so let's just call them "commands".  I like the split into "read"  
and "write" commands, especially when most of the commands naturally  
come in such pairs, but if you don't like that part it can be reduced  
to a single read/write command (and then we'd define separate set/get  
commands where appropriate).

Note that the XICS patchset also involves device commands.  It does it  
by passing any unknown vm ioctl to the irqchip (XICS implements  
KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to  
KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons  
elsewhere in the thread for preferring something that doesn't require a  
new ioctl for every device command.

> > It's not about "silliness" as that this new thing I added for other
> > reasons did the job just as well (again, except when it comes to
> > irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> > was not the main point of the new interface.
> Having generic interface for device creation and then make some  
> devices
> special by allowing them to be used with KVM_IRQ_LINE makes little
> sense,

Well, we'd want to document which devices tie into which generic  
interfaces, and which device is selected if multiple such devices are  
created (if that is allowed for a particular device class).

For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip id  
in kvm_irq_routing_irqchip (or, have an attribute/command that  
retrieves the id to be used there).  Unfortunately there is no irqchip  
field in kvm_irq_routing_msi, though since it's basically a command to  
write to an arbitrary MMIO address, maybe it could just be implemented  
that way?

> > >> I see no need for a separate ioctl in terms of the underlying
> > >> infrastructure for distinguishing "attribute" from "write-only
> > >> command".  I'm open to improvements on what the ioctl is called.
> > >> It's basically like setting a register on a device, except I was
> > >> concerned that if we actually called it a "register" that people
> > >> would take it too literally and think it's only for the  
> architected
> > >> register state of the emulated device.
> > >I agree "attribute" is better name than "register", but injecting
> > >interrupt is not setting an attribute.
> >
> > It's a dynamic attribute -- the state of the input line.  Better
> > names are welcome.  I don't see this difference as enough to warrant
> > separate ioctls.
> As long as you use the same attribute for migration and interrupt  
> injection
> purpose I do. If you use separate attributes for migration and  
> interrupt
> injection then not having separate ioctl is just a hack.

Why is it a hack?  Is it also a hack to not use a separate ioctl to  
reset the device, to move its address map, etc?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-22  2:17                   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-22  2:17 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras

  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> > >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> >> The ability to set/get attributes is needed.  Sorry, but "get
> > >or set
> > >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> > >just
> > >> >> not good enough -- assuming you don't want us to start  
> sticking
> > >> >> pointers and commands in *that* data. :-)
> > >> >>
> > >> >Proposed interface sticks pointers into ioctl data, so why doing
> > >> >the same
> > >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > >>
> > >> There's a difference between putting a pointer in an ioctl  
> control
> > >> structure that is specifically documented as being that way (as  
> in
> > >> ONE_REG), versus taking an ioctl that claims to be  
> setting/getting a
> > >> blob of state and embedding pointers in it.  It would be like
> > >> sticking a pointer in the attribute payload of this API, which I
> > >> think is something to be discouraged.
> > >If documentation is what differentiate for you between silly and  
> smart
> > >then write documentation instead of new interfaces.
> >
> > You mean like what we did with SREGS, that got deprecated and
> > replaced with ONE_REG?
> >
> SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> interface. You can access all register at once or you can access them
> one by one. If there is a need we can add MULTIPLE_REGS that will get
> list of requested REGS.

http://www.spinics.net/lists/kvm-ppc/msg04876.html
http://www.spinics.net/lists/kvm-ppc/msg05842.html

> The interface is not over generic i.e it does
> not try to replace KVM_RUN by writing special register.

Sigh.

> >                                                    And we
> > don't necessarily want to set *everything*.
> What are those cases? You do need to on reset/migration.

Why do we want to set all the registers on reset, rather than tell the  
in-kernel device to reset?  The default state came from the kernel in  
the first place on irqchip creation...

> > >>                                        It'd also be using
> > >> KVM_SET_IRQCHIP to read data, which is the sort of thing you  
> object
> > >> to later on regarding KVM_IRQ_LINE_STATUS.
> > >>
> > >Do not see why.
> >
> > It's either that, or have the data direction of the "chip" field in
> > KVM_GET_IRQCHIP not be entirely in the "get" direction.
> >
> Still do not follow. Example?

struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is  
insufficient to communicate attribute type plus a pointer.  So if we  
want to provide a pointer for the kernel to write the attribute into,  
it has to read from memory that the ioctl definition suggests should  
only be written to.

> > >Setting the address is setting an attribute. Sending MSI is a  
> command.
> > >Things you set/get during init/migration are attributes. Things  
> you do
> > >to cause side-effects are commands.
> >
> > What if I set the address at a time that isn't init/migration (the
> > hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> > an attribute due to how the caller uses it?
> >
> What's the interface for guest to move it?

Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.

> Why it goes via userspace?

Because the mechanism in question doesn't just move MPIC.  It moves a  
big block of a bunch of different devices all at once.

> You can move APIC base too, but this does not involve userspace. But
> even if you do go via userspace, it is just a guest asking to change  
> device
> configuration, so using SET_ATTR to set new configuration is fine.
> 
> > >> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What  
> real
> > >> (non-glue/wrapper) code can become common?
> > >>
> > >No new ioctl with exactly same result (well actually even faster  
> since
> > >less copying is done).
> >
> > Which ioctl would go away?
> >
> Those that you propose in your new interface.

No, they wouldn't.  At most one MPIC attribute group would go away  
(though as I've noted it would still be useful to be able to "get"  
those attributes for debugging).

> > >> And I really hope you don't want us to do MSIs the x86 way.
> > >>
> > >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> > >glue it
> > >to mpic.
> >
> > We'll have to write extra code for it compared to the current way
> > where it works with *zero* code beyond what is wanted for other
> > purposes such as debug and (eventually) migration.  At least it's
> > more direct than having to establish a GSI route...
> If just writing a register cause MSI to be send how do you distinguish
> between write that should send MSI and write that is done on migration
> to transfer current value?

It is a write-only command register.  The registers that contain the  
state are elsewhere.

Again, we do not currently support migration on MPIC.  It is a very low  
priority for embedded.  We do not wish to rule it out entirely, but it  
most likely would require adding more state accesors.

> We had that problem with MSRs on x86. We had
> to, eventually, add a flag that tells us the reason of MSR access.

The equivalent to that flag would be using the right kind of accessor  
for what you want to do (simulated guest access versus backdoor state  
access).

> > >> In the XICS thread, Paul brought up the possibliity of cascaded
> > >> MPICs.  It's not relevant to the systems we're trying to model,  
> but
> > >> if one did want to use the in-kernel irqchip interface for that,  
> it
> > >> would be really nice to be able to operate on a specific MPIC for
> > >> injection rather than have to come up with some sort of global
> > >> identifier (above and beyond the minor flattening we'd need to  
> do to
> > >> represent a single MPIC's interrupts in a flat numberspace).
> > >>
> > >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> > >  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> > >  field: | irq_type  | vcpu_index |   irq_number    |
> > >Why will similar approach not work?
> >
> > Well, it conflicts with the GSI routing stuff, and I don't see an
> > irq chip ID field...
> It does :( Can't say I am happy about it, but I skipped the discussion
> about the interface back then and it is too late to complain now.  
> Since,
> as you notices, irqfd interfaces with irq routing I wonder what's ARM
> plan about it. But if you choose to go ARM way the format is ARM  
> specific,
> so you can use your own encoding and put irq chip information there.

Well, we do want to support irqfd, so I don't think we'll be following  
ARM here.

> > But otherwise (or assuming you mean to use such an encoding when
> > setting up a GSI route), I didn't say this part couldn't be made to
> > work.  It will require new kernel code for managing a GSI table in a
> > non-APIC way, and a new callback into the device code, but as I've
> > said elsewhere I think we need it for irqfd anyway.  If I use
> > KVM_IRQ_LINE for injecting interrupts, do you still object to the
> > rest of it?
> The rest of what, proposed interface? There are two separate  
> discussions
> happening here interleaved. First is "do we need to introduce new  
> generic
> interface for device creation when existing one, albeit not ideal,  
> can be
> used" and I am OK with that as long as ARM moves to it for 3.10,  
> although
> I would prefer to have some example of what this interface will be  
> used
> for besides irq chips otherwise it will be just another way to create
> irqchip.

We need a new way to create irqchips anyway, even if it's just what the  
XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to  
KVM_CREATE_DEVICE except it doesn't return an identifier for operating  
on a specific device).  And of course we want to sort this out before  
either patchset gets merged, so we don't end up adding both methods.  I  
suspect the XICS patchset flew under your radar because it has "PPC:"  
in front of it and the subject line doesn't mention the ioctl, but I'm  
not the only one that felt the need for a few new ioctls.

As for other types of devices, x86 has i8254.c emulated in-kernel -- I  
know that's not going to switch to the new interface, but it could have  
if it existed back then.  I can also see creating an in-kernel  
emulation device for doing MMIO filtering on some piece of embedded  
hardware that guests need to access with reasonable performance, but  
the hardware desginers screwed up the protection slightly (e.g. put  
other things in the same 4K page).  We've done such filtering before in  
our standalone hypervisor; the question is whether it happens to  
anything with enough performance requirements to be done in the kernel.

> Second one is "how the interface should look like". And here I
> think that strong distinction is needed between setting the attributes
> and sending commands with side effects for reasons explained all over
> this ml thread.

OK, so let's just call them "commands".  I like the split into "read"  
and "write" commands, especially when most of the commands naturally  
come in such pairs, but if you don't like that part it can be reduced  
to a single read/write command (and then we'd define separate set/get  
commands where appropriate).

Note that the XICS patchset also involves device commands.  It does it  
by passing any unknown vm ioctl to the irqchip (XICS implements  
KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to  
KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons  
elsewhere in the thread for preferring something that doesn't require a  
new ioctl for every device command.

> > It's not about "silliness" as that this new thing I added for other
> > reasons did the job just as well (again, except when it comes to
> > irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> > was not the main point of the new interface.
> Having generic interface for device creation and then make some  
> devices
> special by allowing them to be used with KVM_IRQ_LINE makes little
> sense,

Well, we'd want to document which devices tie into which generic  
interfaces, and which device is selected if multiple such devices are  
created (if that is allowed for a particular device class).

For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip id  
in kvm_irq_routing_irqchip (or, have an attribute/command that  
retrieves the id to be used there).  Unfortunately there is no irqchip  
field in kvm_irq_routing_msi, though since it's basically a command to  
write to an arbitrary MMIO address, maybe it could just be implemented  
that way?

> > >> I see no need for a separate ioctl in terms of the underlying
> > >> infrastructure for distinguishing "attribute" from "write-only
> > >> command".  I'm open to improvements on what the ioctl is called.
> > >> It's basically like setting a register on a device, except I was
> > >> concerned that if we actually called it a "register" that people
> > >> would take it too literally and think it's only for the  
> architected
> > >> register state of the emulated device.
> > >I agree "attribute" is better name than "register", but injecting
> > >interrupt is not setting an attribute.
> >
> > It's a dynamic attribute -- the state of the input line.  Better
> > names are welcome.  I don't see this difference as enough to warrant
> > separate ioctls.
> As long as you use the same attribute for migration and interrupt  
> injection
> purpose I do. If you use separate attributes for migration and  
> interrupt
> injection then not having separate ioctl is just a hack.

Why is it a hack?  Is it also a hack to not use a separate ioctl to  
reset the device, to move its address map, etc?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-22  2:00                         ` Scott Wood
@ 2013-02-23 15:04                           ` Marcelo Tosatti
  -1 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-23 15:04 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
> On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> >> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >> >It is then not necessary to set device attributes on a live
> >> >guest and
> >> >> >deal with the complications associated with that.
> >> >>
> >> >> Which complications?
> >> >>
> >> >> -Scott
> >> >
> >> >Semantics of individual attribute writes, for one.
> >>
> >> When the attribute is a device register, the hardware documentation
> >> takes care of that.
> >
> >You are not writing to the registers from the CPU point of view.
> 
> That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
> implemented on MPIC (with the exception of registers whose behavior
> changes based on which specific vcpu you use to access them).
> If/when we have a need to set/get state in a different manner,
> that's a separate attribute group.

Can you describe usage of this register again?

> >> Otherwise, the semantics are documented in the
> >> device-specific documentation -- which can include restricting when
> >> the access is allowed.  Same as with any other interface
> >> documentation.
> >
> >Again, you are talking about the semantics of device access from the
> >software operating on the machine view. We are discussing hypervisor
> >userspace <-> hypervisor kernel interface.
> 
> And I was talking about the userspace-to-hypervisor kernel interface
> documentation.  It just happens that one specific MPIC device group
> ("when the attribute is a device register") is defined with respect
> to what guest software would see if it did a similar access.
> 
> >In general you never have to set attributes on a device after it has
> >been initialized, because there is state associated with that device
> >that requires proper handling (example: if you modify a timer counter
> >register of a timer device, any software timers used to emulate the
> >timer counter must be cancelled).
> 
> Yes, it requires proper handling and the MMIO code does that.
>
> If and when we add raw state accessors, it's totally reasonable for
> there to be command/attribute-specific documented restrictions on
> when the access can be done.

> >Also, it is necessary to provide proper locking of device attribute
> >write versus vcpu device access. So far we have been focusing on
> >having
> >a lockless vcpu path.
> 
> How is device access related to vcpus?  Existing irqchip code is not
> lockless.

VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in
virt/kvm/).

> >So when device attributes can be modified has implications beyond what
> >may seem visible at first.
> >
> >Are this reasonable arguments?
> >
> >Basically abstract 'device attributes' are too abstract.
> 
> It's up to the device-specific documentation to make them not
> abstract (I admit there are a few details missing in mpic.txt that
> I've pointed out in this thread -- it is RFC v1 after all).  This
> wouldn't be any different if we used separate ioctls for everything.
> It's like saying abstract 'ioctl' is too abstract.

Perhaps a better way to put it is that its too permissive.

> >However, your proposed interface deals with sucky capability,
> >versioning
> >and namespace conflicts we have now. Note these items can probably be
> >improved separately.
> 
> Any particular proposals?

Namespace conflicts: Reserve ranges for each arch. 

The other two items, haven't though. I am not the one bothered :-) (yes, they
suck).

> >> I suppose mpic.txt could use an additional statement that
> >> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> >> by the guest.
> >>
> >> >Locking versus currently executing VCPUs, for another (see how
> >> >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> >> >shared).
> >>
> >> If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> >> you have a GSI routing table, which this patchset doesn't.
> >>
> >> Assuming we end up having a routing table to support irqfd, we still
> >> can't share the code as is, since it's APIC-specific.
> >
> >Suppose it is worthwhile to attempt to share code as much as possible.
> 
> Sure... my point is it isn't a case of "the common code is right
> over there, why aren't you using it?"  I'll try to share what I
> reasonably can, subject to my limited knowledge of how the APIC
> stuff works.  The irqfd code is substantial enough that refactoring
> for sharing should be worthwhile.  I'm not so sure about irq_comm.c.
> 
> -scott

Note just pointing out drawbacks of device attributes (if something of
that sort is integrated, x86 should also use it).

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-23 15:04                           ` Marcelo Tosatti
  0 siblings, 0 replies; 261+ messages in thread
From: Marcelo Tosatti @ 2013-02-23 15:04 UTC (permalink / raw)
  To: Scott Wood; +Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
> On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> >> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >> >It is then not necessary to set device attributes on a live
> >> >guest and
> >> >> >deal with the complications associated with that.
> >> >>
> >> >> Which complications?
> >> >>
> >> >> -Scott
> >> >
> >> >Semantics of individual attribute writes, for one.
> >>
> >> When the attribute is a device register, the hardware documentation
> >> takes care of that.
> >
> >You are not writing to the registers from the CPU point of view.
> 
> That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
> implemented on MPIC (with the exception of registers whose behavior
> changes based on which specific vcpu you use to access them).
> If/when we have a need to set/get state in a different manner,
> that's a separate attribute group.

Can you describe usage of this register again?

> >> Otherwise, the semantics are documented in the
> >> device-specific documentation -- which can include restricting when
> >> the access is allowed.  Same as with any other interface
> >> documentation.
> >
> >Again, you are talking about the semantics of device access from the
> >software operating on the machine view. We are discussing hypervisor
> >userspace <-> hypervisor kernel interface.
> 
> And I was talking about the userspace-to-hypervisor kernel interface
> documentation.  It just happens that one specific MPIC device group
> ("when the attribute is a device register") is defined with respect
> to what guest software would see if it did a similar access.
> 
> >In general you never have to set attributes on a device after it has
> >been initialized, because there is state associated with that device
> >that requires proper handling (example: if you modify a timer counter
> >register of a timer device, any software timers used to emulate the
> >timer counter must be cancelled).
> 
> Yes, it requires proper handling and the MMIO code does that.
>
> If and when we add raw state accessors, it's totally reasonable for
> there to be command/attribute-specific documented restrictions on
> when the access can be done.

> >Also, it is necessary to provide proper locking of device attribute
> >write versus vcpu device access. So far we have been focusing on
> >having
> >a lockless vcpu path.
> 
> How is device access related to vcpus?  Existing irqchip code is not
> lockless.

VCPUS access in-kernel devices. Yes, it is lockless (see RCU usage in
virt/kvm/).

> >So when device attributes can be modified has implications beyond what
> >may seem visible at first.
> >
> >Are this reasonable arguments?
> >
> >Basically abstract 'device attributes' are too abstract.
> 
> It's up to the device-specific documentation to make them not
> abstract (I admit there are a few details missing in mpic.txt that
> I've pointed out in this thread -- it is RFC v1 after all).  This
> wouldn't be any different if we used separate ioctls for everything.
> It's like saying abstract 'ioctl' is too abstract.

Perhaps a better way to put it is that its too permissive.

> >However, your proposed interface deals with sucky capability,
> >versioning
> >and namespace conflicts we have now. Note these items can probably be
> >improved separately.
> 
> Any particular proposals?

Namespace conflicts: Reserve ranges for each arch. 

The other two items, haven't though. I am not the one bothered :-) (yes, they
suck).

> >> I suppose mpic.txt could use an additional statement that
> >> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> >> by the guest.
> >>
> >> >Locking versus currently executing VCPUs, for another (see how
> >> >KVM_SET_IRQ's RCU usage, for instance, that is something should be
> >> >shared).
> >>
> >> If you mean kvm_set_irq() in irq_comm.c, that's only relevant when
> >> you have a GSI routing table, which this patchset doesn't.
> >>
> >> Assuming we end up having a routing table to support irqfd, we still
> >> can't share the code as is, since it's APIC-specific.
> >
> >Suppose it is worthwhile to attempt to share code as much as possible.
> 
> Sure... my point is it isn't a case of "the common code is right
> over there, why aren't you using it?"  I'll try to share what I
> reasonably can, subject to my limited knowledge of how the APIC
> stuff works.  The irqfd code is substantial enough that refactoring
> for sharing should be worthwhile.  I'm not so sure about irq_comm.c.
> 
> -scott

Note just pointing out drawbacks of device attributes (if something of
that sort is integrated, x86 should also use it).


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-20  2:16             ` Christoffer Dall
@ 2013-02-24 13:12               ` Marc Zyngier
  -1 siblings, 0 replies; 261+ messages in thread
From: Marc Zyngier @ 2013-02-24 13:12 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Tue, 19 Feb 2013 18:16:53 -0800, Christoffer Dall
<cdall@cs.columbia.edu> wrote:
> On Tue, Feb 19, 2013 at 12:16 PM, Scott Wood <scottwood@freescale.com>
> wrote:


>> We at least need the numberspace to not be architecture-specific if we
>> want
>> to retain the possibility of changing later -- not to mention what
>> happens
>> if architectures merge.  I see that "arm" and "arm64" are separate,
>> despite
>> the fact that other architectures that used to be split this way have
>> since
>> merged.  Maybe "arm64" is too different from "arm" for that to happen,
>> but
>> who knows...
>>
> 
> Fair point, nobody knows.

This is unlikely to happen soon.

>> ...and if they don't merge, wouldn't that be a likely case for devices
>> shared across architectures?  Does arm64 use gic/vgic?  This post
>> suggests
>> that there is at least something in common (the bit about "once the GIC
>> code
>> is shared between
>> arm and arm64"):
>>
http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html
>>
> 
> I'm not sure how much of that is public at this point, or even
> determined. But KVM already shares code between arm64 and arm, so I
> guess I thought of this as a single architecture from the point of
> view of virt/kvm/kvm_main.c, but that may be incorrect actually.

As I am the sad bastard who wrote most of the VGIC stuff, and also the one
who maintains KVM on arm64, I feel a need to chime in:

arm64 very much uses GIC/VGIC. Actually, all the "in-kernel device" code
is shared between arm and arm64. That means VGIC, timers and PSCI. Which
probably means that code should be moved out of arch/arm/kvm, but that's a
separate story.

It is my intention to keep the interface as close as possible to arm, as
long as that doesn't cause any major issue (arm64 has some slightly
different requirements).

Cheers,

        M.
-- 
Who you jivin' with that Cosmik Debris?

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-24 13:12               ` Marc Zyngier
  0 siblings, 0 replies; 261+ messages in thread
From: Marc Zyngier @ 2013-02-24 13:12 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Tue, 19 Feb 2013 18:16:53 -0800, Christoffer Dall
<cdall@cs.columbia.edu> wrote:
> On Tue, Feb 19, 2013 at 12:16 PM, Scott Wood <scottwood@freescale.com>
> wrote:


>> We at least need the numberspace to not be architecture-specific if we
>> want
>> to retain the possibility of changing later -- not to mention what
>> happens
>> if architectures merge.  I see that "arm" and "arm64" are separate,
>> despite
>> the fact that other architectures that used to be split this way have
>> since
>> merged.  Maybe "arm64" is too different from "arm" for that to happen,
>> but
>> who knows...
>>
> 
> Fair point, nobody knows.

This is unlikely to happen soon.

>> ...and if they don't merge, wouldn't that be a likely case for devices
>> shared across architectures?  Does arm64 use gic/vgic?  This post
>> suggests
>> that there is at least something in common (the bit about "once the GIC
>> code
>> is shared between
>> arm and arm64"):
>>
http://lists.infradead.org/pipermail/linux-arm-kernel/2012-December/135836.html
>>
> 
> I'm not sure how much of that is public at this point, or even
> determined. But KVM already shares code between arm64 and arm, so I
> guess I thought of this as a single architecture from the point of
> view of virt/kvm/kvm_main.c, but that may be incorrect actually.

As I am the sad bastard who wrote most of the VGIC stuff, and also the one
who maintains KVM on arm64, I feel a need to chime in:

arm64 very much uses GIC/VGIC. Actually, all the "in-kernel device" code
is shared between arm and arm64. That means VGIC, timers and PSCI. Which
probably means that code should be moved out of arch/arm/kvm, but that's a
separate story.

It is my intention to keep the interface as close as possible to arm, as
long as that doesn't cause any major issue (arm64 has some slightly
different requirements).

Cheers,

        M.
-- 
Who you jivin' with that Cosmik Debris?

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-22  2:17                   ` Scott Wood
@ 2013-02-24 15:46                     ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-24 15:46 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras

On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> >On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >>  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> >> >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> >> >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >> >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> >> >> The ability to set/get attributes is needed.  Sorry, but "get
> >> >or set
> >> >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> >> >just
> >> >> >> not good enough -- assuming you don't want us to start
> >sticking
> >> >> >> pointers and commands in *that* data. :-)
> >> >> >>
> >> >> >Proposed interface sticks pointers into ioctl data, so why doing
> >> >> >the same
> >> >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >> >>
> >> >> There's a difference between putting a pointer in an ioctl
> >control
> >> >> structure that is specifically documented as being that way
> >(as in
> >> >> ONE_REG), versus taking an ioctl that claims to be
> >setting/getting a
> >> >> blob of state and embedding pointers in it.  It would be like
> >> >> sticking a pointer in the attribute payload of this API, which I
> >> >> think is something to be discouraged.
> >> >If documentation is what differentiate for you between silly
> >and smart
> >> >then write documentation instead of new interfaces.
> >>
> >> You mean like what we did with SREGS, that got deprecated and
> >> replaced with ONE_REG?
> >>
> >SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> >interface. You can access all register at once or you can access them
> >one by one. If there is a need we can add MULTIPLE_REGS that will get
> >list of requested REGS.
> 
> http://www.spinics.net/lists/kvm-ppc/msg04876.html
> http://www.spinics.net/lists/kvm-ppc/msg05842.html
> 
If Alex prefers ONE_REG interface this is his call :)

> >The interface is not over generic i.e it does
> >not try to replace KVM_RUN by writing special register.
> 
> Sigh.
> 
> >>                                                    And we
> >> don't necessarily want to set *everything*.
> >What are those cases? You do need to on reset/migration.
> 
> Why do we want to set all the registers on reset, rather than tell
> the in-kernel device to reset?  The default state came from the
> kernel in the first place on irqchip creation...
> 
I have nothing against telling in-kernel device to reset provided there
is a way to do so, which current interface lacks. Reset in userspase has
its advantage too: bugs are easier to fix, there may be different kind
of resets (hard/soft).

> >> >>                                        It'd also be using
> >> >> KVM_SET_IRQCHIP to read data, which is the sort of thing you
> >object
> >> >> to later on regarding KVM_IRQ_LINE_STATUS.
> >> >>
> >> >Do not see why.
> >>
> >> It's either that, or have the data direction of the "chip" field in
> >> KVM_GET_IRQCHIP not be entirely in the "get" direction.
> >>
> >Still do not follow. Example?
> 
> struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is
> insufficient to communicate attribute type plus a pointer.  So if we
> want to provide a pointer for the kernel to write the attribute
> into, it has to read from memory that the ioctl definition suggests
> should only be written to.
Yes, but this is not different from the interface you propose.

> 
> >> >Setting the address is setting an attribute. Sending MSI is a
> >command.
> >> >Things you set/get during init/migration are attributes. Things
> >you do
> >> >to cause side-effects are commands.
> >>
> >> What if I set the address at a time that isn't init/migration (the
> >> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> >> an attribute due to how the caller uses it?
> >>
> >What's the interface for guest to move it?
> 
> Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.
> 
> >Why it goes via userspace?
> 
> Because the mechanism in question doesn't just move MPIC.  It moves
> a big block of a bunch of different devices all at once.
> 
> >You can move APIC base too, but this does not involve userspace. But
> >even if you do go via userspace, it is just a guest asking to
> >change device
> >configuration, so using SET_ATTR to set new configuration is fine.
> >
> >> >> What is the benefit of KVM_IRQ_LINE over what MPIC does?
> >What real
> >> >> (non-glue/wrapper) code can become common?
> >> >>
> >> >No new ioctl with exactly same result (well actually even
> >faster since
> >> >less copying is done).
> >>
> >> Which ioctl would go away?
> >>
> >Those that you propose in your new interface.
> 
> No, they wouldn't.  At most one MPIC attribute group would go away
> (though as I've noted it would still be useful to be able to "get"
> those attributes for debugging).
> 
> >> >> And I really hope you don't want us to do MSIs the x86 way.
> >> >>
> >> >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> >> >glue it
> >> >to mpic.
> >>
> >> We'll have to write extra code for it compared to the current way
> >> where it works with *zero* code beyond what is wanted for other
> >> purposes such as debug and (eventually) migration.  At least it's
> >> more direct than having to establish a GSI route...
> >If just writing a register cause MSI to be send how do you distinguish
> >between write that should send MSI and write that is done on migration
> >to transfer current value?
> 
> It is a write-only command register.  The registers that contain the
> state are elsewhere.
> 
The register may be write-only from OS point of view, but its internal
state may still need to be transfered on migration. This brings us back
to the point that state and commands should have different accessors.

> Again, we do not currently support migration on MPIC.  It is a very
> low priority for embedded.  We do not wish to rule it out entirely,
> but it most likely would require adding more state accesors.
The interface suppose to be generic, we are not talking about MPIC
specifically here. Regarding migration "never say never"  :)

> 
> >We had that problem with MSRs on x86. We had
> >to, eventually, add a flag that tells us the reason of MSR access.
> 
> The equivalent to that flag would be using the right kind of
> accessor for what you want to do (simulated guest access versus
> backdoor state access).
Ugh.

> 
> >> >> In the XICS thread, Paul brought up the possibliity of cascaded
> >> >> MPICs.  It's not relevant to the systems we're trying to
> >model, but
> >> >> if one did want to use the in-kernel irqchip interface for
> >that, it
> >> >> would be really nice to be able to operate on a specific MPIC for
> >> >> injection rather than have to come up with some sort of global
> >> >> identifier (above and beyond the minor flattening we'd need
> >to do to
> >> >> represent a single MPIC's interrupts in a flat numberspace).
> >> >>
> >> >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> >> >  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> >> >  field: | irq_type  | vcpu_index |   irq_number    |
> >> >Why will similar approach not work?
> >>
> >> Well, it conflicts with the GSI routing stuff, and I don't see an
> >> irq chip ID field...
> >It does :( Can't say I am happy about it, but I skipped the discussion
> >about the interface back then and it is too late to complain now.
> >Since,
> >as you notices, irqfd interfaces with irq routing I wonder what's ARM
> >plan about it. But if you choose to go ARM way the format is ARM
> >specific,
> >so you can use your own encoding and put irq chip information there.
> 
> Well, we do want to support irqfd, so I don't think we'll be
> following ARM here.
> 
I think they want too. May be they have a plan to enhance irqfd. S390
people do something about it now. Haven't look at proposed patches yet.

> >> But otherwise (or assuming you mean to use such an encoding when
> >> setting up a GSI route), I didn't say this part couldn't be made to
> >> work.  It will require new kernel code for managing a GSI table in a
> >> non-APIC way, and a new callback into the device code, but as I've
> >> said elsewhere I think we need it for irqfd anyway.  If I use
> >> KVM_IRQ_LINE for injecting interrupts, do you still object to the
> >> rest of it?
> >The rest of what, proposed interface? There are two separate
> >discussions
> >happening here interleaved. First is "do we need to introduce new
> >generic
> >interface for device creation when existing one, albeit not ideal,
> >can be
> >used" and I am OK with that as long as ARM moves to it for 3.10,
> >although
> >I would prefer to have some example of what this interface will be
> >used
> >for besides irq chips otherwise it will be just another way to create
> >irqchip.
> 
> We need a new way to create irqchips anyway, even if it's just what
> the XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to
> KVM_CREATE_DEVICE except it doesn't return an identifier for
> operating on a specific device).  And of course we want to sort this
> out before either patchset gets merged, so we don't end up adding
> both methods.  I suspect the XICS patchset flew under your radar
> because it has "PPC:" in front of it and the subject line doesn't
> mention the ioctl, but I'm not the only one that felt the need for a
> few new ioctls.
> 
I noticed the thread (there was in-kernel irqchip there to compensate for
PPC), but haven't read it before sending the email otherwise I would
have added here that you guys need to agree on common interface.

Now I looked closer into proposed interface. I am not sure why Paul
decided to implement KVM_CREATE_IRQCHIP_ARGS instead of using
KVM_CREATE_IRQCHIP. He supports only one type with his patches and I am
not sure if he is planning add something else. KVM_IRQCHIP_SET_SOURCES
looks like it tires to reimplement irq routing.
 
> As for other types of devices, x86 has i8254.c emulated in-kernel --
> I know that's not going to switch to the new interface, but it could
> have if it existed back then.
Since it is not going to switch it is not a good example. On x86 probably
interrupt remapping device will have to have some kernel component that
may take advantage of the new interface, but I haven't thought about
interrupt remapping implementation enough to be sure.

>                                I can also see creating an in-kernel
> emulation device for doing MMIO filtering on some piece of embedded
> hardware that guests need to access with reasonable performance, but
> the hardware desginers screwed up the protection slightly (e.g. put
> other things in the same 4K page).  We've done such filtering before
> in our standalone hypervisor; the question is whether it happens to
> anything with enough performance requirements to be done in the
> kernel.
I am not sure why special device is needed for such filtering. If MMIO
is not handled by the kernel it is forwarded to a userspace.

> 
> >Second one is "how the interface should look like". And here I
> >think that strong distinction is needed between setting the attributes
> >and sending commands with side effects for reasons explained all over
> >this ml thread.
> 
> OK, so let's just call them "commands".  I like the split into
> "read" and "write" commands, especially when most of the commands
> naturally come in such pairs, but if you don't like that part it can
> be reduced to a single read/write command (and then we'd define
> separate set/get commands where appropriate).
I think read/write will be simpler.

> 
> Note that the XICS patchset also involves device commands.  It does
> it by passing any unknown vm ioctl to the irqchip (XICS implements
> KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to
> KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons
> elsewhere in the thread for preferring something that doesn't
> require a new ioctl for every device command.
> 
> >> It's not about "silliness" as that this new thing I added for other
> >> reasons did the job just as well (again, except when it comes to
> >> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> >> was not the main point of the new interface.
> >Having generic interface for device creation and then make some
> >devices
> >special by allowing them to be used with KVM_IRQ_LINE makes little
> >sense,
> 
> Well, we'd want to document which devices tie into which generic
> interfaces, and which device is selected if multiple such devices
> are created (if that is allowed for a particular device class).
> 
> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
> id in kvm_irq_routing_irqchip (or, have an attribute/command that
> retrieves the id to be used there).  Unfortunately there is no
> irqchip field in kvm_irq_routing_msi, though since it's basically a
> command to write to an arbitrary MMIO address, maybe it could just
> be implemented that way?
> 
How MSI is delivered with MPIC? From device point of view sending an MSI
is doing write at address X of data Y. How PPC with MPIC translates this
into actual interrupt?

> >> >> I see no need for a separate ioctl in terms of the underlying
> >> >> infrastructure for distinguishing "attribute" from "write-only
> >> >> command".  I'm open to improvements on what the ioctl is called.
> >> >> It's basically like setting a register on a device, except I was
> >> >> concerned that if we actually called it a "register" that people
> >> >> would take it too literally and think it's only for the
> >architected
> >> >> register state of the emulated device.
> >> >I agree "attribute" is better name than "register", but injecting
> >> >interrupt is not setting an attribute.
> >>
> >> It's a dynamic attribute -- the state of the input line.  Better
> >> names are welcome.  I don't see this difference as enough to warrant
> >> separate ioctls.
> >As long as you use the same attribute for migration and interrupt
> >injection
> >purpose I do. If you use separate attributes for migration and
> >interrupt
> >injection then not having separate ioctl is just a hack.
> 
> Why is it a hack?  Is it also a hack to not use a separate ioctl to
> reset the device, to move its address map, etc?
>
It is a hack because purpose of the interface becomes unclear. If you
see it called in a code you have no idea what semantics is expected.
For instance getting/setting of a state should be done when vcpus are
not running, but commands may be sent while they are. This also
encourage bugs when incorrect attribute is used during migration or vice
versa. In short interface does not help you, it requires you to read
documentation of each device very carefully. Reset is one thing that
will be implemented as a command ioctl. Moving address map depends. If
an address is a part of device's sate and will be migrated as such then it
is device attribute, otherwise use command to move it.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-24 15:46                     ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-24 15:46 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras

On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> >On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >>  On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
> >> >On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> >> >> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >> >> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> >> >> The ability to set/get attributes is needed.  Sorry, but "get
> >> >or set
> >> >> >> one blob of data, up to 512 bytes, for the entire irqchip" is
> >> >just
> >> >> >> not good enough -- assuming you don't want us to start
> >sticking
> >> >> >> pointers and commands in *that* data. :-)
> >> >> >>
> >> >> >Proposed interface sticks pointers into ioctl data, so why doing
> >> >> >the same
> >> >> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >> >>
> >> >> There's a difference between putting a pointer in an ioctl
> >control
> >> >> structure that is specifically documented as being that way
> >(as in
> >> >> ONE_REG), versus taking an ioctl that claims to be
> >setting/getting a
> >> >> blob of state and embedding pointers in it.  It would be like
> >> >> sticking a pointer in the attribute payload of this API, which I
> >> >> think is something to be discouraged.
> >> >If documentation is what differentiate for you between silly
> >and smart
> >> >then write documentation instead of new interfaces.
> >>
> >> You mean like what we did with SREGS, that got deprecated and
> >> replaced with ONE_REG?
> >>
> >SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> >interface. You can access all register at once or you can access them
> >one by one. If there is a need we can add MULTIPLE_REGS that will get
> >list of requested REGS.
> 
> http://www.spinics.net/lists/kvm-ppc/msg04876.html
> http://www.spinics.net/lists/kvm-ppc/msg05842.html
> 
If Alex prefers ONE_REG interface this is his call :)

> >The interface is not over generic i.e it does
> >not try to replace KVM_RUN by writing special register.
> 
> Sigh.
> 
> >>                                                    And we
> >> don't necessarily want to set *everything*.
> >What are those cases? You do need to on reset/migration.
> 
> Why do we want to set all the registers on reset, rather than tell
> the in-kernel device to reset?  The default state came from the
> kernel in the first place on irqchip creation...
> 
I have nothing against telling in-kernel device to reset provided there
is a way to do so, which current interface lacks. Reset in userspase has
its advantage too: bugs are easier to fix, there may be different kind
of resets (hard/soft).

> >> >>                                        It'd also be using
> >> >> KVM_SET_IRQCHIP to read data, which is the sort of thing you
> >object
> >> >> to later on regarding KVM_IRQ_LINE_STATUS.
> >> >>
> >> >Do not see why.
> >>
> >> It's either that, or have the data direction of the "chip" field in
> >> KVM_GET_IRQCHIP not be entirely in the "get" direction.
> >>
> >Still do not follow. Example?
> 
> struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is
> insufficient to communicate attribute type plus a pointer.  So if we
> want to provide a pointer for the kernel to write the attribute
> into, it has to read from memory that the ioctl definition suggests
> should only be written to.
Yes, but this is not different from the interface you propose.

> 
> >> >Setting the address is setting an attribute. Sending MSI is a
> >command.
> >> >Things you set/get during init/migration are attributes. Things
> >you do
> >> >to cause side-effects are commands.
> >>
> >> What if I set the address at a time that isn't init/migration (the
> >> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
> >> an attribute due to how the caller uses it?
> >>
> >What's the interface for guest to move it?
> 
> Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.
> 
> >Why it goes via userspace?
> 
> Because the mechanism in question doesn't just move MPIC.  It moves
> a big block of a bunch of different devices all at once.
> 
> >You can move APIC base too, but this does not involve userspace. But
> >even if you do go via userspace, it is just a guest asking to
> >change device
> >configuration, so using SET_ATTR to set new configuration is fine.
> >
> >> >> What is the benefit of KVM_IRQ_LINE over what MPIC does?
> >What real
> >> >> (non-glue/wrapper) code can become common?
> >> >>
> >> >No new ioctl with exactly same result (well actually even
> >faster since
> >> >less copying is done).
> >>
> >> Which ioctl would go away?
> >>
> >Those that you propose in your new interface.
> 
> No, they wouldn't.  At most one MPIC attribute group would go away
> (though as I've noted it would still be useful to be able to "get"
> those attributes for debugging).
> 
> >> >> And I really hope you don't want us to do MSIs the x86 way.
> >> >>
> >> >What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
> >> >glue it
> >> >to mpic.
> >>
> >> We'll have to write extra code for it compared to the current way
> >> where it works with *zero* code beyond what is wanted for other
> >> purposes such as debug and (eventually) migration.  At least it's
> >> more direct than having to establish a GSI route...
> >If just writing a register cause MSI to be send how do you distinguish
> >between write that should send MSI and write that is done on migration
> >to transfer current value?
> 
> It is a write-only command register.  The registers that contain the
> state are elsewhere.
> 
The register may be write-only from OS point of view, but its internal
state may still need to be transfered on migration. This brings us back
to the point that state and commands should have different accessors.

> Again, we do not currently support migration on MPIC.  It is a very
> low priority for embedded.  We do not wish to rule it out entirely,
> but it most likely would require adding more state accesors.
The interface suppose to be generic, we are not talking about MPIC
specifically here. Regarding migration "never say never"  :)

> 
> >We had that problem with MSRs on x86. We had
> >to, eventually, add a flag that tells us the reason of MSR access.
> 
> The equivalent to that flag would be using the right kind of
> accessor for what you want to do (simulated guest access versus
> backdoor state access).
Ugh.

> 
> >> >> In the XICS thread, Paul brought up the possibliity of cascaded
> >> >> MPICs.  It's not relevant to the systems we're trying to
> >model, but
> >> >> if one did want to use the in-kernel irqchip interface for
> >that, it
> >> >> would be really nice to be able to operate on a specific MPIC for
> >> >> injection rather than have to come up with some sort of global
> >> >> identifier (above and beyond the minor flattening we'd need
> >to do to
> >> >> represent a single MPIC's interrupts in a flat numberspace).
> >> >>
> >> >ARM encodes information in irq field of KVM_IRQ_LINE like that:
> >> > šbits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
> >> >  field: | irq_type  | vcpu_index |   irq_number    |
> >> >Why will similar approach not work?
> >>
> >> Well, it conflicts with the GSI routing stuff, and I don't see an
> >> irq chip ID field...
> >It does :( Can't say I am happy about it, but I skipped the discussion
> >about the interface back then and it is too late to complain now.
> >Since,
> >as you notices, irqfd interfaces with irq routing I wonder what's ARM
> >plan about it. But if you choose to go ARM way the format is ARM
> >specific,
> >so you can use your own encoding and put irq chip information there.
> 
> Well, we do want to support irqfd, so I don't think we'll be
> following ARM here.
> 
I think they want too. May be they have a plan to enhance irqfd. S390
people do something about it now. Haven't look at proposed patches yet.

> >> But otherwise (or assuming you mean to use such an encoding when
> >> setting up a GSI route), I didn't say this part couldn't be made to
> >> work.  It will require new kernel code for managing a GSI table in a
> >> non-APIC way, and a new callback into the device code, but as I've
> >> said elsewhere I think we need it for irqfd anyway.  If I use
> >> KVM_IRQ_LINE for injecting interrupts, do you still object to the
> >> rest of it?
> >The rest of what, proposed interface? There are two separate
> >discussions
> >happening here interleaved. First is "do we need to introduce new
> >generic
> >interface for device creation when existing one, albeit not ideal,
> >can be
> >used" and I am OK with that as long as ARM moves to it for 3.10,
> >although
> >I would prefer to have some example of what this interface will be
> >used
> >for besides irq chips otherwise it will be just another way to create
> >irqchip.
> 
> We need a new way to create irqchips anyway, even if it's just what
> the XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to
> KVM_CREATE_DEVICE except it doesn't return an identifier for
> operating on a specific device).  And of course we want to sort this
> out before either patchset gets merged, so we don't end up adding
> both methods.  I suspect the XICS patchset flew under your radar
> because it has "PPC:" in front of it and the subject line doesn't
> mention the ioctl, but I'm not the only one that felt the need for a
> few new ioctls.
> 
I noticed the thread (there was in-kernel irqchip there to compensate for
PPC), but haven't read it before sending the email otherwise I would
have added here that you guys need to agree on common interface.

Now I looked closer into proposed interface. I am not sure why Paul
decided to implement KVM_CREATE_IRQCHIP_ARGS instead of using
KVM_CREATE_IRQCHIP. He supports only one type with his patches and I am
not sure if he is planning add something else. KVM_IRQCHIP_SET_SOURCES
looks like it tires to reimplement irq routing.
 
> As for other types of devices, x86 has i8254.c emulated in-kernel --
> I know that's not going to switch to the new interface, but it could
> have if it existed back then.
Since it is not going to switch it is not a good example. On x86 probably
interrupt remapping device will have to have some kernel component that
may take advantage of the new interface, but I haven't thought about
interrupt remapping implementation enough to be sure.

>                                I can also see creating an in-kernel
> emulation device for doing MMIO filtering on some piece of embedded
> hardware that guests need to access with reasonable performance, but
> the hardware desginers screwed up the protection slightly (e.g. put
> other things in the same 4K page).  We've done such filtering before
> in our standalone hypervisor; the question is whether it happens to
> anything with enough performance requirements to be done in the
> kernel.
I am not sure why special device is needed for such filtering. If MMIO
is not handled by the kernel it is forwarded to a userspace.

> 
> >Second one is "how the interface should look like". And here I
> >think that strong distinction is needed between setting the attributes
> >and sending commands with side effects for reasons explained all over
> >this ml thread.
> 
> OK, so let's just call them "commands".  I like the split into
> "read" and "write" commands, especially when most of the commands
> naturally come in such pairs, but if you don't like that part it can
> be reduced to a single read/write command (and then we'd define
> separate set/get commands where appropriate).
I think read/write will be simpler.

> 
> Note that the XICS patchset also involves device commands.  It does
> it by passing any unknown vm ioctl to the irqchip (XICS implements
> KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to
> KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons
> elsewhere in the thread for preferring something that doesn't
> require a new ioctl for every device command.
> 
> >> It's not about "silliness" as that this new thing I added for other
> >> reasons did the job just as well (again, except when it comes to
> >> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
> >> was not the main point of the new interface.
> >Having generic interface for device creation and then make some
> >devices
> >special by allowing them to be used with KVM_IRQ_LINE makes little
> >sense,
> 
> Well, we'd want to document which devices tie into which generic
> interfaces, and which device is selected if multiple such devices
> are created (if that is allowed for a particular device class).
> 
> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
> id in kvm_irq_routing_irqchip (or, have an attribute/command that
> retrieves the id to be used there).  Unfortunately there is no
> irqchip field in kvm_irq_routing_msi, though since it's basically a
> command to write to an arbitrary MMIO address, maybe it could just
> be implemented that way?
> 
How MSI is delivered with MPIC? From device point of view sending an MSI
is doing write at address X of data Y. How PPC with MPIC translates this
into actual interrupt?

> >> >> I see no need for a separate ioctl in terms of the underlying
> >> >> infrastructure for distinguishing "attribute" from "write-only
> >> >> command".  I'm open to improvements on what the ioctl is called.
> >> >> It's basically like setting a register on a device, except I was
> >> >> concerned that if we actually called it a "register" that people
> >> >> would take it too literally and think it's only for the
> >architected
> >> >> register state of the emulated device.
> >> >I agree "attribute" is better name than "register", but injecting
> >> >interrupt is not setting an attribute.
> >>
> >> It's a dynamic attribute -- the state of the input line.  Better
> >> names are welcome.  I don't see this difference as enough to warrant
> >> separate ioctls.
> >As long as you use the same attribute for migration and interrupt
> >injection
> >purpose I do. If you use separate attributes for migration and
> >interrupt
> >injection then not having separate ioctl is just a hack.
> 
> Why is it a hack?  Is it also a hack to not use a separate ioctl to
> reset the device, to move its address map, etc?
>
It is a hack because purpose of the interface becomes unclear. If you
see it called in a code you have no idea what semantics is expected.
For instance getting/setting of a state should be done when vcpus are
not running, but commands may be sent while they are. This also
encourage bugs when incorrect attribute is used during migration or vice
versa. In short interface does not help you, it requires you to read
documentation of each device very carefully. Reset is one thing that
will be implemented as a command ioctl. Moving address map depends. If
an address is a part of device's sate and will be migrated as such then it
is device attribute, otherwise use command to move it.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-18 12:21     ` Gleb Natapov
@ 2013-02-25  1:11       ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-02-25  1:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
> Copying Christoffer since ARM has in kernel irq chip too.
> 
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> > 
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> You are not only provide different way to create in kernel irq chip you
> also use an alternate way to trigger interrupt lines. Before going into
> interface specifics lets think about whether it is really worth it? x86
> obviously support old way and will have to for some, very long, time.
> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
> be 2 archs against one. Christoffer do you think the proposed way it
> better for your needs. Are you willing to make vGIC use it?

In fact there have been two distinct interrupt controller emulations
for PPC posted recently, Scott's and mine, with quite different
interfaces.  In contrast to Scott's device control API, where the
operations you would do for different interrupt controllers are quite
different, mine attempted to define a much more abstract interface for
interrupt controller emulations, with the idea that an interrupt
controller subsystem connects a set of interrupt sources, each with
some state, to a set of interrupt delivery mechanisms, one per vcpu,
also with some state.

Thus my interface had:

- a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
  indicates the overall architecture of the interrupt subsystem,

- KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
  return or modify the state of some set of interrupt sources

- a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
  identifier space, that is used to save and restore the per-vcpu
  interrupt presentation state.

Alternatively, I could modify my code to use the existing
KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
were to do that I would want to rename the 'pad' field in struct
kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
interrupt sources to be affected.  I'd also want to keep the ONE_REG
identifier for the per-vcpu state.

Or I could change over to using Scott's device control API, though I
have some objections to doing that.

What is your advice about which interface to use?

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-25  1:11       ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-02-25  1:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
> Copying Christoffer since ARM has in kernel irq chip too.
> 
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> > 
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> You are not only provide different way to create in kernel irq chip you
> also use an alternate way to trigger interrupt lines. Before going into
> interface specifics lets think about whether it is really worth it? x86
> obviously support old way and will have to for some, very long, time.
> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
> be 2 archs against one. Christoffer do you think the proposed way it
> better for your needs. Are you willing to make vGIC use it?

In fact there have been two distinct interrupt controller emulations
for PPC posted recently, Scott's and mine, with quite different
interfaces.  In contrast to Scott's device control API, where the
operations you would do for different interrupt controllers are quite
different, mine attempted to define a much more abstract interface for
interrupt controller emulations, with the idea that an interrupt
controller subsystem connects a set of interrupt sources, each with
some state, to a set of interrupt delivery mechanisms, one per vcpu,
also with some state.

Thus my interface had:

- a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
  indicates the overall architecture of the interrupt subsystem,

- KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
  return or modify the state of some set of interrupt sources

- a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
  identifier space, that is used to save and restore the per-vcpu
  interrupt presentation state.

Alternatively, I could modify my code to use the existing
KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
were to do that I would want to rename the 'pad' field in struct
kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
interrupt sources to be affected.  I'd also want to keep the ONE_REG
identifier for the per-vcpu state.

Or I could change over to using Scott's device control API, though I
have some objections to doing that.

What is your advice about which interface to use?

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-25  1:11       ` Paul Mackerras
@ 2013-02-25 13:09         ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-25 13:09 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 25, 2013 at 12:11:19PM +1100, Paul Mackerras wrote:
> On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
> > Copying Christoffer since ARM has in kernel irq chip too.
> > 
> > On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > > Currently, devices that are emulated inside KVM are configured in a
> > > hardcoded manner based on an assumption that any given architecture
> > > only has one way to do it.  If there's any need to access device state,
> > > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > > cumbersome and depletes a limited numberspace.
> > > 
> > > This API provides a mechanism to instantiate a device of a certain
> > > type, returning an ID that can be used to set/get attributes of the
> > > device.  Attributes may include configuration parameters (e.g.
> > > register base address), device state, operational commands, etc.  It
> > > is similar to the ONE_REG API, except that it acts on devices rather
> > > than vcpus.
> > You are not only provide different way to create in kernel irq chip you
> > also use an alternate way to trigger interrupt lines. Before going into
> > interface specifics lets think about whether it is really worth it? x86
> > obviously support old way and will have to for some, very long, time.
> > ARM vGIC code, that is ready to go upstream, uses old way too. So it will
> > be 2 archs against one. Christoffer do you think the proposed way it
> > better for your needs. Are you willing to make vGIC use it?
> 
> In fact there have been two distinct interrupt controller emulations
> for PPC posted recently, Scott's and mine, with quite different
> interfaces.  In contrast to Scott's device control API, where the
> operations you would do for different interrupt controllers are quite
> different, mine attempted to define a much more abstract interface for
> interrupt controller emulations, with the idea that an interrupt
> controller subsystem connects a set of interrupt sources, each with
> some state, to a set of interrupt delivery mechanisms, one per vcpu,
> also with some state.
> 
I agree with Scott that it is futile to try and come up with generic
irqchip configuration interface and I doubt it is needed from QEMU
or other userspace pov. I looked at proposed KVM_IRQCHIP_SET_SOURCES
interface and while it is possible to pass some information about
pic/ioapic using it there will be a lot of information that will not
fit there. For one there is global irqchips related state and proposed
interface only talk about interrupt sources. Another is that using
generic interface will require us to have a code that translate irqchip
representation into this generic one and back for no apparent gain.
Currently pic/ioapic state is very similar to what HW specification
defines and it is not going to change.

Looking at your implementation of KVM_IRQCHIP_SET_SOURCES I wounder how
well it work for migration though. The interface suppose to transfer
irqchips state as is, but I see things like that in your code:

                       mutex_lock(&ics->lock);
                       irqp->server = val & KVM_IRQ_SERVER_MASK;
                       irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
                       irqp->resend = 0;
                       irqp->masked_pending = 0;
                       irqp->asserted = 0;

Why it is safe to initialize those values to default values during
migration? Wouldn't it be simpler and more correct to just transfer
the whole content of irqp from src to dst?

> Thus my interface had:
> 
> - a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
>   indicates the overall architecture of the interrupt subsystem,
> 
> - KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
>   return or modify the state of some set of interrupt sources
> 
> - a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
>   identifier space, that is used to save and restore the per-vcpu
>   interrupt presentation state.
> 
> Alternatively, I could modify my code to use the existing
> KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
> were to do that I would want to rename the 'pad' field in struct
> kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
> interrupt sources to be affected.  I'd also want to keep the ONE_REG
> identifier for the per-vcpu state.
It is preferable to the interface you propose since I do not think your
interface fits other interrupt controllers well. You can put nr field
into dummy[] payload, or replace pad with "union {pad, nr}".

> 
> Or I could change over to using Scott's device control API, though I
> have some objections to doing that.
> 
> What is your advice about which interface to use?
> 
The ideal situation would be if you and Scott agree on the details about
the interface. If you don't like something about Scott's interface we
can discuss it and shape it to something you agree with or even like.
I already asked Scott to introduce command interface. Scott does not
care about migration, you do, so you can make sure that interface works
for that.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-25 13:09         ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-02-25 13:09 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

On Mon, Feb 25, 2013 at 12:11:19PM +1100, Paul Mackerras wrote:
> On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
> > Copying Christoffer since ARM has in kernel irq chip too.
> > 
> > On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > > Currently, devices that are emulated inside KVM are configured in a
> > > hardcoded manner based on an assumption that any given architecture
> > > only has one way to do it.  If there's any need to access device state,
> > > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > > cumbersome and depletes a limited numberspace.
> > > 
> > > This API provides a mechanism to instantiate a device of a certain
> > > type, returning an ID that can be used to set/get attributes of the
> > > device.  Attributes may include configuration parameters (e.g.
> > > register base address), device state, operational commands, etc.  It
> > > is similar to the ONE_REG API, except that it acts on devices rather
> > > than vcpus.
> > You are not only provide different way to create in kernel irq chip you
> > also use an alternate way to trigger interrupt lines. Before going into
> > interface specifics lets think about whether it is really worth it? x86
> > obviously support old way and will have to for some, very long, time.
> > ARM vGIC code, that is ready to go upstream, uses old way too. So it will
> > be 2 archs against one. Christoffer do you think the proposed way it
> > better for your needs. Are you willing to make vGIC use it?
> 
> In fact there have been two distinct interrupt controller emulations
> for PPC posted recently, Scott's and mine, with quite different
> interfaces.  In contrast to Scott's device control API, where the
> operations you would do for different interrupt controllers are quite
> different, mine attempted to define a much more abstract interface for
> interrupt controller emulations, with the idea that an interrupt
> controller subsystem connects a set of interrupt sources, each with
> some state, to a set of interrupt delivery mechanisms, one per vcpu,
> also with some state.
> 
I agree with Scott that it is futile to try and come up with generic
irqchip configuration interface and I doubt it is needed from QEMU
or other userspace pov. I looked at proposed KVM_IRQCHIP_SET_SOURCES
interface and while it is possible to pass some information about
pic/ioapic using it there will be a lot of information that will not
fit there. For one there is global irqchips related state and proposed
interface only talk about interrupt sources. Another is that using
generic interface will require us to have a code that translate irqchip
representation into this generic one and back for no apparent gain.
Currently pic/ioapic state is very similar to what HW specification
defines and it is not going to change.

Looking at your implementation of KVM_IRQCHIP_SET_SOURCES I wounder how
well it work for migration though. The interface suppose to transfer
irqchips state as is, but I see things like that in your code:

                       mutex_lock(&ics->lock);
                       irqp->server = val & KVM_IRQ_SERVER_MASK;
                       irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
                       irqp->resend = 0;
                       irqp->masked_pending = 0;
                       irqp->asserted = 0;

Why it is safe to initialize those values to default values during
migration? Wouldn't it be simpler and more correct to just transfer
the whole content of irqp from src to dst?

> Thus my interface had:
> 
> - a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
>   indicates the overall architecture of the interrupt subsystem,
> 
> - KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
>   return or modify the state of some set of interrupt sources
> 
> - a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
>   identifier space, that is used to save and restore the per-vcpu
>   interrupt presentation state.
> 
> Alternatively, I could modify my code to use the existing
> KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
> were to do that I would want to rename the 'pad' field in struct
> kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
> interrupt sources to be affected.  I'd also want to keep the ONE_REG
> identifier for the per-vcpu state.
It is preferable to the interface you propose since I do not think your
interface fits other interrupt controllers well. You can put nr field
into dummy[] payload, or replace pad with "union {pad, nr}".

> 
> Or I could change over to using Scott's device control API, though I
> have some objections to doing that.
> 
> What is your advice about which interface to use?
> 
The ideal situation would be if you and Scott agree on the details about
the interface. If you don't like something about Scott's interface we
can discuss it and shape it to something you agree with or even like.
I already asked Scott to introduce command interface. Scott does not
care about migration, you do, so you can make sure that interface works
for that.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-24 15:46                     ` Gleb Natapov
@ 2013-02-25 15:23                       ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-02-25 15:23 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras


On 24.02.2013, at 16:46, Gleb Natapov wrote:

> On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>> On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
>>>> On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
>>>>> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
>>>>>> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
>>>>>>> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
>>>>>>>> The ability to set/get attributes is needed.  Sorry, but "get
>>>>> or set
>>>>>>>> one blob of data, up to 512 bytes, for the entire irqchip" is
>>>>> just
>>>>>>>> not good enough -- assuming you don't want us to start
>>> sticking
>>>>>>>> pointers and commands in *that* data. :-)
>>>>>>>> 
>>>>>>> Proposed interface sticks pointers into ioctl data, so why doing
>>>>>>> the same
>>>>>>> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
>>>>>> 
>>>>>> There's a difference between putting a pointer in an ioctl
>>> control
>>>>>> structure that is specifically documented as being that way
>>> (as in
>>>>>> ONE_REG), versus taking an ioctl that claims to be
>>> setting/getting a
>>>>>> blob of state and embedding pointers in it.  It would be like
>>>>>> sticking a pointer in the attribute payload of this API, which I
>>>>>> think is something to be discouraged.
>>>>> If documentation is what differentiate for you between silly
>>> and smart
>>>>> then write documentation instead of new interfaces.
>>>> 
>>>> You mean like what we did with SREGS, that got deprecated and
>>>> replaced with ONE_REG?
>>>> 
>>> SREGS is implemented by ppc. I see ONE_REG as addition to REGS
>>> interface. You can access all register at once or you can access them
>>> one by one. If there is a need we can add MULTIPLE_REGS that will get
>>> list of requested REGS.
>> 
>> http://www.spinics.net/lists/kvm-ppc/msg04876.html
>> http://www.spinics.net/lists/kvm-ppc/msg05842.html
>> 
> If Alex prefers ONE_REG interface this is his call :)

There's a reason I prefer it :). We ran out of space in the SREGS call, we had a bit map with multiple chunks of data you could set in there and worst of all, when we ran out of space we didn't realize it until the code was already in Linus' tree!

The normal "here's a chunk of data" ioctl style works great when you know all or most of the details of what you want to transfer. In the case of kvm, we need to expose an interface we

  a) Don't control itself. Hardware guys make hardware. We just follow.
  b) Is pretty complex. There can be a lot of state in a CPU / device.

So because if that we needed something more flexible and dynamic. Something where we could add registers we forgot without jumping through lots of hoops.

That's why I prefer ONE_REG for vcpus. Since I see the same danger on device emulation, I tend to prefer it here as well.

> 
>>> The interface is not over generic i.e it does
>>> not try to replace KVM_RUN by writing special register.
>> 
>> Sigh.
>> 
>>>>                                                   And we
>>>> don't necessarily want to set *everything*.
>>> What are those cases? You do need to on reset/migration.
>> 
>> Why do we want to set all the registers on reset, rather than tell
>> the in-kernel device to reset?  The default state came from the
>> kernel in the first place on irqchip creation...
>> 
> I have nothing against telling in-kernel device to reset provided there
> is a way to do so, which current interface lacks. Reset in userspase has
> its advantage too: bugs are easier to fix, there may be different kind
> of resets (hard/soft).

The same argument holds for state access. If you suddenly realize "oh, I need to also set this one hidden state bit in my hardware" you only want to add that specific write on reset.

If you suddenly find a race condition in migration that happens very rarely, you want to add the one register that you forgot to synchronize in addition in your new code. You don't want to change the whole ioctl passed struct.

That's why we have the padding in our ioctl structs. And feature bitmaps. And once we exceed them, we're screwed.

I want to make sure we don't hit that point.

Actually, there's a really easy way to understand what this interface does. It basically implements a state get/set ioctl which allows for very fine grained control over your ioctl struct. It does that by only passing one register at a time.

If we run into concurrency issues (multiple registers need to be set in a locked manner at the same time), we can always add a new ioctl that sets multiple attrs/regs at once. For CPUs, we haven't hit that case yet.

> 
>>>>>>                                       It'd also be using
>>>>>> KVM_SET_IRQCHIP to read data, which is the sort of thing you
>>> object
>>>>>> to later on regarding KVM_IRQ_LINE_STATUS.
>>>>>> 
>>>>> Do not see why.
>>>> 
>>>> It's either that, or have the data direction of the "chip" field in
>>>> KVM_GET_IRQCHIP not be entirely in the "get" direction.
>>>> 
>>> Still do not follow. Example?
>> 
>> struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is
>> insufficient to communicate attribute type plus a pointer.  So if we
>> want to provide a pointer for the kernel to write the attribute
>> into, it has to read from memory that the ioctl definition suggests
>> should only be written to.
> Yes, but this is not different from the interface you propose.
> 
>> 
>>>>> Setting the address is setting an attribute. Sending MSI is a
>>> command.
>>>>> Things you set/get during init/migration are attributes. Things
>>> you do
>>>>> to cause side-effects are commands.
>>>> 
>>>> What if I set the address at a time that isn't init/migration (the
>>>> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
>>>> an attribute due to how the caller uses it?
>>>> 
>>> What's the interface for guest to move it?
>> 
>> Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.
>> 
>>> Why it goes via userspace?
>> 
>> Because the mechanism in question doesn't just move MPIC.  It moves
>> a big block of a bunch of different devices all at once.
>> 
>>> You can move APIC base too, but this does not involve userspace. But
>>> even if you do go via userspace, it is just a guest asking to
>>> change device
>>> configuration, so using SET_ATTR to set new configuration is fine.
>>> 
>>>>>> What is the benefit of KVM_IRQ_LINE over what MPIC does?
>>> What real
>>>>>> (non-glue/wrapper) code can become common?
>>>>>> 
>>>>> No new ioctl with exactly same result (well actually even
>>> faster since
>>>>> less copying is done).
>>>> 
>>>> Which ioctl would go away?
>>>> 
>>> Those that you propose in your new interface.
>> 
>> No, they wouldn't.  At most one MPIC attribute group would go away
>> (though as I've noted it would still be useful to be able to "get"
>> those attributes for debugging).
>> 
>>>>>> And I really hope you don't want us to do MSIs the x86 way.
>>>>>> 
>>>>> What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
>>>>> glue it
>>>>> to mpic.
>>>> 
>>>> We'll have to write extra code for it compared to the current way
>>>> where it works with *zero* code beyond what is wanted for other
>>>> purposes such as debug and (eventually) migration.  At least it's
>>>> more direct than having to establish a GSI route...
>>> If just writing a register cause MSI to be send how do you distinguish
>>> between write that should send MSI and write that is done on migration
>>> to transfer current value?
>> 
>> It is a write-only command register.  The registers that contain the
>> state are elsewhere.
>> 
> The register may be write-only from OS point of view, but its internal
> state may still need to be transfered on migration. This brings us back
> to the point that state and commands should have different accessors.

Scott expressed this through his "groups". Hidden registers are a separate group, but still accessed using the same ideology of one register at a time. I think that makes sense.

> 
>> Again, we do not currently support migration on MPIC.  It is a very
>> low priority for embedded.  We do not wish to rule it out entirely,
>> but it most likely would require adding more state accesors.
> The interface suppose to be generic, we are not talking about MPIC
> specifically here. Regarding migration "never say never"  :)
> 
>> 
>>> We had that problem with MSRs on x86. We had
>>> to, eventually, add a flag that tells us the reason of MSR access.
>> 
>> The equivalent to that flag would be using the right kind of
>> accessor for what you want to do (simulated guest access versus
>> backdoor state access).
> Ugh.

If we need to modify hidden state for migration, we can always create hidden registers in this interface. We won't know until we actually implement it. And then a new MPIC revision gets released and we need to do things all over again.

The point of a generic get/set interface really is that we can change and extend the type of state we move between user and kernel space.

> 
>> 
>>>>>> In the XICS thread, Paul brought up the possibliity of cascaded
>>>>>> MPICs.  It's not relevant to the systems we're trying to
>>> model, but
>>>>>> if one did want to use the in-kernel irqchip interface for
>>> that, it
>>>>>> would be really nice to be able to operate on a specific MPIC for
>>>>>> injection rather than have to come up with some sort of global
>>>>>> identifier (above and beyond the minor flattening we'd need
>>> to do to
>>>>>> represent a single MPIC's interrupts in a flat numberspace).
>>>>>> 
>>>>> ARM encodes information in irq field of KVM_IRQ_LINE like that:
>>>>>  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
>>>>> field: | irq_type  | vcpu_index |   irq_number    |
>>>>> Why will similar approach not work?
>>>> 
>>>> Well, it conflicts with the GSI routing stuff, and I don't see an
>>>> irq chip ID field...
>>> It does :( Can't say I am happy about it, but I skipped the discussion
>>> about the interface back then and it is too late to complain now.
>>> Since,
>>> as you notices, irqfd interfaces with irq routing I wonder what's ARM
>>> plan about it. But if you choose to go ARM way the format is ARM
>>> specific,
>>> so you can use your own encoding and put irq chip information there.
>> 
>> Well, we do want to support irqfd, so I don't think we'll be
>> following ARM here.
>> 
> I think they want too. May be they have a plan to enhance irqfd. S390
> people do something about it now. Haven't look at proposed patches yet.
> 
>>>> But otherwise (or assuming you mean to use such an encoding when
>>>> setting up a GSI route), I didn't say this part couldn't be made to
>>>> work.  It will require new kernel code for managing a GSI table in a
>>>> non-APIC way, and a new callback into the device code, but as I've
>>>> said elsewhere I think we need it for irqfd anyway.  If I use
>>>> KVM_IRQ_LINE for injecting interrupts, do you still object to the
>>>> rest of it?
>>> The rest of what, proposed interface? There are two separate
>>> discussions
>>> happening here interleaved. First is "do we need to introduce new
>>> generic
>>> interface for device creation when existing one, albeit not ideal,
>>> can be
>>> used" and I am OK with that as long as ARM moves to it for 3.10,
>>> although
>>> I would prefer to have some example of what this interface will be
>>> used
>>> for besides irq chips otherwise it will be just another way to create
>>> irqchip.
>> 
>> We need a new way to create irqchips anyway, even if it's just what
>> the XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to
>> KVM_CREATE_DEVICE except it doesn't return an identifier for
>> operating on a specific device).  And of course we want to sort this
>> out before either patchset gets merged, so we don't end up adding
>> both methods.  I suspect the XICS patchset flew under your radar
>> because it has "PPC:" in front of it and the subject line doesn't
>> mention the ioctl, but I'm not the only one that felt the need for a
>> few new ioctls.
>> 
> I noticed the thread (there was in-kernel irqchip there to compensate for
> PPC), but haven't read it before sending the email otherwise I would
> have added here that you guys need to agree on common interface.
> 
> Now I looked closer into proposed interface. I am not sure why Paul
> decided to implement KVM_CREATE_IRQCHIP_ARGS instead of using
> KVM_CREATE_IRQCHIP. He supports only one type with his patches and I am
> not sure if he is planning add something else. KVM_IRQCHIP_SET_SOURCES
> looks like it tires to reimplement irq routing.

Because the same host hardware can have either an in-kernel MPIC or in-kernel XICS, depending on the type of guest you want to run.

> 
>> As for other types of devices, x86 has i8254.c emulated in-kernel --
>> I know that's not going to switch to the new interface, but it could
>> have if it existed back then.
> Since it is not going to switch it is not a good example. On x86 probably
> interrupt remapping device will have to have some kernel component that
> may take advantage of the new interface, but I haven't thought about
> interrupt remapping implementation enough to be sure.
> 
>>                               I can also see creating an in-kernel
>> emulation device for doing MMIO filtering on some piece of embedded
>> hardware that guests need to access with reasonable performance, but
>> the hardware desginers screwed up the protection slightly (e.g. put
>> other things in the same 4K page).  We've done such filtering before
>> in our standalone hypervisor; the question is whether it happens to
>> anything with enough performance requirements to be done in the
>> kernel.
> I am not sure why special device is needed for such filtering. If MMIO
> is not handled by the kernel it is forwarded to a userspace.

Have you attended my talk at KVM Forum? :)

Emulated device access (PIO / MMIO) in kvm is about 2-3x faster than going to QEMU on x86. Pure roundtrip time. And it gets worse with hypercalls. See slide 41:

  https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf

I haven't done these benchmarks on PPC yet, but I don't expect them to look vastly different.

> 
>> 
>>> Second one is "how the interface should look like". And here I
>>> think that strong distinction is needed between setting the attributes
>>> and sending commands with side effects for reasons explained all over
>>> this ml thread.
>> 
>> OK, so let's just call them "commands".  I like the split into
>> "read" and "write" commands, especially when most of the commands
>> naturally come in such pairs, but if you don't like that part it can
>> be reduced to a single read/write command (and then we'd define
>> separate set/get commands where appropriate).
> I think read/write will be simpler.
> 
>> 
>> Note that the XICS patchset also involves device commands.  It does
>> it by passing any unknown vm ioctl to the irqchip (XICS implements
>> KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to
>> KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons
>> elsewhere in the thread for preferring something that doesn't
>> require a new ioctl for every device command.
>> 
>>>> It's not about "silliness" as that this new thing I added for other
>>>> reasons did the job just as well (again, except when it comes to
>>>> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
>>>> was not the main point of the new interface.
>>> Having generic interface for device creation and then make some
>>> devices
>>> special by allowing them to be used with KVM_IRQ_LINE makes little
>>> sense,
>> 
>> Well, we'd want to document which devices tie into which generic
>> interfaces, and which device is selected if multiple such devices
>> are created (if that is allowed for a particular device class).
>> 
>> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
>> id in kvm_irq_routing_irqchip (or, have an attribute/command that
>> retrieves the id to be used there).  Unfortunately there is no
>> irqchip field in kvm_irq_routing_msi, though since it's basically a
>> command to write to an arbitrary MMIO address, maybe it could just
>> be implemented that way?
>> 
> How MSI is delivered with MPIC? From device point of view sending an MSI
> is doing write at address X of data Y. How PPC with MPIC translates this
> into actual interrupt?

The device DMAs into an MPIC register to trigger the MSI. The MPIC then enables the "level high" bit for that particular internal interrupt line. It also triggers an interrupt on a CPU - the same way it would with level or timer interrupts. Filtering and priority also work the same way. Once the guest ACKs that interrupt, the "level high" bit for that MSI line gets cleared.

So essentially, the MPIC converts MSIs into level based interrupts.

That's why pushing an MSI into the in-kernel MPIC by using the live migration interface works. All you need to do is live migrate the "MSI interrupt line is active" bit into the device state.

> 
>>>>>> I see no need for a separate ioctl in terms of the underlying
>>>>>> infrastructure for distinguishing "attribute" from "write-only
>>>>>> command".  I'm open to improvements on what the ioctl is called.
>>>>>> It's basically like setting a register on a device, except I was
>>>>>> concerned that if we actually called it a "register" that people
>>>>>> would take it too literally and think it's only for the
>>> architected
>>>>>> register state of the emulated device.
>>>>> I agree "attribute" is better name than "register", but injecting
>>>>> interrupt is not setting an attribute.
>>>> 
>>>> It's a dynamic attribute -- the state of the input line.  Better
>>>> names are welcome.  I don't see this difference as enough to warrant
>>>> separate ioctls.
>>> As long as you use the same attribute for migration and interrupt
>>> injection
>>> purpose I do. If you use separate attributes for migration and
>>> interrupt
>>> injection then not having separate ioctl is just a hack.
>> 
>> Why is it a hack?  Is it also a hack to not use a separate ioctl to
>> reset the device, to move its address map, etc?
>> 
> It is a hack because purpose of the interface becomes unclear. If you
> see it called in a code you have no idea what semantics is expected.
> For instance getting/setting of a state should be done when vcpus are
> not running, but commands may be sent while they are. This also
> encourage bugs when incorrect attribute is used during migration or vice
> versa. In short interface does not help you, it requires you to read
> documentation of each device very carefully. Reset is one thing that
> will be implemented as a command ioctl. Moving address map depends. If
> an address is a part of device's sate and will be migrated as such then it
> is device attribute, otherwise use command to move it.

I agree here. I think it makes sense to have a separate interface to set interrupt states. This separate interface can be a separate group inside this interface too. But I agree that we shouldn't use the live migration interface to set the interrupt line of the MPIC.


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-25 15:23                       ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-02-25 15:23 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras


On 24.02.2013, at 16:46, Gleb Natapov wrote:

> On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>> On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
>>>> On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:
>>>>> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
>>>>>> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
>>>>>>> On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
>>>>>>>> The ability to set/get attributes is needed.  Sorry, but "get
>>>>> or set
>>>>>>>> one blob of data, up to 512 bytes, for the entire irqchip" is
>>>>> just
>>>>>>>> not good enough -- assuming you don't want us to start
>>> sticking
>>>>>>>> pointers and commands in *that* data. :-)
>>>>>>>> 
>>>>>>> Proposed interface sticks pointers into ioctl data, so why doing
>>>>>>> the same
>>>>>>> for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
>>>>>> 
>>>>>> There's a difference between putting a pointer in an ioctl
>>> control
>>>>>> structure that is specifically documented as being that way
>>> (as in
>>>>>> ONE_REG), versus taking an ioctl that claims to be
>>> setting/getting a
>>>>>> blob of state and embedding pointers in it.  It would be like
>>>>>> sticking a pointer in the attribute payload of this API, which I
>>>>>> think is something to be discouraged.
>>>>> If documentation is what differentiate for you between silly
>>> and smart
>>>>> then write documentation instead of new interfaces.
>>>> 
>>>> You mean like what we did with SREGS, that got deprecated and
>>>> replaced with ONE_REG?
>>>> 
>>> SREGS is implemented by ppc. I see ONE_REG as addition to REGS
>>> interface. You can access all register at once or you can access them
>>> one by one. If there is a need we can add MULTIPLE_REGS that will get
>>> list of requested REGS.
>> 
>> http://www.spinics.net/lists/kvm-ppc/msg04876.html
>> http://www.spinics.net/lists/kvm-ppc/msg05842.html
>> 
> If Alex prefers ONE_REG interface this is his call :)

There's a reason I prefer it :). We ran out of space in the SREGS call, we had a bit map with multiple chunks of data you could set in there and worst of all, when we ran out of space we didn't realize it until the code was already in Linus' tree!

The normal "here's a chunk of data" ioctl style works great when you know all or most of the details of what you want to transfer. In the case of kvm, we need to expose an interface we

  a) Don't control itself. Hardware guys make hardware. We just follow.
  b) Is pretty complex. There can be a lot of state in a CPU / device.

So because if that we needed something more flexible and dynamic. Something where we could add registers we forgot without jumping through lots of hoops.

That's why I prefer ONE_REG for vcpus. Since I see the same danger on device emulation, I tend to prefer it here as well.

> 
>>> The interface is not over generic i.e it does
>>> not try to replace KVM_RUN by writing special register.
>> 
>> Sigh.
>> 
>>>>                                                   And we
>>>> don't necessarily want to set *everything*.
>>> What are those cases? You do need to on reset/migration.
>> 
>> Why do we want to set all the registers on reset, rather than tell
>> the in-kernel device to reset?  The default state came from the
>> kernel in the first place on irqchip creation...
>> 
> I have nothing against telling in-kernel device to reset provided there
> is a way to do so, which current interface lacks. Reset in userspase has
> its advantage too: bugs are easier to fix, there may be different kind
> of resets (hard/soft).

The same argument holds for state access. If you suddenly realize "oh, I need to also set this one hidden state bit in my hardware" you only want to add that specific write on reset.

If you suddenly find a race condition in migration that happens very rarely, you want to add the one register that you forgot to synchronize in addition in your new code. You don't want to change the whole ioctl passed struct.

That's why we have the padding in our ioctl structs. And feature bitmaps. And once we exceed them, we're screwed.

I want to make sure we don't hit that point.

Actually, there's a really easy way to understand what this interface does. It basically implements a state get/set ioctl which allows for very fine grained control over your ioctl struct. It does that by only passing one register at a time.

If we run into concurrency issues (multiple registers need to be set in a locked manner at the same time), we can always add a new ioctl that sets multiple attrs/regs at once. For CPUs, we haven't hit that case yet.

> 
>>>>>>                                       It'd also be using
>>>>>> KVM_SET_IRQCHIP to read data, which is the sort of thing you
>>> object
>>>>>> to later on regarding KVM_IRQ_LINE_STATUS.
>>>>>> 
>>>>> Do not see why.
>>>> 
>>>> It's either that, or have the data direction of the "chip" field in
>>>> KVM_GET_IRQCHIP not be entirely in the "get" direction.
>>>> 
>>> Still do not follow. Example?
>> 
>> struct kvm_irqchip has "chip_id", "pad", and "chip".  "pad" is
>> insufficient to communicate attribute type plus a pointer.  So if we
>> want to provide a pointer for the kernel to write the attribute
>> into, it has to read from memory that the ioctl definition suggests
>> should only be written to.
> Yes, but this is not different from the interface you propose.
> 
>> 
>>>>> Setting the address is setting an attribute. Sending MSI is a
>>> command.
>>>>> Things you set/get during init/migration are attributes. Things
>>> you do
>>>>> to cause side-effects are commands.
>>>> 
>>>> What if I set the address at a time that isn't init/migration (the
>>>> hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not
>>>> an attribute due to how the caller uses it?
>>>> 
>>> What's the interface for guest to move it?
>> 
>> Some non-MPIC registers called CCSRBARH, CCSRBARL, and CCSRBAR.
>> 
>>> Why it goes via userspace?
>> 
>> Because the mechanism in question doesn't just move MPIC.  It moves
>> a big block of a bunch of different devices all at once.
>> 
>>> You can move APIC base too, but this does not involve userspace. But
>>> even if you do go via userspace, it is just a guest asking to
>>> change device
>>> configuration, so using SET_ATTR to set new configuration is fine.
>>> 
>>>>>> What is the benefit of KVM_IRQ_LINE over what MPIC does?
>>> What real
>>>>>> (non-glue/wrapper) code can become common?
>>>>>> 
>>>>> No new ioctl with exactly same result (well actually even
>>> faster since
>>>>> less copying is done).
>>>> 
>>>> Which ioctl would go away?
>>>> 
>>> Those that you propose in your new interface.
>> 
>> No, they wouldn't.  At most one MPIC attribute group would go away
>> (though as I've noted it would still be useful to be able to "get"
>> those attributes for debugging).
>> 
>>>>>> And I really hope you don't want us to do MSIs the x86 way.
>>>>>> 
>>>>> What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to
>>>>> glue it
>>>>> to mpic.
>>>> 
>>>> We'll have to write extra code for it compared to the current way
>>>> where it works with *zero* code beyond what is wanted for other
>>>> purposes such as debug and (eventually) migration.  At least it's
>>>> more direct than having to establish a GSI route...
>>> If just writing a register cause MSI to be send how do you distinguish
>>> between write that should send MSI and write that is done on migration
>>> to transfer current value?
>> 
>> It is a write-only command register.  The registers that contain the
>> state are elsewhere.
>> 
> The register may be write-only from OS point of view, but its internal
> state may still need to be transfered on migration. This brings us back
> to the point that state and commands should have different accessors.

Scott expressed this through his "groups". Hidden registers are a separate group, but still accessed using the same ideology of one register at a time. I think that makes sense.

> 
>> Again, we do not currently support migration on MPIC.  It is a very
>> low priority for embedded.  We do not wish to rule it out entirely,
>> but it most likely would require adding more state accesors.
> The interface suppose to be generic, we are not talking about MPIC
> specifically here. Regarding migration "never say never"  :)
> 
>> 
>>> We had that problem with MSRs on x86. We had
>>> to, eventually, add a flag that tells us the reason of MSR access.
>> 
>> The equivalent to that flag would be using the right kind of
>> accessor for what you want to do (simulated guest access versus
>> backdoor state access).
> Ugh.

If we need to modify hidden state for migration, we can always create hidden registers in this interface. We won't know until we actually implement it. And then a new MPIC revision gets released and we need to do things all over again.

The point of a generic get/set interface really is that we can change and extend the type of state we move between user and kernel space.

> 
>> 
>>>>>> In the XICS thread, Paul brought up the possibliity of cascaded
>>>>>> MPICs.  It's not relevant to the systems we're trying to
>>> model, but
>>>>>> if one did want to use the in-kernel irqchip interface for
>>> that, it
>>>>>> would be really nice to be able to operate on a specific MPIC for
>>>>>> injection rather than have to come up with some sort of global
>>>>>> identifier (above and beyond the minor flattening we'd need
>>> to do to
>>>>>> represent a single MPIC's interrupts in a flat numberspace).
>>>>>> 
>>>>> ARM encodes information in irq field of KVM_IRQ_LINE like that:
>>>>>  bits:  | 31 ... 24 | 23  ... 16 | 15    ...     0 |
>>>>> field: | irq_type  | vcpu_index |   irq_number    |
>>>>> Why will similar approach not work?
>>>> 
>>>> Well, it conflicts with the GSI routing stuff, and I don't see an
>>>> irq chip ID field...
>>> It does :( Can't say I am happy about it, but I skipped the discussion
>>> about the interface back then and it is too late to complain now.
>>> Since,
>>> as you notices, irqfd interfaces with irq routing I wonder what's ARM
>>> plan about it. But if you choose to go ARM way the format is ARM
>>> specific,
>>> so you can use your own encoding and put irq chip information there.
>> 
>> Well, we do want to support irqfd, so I don't think we'll be
>> following ARM here.
>> 
> I think they want too. May be they have a plan to enhance irqfd. S390
> people do something about it now. Haven't look at proposed patches yet.
> 
>>>> But otherwise (or assuming you mean to use such an encoding when
>>>> setting up a GSI route), I didn't say this part couldn't be made to
>>>> work.  It will require new kernel code for managing a GSI table in a
>>>> non-APIC way, and a new callback into the device code, but as I've
>>>> said elsewhere I think we need it for irqfd anyway.  If I use
>>>> KVM_IRQ_LINE for injecting interrupts, do you still object to the
>>>> rest of it?
>>> The rest of what, proposed interface? There are two separate
>>> discussions
>>> happening here interleaved. First is "do we need to introduce new
>>> generic
>>> interface for device creation when existing one, albeit not ideal,
>>> can be
>>> used" and I am OK with that as long as ARM moves to it for 3.10,
>>> although
>>> I would prefer to have some example of what this interface will be
>>> used
>>> for besides irq chips otherwise it will be just another way to create
>>> irqchip.
>> 
>> We need a new way to create irqchips anyway, even if it's just what
>> the XICS patchset adds (KVM_CREATE_IRQCHIP_ARGS, which is similar to
>> KVM_CREATE_DEVICE except it doesn't return an identifier for
>> operating on a specific device).  And of course we want to sort this
>> out before either patchset gets merged, so we don't end up adding
>> both methods.  I suspect the XICS patchset flew under your radar
>> because it has "PPC:" in front of it and the subject line doesn't
>> mention the ioctl, but I'm not the only one that felt the need for a
>> few new ioctls.
>> 
> I noticed the thread (there was in-kernel irqchip there to compensate for
> PPC), but haven't read it before sending the email otherwise I would
> have added here that you guys need to agree on common interface.
> 
> Now I looked closer into proposed interface. I am not sure why Paul
> decided to implement KVM_CREATE_IRQCHIP_ARGS instead of using
> KVM_CREATE_IRQCHIP. He supports only one type with his patches and I am
> not sure if he is planning add something else. KVM_IRQCHIP_SET_SOURCES
> looks like it tires to reimplement irq routing.

Because the same host hardware can have either an in-kernel MPIC or in-kernel XICS, depending on the type of guest you want to run.

> 
>> As for other types of devices, x86 has i8254.c emulated in-kernel --
>> I know that's not going to switch to the new interface, but it could
>> have if it existed back then.
> Since it is not going to switch it is not a good example. On x86 probably
> interrupt remapping device will have to have some kernel component that
> may take advantage of the new interface, but I haven't thought about
> interrupt remapping implementation enough to be sure.
> 
>>                               I can also see creating an in-kernel
>> emulation device for doing MMIO filtering on some piece of embedded
>> hardware that guests need to access with reasonable performance, but
>> the hardware desginers screwed up the protection slightly (e.g. put
>> other things in the same 4K page).  We've done such filtering before
>> in our standalone hypervisor; the question is whether it happens to
>> anything with enough performance requirements to be done in the
>> kernel.
> I am not sure why special device is needed for such filtering. If MMIO
> is not handled by the kernel it is forwarded to a userspace.

Have you attended my talk at KVM Forum? :)

Emulated device access (PIO / MMIO) in kvm is about 2-3x faster than going to QEMU on x86. Pure roundtrip time. And it gets worse with hypercalls. See slide 41:

  https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf

I haven't done these benchmarks on PPC yet, but I don't expect them to look vastly different.

> 
>> 
>>> Second one is "how the interface should look like". And here I
>>> think that strong distinction is needed between setting the attributes
>>> and sending commands with side effects for reasons explained all over
>>> this ml thread.
>> 
>> OK, so let's just call them "commands".  I like the split into
>> "read" and "write" commands, especially when most of the commands
>> naturally come in such pairs, but if you don't like that part it can
>> be reduced to a single read/write command (and then we'd define
>> separate set/get commands where appropriate).
> I think read/write will be simpler.
> 
>> 
>> Note that the XICS patchset also involves device commands.  It does
>> it by passing any unknown vm ioctl to the irqchip (XICS implements
>> KVM_IRQCHIP_GET_SOURCES and KVM_IRQCHIP_SET_SOURCES in addition to
>> KVM_IRQ_LINE).  Obviously both ways can work; I've given my reasons
>> elsewhere in the thread for preferring something that doesn't
>> require a new ioctl for every device command.
>> 
>>>> It's not about "silliness" as that this new thing I added for other
>>>> reasons did the job just as well (again, except when it comes to
>>>> irqfd), and avoided the need for a GSI table, etc.  IRQ injection
>>>> was not the main point of the new interface.
>>> Having generic interface for device creation and then make some
>>> devices
>>> special by allowing them to be used with KVM_IRQ_LINE makes little
>>> sense,
>> 
>> Well, we'd want to document which devices tie into which generic
>> interfaces, and which device is selected if multiple such devices
>> are created (if that is allowed for a particular device class).
>> 
>> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
>> id in kvm_irq_routing_irqchip (or, have an attribute/command that
>> retrieves the id to be used there).  Unfortunately there is no
>> irqchip field in kvm_irq_routing_msi, though since it's basically a
>> command to write to an arbitrary MMIO address, maybe it could just
>> be implemented that way?
>> 
> How MSI is delivered with MPIC? From device point of view sending an MSI
> is doing write at address X of data Y. How PPC with MPIC translates this
> into actual interrupt?

The device DMAs into an MPIC register to trigger the MSI. The MPIC then enables the "level high" bit for that particular internal interrupt line. It also triggers an interrupt on a CPU - the same way it would with level or timer interrupts. Filtering and priority also work the same way. Once the guest ACKs that interrupt, the "level high" bit for that MSI line gets cleared.

So essentially, the MPIC converts MSIs into level based interrupts.

That's why pushing an MSI into the in-kernel MPIC by using the live migration interface works. All you need to do is live migrate the "MSI interrupt line is active" bit into the device state.

> 
>>>>>> I see no need for a separate ioctl in terms of the underlying
>>>>>> infrastructure for distinguishing "attribute" from "write-only
>>>>>> command".  I'm open to improvements on what the ioctl is called.
>>>>>> It's basically like setting a register on a device, except I was
>>>>>> concerned that if we actually called it a "register" that people
>>>>>> would take it too literally and think it's only for the
>>> architected
>>>>>> register state of the emulated device.
>>>>> I agree "attribute" is better name than "register", but injecting
>>>>> interrupt is not setting an attribute.
>>>> 
>>>> It's a dynamic attribute -- the state of the input line.  Better
>>>> names are welcome.  I don't see this difference as enough to warrant
>>>> separate ioctls.
>>> As long as you use the same attribute for migration and interrupt
>>> injection
>>> purpose I do. If you use separate attributes for migration and
>>> interrupt
>>> injection then not having separate ioctl is just a hack.
>> 
>> Why is it a hack?  Is it also a hack to not use a separate ioctl to
>> reset the device, to move its address map, etc?
>> 
> It is a hack because purpose of the interface becomes unclear. If you
> see it called in a code you have no idea what semantics is expected.
> For instance getting/setting of a state should be done when vcpus are
> not running, but commands may be sent while they are. This also
> encourage bugs when incorrect attribute is used during migration or vice
> versa. In short interface does not help you, it requires you to read
> documentation of each device very carefully. Reset is one thing that
> will be implemented as a command ioctl. Moving address map depends. If
> an address is a part of device's sate and will be migrated as such then it
> is device attribute, otherwise use command to move it.

I agree here. I think it makes sense to have a separate interface to set interrupt states. This separate interface can be a separate group inside this interface too. But I agree that we shouldn't use the live migration interface to set the interrupt line of the MPIC.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-25 13:09         ` Gleb Natapov
@ 2013-02-25 15:29           ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-02-25 15:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Mackerras, Scott Wood, kvm-ppc, kvm, Christoffer Dall


On 25.02.2013, at 14:09, Gleb Natapov wrote:

> On Mon, Feb 25, 2013 at 12:11:19PM +1100, Paul Mackerras wrote:
>> On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
>>> Copying Christoffer since ARM has in kernel irq chip too.
>>> 
>>> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>>>> Currently, devices that are emulated inside KVM are configured in a
>>>> hardcoded manner based on an assumption that any given architecture
>>>> only has one way to do it.  If there's any need to access device state,
>>>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>>>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>>>> cumbersome and depletes a limited numberspace.
>>>> 
>>>> This API provides a mechanism to instantiate a device of a certain
>>>> type, returning an ID that can be used to set/get attributes of the
>>>> device.  Attributes may include configuration parameters (e.g.
>>>> register base address), device state, operational commands, etc.  It
>>>> is similar to the ONE_REG API, except that it acts on devices rather
>>>> than vcpus.
>>> You are not only provide different way to create in kernel irq chip you
>>> also use an alternate way to trigger interrupt lines. Before going into
>>> interface specifics lets think about whether it is really worth it? x86
>>> obviously support old way and will have to for some, very long, time.
>>> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
>>> be 2 archs against one. Christoffer do you think the proposed way it
>>> better for your needs. Are you willing to make vGIC use it?
>> 
>> In fact there have been two distinct interrupt controller emulations
>> for PPC posted recently, Scott's and mine, with quite different
>> interfaces.  In contrast to Scott's device control API, where the
>> operations you would do for different interrupt controllers are quite
>> different, mine attempted to define a much more abstract interface for
>> interrupt controller emulations, with the idea that an interrupt
>> controller subsystem connects a set of interrupt sources, each with
>> some state, to a set of interrupt delivery mechanisms, one per vcpu,
>> also with some state.
>> 
> I agree with Scott that it is futile to try and come up with generic
> irqchip configuration interface and I doubt it is needed from QEMU
> or other userspace pov. I looked at proposed KVM_IRQCHIP_SET_SOURCES
> interface and while it is possible to pass some information about
> pic/ioapic using it there will be a lot of information that will not
> fit there. For one there is global irqchips related state and proposed
> interface only talk about interrupt sources. Another is that using
> generic interface will require us to have a code that translate irqchip
> representation into this generic one and back for no apparent gain.
> Currently pic/ioapic state is very similar to what HW specification
> defines and it is not going to change.
> 
> Looking at your implementation of KVM_IRQCHIP_SET_SOURCES I wounder how
> well it work for migration though. The interface suppose to transfer
> irqchips state as is, but I see things like that in your code:
> 
>                       mutex_lock(&ics->lock);
>                       irqp->server = val & KVM_IRQ_SERVER_MASK;
>                       irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
>                       irqp->resend = 0;
>                       irqp->masked_pending = 0;
>                       irqp->asserted = 0;
> 
> Why it is safe to initialize those values to default values during
> migration? Wouldn't it be simpler and more correct to just transfer
> the whole content of irqp from src to dst?
> 
>> Thus my interface had:
>> 
>> - a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
>>  indicates the overall architecture of the interrupt subsystem,
>> 
>> - KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
>>  return or modify the state of some set of interrupt sources
>> 
>> - a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
>>  identifier space, that is used to save and restore the per-vcpu
>>  interrupt presentation state.
>> 
>> Alternatively, I could modify my code to use the existing
>> KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
>> were to do that I would want to rename the 'pad' field in struct
>> kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
>> interrupt sources to be affected.  I'd also want to keep the ONE_REG
>> identifier for the per-vcpu state.
> It is preferable to the interface you propose since I do not think your
> interface fits other interrupt controllers well. You can put nr field
> into dummy[] payload, or replace pad with "union {pad, nr}".
> 
>> 
>> Or I could change over to using Scott's device control API, though I
>> have some objections to doing that.
>> 
>> What is your advice about which interface to use?
>> 
> The ideal situation would be if you and Scott agree on the details about
> the interface. If you don't like something about Scott's interface we
> can discuss it and shape it to something you agree with or even like.
> I already asked Scott to introduce command interface. Scott does not
> care about migration, you do, so you can make sure that interface works
> for that.

I agree. I really want to see one style used for both devices. If we come to the conclusion that we need to go old-style ioctl payload style, then fine. I obviously prefer fine-grained individual state accessors.

But really, even if the "generic interface" only generalizes MPIC and XICS it's already a win for me :).


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-25 15:29           ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-02-25 15:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Mackerras, Scott Wood, kvm-ppc, kvm, Christoffer Dall


On 25.02.2013, at 14:09, Gleb Natapov wrote:

> On Mon, Feb 25, 2013 at 12:11:19PM +1100, Paul Mackerras wrote:
>> On Mon, Feb 18, 2013 at 02:21:59PM +0200, Gleb Natapov wrote:
>>> Copying Christoffer since ARM has in kernel irq chip too.
>>> 
>>> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
>>>> Currently, devices that are emulated inside KVM are configured in a
>>>> hardcoded manner based on an assumption that any given architecture
>>>> only has one way to do it.  If there's any need to access device state,
>>>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>>>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>>>> cumbersome and depletes a limited numberspace.
>>>> 
>>>> This API provides a mechanism to instantiate a device of a certain
>>>> type, returning an ID that can be used to set/get attributes of the
>>>> device.  Attributes may include configuration parameters (e.g.
>>>> register base address), device state, operational commands, etc.  It
>>>> is similar to the ONE_REG API, except that it acts on devices rather
>>>> than vcpus.
>>> You are not only provide different way to create in kernel irq chip you
>>> also use an alternate way to trigger interrupt lines. Before going into
>>> interface specifics lets think about whether it is really worth it? x86
>>> obviously support old way and will have to for some, very long, time.
>>> ARM vGIC code, that is ready to go upstream, uses old way too. So it will
>>> be 2 archs against one. Christoffer do you think the proposed way it
>>> better for your needs. Are you willing to make vGIC use it?
>> 
>> In fact there have been two distinct interrupt controller emulations
>> for PPC posted recently, Scott's and mine, with quite different
>> interfaces.  In contrast to Scott's device control API, where the
>> operations you would do for different interrupt controllers are quite
>> different, mine attempted to define a much more abstract interface for
>> interrupt controller emulations, with the idea that an interrupt
>> controller subsystem connects a set of interrupt sources, each with
>> some state, to a set of interrupt delivery mechanisms, one per vcpu,
>> also with some state.
>> 
> I agree with Scott that it is futile to try and come up with generic
> irqchip configuration interface and I doubt it is needed from QEMU
> or other userspace pov. I looked at proposed KVM_IRQCHIP_SET_SOURCES
> interface and while it is possible to pass some information about
> pic/ioapic using it there will be a lot of information that will not
> fit there. For one there is global irqchips related state and proposed
> interface only talk about interrupt sources. Another is that using
> generic interface will require us to have a code that translate irqchip
> representation into this generic one and back for no apparent gain.
> Currently pic/ioapic state is very similar to what HW specification
> defines and it is not going to change.
> 
> Looking at your implementation of KVM_IRQCHIP_SET_SOURCES I wounder how
> well it work for migration though. The interface suppose to transfer
> irqchips state as is, but I see things like that in your code:
> 
>                       mutex_lock(&ics->lock);
>                       irqp->server = val & KVM_IRQ_SERVER_MASK;
>                       irqp->priority = val >> KVM_IRQ_PRIORITY_SHIFT;
>                       irqp->resend = 0;
>                       irqp->masked_pending = 0;
>                       irqp->asserted = 0;
> 
> Why it is safe to initialize those values to default values during
> migration? Wouldn't it be simpler and more correct to just transfer
> the whole content of irqp from src to dst?
> 
>> Thus my interface had:
>> 
>> - a KVM_CREATE_IRQCHIP_ARGS ioctl, with an argument structure that
>>  indicates the overall architecture of the interrupt subsystem,
>> 
>> - KVM_IRQCHIP_SET_SOURCES and KVM_IRQCHIP_GET_SOURCES ioctls, that
>>  return or modify the state of some set of interrupt sources
>> 
>> - a KVM_REG_PPC_ICP_STATE identifier in the ONE_REG register
>>  identifier space, that is used to save and restore the per-vcpu
>>  interrupt presentation state.
>> 
>> Alternatively, I could modify my code to use the existing
>> KVM_CREATE_IRQCHIP, KVM_GET_IRQCHIP, and KVM_SET_IRQCHIP ioctls.  If I
>> were to do that I would want to rename the 'pad' field in struct
>> kvm_irqchip to 'nr' and use it with 'chip_id' to identify a range of
>> interrupt sources to be affected.  I'd also want to keep the ONE_REG
>> identifier for the per-vcpu state.
> It is preferable to the interface you propose since I do not think your
> interface fits other interrupt controllers well. You can put nr field
> into dummy[] payload, or replace pad with "union {pad, nr}".
> 
>> 
>> Or I could change over to using Scott's device control API, though I
>> have some objections to doing that.
>> 
>> What is your advice about which interface to use?
>> 
> The ideal situation would be if you and Scott agree on the details about
> the interface. If you don't like something about Scott's interface we
> can discuss it and shape it to something you agree with or even like.
> I already asked Scott to introduce command interface. Scott does not
> care about migration, you do, so you can make sure that interface works
> for that.

I agree. I really want to see one style used for both devices. If we come to the conclusion that we need to go old-style ioctl payload style, then fine. I obviously prefer fine-grained individual state accessors.

But really, even if the "generic interface" only generalizes MPIC and XICS it's already a win for me :).


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-23 15:04                           ` Marcelo Tosatti
@ 2013-02-26  0:27                             ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-26  0:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

  On 02/23/2013 09:04:33 AM, Marcelo Tosatti wrote:
> On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
> > On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> > >You are not writing to the registers from the CPU point of view.
> >
> > That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
> > implemented on MPIC (with the exception of registers whose behavior
> > changes based on which specific vcpu you use to access them).
> > If/when we have a need to set/get state in a different manner,
> > that's a separate attribute group.
> 
> Can you describe usage of this register again?

It's used by QEMU to reflect memory space accesses to the kernel.  In  
practice, these are either debug accesses or MSIs.

> > >Also, it is necessary to provide proper locking of device attribute
> > >write versus vcpu device access. So far we have been focusing on
> > >having
> > >a lockless vcpu path.
> >
> > How is device access related to vcpus?  Existing irqchip code is not
> > lockless.
> 
> VCPUS access in-kernel devices.

Right, VCPUs (plural) not VCPU (singular).  Thus locking is needed.

> Yes, it is lockless (see RCU usage in virt/kvm/).

virt/kvm/kvm_main.c: /* kvm_io_bus_write - called under kvm->slots_lock  
*/
virt/kvm/ioapic.c: spin_lock(&ioapic->lock);
etc.

> > >Basically abstract 'device attributes' are too abstract.
> >
> > It's up to the device-specific documentation to make them not
> > abstract (I admit there are a few details missing in mpic.txt that
> > I've pointed out in this thread -- it is RFC v1 after all).  This
> > wouldn't be any different if we used separate ioctls for everything.
> > It's like saying abstract 'ioctl' is too abstract.
> 
> Perhaps a better way to put it is that its too permissive.

It's no more permissive than saying "go define an ioctl".  It is  
somewhat more constrained than that, in that it is an operation on a  
particular device, as opposed to any possible KVM action.

> > >However, your proposed interface deals with sucky capability,
> > >versioning
> > >and namespace conflicts we have now. Note these items can probably  
> be
> > >improved separately.
> >
> > Any particular proposals?
> 
> Namespace conflicts: Reserve ranges for each arch.

That only helps when the conflict is between different arches (as  
opposed to different devices), and it discourages sharing between  
arches.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-26  0:27                             ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-26  0:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Gleb Natapov, Alexander Graf, kvm-ppc, kvm, Christoffer Dall

  On 02/23/2013 09:04:33 AM, Marcelo Tosatti wrote:
> On Thu, Feb 21, 2013 at 08:00:25PM -0600, Scott Wood wrote:
> > On 02/21/2013 05:03:32 PM, Marcelo Tosatti wrote:
> > >You are not writing to the registers from the CPU point of view.
> >
> > That's exactly how KVM_DEV_MPIC_GRP_REGISTER is defined and
> > implemented on MPIC (with the exception of registers whose behavior
> > changes based on which specific vcpu you use to access them).
> > If/when we have a need to set/get state in a different manner,
> > that's a separate attribute group.
> 
> Can you describe usage of this register again?

It's used by QEMU to reflect memory space accesses to the kernel.  In  
practice, these are either debug accesses or MSIs.

> > >Also, it is necessary to provide proper locking of device attribute
> > >write versus vcpu device access. So far we have been focusing on
> > >having
> > >a lockless vcpu path.
> >
> > How is device access related to vcpus?  Existing irqchip code is not
> > lockless.
> 
> VCPUS access in-kernel devices.

Right, VCPUs (plural) not VCPU (singular).  Thus locking is needed.

> Yes, it is lockless (see RCU usage in virt/kvm/).

virt/kvm/kvm_main.c: /* kvm_io_bus_write - called under kvm->slots_lock  
*/
virt/kvm/ioapic.c: spin_lock(&ioapic->lock);
etc.

> > >Basically abstract 'device attributes' are too abstract.
> >
> > It's up to the device-specific documentation to make them not
> > abstract (I admit there are a few details missing in mpic.txt that
> > I've pointed out in this thread -- it is RFC v1 after all).  This
> > wouldn't be any different if we used separate ioctls for everything.
> > It's like saying abstract 'ioctl' is too abstract.
> 
> Perhaps a better way to put it is that its too permissive.

It's no more permissive than saying "go define an ioctl".  It is  
somewhat more constrained than that, in that it is an operation on a  
particular device, as opposed to any possible KVM action.

> > >However, your proposed interface deals with sucky capability,
> > >versioning
> > >and namespace conflicts we have now. Note these items can probably  
> be
> > >improved separately.
> >
> > Any particular proposals?
> 
> Namespace conflicts: Reserve ranges for each arch.

That only helps when the conflict is between different arches (as  
opposed to different devices), and it discourages sharing between  
arches.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-24 15:46                     ` Gleb Natapov
@ 2013-02-26  2:38                       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-26  2:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras

On 02/24/2013 09:46:17 AM, Gleb Natapov wrote:
> On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
> >  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> > >On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> > >> You mean like what we did with SREGS, that got deprecated and
> > >> replaced with ONE_REG?
> > >>
> > >SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> > >interface. You can access all register at once or you can access  
> them
> > >one by one. If there is a need we can add MULTIPLE_REGS that will  
> get
> > >list of requested REGS.
> >
> > http://www.spinics.net/lists/kvm-ppc/msg04876.html
> > http://www.spinics.net/lists/kvm-ppc/msg05842.html
> >
> If Alex prefers ONE_REG interface this is his call :)

My point is the reasons we prefer it, that SREGS really is deprecated  
for
new state on PPC (REGS is not extensible so it's been in that state for
even longer -- hence SREGS), and the desire to not create that same mess
again.  Running out of space is just one of the ways in which SREGS is a
mess, and I think trying to represent MPIC state that way would be even
worse.

> > >The interface is not over generic i.e it does
> > >not try to replace KVM_RUN by writing special register.
> >
> > Sigh.
> >
> > >>                                                    And we
> > >> don't necessarily want to set *everything*.
> > >What are those cases? You do need to on reset/migration.
> >
> > Why do we want to set all the registers on reset, rather than tell
> > the in-kernel device to reset?  The default state came from the
> > kernel in the first place on irqchip creation...
> >
> I have nothing against telling in-kernel device to reset provided  
> there
> is a way to do so, which current interface lacks.

If by "current" you mean what is proposed here, you can use
KVM_MPIC_DEV_MPIC_REGISTER to set GCR[RESET].

I can see the appeal of an explicit "reset device" command, though,
especially if the hardware reset bit is defined such that you're  
supposed
to poll until it's cleared to know that the reest is complete (the KVM
implementation is not likely to return before the reset is complete, but
it would be better to not rely on such knowledge).

> Reset in userspase has
> its advantage too: bugs are easier to fix,

They're harder to fix if fixing it requires adding something to the API.
Bugs in reset are just a small portion of overall bugs, so I'm not sure
how important this is.  It would require that we implement additional
state accessors now, rather than if and when we do migration.  It could
cause the total bug count to go up versus code that already works and is
self-contained.  :-P

> there may be different kind of resets (hard/soft).

This can be communicated through a device-specific command if it is
applicable (it isn't for MPIC).

> > Again, we do not currently support migration on MPIC.  It is a very
> > low priority for embedded.  We do not wish to rule it out entirely,
> > but it most likely would require adding more state accesors.
> The interface suppose to be generic,
> we are not talking about MPIC
> specifically here.

There are two separate things here.  There is the device control api,
which is generic, but is just a way to create devices and issue commands
to them.

Then there is the specific device implementation, which is not generic.
The details of how migration is handled for a particular device belong  
to
the specific device implementation.

> Regarding migration "never say never"  :)

I did say we don't want to rule it out -- I'm just emphasising this so
that we don't get bogged down in questions of how the currently defined
MPIC attributes/commands would get used for migration.  If we tried to
blindly add support for that now, without being able to test, or being
able to refer to what QEMU normally saves/restores for MPIC, we'd
probably get it wrong.  Just as the untested savevm code in QEMU's
hw/openpic.c is wrong/incomplete.

> > As for other types of devices, x86 has i8254.c emulated in-kernel --
> > I know that's not going to switch to the new interface, but it could
> > have if it existed back then.
> Since it is not going to switch it is not a good example.

The point of the example is to show that irqchips aren't the only thing
that could ever make sense to emulate in the kernel.  I can't see into
the future and tell you what the actual first non-irqchip user would be,
or if there will definitely be one.

If we wait until that user comes along, we'd end up saying *that* is
the only user and "since MPIC is not going to switch it is not a good
example".  If no non-irqchip user ends up coming along, what is the
actual harm from not putting the words "irqchip" in the interface
description (the generic part, not the device-specific part)?

> >                                I can also see creating an in-kernel
> > emulation device for doing MMIO filtering on some piece of embedded
> > hardware that guests need to access with reasonable performance, but
> > the hardware desginers screwed up the protection slightly (e.g. put
> > other things in the same 4K page).  We've done such filtering before
> > in our standalone hypervisor; the question is whether it happens to
> > anything with enough performance requirements to be done in the
> > kernel.
> I am not sure why special device is needed for such filtering. If MMIO
> is not handled by the kernel it is forwarded to a userspace.

"with reasonable performance"

> > For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
> > id in kvm_irq_routing_irqchip (or, have an attribute/command that
> > retrieves the id to be used there).  Unfortunately there is no
> > irqchip field in kvm_irq_routing_msi, though since it's basically a
> > command to write to an arbitrary MMIO address, maybe it could just
> > be implemented that way?
> >
> How MSI is delivered with MPIC? From device point of view sending an  
> MSI
> is doing write at address X of data Y. How PPC with MPIC translates  
> this
> into actual interrupt?

Alex described pretty much what happens.

I see that currently MSIs are just hardcoded to go to
kvm_irq_delivery_to_apic().  This could become a function pointer to
whatever irqchip has registered itself as handling MSIs for a particular
VM.  If we ever have more than one irqchip at the same time that could
handle an MSI, then we'd need something more complicated, but hopefully
we never see that.  For irqfd MSIs I'd rather avoid the overhead of  
going
through the full MMIO emulation path.

> > >As long as you use the same attribute for migration and interrupt
> > >injection
> > >purpose I do. If you use separate attributes for migration and
> > >interrupt
> > >injection then not having separate ioctl is just a hack.
> >
> > Why is it a hack?  Is it also a hack to not use a separate ioctl to
> > reset the device, to move its address map, etc?
> >
> It is a hack because purpose of the interface becomes unclear.

You can have a separate interface without having a separate ioctl.  I
understand the objection to calling certain types of commands
"attributes", but I don't understand why "set this attribute" can't be
called a command.

> If you
> see it called in a code you have no idea what semantics is expected.

Hmm?  If you see a call to KVM_IOCTL_FOO, you look up the documentation
for KVM_IOCTL_FOO to see what the semantics are.  If you see a call to
KVM_DEVICE_COMMAND with a KVM_DEV_MPIC_GRP_FOO command group, you look  
at
mpic.txt to see what the semantics are.

> For instance getting/setting of a state should be done when vcpus are
> not running,

I don't have anything on MPIC that I would currently want to define as
"state" under that definition.  If/when migration is implemented, *some*
of the state might have that restriction, and those would be accessed  
via
something other than KVM_DEV_MPIC_GRP_REGISTER.  I would prefer not to
create accessors for all state in this way, just where something special
is needed -- as in PPC booke ONE_REG where only things like TSR need
special handling.  Plus, most of what would need special handling on  
MPIC
is totally hidden state (e.g. the pending/active queues and the
interrupt input state) rather than a special variant of an existing
register.  The registers that show active MSIs (normally read-only and
read-to-clear) are one of the few exceptions.

Note that earlier you said that you'd be OK with calling the base  
address
an attribute, even if it can change during guest execution -- so is
"should be done when vcpus are not running" really a good thing to base
any such split on?

> but commands may be sent while they are. This also
> encourage bugs when incorrect attribute is used during migration or  
> vice
> versa.

How is that different from specifing the flag incorrectly when you set  
an MSR?

> In short interface does not help you, it requires you to read
> documentation of each device very carefully.

You need to read it carefully enough to know what it does and how to use
it, regardless of any such splitting of ioctls.  This is especially true
if/when we start adding hidden state for migration, since the format of
that state is not defined by the device itself.

> Reset is one thing that will be implemented as a command ioctl.   
> Moving
> address map depends.  If an address is a part of device's sate and  
> will
> be migrated as such then it is device attribute, otherwise use command
> to move it.

It would be migrated, but can also be set at runtime for non-migration
reasons.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-02-26  2:38                       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-02-26  2:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Alexander Graf, kvm-ppc, kvm, Christoffer Dall, Paul Mackerras

On 02/24/2013 09:46:17 AM, Gleb Natapov wrote:
> On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
> >  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> > >On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> > >> You mean like what we did with SREGS, that got deprecated and
> > >> replaced with ONE_REG?
> > >>
> > >SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> > >interface. You can access all register at once or you can access  
> them
> > >one by one. If there is a need we can add MULTIPLE_REGS that will  
> get
> > >list of requested REGS.
> >
> > http://www.spinics.net/lists/kvm-ppc/msg04876.html
> > http://www.spinics.net/lists/kvm-ppc/msg05842.html
> >
> If Alex prefers ONE_REG interface this is his call :)

My point is the reasons we prefer it, that SREGS really is deprecated  
for
new state on PPC (REGS is not extensible so it's been in that state for
even longer -- hence SREGS), and the desire to not create that same mess
again.  Running out of space is just one of the ways in which SREGS is a
mess, and I think trying to represent MPIC state that way would be even
worse.

> > >The interface is not over generic i.e it does
> > >not try to replace KVM_RUN by writing special register.
> >
> > Sigh.
> >
> > >>                                                    And we
> > >> don't necessarily want to set *everything*.
> > >What are those cases? You do need to on reset/migration.
> >
> > Why do we want to set all the registers on reset, rather than tell
> > the in-kernel device to reset?  The default state came from the
> > kernel in the first place on irqchip creation...
> >
> I have nothing against telling in-kernel device to reset provided  
> there
> is a way to do so, which current interface lacks.

If by "current" you mean what is proposed here, you can use
KVM_MPIC_DEV_MPIC_REGISTER to set GCR[RESET].

I can see the appeal of an explicit "reset device" command, though,
especially if the hardware reset bit is defined such that you're  
supposed
to poll until it's cleared to know that the reest is complete (the KVM
implementation is not likely to return before the reset is complete, but
it would be better to not rely on such knowledge).

> Reset in userspase has
> its advantage too: bugs are easier to fix,

They're harder to fix if fixing it requires adding something to the API.
Bugs in reset are just a small portion of overall bugs, so I'm not sure
how important this is.  It would require that we implement additional
state accessors now, rather than if and when we do migration.  It could
cause the total bug count to go up versus code that already works and is
self-contained.  :-P

> there may be different kind of resets (hard/soft).

This can be communicated through a device-specific command if it is
applicable (it isn't for MPIC).

> > Again, we do not currently support migration on MPIC.  It is a very
> > low priority for embedded.  We do not wish to rule it out entirely,
> > but it most likely would require adding more state accesors.
> The interface suppose to be generic,
> we are not talking about MPIC
> specifically here.

There are two separate things here.  There is the device control api,
which is generic, but is just a way to create devices and issue commands
to them.

Then there is the specific device implementation, which is not generic.
The details of how migration is handled for a particular device belong  
to
the specific device implementation.

> Regarding migration "never say never"  :)

I did say we don't want to rule it out -- I'm just emphasising this so
that we don't get bogged down in questions of how the currently defined
MPIC attributes/commands would get used for migration.  If we tried to
blindly add support for that now, without being able to test, or being
able to refer to what QEMU normally saves/restores for MPIC, we'd
probably get it wrong.  Just as the untested savevm code in QEMU's
hw/openpic.c is wrong/incomplete.

> > As for other types of devices, x86 has i8254.c emulated in-kernel --
> > I know that's not going to switch to the new interface, but it could
> > have if it existed back then.
> Since it is not going to switch it is not a good example.

The point of the example is to show that irqchips aren't the only thing
that could ever make sense to emulate in the kernel.  I can't see into
the future and tell you what the actual first non-irqchip user would be,
or if there will definitely be one.

If we wait until that user comes along, we'd end up saying *that* is
the only user and "since MPIC is not going to switch it is not a good
example".  If no non-irqchip user ends up coming along, what is the
actual harm from not putting the words "irqchip" in the interface
description (the generic part, not the device-specific part)?

> >                                I can also see creating an in-kernel
> > emulation device for doing MMIO filtering on some piece of embedded
> > hardware that guests need to access with reasonable performance, but
> > the hardware desginers screwed up the protection slightly (e.g. put
> > other things in the same 4K page).  We've done such filtering before
> > in our standalone hypervisor; the question is whether it happens to
> > anything with enough performance requirements to be done in the
> > kernel.
> I am not sure why special device is needed for such filtering. If MMIO
> is not handled by the kernel it is forwarded to a userspace.

"with reasonable performance"

> > For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
> > id in kvm_irq_routing_irqchip (or, have an attribute/command that
> > retrieves the id to be used there).  Unfortunately there is no
> > irqchip field in kvm_irq_routing_msi, though since it's basically a
> > command to write to an arbitrary MMIO address, maybe it could just
> > be implemented that way?
> >
> How MSI is delivered with MPIC? From device point of view sending an  
> MSI
> is doing write at address X of data Y. How PPC with MPIC translates  
> this
> into actual interrupt?

Alex described pretty much what happens.

I see that currently MSIs are just hardcoded to go to
kvm_irq_delivery_to_apic().  This could become a function pointer to
whatever irqchip has registered itself as handling MSIs for a particular
VM.  If we ever have more than one irqchip at the same time that could
handle an MSI, then we'd need something more complicated, but hopefully
we never see that.  For irqfd MSIs I'd rather avoid the overhead of  
going
through the full MMIO emulation path.

> > >As long as you use the same attribute for migration and interrupt
> > >injection
> > >purpose I do. If you use separate attributes for migration and
> > >interrupt
> > >injection then not having separate ioctl is just a hack.
> >
> > Why is it a hack?  Is it also a hack to not use a separate ioctl to
> > reset the device, to move its address map, etc?
> >
> It is a hack because purpose of the interface becomes unclear.

You can have a separate interface without having a separate ioctl.  I
understand the objection to calling certain types of commands
"attributes", but I don't understand why "set this attribute" can't be
called a command.

> If you
> see it called in a code you have no idea what semantics is expected.

Hmm?  If you see a call to KVM_IOCTL_FOO, you look up the documentation
for KVM_IOCTL_FOO to see what the semantics are.  If you see a call to
KVM_DEVICE_COMMAND with a KVM_DEV_MPIC_GRP_FOO command group, you look  
at
mpic.txt to see what the semantics are.

> For instance getting/setting of a state should be done when vcpus are
> not running,

I don't have anything on MPIC that I would currently want to define as
"state" under that definition.  If/when migration is implemented, *some*
of the state might have that restriction, and those would be accessed  
via
something other than KVM_DEV_MPIC_GRP_REGISTER.  I would prefer not to
create accessors for all state in this way, just where something special
is needed -- as in PPC booke ONE_REG where only things like TSR need
special handling.  Plus, most of what would need special handling on  
MPIC
is totally hidden state (e.g. the pending/active queues and the
interrupt input state) rather than a special variant of an existing
register.  The registers that show active MSIs (normally read-only and
read-to-clear) are one of the few exceptions.

Note that earlier you said that you'd be OK with calling the base  
address
an attribute, even if it can change during guest execution -- so is
"should be done when vcpus are not running" really a good thing to base
any such split on?

> but commands may be sent while they are. This also
> encourage bugs when incorrect attribute is used during migration or  
> vice
> versa.

How is that different from specifing the flag incorrectly when you set  
an MSR?

> In short interface does not help you, it requires you to read
> documentation of each device very carefully.

You need to read it carefully enough to know what it does and how to use
it, regardless of any such splitting of ioctls.  This is especially true
if/when we start adding hidden state for migration, since the format of
that state is not defined by the device itself.

> Reset is one thing that will be implemented as a command ioctl.   
> Moving
> address map depends.  If an address is a part of device's sate and  
> will
> be migrated as such then it is device attribute, otherwise use command
> to move it.

It would be migrated, but can also be set at runtime for non-migration
reasons.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-14  5:49   ` Scott Wood
@ 2013-03-06  0:59     ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-03-06  0:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.

The API looks fine to me.  Ultimately I could use a version of the
get/set attribute ioctls that get or set multiple attributes within a
group, but that can come later.

Were you thinking that the attribute codes should encode the size of
the attribute, like the one_reg register IDs do?  If so it would be
good to define the bitfield and values for that in this patch.

The one comment I have on the implementation is that it doesn't seem
to conveniently allow for multiple instances of a device type, since
there is no instance-specific pointer stored anywhere.  More comments
below...

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0350e0d..dbaf012 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_memslots {
>  	short id_to_index[KVM_MEM_SLOTS_NUM];
>  };
>  
> +/*
> + * The worst case number of simultaneous devices will likely be very low
> + * (usually zero or one) for the forseeable future.  If the worst case
> + * exceeds this, then it can be increased, or we can convert to idr.
> + */
> +#define KVM_MAX_DEVICES 4
> +
> +struct kvm_device {
> +	u32 type;
> +
> +	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> +};

This is more of a device class definition than a device instance
definition.  There is nothing in this struct that would be different
between different instances of a given device, and in fact it would
make sense to use the one copy of this struct for all instances of a
given type.  However, the functions listed here only take the struct
kvm_device pointer, meaning that to distinguish between instances,
these functions would have to do some sort of container_of trick to
know which instance to operate on.

I think it would make more sense either to add a void * instance data
pointer to struct kvm_device, or to add a void * argument to those
functions as an instance data pointer and add another field to struct
kvm like this:

> +
>  struct kvm {
>  	spinlock_t mmu_lock;
>  	struct mutex slots_lock;
> @@ -385,6 +404,8 @@ struct kvm {
>  	long mmu_notifier_count;
>  #endif
>  	long tlbs_dirty;
> +	struct kvm_device *devices[KVM_MAX_DEVICES];

+	void *device_instance[KVM_MAX_DEVICES];

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-03-06  0:59     ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-03-06  0:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.

The API looks fine to me.  Ultimately I could use a version of the
get/set attribute ioctls that get or set multiple attributes within a
group, but that can come later.

Were you thinking that the attribute codes should encode the size of
the attribute, like the one_reg register IDs do?  If so it would be
good to define the bitfield and values for that in this patch.

The one comment I have on the implementation is that it doesn't seem
to conveniently allow for multiple instances of a device type, since
there is no instance-specific pointer stored anywhere.  More comments
below...

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0350e0d..dbaf012 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_memslots {
>  	short id_to_index[KVM_MEM_SLOTS_NUM];
>  };
>  
> +/*
> + * The worst case number of simultaneous devices will likely be very low
> + * (usually zero or one) for the forseeable future.  If the worst case
> + * exceeds this, then it can be increased, or we can convert to idr.
> + */
> +#define KVM_MAX_DEVICES 4
> +
> +struct kvm_device {
> +	u32 type;
> +
> +	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> +			struct kvm_device_attr *attr);
> +	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> +};

This is more of a device class definition than a device instance
definition.  There is nothing in this struct that would be different
between different instances of a given device, and in fact it would
make sense to use the one copy of this struct for all instances of a
given type.  However, the functions listed here only take the struct
kvm_device pointer, meaning that to distinguish between instances,
these functions would have to do some sort of container_of trick to
know which instance to operate on.

I think it would make more sense either to add a void * instance data
pointer to struct kvm_device, or to add a void * argument to those
functions as an instance data pointer and add another field to struct
kvm like this:

> +
>  struct kvm {
>  	spinlock_t mmu_lock;
>  	struct mutex slots_lock;
> @@ -385,6 +404,8 @@ struct kvm {
>  	long mmu_notifier_count;
>  #endif
>  	long tlbs_dirty;
> +	struct kvm_device *devices[KVM_MAX_DEVICES];

+	void *device_instance[KVM_MAX_DEVICES];

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-03-06  0:59     ` Paul Mackerras
@ 2013-03-06  1:20       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-03-06  1:20 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 03/05/2013 06:59:26 PM, Paul Mackerras wrote:
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> >
> > Both device types and individual attributes can be tested without  
> having
> > to create the device or get/set the attribute, without the need for
> > separately managing enumerated capabilities.
> 
> The API looks fine to me.  Ultimately I could use a version of the
> get/set attribute ioctls that get or set multiple attributes within a
> group, but that can come later.
> 
> Were you thinking that the attribute codes should encode the size of
> the attribute, like the one_reg register IDs do?  If so it would be
> good to define the bitfield and values for that in this patch.

Ah, forgot that ONE_REG did that.  I think I'd rather just make size a  
separate field in the struct.

> The one comment I have on the implementation is that it doesn't seem
> to conveniently allow for multiple instances of a device type, since
> there is no instance-specific pointer stored anywhere.  More comments
> below...

The device implementation creates the kvm_device struct, and can embed  
it in some other struct to provide instance-specific context, using  
container_of() to extract it.  MPIC does this in patch 6/6.

MPIC currently forbids multiple instances because of semantic issues  
(if there are multiple MPICs, which one gets the connection to the  
vcpus, and where do the outputs of the others go?), not anything to do  
with the device control API.

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 0350e0d..dbaf012 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -335,6 +335,25 @@ struct kvm_memslots {
> >  	short id_to_index[KVM_MEM_SLOTS_NUM];
> >  };
> >
> > +/*
> > + * The worst case number of simultaneous devices will likely be  
> very low
> > + * (usually zero or one) for the forseeable future.  If the worst  
> case
> > + * exceeds this, then it can be increased, or we can convert to  
> idr.
> > + */
> > +#define KVM_MAX_DEVICES 4
> > +
> > +struct kvm_device {
> > +	u32 type;
> > +
> > +	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> > +			struct kvm_device_attr *attr);
> > +	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> > +			struct kvm_device_attr *attr);
> > +	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> > +			struct kvm_device_attr *attr);
> > +	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> > +};
> 
> This is more of a device class definition than a device instance
> definition.  There is nothing in this struct that would be different
> between different instances of a given device,

Its address would be different, which can be used with container_of()  
as described above.

> and in fact it would make sense to use the one copy of this struct  
> for all instances of a
> given type.

We could do that, but then we'd need to wrap it with a struct that  
points to the class data and the instance data (the latter either as a  
pointer or through container_of).  Given that both the number of  
instances and the number of struct members are small, I didn't think it  
was worth separating it out.

> However, the functions listed here only take the struct
> kvm_device pointer, meaning that to distinguish between instances,
> these functions would have to do some sort of container_of trick to
> know which instance to operate on.

container_of is fairly idiomatic in Linux.  The existing kvm_io_device  
works this way as well.  What is the advantage of using an explicit  
pointer?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-03-06  1:20       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-03-06  1:20 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 03/05/2013 06:59:26 PM, Paul Mackerras wrote:
> On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> >
> > Both device types and individual attributes can be tested without  
> having
> > to create the device or get/set the attribute, without the need for
> > separately managing enumerated capabilities.
> 
> The API looks fine to me.  Ultimately I could use a version of the
> get/set attribute ioctls that get or set multiple attributes within a
> group, but that can come later.
> 
> Were you thinking that the attribute codes should encode the size of
> the attribute, like the one_reg register IDs do?  If so it would be
> good to define the bitfield and values for that in this patch.

Ah, forgot that ONE_REG did that.  I think I'd rather just make size a  
separate field in the struct.

> The one comment I have on the implementation is that it doesn't seem
> to conveniently allow for multiple instances of a device type, since
> there is no instance-specific pointer stored anywhere.  More comments
> below...

The device implementation creates the kvm_device struct, and can embed  
it in some other struct to provide instance-specific context, using  
container_of() to extract it.  MPIC does this in patch 6/6.

MPIC currently forbids multiple instances because of semantic issues  
(if there are multiple MPICs, which one gets the connection to the  
vcpus, and where do the outputs of the others go?), not anything to do  
with the device control API.

> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 0350e0d..dbaf012 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -335,6 +335,25 @@ struct kvm_memslots {
> >  	short id_to_index[KVM_MEM_SLOTS_NUM];
> >  };
> >
> > +/*
> > + * The worst case number of simultaneous devices will likely be  
> very low
> > + * (usually zero or one) for the forseeable future.  If the worst  
> case
> > + * exceeds this, then it can be increased, or we can convert to  
> idr.
> > + */
> > +#define KVM_MAX_DEVICES 4
> > +
> > +struct kvm_device {
> > +	u32 type;
> > +
> > +	int (*set_attr)(struct kvm *kvm, struct kvm_device *dev,
> > +			struct kvm_device_attr *attr);
> > +	int (*get_attr)(struct kvm *kvm, struct kvm_device *dev,
> > +			struct kvm_device_attr *attr);
> > +	int (*has_attr)(struct kvm *kvm, struct kvm_device *dev,
> > +			struct kvm_device_attr *attr);
> > +	void (*destroy)(struct kvm *kvm, struct kvm_device *dev);
> > +};
> 
> This is more of a device class definition than a device instance
> definition.  There is nothing in this struct that would be different
> between different instances of a given device,

Its address would be different, which can be used with container_of()  
as described above.

> and in fact it would make sense to use the one copy of this struct  
> for all instances of a
> given type.

We could do that, but then we'd need to wrap it with a struct that  
points to the class data and the instance data (the latter either as a  
pointer or through container_of).  Given that both the number of  
instances and the number of struct members are small, I didn't think it  
was worth separating it out.

> However, the functions listed here only take the struct
> kvm_device pointer, meaning that to distinguish between instances,
> these functions would have to do some sort of container_of trick to
> know which instance to operate on.

container_of is fairly idiomatic in Linux.  The existing kvm_io_device  
works this way as well.  What is the advantage of using an explicit  
pointer?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-02-14  5:49   ` Scott Wood
@ 2013-03-06  2:48     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 261+ messages in thread
From: Benjamin Herrenschmidt @ 2013-03-06  2:48 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Gleb Natapov, Marcelo Tosatti

On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.

Allright guys, let's take a break for a minute :-)

What you seem to be proposing is a whole new construct / API to create
"device" objects with "attributes" as a way to avoid adding tons of
ioctls to the VM.

Then you somewhat "coerce" behaviours (ie. methods) as side effects of
setting some of those attributes, and create some kind of rigid API
through which any kind of potential device "attribute" needs to be
coerced through.

Essentially you are trying to re-invent encapsulation of kernel objects
manipulated by userspace, we already have several mechanisms for doing
just that and you are trying to add yet a new one :-)

What about instead using existing mechanisms for doing just that:

Make your "create device" return an anonymous file descriptor !

That gives you everything ... private ioctl's which can do whatever the
heck you want (attributes, methods, etc...), mmap, etc...

Guess what ? That's already what we do for various things like our
in-kernel emulated iommu tables as far as I can remember.

If your problem is to avoid the bottleneck of having to deal with
upstream maintainers for generic VM ioctls every time you add some new
platform specific kernel "object" then this is probably a much better
approach.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-03-06  2:48     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 261+ messages in thread
From: Benjamin Herrenschmidt @ 2013-03-06  2:48 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Gleb Natapov, Marcelo Tosatti

On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.

Allright guys, let's take a break for a minute :-)

What you seem to be proposing is a whole new construct / API to create
"device" objects with "attributes" as a way to avoid adding tons of
ioctls to the VM.

Then you somewhat "coerce" behaviours (ie. methods) as side effects of
setting some of those attributes, and create some kind of rigid API
through which any kind of potential device "attribute" needs to be
coerced through.

Essentially you are trying to re-invent encapsulation of kernel objects
manipulated by userspace, we already have several mechanisms for doing
just that and you are trying to add yet a new one :-)

What about instead using existing mechanisms for doing just that:

Make your "create device" return an anonymous file descriptor !

That gives you everything ... private ioctl's which can do whatever the
heck you want (attributes, methods, etc...), mmap, etc...

Guess what ? That's already what we do for various things like our
in-kernel emulated iommu tables as far as I can remember.

If your problem is to avoid the bottleneck of having to deal with
upstream maintainers for generic VM ioctls every time you add some new
platform specific kernel "object" then this is probably a much better
approach.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-03-06  2:48     ` Benjamin Herrenschmidt
@ 2013-03-06  3:36       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-03-06  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Graf, kvm-ppc, kvm, Gleb Natapov, Marcelo Tosatti

On 03/05/2013 08:48:33 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> 
> Allright guys, let's take a break for a minute :-)
> 
> What you seem to be proposing is a whole new construct / API to create
> "device" objects with "attributes" as a way to avoid adding tons of
> ioctls to the VM.
> 
> Then you somewhat "coerce" behaviours (ie. methods) as side effects of
> setting some of those attributes, and create some kind of rigid API
> through which any kind of potential device "attribute" needs to be
> coerced through.

It's not *that* rigid, and the rigidity that is there reduces what has  
to be spelled out for each individual thing.

It also helps with things like strace, once "size" is added.  If we go  
the private ioctl route, it would need to be updated for every new  
command (or more likely, it never gets updated and the strace output is  
less useful).

> Essentially you are trying to re-invent encapsulation of kernel  
> objects
> manipulated by userspace, we already have several mechanisms for doing
> just that and you are trying to add yet a new one :-)
> 
> What about instead using existing mechanisms for doing just that:
> 
> Make your "create device" return an anonymous file descriptor !

Well, technically the API documentation doesn't say that it *isn't* a  
file descriptor. :-)

I don't necessarily hate the idea, it just seemed like overkill.  And  
if I went that way probably someone else would complain that it was.

> That gives you everything ... private ioctl's which can do whatever  
> the
> heck you want (attributes, methods, etc...), mmap, etc...

Including all the downsides of ioctls, plus dealing with the reference  
counting and destruction order...

> Guess what ? That's already what we do for various things like our
> in-kernel emulated iommu tables as far as I can remember.

Which file is this in?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-03-06  3:36       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-03-06  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Graf, kvm-ppc, kvm, Gleb Natapov, Marcelo Tosatti

On 03/05/2013 08:48:33 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> 
> Allright guys, let's take a break for a minute :-)
> 
> What you seem to be proposing is a whole new construct / API to create
> "device" objects with "attributes" as a way to avoid adding tons of
> ioctls to the VM.
> 
> Then you somewhat "coerce" behaviours (ie. methods) as side effects of
> setting some of those attributes, and create some kind of rigid API
> through which any kind of potential device "attribute" needs to be
> coerced through.

It's not *that* rigid, and the rigidity that is there reduces what has  
to be spelled out for each individual thing.

It also helps with things like strace, once "size" is added.  If we go  
the private ioctl route, it would need to be updated for every new  
command (or more likely, it never gets updated and the strace output is  
less useful).

> Essentially you are trying to re-invent encapsulation of kernel  
> objects
> manipulated by userspace, we already have several mechanisms for doing
> just that and you are trying to add yet a new one :-)
> 
> What about instead using existing mechanisms for doing just that:
> 
> Make your "create device" return an anonymous file descriptor !

Well, technically the API documentation doesn't say that it *isn't* a  
file descriptor. :-)

I don't necessarily hate the idea, it just seemed like overkill.  And  
if I went that way probably someone else would complain that it was.

> That gives you everything ... private ioctl's which can do whatever  
> the
> heck you want (attributes, methods, etc...), mmap, etc...

Including all the downsides of ioctls, plus dealing with the reference  
counting and destruction order...

> Guess what ? That's already what we do for various things like our
> in-kernel emulated iommu tables as far as I can remember.

Which file is this in?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-03-06  3:36       ` Scott Wood
@ 2013-03-06  4:28         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 261+ messages in thread
From: Benjamin Herrenschmidt @ 2013-03-06  4:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Gleb Natapov, Marcelo Tosatti

On Tue, 2013-03-05 at 21:36 -0600, Scott Wood wrote:
> 
> Which file is this in?

A few hits:

$ grep anon arch/powerpc/kvm/*
arch/powerpc/kvm/book3s_64_mmu_hv.c:#include <linux/anon_inodes.h>
arch/powerpc/kvm/book3s_64_mmu_hv.c:	ret = anon_inode_getfd("kvm-htab", &kvm_htab_fops, ctx, rwflag);
arch/powerpc/kvm/book3s_64_vio.c:#include <linux/anon_inodes.h>
arch/powerpc/kvm/book3s_64_vio.c:	return anon_inode_getfd("kvm-spapr-tce", &kvm_spapr_tce_fops,
arch/powerpc/kvm/#book3s_hv.c#:#include <linux/anon_inodes.h>
arch/powerpc/kvm/#book3s_hv.c#:	fd = anon_inode_getfd("kvm-rma", &kvm_rma_fops, ri, O_RDWR);
arch/powerpc/kvm/book3s_hv.c:#include <linux/anon_inodes.h>
arch/powerpc/kvm/book3s_hv.c:	fd = anon_inode_getfd("kvm-rma", &kvm_rma_fops, ri, O_RDWR);
arch/powerpc/kvm/book3s_pr_papr.c:#include <linux/anon_inodes.h>

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-03-06  4:28         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 261+ messages in thread
From: Benjamin Herrenschmidt @ 2013-03-06  4:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, Gleb Natapov, Marcelo Tosatti

On Tue, 2013-03-05 at 21:36 -0600, Scott Wood wrote:
> 
> Which file is this in?

A few hits:

$ grep anon arch/powerpc/kvm/*
arch/powerpc/kvm/book3s_64_mmu_hv.c:#include <linux/anon_inodes.h>
arch/powerpc/kvm/book3s_64_mmu_hv.c:	ret = anon_inode_getfd("kvm-htab", &kvm_htab_fops, ctx, rwflag);
arch/powerpc/kvm/book3s_64_vio.c:#include <linux/anon_inodes.h>
arch/powerpc/kvm/book3s_64_vio.c:	return anon_inode_getfd("kvm-spapr-tce", &kvm_spapr_tce_fops,
arch/powerpc/kvm/#book3s_hv.c#:#include <linux/anon_inodes.h>
arch/powerpc/kvm/#book3s_hv.c#:	fd = anon_inode_getfd("kvm-rma", &kvm_rma_fops, ri, O_RDWR);
arch/powerpc/kvm/book3s_hv.c:#include <linux/anon_inodes.h>
arch/powerpc/kvm/book3s_hv.c:	fd = anon_inode_getfd("kvm-rma", &kvm_rma_fops, ri, O_RDWR);
arch/powerpc/kvm/book3s_pr_papr.c:#include <linux/anon_inodes.h>

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
  2013-03-06  2:48     ` Benjamin Herrenschmidt
@ 2013-03-06 10:18       ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-03-06 10:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Marcelo Tosatti

On Wed, Mar 06, 2013 at 01:48:33PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> > 
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> 
> Allright guys, let's take a break for a minute :-)
> 
> What you seem to be proposing is a whole new construct / API to create
> "device" objects with "attributes" as a way to avoid adding tons of
> ioctls to the VM.
> 
> Then you somewhat "coerce" behaviours (ie. methods) as side effects of
> setting some of those attributes, and create some kind of rigid API
> through which any kind of potential device "attribute" needs to be
> coerced through.
> 
> Essentially you are trying to re-invent encapsulation of kernel objects
> manipulated by userspace, we already have several mechanisms for doing
> just that and you are trying to add yet a new one :-)
> 
> What about instead using existing mechanisms for doing just that:
> 
> Make your "create device" return an anonymous file descriptor !
> 
That was faster that I predicted! :)

http://www.spinics.net/lists/kvm/msg86687.html last paragraph.

> That gives you everything ... private ioctl's which can do whatever the
> heck you want (attributes, methods, etc...), mmap, etc...
> 
> Guess what ? That's already what we do for various things like our
> in-kernel emulated iommu tables as far as I can remember.
> 
> If your problem is to avoid the bottleneck of having to deal with
> upstream maintainers for generic VM ioctls every time you add some new
> platform specific kernel "object" then this is probably a much better
> approach.
> 
> Cheers,
> Ben.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 1/6] kvm: add device control API
@ 2013-03-06 10:18       ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-03-06 10:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm, Marcelo Tosatti

On Wed, Mar 06, 2013 at 01:48:33PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2013-02-13 at 23:49 -0600, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> > 
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> 
> Allright guys, let's take a break for a minute :-)
> 
> What you seem to be proposing is a whole new construct / API to create
> "device" objects with "attributes" as a way to avoid adding tons of
> ioctls to the VM.
> 
> Then you somewhat "coerce" behaviours (ie. methods) as side effects of
> setting some of those attributes, and create some kind of rigid API
> through which any kind of potential device "attribute" needs to be
> coerced through.
> 
> Essentially you are trying to re-invent encapsulation of kernel objects
> manipulated by userspace, we already have several mechanisms for doing
> just that and you are trying to add yet a new one :-)
> 
> What about instead using existing mechanisms for doing just that:
> 
> Make your "create device" return an anonymous file descriptor !
> 
That was faster that I predicted! :)

http://www.spinics.net/lists/kvm/msg86687.html last paragraph.

> That gives you everything ... private ioctl's which can do whatever the
> heck you want (attributes, methods, etc...), mmap, etc...
> 
> Guess what ? That's already what we do for various things like our
> in-kernel emulated iommu tables as far as I can remember.
> 
> If your problem is to avoid the bottleneck of having to deal with
> upstream maintainers for generic VM ioctls every time you add some new
> platform specific kernel "object" then this is probably a much better
> approach.
> 
> Cheers,
> Ben.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-02-14  5:49   ` Scott Wood
@ 2013-03-21  8:28     ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-03-21  8:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm


On 14.02.2013, at 06:49, Scott Wood wrote:

> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> TODO: irqfd support
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>

Could you please split this patch up on your next respin? Also please make sure you don't have #if 0'ed code in here. Just return to user space with an error when you encounter something you don't know how to handle.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-03-21  8:28     ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-03-21  8:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm


On 14.02.2013, at 06:49, Scott Wood wrote:

> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> TODO: irqfd support
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>

Could you please split this patch up on your next respin? Also please make sure you don't have #if 0'ed code in here. Just return to user space with an error when you encounter something you don't know how to handle.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-03-21  8:28     ` Alexander Graf
@ 2013-03-21 14:43       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-03-21 14:43 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm

On 03/21/2013 03:28:35 AM, Alexander Graf wrote:
> 
> On 14.02.2013, at 06:49, Scott Wood wrote:
> 
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >
> > TODO: irqfd support
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> 
> Could you please split this patch up on your next respin?

Any particular split you're looking for?

The only reason it's split as much as it is already is to give some  
chance of merging updates from QEMU being less painful.  As far as the  
kernel is concerned, this is new code, which is not functional (and  
thus not built) before this patch.  There aren't meaningful  
intermediate states.

> Also please make sure you don't have #if 0'ed code in here.

Well, yeah.  Note the RFC. :-)

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-03-21 14:43       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-03-21 14:43 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm

On 03/21/2013 03:28:35 AM, Alexander Graf wrote:
> 
> On 14.02.2013, at 06:49, Scott Wood wrote:
> 
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >
> > TODO: irqfd support
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> 
> Could you please split this patch up on your next respin?

Any particular split you're looking for?

The only reason it's split as much as it is already is to give some  
chance of merging updates from QEMU being less painful.  As far as the  
kernel is concerned, this is new code, which is not functional (and  
thus not built) before this patch.  There aren't meaningful  
intermediate states.

> Also please make sure you don't have #if 0'ed code in here.

Well, yeah.  Note the RFC. :-)

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-03-21 14:43       ` Scott Wood
@ 2013-03-21 14:52         ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-03-21 14:52 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm


On 21.03.2013, at 15:43, Scott Wood wrote:

> On 03/21/2013 03:28:35 AM, Alexander Graf wrote:
>> On 14.02.2013, at 06:49, Scott Wood wrote:
>> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >
>> > TODO: irqfd support
>> >
>> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> Could you please split this patch up on your next respin?
> 
> Any particular split you're looking for?

Anything that makes reviewing it easier :). I can't concentrate for 100k straight.

> The only reason it's split as much as it is already is to give some chance of merging updates from QEMU being less painful.  As far as the kernel is concerned, this is new code, which is not functional (and thus not built) before this patch.  There aren't meaningful intermediate states.
> 
>> Also please make sure you don't have #if 0'ed code in here.
> 
> Well, yeah.  Note the RFC. :-)

Just wanted to make sure you don't forget them when you send out a non-RFC :). Not that I'd assume you'd do that ;)


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-03-21 14:52         ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-03-21 14:52 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm


On 21.03.2013, at 15:43, Scott Wood wrote:

> On 03/21/2013 03:28:35 AM, Alexander Graf wrote:
>> On 14.02.2013, at 06:49, Scott Wood wrote:
>> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >
>> > TODO: irqfd support
>> >
>> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> Could you please split this patch up on your next respin?
> 
> Any particular split you're looking for?

Anything that makes reviewing it easier :). I can't concentrate for 100k straight.

> The only reason it's split as much as it is already is to give some chance of merging updates from QEMU being less painful.  As far as the kernel is concerned, this is new code, which is not functional (and thus not built) before this patch.  There aren't meaningful intermediate states.
> 
>> Also please make sure you don't have #if 0'ed code in here.
> 
> Well, yeah.  Note the RFC. :-)

Just wanted to make sure you don't forget them when you send out a non-RFC :). Not that I'd assume you'd do that ;)


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 0/6] device control and in-kernel MPIC
  2013-02-14  5:49 ` Scott Wood
@ 2013-04-01 22:47   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

v2 addresses some requested changes, such as the use of a file descriptor
instead of an ad-hoc handle array, and the use of an enableable
IRQ-type-specific capability to bind the vcpu to a particular MPIC device
(among other things, this allows the notifier patch to go away).

Some other requested improvements, such as support for the standard
KVM_IRQ_LINE interface and splitting up the "in-kernel MPIC emulation"
patch, will be addressed in a later revision.

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt          |   78 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h        |   16 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    8 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   12 +-
 arch/powerpc/kvm/mpic.c                    | 1786 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   38 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |   37 +
 virt/kvm/kvm_main.c                        |   40 +
 13 files changed, 2052 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 0/6] device control and in-kernel MPIC
@ 2013-04-01 22:47   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

v2 addresses some requested changes, such as the use of a file descriptor
instead of an ad-hoc handle array, and the use of an enableable
IRQ-type-specific capability to bind the vcpu to a particular MPIC device
(among other things, this allows the notifier patch to go away).

Some other requested improvements, such as support for the standard
KVM_IRQ_LINE interface and splitting up the "in-kernel MPIC emulation"
patch, will be addressed in a later revision.

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt          |   78 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h        |   16 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    8 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   12 +-
 arch/powerpc/kvm/mpic.c                    | 1786 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   38 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |   37 +
 virt/kvm/kvm_main.c                        |   40 +
 13 files changed, 2052 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-01 22:47     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 arch/powerpc/include/asm/kvm_host.h      |    6 +++
 arch/powerpc/include/asm/kvm_ppc.h       |    2 +
 arch/powerpc/kvm/powerpc.c               |    7 +++
 include/uapi/linux/kvm.h                 |   27 ++++++++++++
 virt/kvm/kvm_main.c                      |   31 +++++++++++++
 7 files changed, 144 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..77328aa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..e0caae2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQCHIP_NONE	0
+#define KVMPPC_IRQCHIP_MPIC	1
+
 struct kvm_vcpu_arch {
 	ulong host_stack;
 	u32 host_pid;
@@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+	int irqchip_type;
+	void *irqchip_priv;
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..f44932c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -323,4 +323,6 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
 	return ea;
 }
 
+void mpic_put(void *priv);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..bdfa526 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	tasklet_kill(&vcpu->arch.tasklet);
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
+
+	switch (vcpu->arch.irqchip_type) {
+	case KVMPPC_IRQCHIP_MPIC:
+		mpic_put(vcpu->arch.irqchip_priv);
+		break;
+	}
+
 	kvmppc_core_vcpu_free(vcpu);
 }
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..20ce2d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_DEVICE_CTRL 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
 
 /*
+ * Device control API, available with KVM_CAP_DEVICE_CTRL
+ */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* ioctl for vm fd */
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
+
+/* ioctls for fds returned by KVM_CREATE_DEVICE */
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
+
+/*
  * ioctls for vcpu fds
  */
 #define KVM_RUN                   _IO(KVMIO,   0x80)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+
+	switch (cd->type) {
+	default:
+		return -ENODEV;
+	}
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2272,6 +2283,26 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		mutex_lock(&kvm->lock);
+		r = kvm_ioctl_create_device(kvm, &cd);
+		mutex_unlock(&kvm->lock);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-01 22:47     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 arch/powerpc/include/asm/kvm_host.h      |    6 +++
 arch/powerpc/include/asm/kvm_ppc.h       |    2 +
 arch/powerpc/kvm/powerpc.c               |    7 +++
 include/uapi/linux/kvm.h                 |   27 ++++++++++++
 virt/kvm/kvm_main.c                      |   31 +++++++++++++
 7 files changed, 144 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..77328aa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..e0caae2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQCHIP_NONE	0
+#define KVMPPC_IRQCHIP_MPIC	1
+
 struct kvm_vcpu_arch {
 	ulong host_stack;
 	u32 host_pid;
@@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+	int irqchip_type;
+	void *irqchip_priv;
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..f44932c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -323,4 +323,6 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
 	return ea;
 }
 
+void mpic_put(void *priv);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..bdfa526 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	tasklet_kill(&vcpu->arch.tasklet);
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
+
+	switch (vcpu->arch.irqchip_type) {
+	case KVMPPC_IRQCHIP_MPIC:
+		mpic_put(vcpu->arch.irqchip_priv);
+		break;
+	}
+
 	kvmppc_core_vcpu_free(vcpu);
 }
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..20ce2d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_DEVICE_CTRL 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
 
 /*
+ * Device control API, available with KVM_CAP_DEVICE_CTRL
+ */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* ioctl for vm fd */
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
+
+/* ioctls for fds returned by KVM_CREATE_DEVICE */
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
+
+/*
  * ioctls for vcpu fds
  */
 #define KVM_RUN                   _IO(KVMIO,   0x80)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+
+	switch (cd->type) {
+	default:
+		return -ENODEV;
+	}
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2272,6 +2283,26 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		mutex_lock(&kvm->lock);
+		r = kvm_ioctl_create_device(kvm, &cd);
+		mutex_unlock(&kvm->lock);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r = -ENOTTY)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-01 22:47     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] == inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] == output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq == opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ == 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] == 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask == 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask == (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i == opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level =
+			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr =
+	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level =
+		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval =
+		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr == 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) == 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr == 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ == -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq == -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
@ 2013-04-01 22:47     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] = inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] = output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq = opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ = 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] = 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask = 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask = (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i = opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level +			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr +	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level +		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval +		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr = 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) = 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr = 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ = -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq = -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 3/6] kvm/ppc/mpic: remove some obviously unneeded code
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-01 22:47     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] == inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] == output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 3/6] kvm/ppc/mpic: remove some obviously unneeded code
@ 2013-04-01 22:47     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] = inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] = output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 4/6] kvm/ppc/mpic: adapt to kernel style and environment
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-01 22:47     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq == opp->max_irq) {
+		if (irq == opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ == 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ == 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] == 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] == 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask == 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i == opp->nb_cpus) {
+			if (i == opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level =
-			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level =
-		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr == 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) == 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr == 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ == -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq == -1) {
+	if (irq == -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 4/6] kvm/ppc/mpic: adapt to kernel style and environment
@ 2013-04-01 22:47     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq = opp->max_irq) {
+		if (irq = opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ = 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ = 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] = 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] = 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask = 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i = opp->nb_cpus) {
+			if (i = opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level -			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level -		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr = 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) = 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr = 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ = -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq = -1) {
+	if (irq = -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-01 22:47     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h        |    8 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    4 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    |  816 +++++++++++++++++++++-------
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |    9 +
 virt/kvm/kvm_main.c                        |    9 +
 11 files changed, 713 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..79e000a
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+=========================
+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest. 
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e0caae2..6713327 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -525,7 +530,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -595,5 +600,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f44932c..20b2a5e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -270,6 +272,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..a87139b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 58057d6..cddc6b3 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -346,7 +346,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -427,8 +427,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear == true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr == true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr == true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+				kvmppc_mpic_set_epr(vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..9aace50 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,19 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/anon_inodes.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -36,6 +49,7 @@
 #define OPENPIC_FLAG_ILR          (2 << 0)
 
 /* OpenPIC address map */
+#define OPENPIC_REG_SIZE             0x40000
 #define OPENPIC_GLB_REG_START        0x0
 #define OPENPIC_GLB_REG_SIZE         0x10F0
 #define OPENPIC_TMR_REG_START        0x10F0
@@ -89,6 +103,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +113,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +142,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +169,28 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
 };
 
+struct openpic;
+
 struct openpic {
+	struct kvm *kvm;
+	struct kvm_io_device mmio;
+	struct list_head mmio_regions;
+	atomic_t users;
+	bool mmio_mapped;
+
+	gpa_t reg_base;
+	spinlock_t lock;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +231,47 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +332,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +346,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ == 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] == 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +386,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +401,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +477,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +495,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +508,13 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +564,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +599,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +610,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +625,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +674,22 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
-	bool mpic_proxy = false;
-
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
-	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
-		mpic_proxy = true;
-
-	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_gbl_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
-	struct irq_dest *dst;
-	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -664,22 +712,11 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x1080:		/* VIR */
 		break;
 	case 0x1090:		/* PIR */
-		for (idx = 0; idx < opp->nb_cpus; idx++) {
-			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx)) &&
-				   (opp->pir & (1 << idx))) {
-				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			}
-		}
-		opp->pir = val;
-		break;
+		/*
+		 * This register is used to reset a CPU core --
+		 * let userspace handle it.
+		 */
+		return 1;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
@@ -695,21 +732,24 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint32_t retval;
+	u32 retval;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval =
-		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		retval = openpic_cpu_read_internal(opp, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,28 +790,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
-	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr == 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
@@ -795,15 +835,17 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +855,7 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -830,18 +873,16 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
@@ -857,15 +898,17 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -884,20 +927,19 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 	}
 
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,17 +953,19 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint64_t r = 0;
+	uint32_t r = 0;
 	int i, srs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return 1;
 
 	srs = addr >> 4;
 
@@ -945,45 +989,47 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+	*ptr = r;
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	uint64_t r = 0;
+	uint32_t r = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	*ptr = r;
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(void *opaque, gpa_t addr, u32 val)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1054,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1089,22 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(void *opaque, gpa_t addr, u32 val)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_write_internal(opp, addr, val,
+					 (addr & 0x1f000) >> 12);
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1114,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1157,35 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
+{
+	struct openpic *opp = vcpu->arch.irqchip_priv;
+	int cpu = vcpu->vcpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,49 +1207,67 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_read_internal(opp, addr, ptr,
+					 (addr & 0x1f000) >> 12);
 }
 
-static const struct kvm_io_device_ops openpic_glb_ops_be = {
+struct mem_reg {
+	struct list_head list;
+	int (*read)(void *opaque, gpa_t addr, u32 *ptr);
+	int (*write)(void *opaque, gpa_t addr, u32 val);
+	gpa_t start_addr;
+	int size;
+};
+
+static struct mem_reg openpic_gbl_mmio = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
+	.start_addr = OPENPIC_GLB_REG_START,
+	.size = OPENPIC_GLB_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_tmr_ops_be = {
+static struct mem_reg openpic_tmr_mmio = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
+	.start_addr = OPENPIC_TMR_REG_START,
+	.size = OPENPIC_TMR_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_cpu_ops_be = {
+static struct mem_reg openpic_cpu_mmio = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
+	.start_addr = OPENPIC_CPU_REG_START,
+	.size = OPENPIC_CPU_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_src_ops_be = {
+static struct mem_reg openpic_src_mmio = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
+	.start_addr = OPENPIC_SRC_REG_START,
+	.size = OPENPIC_SRC_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_msi_ops_be = {
+static struct mem_reg openpic_msi_mmio = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
+	.start_addr = OPENPIC_MSI_REG_START,
+	.size = OPENPIC_MSI_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_summary_ops_be = {
+static struct mem_reg openpic_summary_mmio = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-};
-
-struct mem_reg {
-	const char *name;
-	const struct kvm_io_device_ops *ops;
-	gpa_t start_addr;
-	int size;
+	.start_addr = OPENPIC_SUMMARY_REG_START,
+	.size = OPENPIC_SUMMARY_REG_SIZE,
 };
 
 static void fsl_common_init(struct openpic *opp)
@@ -1192,6 +1275,9 @@ static void fsl_common_init(struct openpic *opp)
 	int i;
 	int virq = MAX_SRC;
 
+	list_add(&openpic_msi_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_summary_mmio.list, &opp->mmio_regions);
+
 	opp->vid = VID_REVISION_1_2;
 	opp->vir = VIR_GENERIC;
 	opp->vector_mask = 0xFFFF;
@@ -1205,11 +1291,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,63 +1311,406 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static int kvm_mpic_read_internal(struct openpic *opp, gpa_t addr, u32 *ptr)
 {
-	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+	struct list_head *node;
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-		(*count)++;
-		list++;
+		return mr->read(opp, addr - mr->start_addr, ptr);
 	}
+
+	return 1;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int kvm_mpic_write_internal(struct openpic *opp, gpa_t addr, u32 val)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_be[] = {
-		{"glb", &openpic_glb_ops_be,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_be,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_be,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_be,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_fsl[] = {
-		{"msi", &openpic_msi_ops_be,
-		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
-		{"summary", &openpic_summary_ops_be,
-		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
-		{NULL}
-	};
+	struct list_head *node;
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
+
+		return mr->write(opp, addr - mr->start_addr, val);
+	}
+
+	return 1;
+}
+
+static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
+			 int len, void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice to
+	 * people dumping registers a byte at a time -- it works in real
+	 * hardware (reads only, not writes).
+	 */
+	if (len == 4) {
+		if (addr & 3) {
+			pr_debug("%s: bad alignment %llx/%d\n",
+				 __func__, addr, len);
+			return -EINVAL;
+		}
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
+		spin_unlock_irq(&opp->lock);
+
+		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
+			 __func__, addr, ret, *(const u32 *)ptr);
+	} else if (len == 1) {
+		union {
+			u32 val;
+			u8 bytes[4];
+		} u;
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
+		spin_unlock_irq(&opp->lock);
+
+		*(u8 *)ptr = u.bytes[addr & 3];
+
+		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
+			 __func__, addr, ret, *(const u8 *)ptr);
+	} else {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
+			  int len, const void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EOPNOTSUPP;
+	}
+	if (addr & 3) {
+		pr_debug("%s: bad alignment %llx/%d\n", __func__, addr, len);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
+				      *(const u32 *)ptr);
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: addr %llx ret %d val %x\n",
+		 __func__, addr, ret, *(const u32 *)ptr);
+
+	return ret;
+}
+
+static void kvm_mpic_dtor(struct kvm_io_device *this)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+
+	opp->mmio_mapped = false;
+}
+
+static const struct kvm_io_device_ops mpic_mmio_ops = {
+	.read = kvm_mpic_read,
+	.write = kvm_mpic_write,
+	.destructor = kvm_mpic_dtor,
+};
+
+static void map_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = true;
+
+	kvm_iodevice_init(&opp->mmio, &mpic_mmio_ops);
+
+	kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+				opp->reg_base, OPENPIC_REG_SIZE,
+				&opp->mmio);
+}
+
+static void unmap_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = false;
+
+	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+}
+
+static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 base;
+
+	if (copy_from_user(&base, (u64 __iomem *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base == opp->reg_base)
+		return 0;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	unmap_mmio(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base == 0)
+		goto out;
+
+	map_mmio(opp);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+out:
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	if (type == ATTR_SET)
+		ret = kvm_mpic_write_internal(opp, addr, *val);
+	else
+		ret = kvm_mpic_read_internal(opp, addr, val);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(opp, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			mutex_lock(&opp->kvm->slots_lock);
+			attr64 = opp->reg_base;
+			mutex_unlock(&opp->kvm->slots_lock);
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		attr32 = opp->src[attr->attr].pending;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_has_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			break;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static long kvm_mpic_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm_device_attr attr;
+	int (*accessor)(struct openpic *opp, struct kvm_device_attr *attr);
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		accessor = mpic_set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		accessor = mpic_get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = mpic_has_attr;
+		break;
 	default:
+		return -ENOTTY;
+	}
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(opp, &attr);
+}
+
+static void mpic_destroy(struct openpic *opp)
+{
+	if (opp->mmio_mapped) {
+		/*
+		 * Normally we get unmapped by kvm_io_bus_destroy(),
+		 * which happens before the VCPUs release their references.
+		 *
+		 * Thus, we should only get here if no VCPUs took a reference
+		 * to us in the first place.
+		 */
+		WARN_ON(opp->nb_cpus != 0);
+		unmap_mmio(opp);
+	}
+
+	kfree(opp);
+}
+
+void mpic_put(void *priv)
+{
+	struct openpic *opp = priv;
+
+	if (atomic_dec_and_test(&opp->users))
+		mpic_destroy(opp);
+}
+
+static int kvm_mpic_release(struct inode *inode, struct file *filp)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm *kvm = opp->kvm;
+
+	mpic_put(opp);
+	kvm_put_kvm(kvm);
+	return 0;
+}
+
+static const struct file_operations kvm_mpic_fops = {
+	.unlocked_ioctl = kvm_mpic_ioctl,
+	.release = kvm_mpic_release,
+};
+
+int kvm_create_mpic(struct kvm *kvm, u32 type)
+{
+	struct openpic *opp;
+	int ret, fd;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return -ENOMEM;
+
+	fd = anon_inode_getfd("kvm-mpic", &kvm_mpic_fops, opp, O_RDWR);
+	if (fd < 0) {
+		ret = fd;
+		goto err;
+	}
+
+	opp->kvm = kvm;
+	opp->model = type;
+	atomic_set(&opp->users, 1);
+	spin_lock_init(&opp->lock);
+
+	INIT_LIST_HEAD(&opp->mmio_regions);
+	list_add(&openpic_gbl_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_tmr_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_src_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_cpu_mmio.list, &opp->mmio_regions);
+
+	switch (opp->model) {
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1718,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1729,19 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
-	return 0;
+	openpic_reset(opp);
+	kvm_get_kvm(kvm);
+	return fd;
+
+err:
+	kfree(opp);
+	return ret;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index bdfa526..5656d12 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -776,7 +777,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -922,6 +926,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -940,7 +945,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -952,7 +956,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -962,7 +965,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -979,7 +981,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -992,7 +993,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c0be23..852a3a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1084,6 +1084,8 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+int kvm_create_mpic(struct kvm *kvm, u32 type);
+
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #else
 static inline void __guest_enter(void) { return; }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20ce2d2..d8f44ef 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -927,6 +927,15 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed033c0..e325f5d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42: {
+		if (test)
+			return 0;
+
+		return kvm_create_mpic(kvm, cd->type);
+	}
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-01 22:47     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h        |    8 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    4 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    |  816 +++++++++++++++++++++-------
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |    9 +
 virt/kvm/kvm_main.c                        |    9 +
 11 files changed, 713 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..79e000a
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+============+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest. 
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e0caae2..6713327 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -525,7 +530,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -595,5 +600,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f44932c..20b2a5e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -270,6 +272,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..a87139b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 58057d6..cddc6b3 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -346,7 +346,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -427,8 +427,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear = true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr = true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr = true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+				kvmppc_mpic_set_epr(vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..9aace50 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,19 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/anon_inodes.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -36,6 +49,7 @@
 #define OPENPIC_FLAG_ILR          (2 << 0)
 
 /* OpenPIC address map */
+#define OPENPIC_REG_SIZE             0x40000
 #define OPENPIC_GLB_REG_START        0x0
 #define OPENPIC_GLB_REG_SIZE         0x10F0
 #define OPENPIC_TMR_REG_START        0x10F0
@@ -89,6 +103,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +113,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +142,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +169,28 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
 };
 
+struct openpic;
+
 struct openpic {
+	struct kvm *kvm;
+	struct kvm_io_device mmio;
+	struct list_head mmio_regions;
+	atomic_t users;
+	bool mmio_mapped;
+
+	gpa_t reg_base;
+	spinlock_t lock;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +231,47 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +332,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +346,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ = 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] = 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +386,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +401,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +477,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +495,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +508,13 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +564,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +599,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +610,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +625,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +674,22 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
-	bool mpic_proxy = false;
-
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
-	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
-		mpic_proxy = true;
-
-	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_gbl_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
-	struct irq_dest *dst;
-	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -664,22 +712,11 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x1080:		/* VIR */
 		break;
 	case 0x1090:		/* PIR */
-		for (idx = 0; idx < opp->nb_cpus; idx++) {
-			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx)) &&
-				   (opp->pir & (1 << idx))) {
-				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			}
-		}
-		opp->pir = val;
-		break;
+		/*
+		 * This register is used to reset a CPU core --
+		 * let userspace handle it.
+		 */
+		return 1;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
@@ -695,21 +732,24 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint32_t retval;
+	u32 retval;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval -		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		retval = openpic_cpu_read_internal(opp, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,28 +790,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
-	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr = 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
@@ -795,15 +835,17 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +855,7 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -830,18 +873,16 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
@@ -857,15 +898,17 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -884,20 +927,19 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 	}
 
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,17 +953,19 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint64_t r = 0;
+	uint32_t r = 0;
 	int i, srs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return 1;
 
 	srs = addr >> 4;
 
@@ -945,45 +989,47 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+	*ptr = r;
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	uint64_t r = 0;
+	uint32_t r = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	*ptr = r;
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(void *opaque, gpa_t addr, u32 val)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1054,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1089,22 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(void *opaque, gpa_t addr, u32 val)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_write_internal(opp, addr, val,
+					 (addr & 0x1f000) >> 12);
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1114,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1157,35 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
+{
+	struct openpic *opp = vcpu->arch.irqchip_priv;
+	int cpu = vcpu->vcpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) = GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,49 +1207,67 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_read_internal(opp, addr, ptr,
+					 (addr & 0x1f000) >> 12);
 }
 
-static const struct kvm_io_device_ops openpic_glb_ops_be = {
+struct mem_reg {
+	struct list_head list;
+	int (*read)(void *opaque, gpa_t addr, u32 *ptr);
+	int (*write)(void *opaque, gpa_t addr, u32 val);
+	gpa_t start_addr;
+	int size;
+};
+
+static struct mem_reg openpic_gbl_mmio = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
+	.start_addr = OPENPIC_GLB_REG_START,
+	.size = OPENPIC_GLB_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_tmr_ops_be = {
+static struct mem_reg openpic_tmr_mmio = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
+	.start_addr = OPENPIC_TMR_REG_START,
+	.size = OPENPIC_TMR_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_cpu_ops_be = {
+static struct mem_reg openpic_cpu_mmio = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
+	.start_addr = OPENPIC_CPU_REG_START,
+	.size = OPENPIC_CPU_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_src_ops_be = {
+static struct mem_reg openpic_src_mmio = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
+	.start_addr = OPENPIC_SRC_REG_START,
+	.size = OPENPIC_SRC_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_msi_ops_be = {
+static struct mem_reg openpic_msi_mmio = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
+	.start_addr = OPENPIC_MSI_REG_START,
+	.size = OPENPIC_MSI_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_summary_ops_be = {
+static struct mem_reg openpic_summary_mmio = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-};
-
-struct mem_reg {
-	const char *name;
-	const struct kvm_io_device_ops *ops;
-	gpa_t start_addr;
-	int size;
+	.start_addr = OPENPIC_SUMMARY_REG_START,
+	.size = OPENPIC_SUMMARY_REG_SIZE,
 };
 
 static void fsl_common_init(struct openpic *opp)
@@ -1192,6 +1275,9 @@ static void fsl_common_init(struct openpic *opp)
 	int i;
 	int virq = MAX_SRC;
 
+	list_add(&openpic_msi_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_summary_mmio.list, &opp->mmio_regions);
+
 	opp->vid = VID_REVISION_1_2;
 	opp->vir = VIR_GENERIC;
 	opp->vector_mask = 0xFFFF;
@@ -1205,11 +1291,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,63 +1311,406 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static int kvm_mpic_read_internal(struct openpic *opp, gpa_t addr, u32 *ptr)
 {
-	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+	struct list_head *node;
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-		(*count)++;
-		list++;
+		return mr->read(opp, addr - mr->start_addr, ptr);
 	}
+
+	return 1;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int kvm_mpic_write_internal(struct openpic *opp, gpa_t addr, u32 val)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_be[] = {
-		{"glb", &openpic_glb_ops_be,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_be,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_be,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_be,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_fsl[] = {
-		{"msi", &openpic_msi_ops_be,
-		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
-		{"summary", &openpic_summary_ops_be,
-		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
-		{NULL}
-	};
+	struct list_head *node;
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
+
+		return mr->write(opp, addr - mr->start_addr, val);
+	}
+
+	return 1;
+}
+
+static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
+			 int len, void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice to
+	 * people dumping registers a byte at a time -- it works in real
+	 * hardware (reads only, not writes).
+	 */
+	if (len = 4) {
+		if (addr & 3) {
+			pr_debug("%s: bad alignment %llx/%d\n",
+				 __func__, addr, len);
+			return -EINVAL;
+		}
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
+		spin_unlock_irq(&opp->lock);
+
+		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
+			 __func__, addr, ret, *(const u32 *)ptr);
+	} else if (len = 1) {
+		union {
+			u32 val;
+			u8 bytes[4];
+		} u;
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
+		spin_unlock_irq(&opp->lock);
+
+		*(u8 *)ptr = u.bytes[addr & 3];
+
+		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
+			 __func__, addr, ret, *(const u8 *)ptr);
+	} else {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
+			  int len, const void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EOPNOTSUPP;
+	}
+	if (addr & 3) {
+		pr_debug("%s: bad alignment %llx/%d\n", __func__, addr, len);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
+				      *(const u32 *)ptr);
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: addr %llx ret %d val %x\n",
+		 __func__, addr, ret, *(const u32 *)ptr);
+
+	return ret;
+}
+
+static void kvm_mpic_dtor(struct kvm_io_device *this)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+
+	opp->mmio_mapped = false;
+}
+
+static const struct kvm_io_device_ops mpic_mmio_ops = {
+	.read = kvm_mpic_read,
+	.write = kvm_mpic_write,
+	.destructor = kvm_mpic_dtor,
+};
+
+static void map_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = true;
+
+	kvm_iodevice_init(&opp->mmio, &mpic_mmio_ops);
+
+	kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+				opp->reg_base, OPENPIC_REG_SIZE,
+				&opp->mmio);
+}
+
+static void unmap_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = false;
+
+	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+}
+
+static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 base;
+
+	if (copy_from_user(&base, (u64 __iomem *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base = opp->reg_base)
+		return 0;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	unmap_mmio(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base = 0)
+		goto out;
+
+	map_mmio(opp);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+out:
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	if (type = ATTR_SET)
+		ret = kvm_mpic_write_internal(opp, addr, *val);
+	else
+		ret = kvm_mpic_read_internal(opp, addr, val);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(opp, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			mutex_lock(&opp->kvm->slots_lock);
+			attr64 = opp->reg_base;
+			mutex_unlock(&opp->kvm->slots_lock);
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		attr32 = opp->src[attr->attr].pending;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_has_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			break;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static long kvm_mpic_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm_device_attr attr;
+	int (*accessor)(struct openpic *opp, struct kvm_device_attr *attr);
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		accessor = mpic_set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		accessor = mpic_get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = mpic_has_attr;
+		break;
 	default:
+		return -ENOTTY;
+	}
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(opp, &attr);
+}
+
+static void mpic_destroy(struct openpic *opp)
+{
+	if (opp->mmio_mapped) {
+		/*
+		 * Normally we get unmapped by kvm_io_bus_destroy(),
+		 * which happens before the VCPUs release their references.
+		 *
+		 * Thus, we should only get here if no VCPUs took a reference
+		 * to us in the first place.
+		 */
+		WARN_ON(opp->nb_cpus != 0);
+		unmap_mmio(opp);
+	}
+
+	kfree(opp);
+}
+
+void mpic_put(void *priv)
+{
+	struct openpic *opp = priv;
+
+	if (atomic_dec_and_test(&opp->users))
+		mpic_destroy(opp);
+}
+
+static int kvm_mpic_release(struct inode *inode, struct file *filp)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm *kvm = opp->kvm;
+
+	mpic_put(opp);
+	kvm_put_kvm(kvm);
+	return 0;
+}
+
+static const struct file_operations kvm_mpic_fops = {
+	.unlocked_ioctl = kvm_mpic_ioctl,
+	.release = kvm_mpic_release,
+};
+
+int kvm_create_mpic(struct kvm *kvm, u32 type)
+{
+	struct openpic *opp;
+	int ret, fd;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return -ENOMEM;
+
+	fd = anon_inode_getfd("kvm-mpic", &kvm_mpic_fops, opp, O_RDWR);
+	if (fd < 0) {
+		ret = fd;
+		goto err;
+	}
+
+	opp->kvm = kvm;
+	opp->model = type;
+	atomic_set(&opp->users, 1);
+	spin_lock_init(&opp->lock);
+
+	INIT_LIST_HEAD(&opp->mmio_regions);
+	list_add(&openpic_gbl_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_tmr_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_src_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_cpu_mmio.list, &opp->mmio_regions);
+
+	switch (opp->model) {
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1718,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1729,19 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
-	return 0;
+	openpic_reset(opp);
+	kvm_get_kvm(kvm);
+	return fd;
+
+err:
+	kfree(opp);
+	return ret;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index bdfa526..5656d12 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -776,7 +777,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -922,6 +926,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -940,7 +945,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -952,7 +956,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -962,7 +965,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -979,7 +981,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -992,7 +993,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c0be23..852a3a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1084,6 +1084,8 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+int kvm_create_mpic(struct kvm *kvm, u32 type);
+
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #else
 static inline void __guest_enter(void) { return; }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20ce2d2..d8f44ef 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -927,6 +927,15 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed033c0..e325f5d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42: {
+		if (test)
+			return 0;
+
+		return kvm_create_mpic(kvm, cd->type);
+	}
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-01 22:47     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt   |    8 ++++++
 arch/powerpc/include/asm/kvm_host.h |   10 ++++---
 arch/powerpc/include/asm/kvm_ppc.h  |    2 ++
 arch/powerpc/kvm/booke.c            |    4 ++-
 arch/powerpc/kvm/mpic.c             |   49 +++++++++++++++++++++++++++++++----
 arch/powerpc/kvm/powerpc.c          |   25 +++++++++++++++---
 include/uapi/linux/kvm.h            |    1 +
 7 files changed, 86 insertions(+), 13 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 77328aa..38f9b6d 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] == 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+            args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 6713327..2a2e235 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,8 +375,10 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
-#define KVMPPC_IRQCHIP_NONE	0
-#define KVMPPC_IRQCHIP_MPIC	1
+#define KVMPPC_IRQ_DEFAULT	0
+#define KVMPPC_IRQ_MPIC		1
+
+struct openpic;
 
 struct kvm_vcpu_arch {
 	ulong host_stack;
@@ -557,8 +559,8 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
-	int irqchip_type;
-	void *irqchip_priv;
+	int irq_type;		/* one of KVM_IRQ_* */
+	struct openpic *mpic;	/* KVM_IRQ_MPIC */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 20b2a5e..2cc18a4 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -273,6 +273,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cddc6b3..7d00222 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		if (update_epr == true) {
 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
-			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
+				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
 				kvmppc_mpic_set_epr(vcpu);
+			}
 		}
 
 		new_msr &= msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 9aace50..b790f47 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-	struct openpic *opp = vcpu->arch.irqchip_priv;
+	struct openpic *opp = vcpu->arch.mpic;
 	int cpu = vcpu->vcpu_id;
 	unsigned long flags;
 
@@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-	BUG_ON(opp->mmio_mapped);
-	opp->mmio_mapped = false;
-
-	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	if (opp->mmio_mapped) {
+		opp->mmio_mapped = false;
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	}
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1683,6 +1683,45 @@ static const struct file_operations kvm_mpic_fops = {
 	.release = kvm_mpic_release,
 };
 
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu)
+{
+	struct openpic *opp = mpic_filp->private_data;
+	int ret = 0;
+
+	if (mpic_filp->f_op != &kvm_mpic_fops)
+		return -EPERM;
+	if (opp->kvm != vcpu->kvm)
+		return -EPERM;
+	if (cpu < 0 || cpu >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	if (opp->dst[cpu].vcpu) {
+		ret = -EEXIST;
+		goto out;
+	}
+	if (vcpu->arch.irq_type) {
+		return -EBUSY;
+		goto out;
+	}
+
+	opp->dst[cpu].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
+
+	vcpu->arch.mpic = opp;
+	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
+	atomic_inc(&opp->users);
+
+	if (opp->mpic_mode_mask == GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+out:
+	spin_unlock_irq(&opp->lock);
+	return ret;
+}
+
 int kvm_create_mpic(struct kvm *kvm, u32 type)
 {
 	struct openpic *opp;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 5656d12..9f67edd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/cputable.h>
 #include <asm/uaccess.h>
 #include <asm/kvm_ppc.h>
@@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
 	case KVM_CAP_SW_TLB:
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -461,9 +465,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
 
-	switch (vcpu->arch.irqchip_type) {
-	case KVMPPC_IRQCHIP_MPIC:
-		mpic_put(vcpu->arch.irqchip_priv);
+	switch (vcpu->arch.irq_type) {
+	case KVMPPC_IRQ_MPIC:
+		mpic_put(vcpu->arch.mpic);
 		break;
 	}
 
@@ -801,6 +805,21 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC: {
+		struct file *filp;
+
+		r = -EBADF;
+		filp = fget(cap->args[0]);
+		if (!filp)
+			break;
+
+		r = kvmppc_mpic_connect_vcpu(filp, vcpu, cap->args[1]);
+
+		fput(filp);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d8f44ef..22fce7b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 #define KVM_CAP_DEVICE_CTRL 89
+#define KVM_CAP_IRQ_MPIC 90
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v2 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-01 22:47     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-01 22:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt   |    8 ++++++
 arch/powerpc/include/asm/kvm_host.h |   10 ++++---
 arch/powerpc/include/asm/kvm_ppc.h  |    2 ++
 arch/powerpc/kvm/booke.c            |    4 ++-
 arch/powerpc/kvm/mpic.c             |   49 +++++++++++++++++++++++++++++++----
 arch/powerpc/kvm/powerpc.c          |   25 +++++++++++++++---
 include/uapi/linux/kvm.h            |    1 +
 7 files changed, 86 insertions(+), 13 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 77328aa..38f9b6d 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] = 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+            args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 6713327..2a2e235 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,8 +375,10 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
-#define KVMPPC_IRQCHIP_NONE	0
-#define KVMPPC_IRQCHIP_MPIC	1
+#define KVMPPC_IRQ_DEFAULT	0
+#define KVMPPC_IRQ_MPIC		1
+
+struct openpic;
 
 struct kvm_vcpu_arch {
 	ulong host_stack;
@@ -557,8 +559,8 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
-	int irqchip_type;
-	void *irqchip_priv;
+	int irq_type;		/* one of KVM_IRQ_* */
+	struct openpic *mpic;	/* KVM_IRQ_MPIC */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 20b2a5e..2cc18a4 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -273,6 +273,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cddc6b3..7d00222 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		if (update_epr = true) {
 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
-			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
+				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
 				kvmppc_mpic_set_epr(vcpu);
+			}
 		}
 
 		new_msr &= msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 9aace50..b790f47 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-	struct openpic *opp = vcpu->arch.irqchip_priv;
+	struct openpic *opp = vcpu->arch.mpic;
 	int cpu = vcpu->vcpu_id;
 	unsigned long flags;
 
@@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-	BUG_ON(opp->mmio_mapped);
-	opp->mmio_mapped = false;
-
-	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	if (opp->mmio_mapped) {
+		opp->mmio_mapped = false;
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	}
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1683,6 +1683,45 @@ static const struct file_operations kvm_mpic_fops = {
 	.release = kvm_mpic_release,
 };
 
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu)
+{
+	struct openpic *opp = mpic_filp->private_data;
+	int ret = 0;
+
+	if (mpic_filp->f_op != &kvm_mpic_fops)
+		return -EPERM;
+	if (opp->kvm != vcpu->kvm)
+		return -EPERM;
+	if (cpu < 0 || cpu >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	if (opp->dst[cpu].vcpu) {
+		ret = -EEXIST;
+		goto out;
+	}
+	if (vcpu->arch.irq_type) {
+		return -EBUSY;
+		goto out;
+	}
+
+	opp->dst[cpu].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
+
+	vcpu->arch.mpic = opp;
+	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
+	atomic_inc(&opp->users);
+
+	if (opp->mpic_mode_mask = GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+out:
+	spin_unlock_irq(&opp->lock);
+	return ret;
+}
+
 int kvm_create_mpic(struct kvm *kvm, u32 type)
 {
 	struct openpic *opp;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 5656d12..9f67edd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/cputable.h>
 #include <asm/uaccess.h>
 #include <asm/kvm_ppc.h>
@@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
 	case KVM_CAP_SW_TLB:
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -461,9 +465,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
 
-	switch (vcpu->arch.irqchip_type) {
-	case KVMPPC_IRQCHIP_MPIC:
-		mpic_put(vcpu->arch.irqchip_priv);
+	switch (vcpu->arch.irq_type) {
+	case KVMPPC_IRQ_MPIC:
+		mpic_put(vcpu->arch.mpic);
 		break;
 	}
 
@@ -801,6 +805,21 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC: {
+		struct file *filp;
+
+		r = -EBADF;
+		filp = fget(cap->args[0]);
+		if (!filp)
+			break;
+
+		r = kvmppc_mpic_connect_vcpu(filp, vcpu, cap->args[1]);
+
+		fput(filp);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d8f44ef..22fce7b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 #define KVM_CAP_DEVICE_CTRL 89
+#define KVM_CAP_IRQ_MPIC 90
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-01 22:47     ` Scott Wood
@ 2013-04-02  6:59       ` tiejun.chen
  -1 siblings, 0 replies; 261+ messages in thread
From: tiejun.chen @ 2013-04-02  6:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/02/2013 06:47 AM, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
>
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
>
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
>
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
>   Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
>   Documentation/virtual/kvm/devices/README |    1 +
>   arch/powerpc/include/asm/kvm_host.h      |    6 +++
>   arch/powerpc/include/asm/kvm_ppc.h       |    2 +
>   arch/powerpc/kvm/powerpc.c               |    7 +++
>   include/uapi/linux/kvm.h                 |   27 ++++++++++++
>   virt/kvm/kvm_main.c                      |   31 +++++++++++++
>   7 files changed, 144 insertions(+)
>   create mode 100644 Documentation/virtual/kvm/devices/README
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..77328aa 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>   written, then `n_invalid' invalid entries, invalidating any previously
>   valid entries found.
>
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>
>   4.77 KVM_ARM_VCPU_INIT
>
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e34f8fe..e0caae2 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
>   	u64 dac[KVMPPC_BOOKE_MAX_DAC];
>   };
>
> +#define KVMPPC_IRQCHIP_NONE	0
> +#define KVMPPC_IRQCHIP_MPIC	1
> +
>   struct kvm_vcpu_arch {
>   	ulong host_stack;
>   	u32 host_pid;
> @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
>   	unsigned long magic_page_pa; /* phys addr to map the magic page to */
>   	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
>
> +	int irqchip_type;
> +	void *irqchip_priv;
> +
>   #ifdef CONFIG_KVM_BOOK3S_64_HV
>   	struct kvm_vcpu_arch_shared shregs;
>
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index f589307..f44932c 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -323,4 +323,6 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
>   	return ea;
>   }
>
> +void mpic_put(void *priv);
> +
>   #endif /* __POWERPC_KVM_PPC_H__ */
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 16b4595..bdfa526 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>   	tasklet_kill(&vcpu->arch.tasklet);
>
>   	kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +	switch (vcpu->arch.irqchip_type) {
> +	case KVMPPC_IRQCHIP_MPIC:
> +		mpic_put(vcpu->arch.irqchip_priv);
> +		break;
> +	}
> +
>   	kvmppc_core_vcpu_free(vcpu);
>   }
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>   #define KVM_CAP_PPC_EPR 86
>   #define KVM_CAP_ARM_PSCI 87
>   #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
>
>   #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>   #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>
>   /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +/* ioctl for vm fd */
> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> +
> +/* ioctls for fds returned by KVM_CREATE_DEVICE */
> +#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> +#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> +
> +/*
>    * ioctls for vcpu fds
>    */
>   #define KVM_RUN                   _IO(KVMIO,   0x80)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff71541..ed033c0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2158,6 +2158,17 @@ out:
>   }
>   #endif
>
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +				   struct kvm_create_device *cd)
> +{
> +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +
> +	switch (cd->type) {
> +	default:
> +		return -ENODEV;
> +	}

Even after apply patch 5, looks here still misses something like:

	if (test)
		WARN_ON_ONCE(!cd->type);

	return -ENODEV;

Tiejun

> +}
> +
>   static long kvm_vm_ioctl(struct file *filp,
>   			   unsigned int ioctl, unsigned long arg)
>   {
> @@ -2272,6 +2283,26 @@ static long kvm_vm_ioctl(struct file *filp,
>   		break;
>   	}
>   #endif
> +	case KVM_CREATE_DEVICE: {
> +		struct kvm_create_device cd;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&cd, argp, sizeof(cd)))
> +			goto out;
> +
> +		mutex_lock(&kvm->lock);
> +		r = kvm_ioctl_create_device(kvm, &cd);
> +		mutex_unlock(&kvm->lock);
> +		if (r)
> +			goto out;
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &cd, sizeof(cd)))
> +			goto out;
> +
> +		r = 0;
> +		break;
> +	}
>   	default:
>   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   		if (r == -ENOTTY)
>

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-02  6:59       ` tiejun.chen
  0 siblings, 0 replies; 261+ messages in thread
From: tiejun.chen @ 2013-04-02  6:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/02/2013 06:47 AM, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
>
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
>
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
>
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
>   Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
>   Documentation/virtual/kvm/devices/README |    1 +
>   arch/powerpc/include/asm/kvm_host.h      |    6 +++
>   arch/powerpc/include/asm/kvm_ppc.h       |    2 +
>   arch/powerpc/kvm/powerpc.c               |    7 +++
>   include/uapi/linux/kvm.h                 |   27 ++++++++++++
>   virt/kvm/kvm_main.c                      |   31 +++++++++++++
>   7 files changed, 144 insertions(+)
>   create mode 100644 Documentation/virtual/kvm/devices/README
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..77328aa 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>   written, then `n_invalid' invalid entries, invalidating any previously
>   valid entries found.
>
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>
>   4.77 KVM_ARM_VCPU_INIT
>
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e34f8fe..e0caae2 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
>   	u64 dac[KVMPPC_BOOKE_MAX_DAC];
>   };
>
> +#define KVMPPC_IRQCHIP_NONE	0
> +#define KVMPPC_IRQCHIP_MPIC	1
> +
>   struct kvm_vcpu_arch {
>   	ulong host_stack;
>   	u32 host_pid;
> @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
>   	unsigned long magic_page_pa; /* phys addr to map the magic page to */
>   	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
>
> +	int irqchip_type;
> +	void *irqchip_priv;
> +
>   #ifdef CONFIG_KVM_BOOK3S_64_HV
>   	struct kvm_vcpu_arch_shared shregs;
>
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index f589307..f44932c 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -323,4 +323,6 @@ static inline ulong kvmppc_get_ea_indexed(struct kvm_vcpu *vcpu, int ra, int rb)
>   	return ea;
>   }
>
> +void mpic_put(void *priv);
> +
>   #endif /* __POWERPC_KVM_PPC_H__ */
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 16b4595..bdfa526 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>   	tasklet_kill(&vcpu->arch.tasklet);
>
>   	kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +	switch (vcpu->arch.irqchip_type) {
> +	case KVMPPC_IRQCHIP_MPIC:
> +		mpic_put(vcpu->arch.irqchip_priv);
> +		break;
> +	}
> +
>   	kvmppc_core_vcpu_free(vcpu);
>   }
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>   #define KVM_CAP_PPC_EPR 86
>   #define KVM_CAP_ARM_PSCI 87
>   #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
>
>   #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>   #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>
>   /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +/* ioctl for vm fd */
> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> +
> +/* ioctls for fds returned by KVM_CREATE_DEVICE */
> +#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> +#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> +
> +/*
>    * ioctls for vcpu fds
>    */
>   #define KVM_RUN                   _IO(KVMIO,   0x80)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff71541..ed033c0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2158,6 +2158,17 @@ out:
>   }
>   #endif
>
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +				   struct kvm_create_device *cd)
> +{
> +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +
> +	switch (cd->type) {
> +	default:
> +		return -ENODEV;
> +	}

Even after apply patch 5, looks here still misses something like:

	if (test)
		WARN_ON_ONCE(!cd->type);

	return -ENODEV;

Tiejun

> +}
> +
>   static long kvm_vm_ioctl(struct file *filp,
>   			   unsigned int ioctl, unsigned long arg)
>   {
> @@ -2272,6 +2283,26 @@ static long kvm_vm_ioctl(struct file *filp,
>   		break;
>   	}
>   #endif
> +	case KVM_CREATE_DEVICE: {
> +		struct kvm_create_device cd;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&cd, argp, sizeof(cd)))
> +			goto out;
> +
> +		mutex_lock(&kvm->lock);
> +		r = kvm_ioctl_create_device(kvm, &cd);
> +		mutex_unlock(&kvm->lock);
> +		if (r)
> +			goto out;
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &cd, sizeof(cd)))
> +			goto out;
> +
> +		r = 0;
> +		break;
> +	}
>   	default:
>   		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>   		if (r = -ENOTTY)
>


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-01 22:47     ` Scott Wood
@ 2013-04-03  1:02       ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-03  1:02 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>

Some comments below...

> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..77328aa 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL

I notice this patch doesn't add this capability; you add it in a later
patch.  Since this patch adds the KVM_CREATE_DEVICE ioctl, it probably
should add the KVM_CAP_DEVICE_CTRL capability too.


> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created

Is this still a possible error code?

> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
>  	u64 dac[KVMPPC_BOOKE_MAX_DAC];
>  };
>  
> +#define KVMPPC_IRQCHIP_NONE	0
> +#define KVMPPC_IRQCHIP_MPIC	1

This define should go in the patch that adds the MPIC device.

>  struct kvm_vcpu_arch {
>  	ulong host_stack;
>  	u32 host_pid;
> @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
>  	unsigned long magic_page_pa; /* phys addr to map the magic page to */
>  	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
>  
> +	int irqchip_type;
> +	void *irqchip_priv;

Since you add this (irqchip_priv) only to remove it in a later patch
and replace it by a device-specific pointer, why bother adding it
here?  And why not give irqchip_type the name it ultimately ends up
with?

> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 16b4595..bdfa526 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>  	tasklet_kill(&vcpu->arch.tasklet);
>  
>  	kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +	switch (vcpu->arch.irqchip_type) {
> +	case KVMPPC_IRQCHIP_MPIC:
> +		mpic_put(vcpu->arch.irqchip_priv);
> +		break;
> +	}

This is going to break bisection, since you don't define mpic_put() in
this patch.

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_EPR 86
>  #define KVM_CAP_ARM_PSCI 87
>  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>  
>  /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +/* ioctl for vm fd */
> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)

This define should go with the other VM ioctls, otherwise the next
person to add a VM ioctl will probably miss it and reuse the 0xe0
code.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03  1:02       ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-03  1:02 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>

Some comments below...

> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..77328aa 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL

I notice this patch doesn't add this capability; you add it in a later
patch.  Since this patch adds the KVM_CREATE_DEVICE ioctl, it probably
should add the KVM_CAP_DEVICE_CTRL capability too.


> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +  ENOSPC: Too many devices have been created

Is this still a possible error code?

> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
>  	u64 dac[KVMPPC_BOOKE_MAX_DAC];
>  };
>  
> +#define KVMPPC_IRQCHIP_NONE	0
> +#define KVMPPC_IRQCHIP_MPIC	1

This define should go in the patch that adds the MPIC device.

>  struct kvm_vcpu_arch {
>  	ulong host_stack;
>  	u32 host_pid;
> @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
>  	unsigned long magic_page_pa; /* phys addr to map the magic page to */
>  	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
>  
> +	int irqchip_type;
> +	void *irqchip_priv;

Since you add this (irqchip_priv) only to remove it in a later patch
and replace it by a device-specific pointer, why bother adding it
here?  And why not give irqchip_type the name it ultimately ends up
with?

> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 16b4595..bdfa526 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>  	tasklet_kill(&vcpu->arch.tasklet);
>  
>  	kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +	switch (vcpu->arch.irqchip_type) {
> +	case KVMPPC_IRQCHIP_MPIC:
> +		mpic_put(vcpu->arch.irqchip_priv);
> +		break;
> +	}

This is going to break bisection, since you don't define mpic_put() in
this patch.

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_EPR 86
>  #define KVM_CAP_ARM_PSCI 87
>  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>  
>  /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +/* ioctl for vm fd */
> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)

This define should go with the other VM ioctls, otherwise the next
person to add a VM ioctl will probably miss it and reuse the 0xe0
code.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03  1:02       ` Paul Mackerras
@ 2013-04-03  1:19         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:19 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> >
> > Both device types and individual attributes can be tested without  
> having
> > to create the device or get/set the attribute, without the need for
> > separately managing enumerated capabilities.
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> 
> Some comments below...
> 
> > diff --git a/Documentation/virtual/kvm/api.txt  
> b/Documentation/virtual/kvm/api.txt
> > index 976eb65..77328aa 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with  
> contents from the data
> >  written, then `n_invalid' invalid entries, invalidating any  
> previously
> >  valid entries found.
> >
> > +4.79 KVM_CREATE_DEVICE
> > +
> > +Capability: KVM_CAP_DEVICE_CTRL
> 
> I notice this patch doesn't add this capability;

Yes, it does (see below).

> you add it in a later patch.

Maybe you're thinking of KVM_CAP_IRQ_MPIC?

> > +Type: vm ioctl
> > +Parameters: struct kvm_create_device (in/out)
> > +Returns: 0 on success, -1 on error
> > +Errors:
> > +  ENODEV: The device type is unknown or unsupported
> > +  EEXIST: Device already created, and this type of device may not
> > +          be instantiated multiple times
> > +  ENOSPC: Too many devices have been created
> 
> Is this still a possible error code?

If you mean ENOSPC, probably not -- it'd be replaced with whatever  
errors can come out of creating a file descriptor.

> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
> >  	u64 dac[KVMPPC_BOOKE_MAX_DAC];
> >  };
> >
> > +#define KVMPPC_IRQCHIP_NONE	0
> > +#define KVMPPC_IRQCHIP_MPIC	1
> 
> This define should go in the patch that adds the MPIC device.
> 
> >  struct kvm_vcpu_arch {
> >  	ulong host_stack;
> >  	u32 host_pid;
> > @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
> >  	unsigned long magic_page_pa; /* phys addr to map the magic page  
> to */
> >  	unsigned long magic_page_ea; /* effect. addr to map the magic  
> page to */
> >
> > +	int irqchip_type;
> > +	void *irqchip_priv;
> 
> Since you add this (irqchip_priv) only to remove it in a later patch
> and replace it by a device-specific pointer, why bother adding it
> here?  And why not give irqchip_type the name it ultimately ends up
> with?

Oops... These were patch shuffling accidents and will be removed from  
the next iteration.

> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 16b4595..bdfa526 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> >  	tasklet_kill(&vcpu->arch.tasklet);
> >
> >  	kvmppc_remove_vcpu_debugfs(vcpu);
> > +
> > +	switch (vcpu->arch.irqchip_type) {
> > +	case KVMPPC_IRQCHIP_MPIC:
> > +		mpic_put(vcpu->arch.irqchip_priv);
> > +		break;
> > +	}
> 
> This is going to break bisection, since you don't define mpic_put() in
> this patch.

Sigh.  Something got messed up; I'll try to sort it out and resubmit.

> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 74d0ff3..20ce2d2 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_PPC_EPR 86
> >  #define KVM_CAP_ARM_PSCI 87
> >  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> > +#define KVM_CAP_DEVICE_CTRL 89

See, here's the capability. :-)

> >  /*
> > + * Device control API, available with KVM_CAP_DEVICE_CTRL
> > + */
> > +#define KVM_CREATE_DEVICE_TEST		1
> > +
> > +struct kvm_create_device {
> > +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> > +	__u32	fd;	/* out: device handle */
> > +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> > +};
> > +
> > +struct kvm_device_attr {
> > +	__u32	flags;		/* no flags currently defined */
> > +	__u32	group;		/* device-defined */
> > +	__u64	attr;		/* group-defined */
> > +	__u64	addr;		/* userspace address of attr data */
> > +};
> > +
> > +/* ioctl for vm fd */
> > +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct  
> kvm_create_device)
> 
> This define should go with the other VM ioctls, otherwise the next
> person to add a VM ioctl will probably miss it and reuse the 0xe0
> code.

That's actually why I moved it to a new section, with device control  
ioctls getting their own range, as the legacy "device model" and some  
other things did.  0xe0 is not the next ioctl that would be used for  
either vm or vcpu.  The ioctl numbering is actually already a mess,  
with sometimes care being taken to keep vcpu and vm ioctls from  
overlapping, but on other places overlapping does happen.  I'm not sure  
what exactly I should do here.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03  1:19         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:19 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> > Currently, devices that are emulated inside KVM are configured in a
> > hardcoded manner based on an assumption that any given architecture
> > only has one way to do it.  If there's any need to access device  
> state,
> > it is done through inflexible one-purpose-only IOCTLs (e.g.
> > KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> > cumbersome and depletes a limited numberspace.
> >
> > This API provides a mechanism to instantiate a device of a certain
> > type, returning an ID that can be used to set/get attributes of the
> > device.  Attributes may include configuration parameters (e.g.
> > register base address), device state, operational commands, etc.  It
> > is similar to the ONE_REG API, except that it acts on devices rather
> > than vcpus.
> >
> > Both device types and individual attributes can be tested without  
> having
> > to create the device or get/set the attribute, without the need for
> > separately managing enumerated capabilities.
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> 
> Some comments below...
> 
> > diff --git a/Documentation/virtual/kvm/api.txt  
> b/Documentation/virtual/kvm/api.txt
> > index 976eb65..77328aa 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with  
> contents from the data
> >  written, then `n_invalid' invalid entries, invalidating any  
> previously
> >  valid entries found.
> >
> > +4.79 KVM_CREATE_DEVICE
> > +
> > +Capability: KVM_CAP_DEVICE_CTRL
> 
> I notice this patch doesn't add this capability;

Yes, it does (see below).

> you add it in a later patch.

Maybe you're thinking of KVM_CAP_IRQ_MPIC?

> > +Type: vm ioctl
> > +Parameters: struct kvm_create_device (in/out)
> > +Returns: 0 on success, -1 on error
> > +Errors:
> > +  ENODEV: The device type is unknown or unsupported
> > +  EEXIST: Device already created, and this type of device may not
> > +          be instantiated multiple times
> > +  ENOSPC: Too many devices have been created
> 
> Is this still a possible error code?

If you mean ENOSPC, probably not -- it'd be replaced with whatever  
errors can come out of creating a file descriptor.

> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
> >  	u64 dac[KVMPPC_BOOKE_MAX_DAC];
> >  };
> >
> > +#define KVMPPC_IRQCHIP_NONE	0
> > +#define KVMPPC_IRQCHIP_MPIC	1
> 
> This define should go in the patch that adds the MPIC device.
> 
> >  struct kvm_vcpu_arch {
> >  	ulong host_stack;
> >  	u32 host_pid;
> > @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
> >  	unsigned long magic_page_pa; /* phys addr to map the magic page  
> to */
> >  	unsigned long magic_page_ea; /* effect. addr to map the magic  
> page to */
> >
> > +	int irqchip_type;
> > +	void *irqchip_priv;
> 
> Since you add this (irqchip_priv) only to remove it in a later patch
> and replace it by a device-specific pointer, why bother adding it
> here?  And why not give irqchip_type the name it ultimately ends up
> with?

Oops... These were patch shuffling accidents and will be removed from  
the next iteration.

> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 16b4595..bdfa526 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> >  	tasklet_kill(&vcpu->arch.tasklet);
> >
> >  	kvmppc_remove_vcpu_debugfs(vcpu);
> > +
> > +	switch (vcpu->arch.irqchip_type) {
> > +	case KVMPPC_IRQCHIP_MPIC:
> > +		mpic_put(vcpu->arch.irqchip_priv);
> > +		break;
> > +	}
> 
> This is going to break bisection, since you don't define mpic_put() in
> this patch.

Sigh.  Something got messed up; I'll try to sort it out and resubmit.

> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 74d0ff3..20ce2d2 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_PPC_EPR 86
> >  #define KVM_CAP_ARM_PSCI 87
> >  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> > +#define KVM_CAP_DEVICE_CTRL 89

See, here's the capability. :-)

> >  /*
> > + * Device control API, available with KVM_CAP_DEVICE_CTRL
> > + */
> > +#define KVM_CREATE_DEVICE_TEST		1
> > +
> > +struct kvm_create_device {
> > +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> > +	__u32	fd;	/* out: device handle */
> > +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> > +};
> > +
> > +struct kvm_device_attr {
> > +	__u32	flags;		/* no flags currently defined */
> > +	__u32	group;		/* device-defined */
> > +	__u64	attr;		/* group-defined */
> > +	__u64	addr;		/* userspace address of attr data */
> > +};
> > +
> > +/* ioctl for vm fd */
> > +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct  
> kvm_create_device)
> 
> This define should go with the other VM ioctls, otherwise the next
> person to add a VM ioctl will probably miss it and reuse the 0xe0
> code.

That's actually why I moved it to a new section, with device control  
ioctls getting their own range, as the legacy "device model" and some  
other things did.  0xe0 is not the next ioctl that would be used for  
either vm or vcpu.  The ioctl numbering is actually already a mess,  
with sometimes care being taken to keep vcpu and vm ioctls from  
overlapping, but on other places overlapping does happen.  I'm not sure  
what exactly I should do here.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
       [not found]       ` <1364923807.24520.2@snotra>
@ 2013-04-03  1:28           ` tiejun.chen
  0 siblings, 0 replies; 261+ messages in thread
From: tiejun.chen @ 2013-04-03  1:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/03/2013 01:30 AM, Scott Wood wrote:
> On 04/02/2013 01:59:57 AM, tiejun.chen wrote:
>> On 04/02/2013 06:47 AM, Scott Wood wrote:
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index ff71541..ed033c0 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -2158,6 +2158,17 @@ out:
>>>   }
>>>   #endif
>>>
>>> +static int kvm_ioctl_create_device(struct kvm *kvm,
>>> +                   struct kvm_create_device *cd)
>>> +{
>>> +    bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>>> +
>>> +    switch (cd->type) {
>>> +    default:
>>> +        return -ENODEV;
>>> +    }
>>
>> Even after apply patch 5, looks here still misses something like:
>>
>>     if (test)
>>         WARN_ON_ONCE(!cd->type);
>
> Why?  How does userspace passing in a bad type value mean the kernel needs to
> report internal badness, why is a value of zero worse than any other bad value,
> and why only when the test flag is set?

I just mean we need do something here since looks the 'test' variable is defined 
but unused, right? But please correct this as you expect :)

And if the userspace can't guarantee cd->type is never zero, we should return 
-ENODEV as well after that switch().

Tiejun

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03  1:28           ` tiejun.chen
  0 siblings, 0 replies; 261+ messages in thread
From: tiejun.chen @ 2013-04-03  1:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/03/2013 01:30 AM, Scott Wood wrote:
> On 04/02/2013 01:59:57 AM, tiejun.chen wrote:
>> On 04/02/2013 06:47 AM, Scott Wood wrote:
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index ff71541..ed033c0 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -2158,6 +2158,17 @@ out:
>>>   }
>>>   #endif
>>>
>>> +static int kvm_ioctl_create_device(struct kvm *kvm,
>>> +                   struct kvm_create_device *cd)
>>> +{
>>> +    bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>>> +
>>> +    switch (cd->type) {
>>> +    default:
>>> +        return -ENODEV;
>>> +    }
>>
>> Even after apply patch 5, looks here still misses something like:
>>
>>     if (test)
>>         WARN_ON_ONCE(!cd->type);
>
> Why?  How does userspace passing in a bad type value mean the kernel needs to
> report internal badness, why is a value of zero worse than any other bad value,
> and why only when the test flag is set?

I just mean we need do something here since looks the 'test' variable is defined 
but unused, right? But please correct this as you expect :)

And if the userspace can't guarantee cd->type is never zero, we should return 
-ENODEV as well after that switch().

Tiejun

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
       [not found]           ` <1364952853.8690.3@snotra>
@ 2013-04-03  1:42               ` tiejun.chen
  0 siblings, 0 replies; 261+ messages in thread
From: tiejun.chen @ 2013-04-03  1:42 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/03/2013 09:34 AM, Scott Wood wrote:
> On 04/02/2013 08:28:01 PM, tiejun.chen wrote:
>> On 04/03/2013 01:30 AM, Scott Wood wrote:
>>> On 04/02/2013 01:59:57 AM, tiejun.chen wrote:
>>>> On 04/02/2013 06:47 AM, Scott Wood wrote:
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index ff71541..ed033c0 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -2158,6 +2158,17 @@ out:
>>>>>   }
>>>>>   #endif
>>>>>
>>>>> +static int kvm_ioctl_create_device(struct kvm *kvm,
>>>>> +                   struct kvm_create_device *cd)
>>>>> +{
>>>>> +    bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>>>>> +
>>>>> +    switch (cd->type) {
>>>>> +    default:
>>>>> +        return -ENODEV;
>>>>> +    }
>>>>
>>>> Even after apply patch 5, looks here still misses something like:
>>>>
>>>>     if (test)
>>>>         WARN_ON_ONCE(!cd->type);
>>>
>>> Why?  How does userspace passing in a bad type value mean the kernel needs to
>>> report internal badness, why is a value of zero worse than any other bad value,
>>> and why only when the test flag is set?
>>
>> I just mean we need do something here since looks the 'test' variable is
>> defined but unused, right? But please correct this as you expect :)
>
> Yes, it's unused in this patch, but is used after patch 5 is applied.  I didn't
> think it was worth adding a temporary unused annotation, since this part of the
> kernel doesn't use -Werror.

Yes, its accepted in !-Werror case if we shouldn't warn something as you said.

Tiejun

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03  1:42               ` tiejun.chen
  0 siblings, 0 replies; 261+ messages in thread
From: tiejun.chen @ 2013-04-03  1:42 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/03/2013 09:34 AM, Scott Wood wrote:
> On 04/02/2013 08:28:01 PM, tiejun.chen wrote:
>> On 04/03/2013 01:30 AM, Scott Wood wrote:
>>> On 04/02/2013 01:59:57 AM, tiejun.chen wrote:
>>>> On 04/02/2013 06:47 AM, Scott Wood wrote:
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index ff71541..ed033c0 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -2158,6 +2158,17 @@ out:
>>>>>   }
>>>>>   #endif
>>>>>
>>>>> +static int kvm_ioctl_create_device(struct kvm *kvm,
>>>>> +                   struct kvm_create_device *cd)
>>>>> +{
>>>>> +    bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>>>>> +
>>>>> +    switch (cd->type) {
>>>>> +    default:
>>>>> +        return -ENODEV;
>>>>> +    }
>>>>
>>>> Even after apply patch 5, looks here still misses something like:
>>>>
>>>>     if (test)
>>>>         WARN_ON_ONCE(!cd->type);
>>>
>>> Why?  How does userspace passing in a bad type value mean the kernel needs to
>>> report internal badness, why is a value of zero worse than any other bad value,
>>> and why only when the test flag is set?
>>
>> I just mean we need do something here since looks the 'test' variable is
>> defined but unused, right? But please correct this as you expect :)
>
> Yes, it's unused in this patch, but is used after patch 5 is applied.  I didn't
> think it was worth adding a temporary unused annotation, since this part of the
> kernel doesn't use -Werror.

Yes, its accepted in !-Werror case if we shouldn't warn something as you said.

Tiejun


^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 0/6] device control and in-kernel MPIC
  2013-04-01 22:47   ` Scott Wood
@ 2013-04-03  1:57     ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

Fixed some patch shuffling errors and some minor issues.

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt          |   78 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h        |   16 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    9 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   12 +-
 arch/powerpc/kvm/mpic.c                    | 1784 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   38 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |   37 +
 virt/kvm/kvm_main.c                        |   40 +
 13 files changed, 2051 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 0/6] device control and in-kernel MPIC
@ 2013-04-03  1:57     ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

Fixed some patch shuffling errors and some minor issues.

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt          |   78 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h        |   16 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    9 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   12 +-
 arch/powerpc/kvm/mpic.c                    | 1784 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   38 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |   37 +
 virt/kvm/kvm_main.c                        |   40 +
 13 files changed, 2051 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-03  1:57       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v3: remove some changes that were merged into this patch by accident,
and fix the error documentation for KVM_CREATE_DEVICE.

NOTE: I had some difficulty figuring out what ioctl numbers I should
assign...  it seems that at one point care was taken to keep vcpu and
vm ioctls separate, but some overlap exists now (despite not exhausing
the ioctl space).  Some of that was my fault, but not all of it. :-)
I moved to a new ioctl range for device control -- please let me know
if there's something else you'd prefer I do.
---
 Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/uapi/linux/kvm.h                 |   27 ++++++++++++
 virt/kvm/kvm_main.c                      |   31 +++++++++++++
 4 files changed, 129 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..d52f3f9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..20ce2d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_DEVICE_CTRL 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
 
 /*
+ * Device control API, available with KVM_CAP_DEVICE_CTRL
+ */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* ioctl for vm fd */
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
+
+/* ioctls for fds returned by KVM_CREATE_DEVICE */
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
+
+/*
  * ioctls for vcpu fds
  */
 #define KVM_RUN                   _IO(KVMIO,   0x80)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+
+	switch (cd->type) {
+	default:
+		return -ENODEV;
+	}
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2272,6 +2283,26 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		mutex_lock(&kvm->lock);
+		r = kvm_ioctl_create_device(kvm, &cd);
+		mutex_unlock(&kvm->lock);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-03  1:57       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v3: remove some changes that were merged into this patch by accident,
and fix the error documentation for KVM_CREATE_DEVICE.

NOTE: I had some difficulty figuring out what ioctl numbers I should
assign...  it seems that at one point care was taken to keep vcpu and
vm ioctls separate, but some overlap exists now (despite not exhausing
the ioctl space).  Some of that was my fault, but not all of it. :-)
I moved to a new ioctl range for device control -- please let me know
if there's something else you'd prefer I do.
---
 Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/uapi/linux/kvm.h                 |   27 ++++++++++++
 virt/kvm/kvm_main.c                      |   31 +++++++++++++
 4 files changed, 129 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..d52f3f9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..20ce2d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_DEVICE_CTRL 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
 
 /*
+ * Device control API, available with KVM_CAP_DEVICE_CTRL
+ */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* ioctl for vm fd */
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
+
+/* ioctls for fds returned by KVM_CREATE_DEVICE */
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
+
+/*
  * ioctls for vcpu fds
  */
 #define KVM_RUN                   _IO(KVMIO,   0x80)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
 }
 #endif
 
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+
+	switch (cd->type) {
+	default:
+		return -ENODEV;
+	}
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2272,6 +2283,26 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		mutex_lock(&kvm->lock);
+		r = kvm_ioctl_create_device(kvm, &cd);
+		mutex_unlock(&kvm->lock);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r = -ENOTTY)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-03  1:57       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] == inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] == output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq == opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ == 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] == 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask == 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask == (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i == opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level =
+			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr =
+	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level =
+		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval =
+		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr == 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) == 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr == 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ == -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq == -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
@ 2013-04-03  1:57       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] = inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] = output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq = opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ = 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] = 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask = 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask = (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i = opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level +			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr +	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level +		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval +		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr = 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) = 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr = 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ = -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq = -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 3/6] kvm/ppc/mpic: remove some obviously unneeded code
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-03  1:57       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] == inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] == output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 3/6] kvm/ppc/mpic: remove some obviously unneeded code
@ 2013-04-03  1:57       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] = inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] = output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 4/6] kvm/ppc/mpic: adapt to kernel style and environment
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-03  1:57       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq == opp->max_irq) {
+		if (irq == opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ == 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ == 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] == 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] == 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask == 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i == opp->nb_cpus) {
+			if (i == opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level =
-			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level =
-		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr == 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) == 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr == 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ == -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq == -1) {
+	if (irq == -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 4/6] kvm/ppc/mpic: adapt to kernel style and environment
@ 2013-04-03  1:57       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq = opp->max_irq) {
+		if (irq = opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ = 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ = 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] = 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] = 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask = 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i = opp->nb_cpus) {
+			if (i = opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level -			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level -		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr = 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) = 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr = 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ = -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq = -1) {
+	if (irq = -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-03  1:57       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v3: mpic_put -> kvmppc_mpic_put

 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h        |    8 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    7 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    |  814 +++++++++++++++++++++-------
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |    9 +
 virt/kvm/kvm_main.c                        |    9 +
 11 files changed, 714 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..79e000a
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+=========================
+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest. 
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..7e7aef9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -522,7 +527,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -589,5 +594,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..3b63b97 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+struct openpic;
+void kvmppc_mpic_put(struct openpic *opp);
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
@@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..a87139b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 58057d6..cddc6b3 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -346,7 +346,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -427,8 +427,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear == true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr == true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr == true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+				kvmppc_mpic_set_epr(vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..8cda2fa 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,19 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/anon_inodes.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -36,6 +49,7 @@
 #define OPENPIC_FLAG_ILR          (2 << 0)
 
 /* OpenPIC address map */
+#define OPENPIC_REG_SIZE             0x40000
 #define OPENPIC_GLB_REG_START        0x0
 #define OPENPIC_GLB_REG_SIZE         0x10F0
 #define OPENPIC_TMR_REG_START        0x10F0
@@ -89,6 +103,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +113,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +142,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +169,28 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
 };
 
+struct openpic;
+
 struct openpic {
+	struct kvm *kvm;
+	struct kvm_io_device mmio;
+	struct list_head mmio_regions;
+	atomic_t users;
+	bool mmio_mapped;
+
+	gpa_t reg_base;
+	spinlock_t lock;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +231,47 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +332,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +346,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ == 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] == 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +386,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +401,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +477,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +495,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +508,13 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +564,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +599,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +610,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +625,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +674,22 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
-	bool mpic_proxy = false;
-
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
-	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
-		mpic_proxy = true;
-
-	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_gbl_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
-	struct irq_dest *dst;
-	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -664,22 +712,11 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x1080:		/* VIR */
 		break;
 	case 0x1090:		/* PIR */
-		for (idx = 0; idx < opp->nb_cpus; idx++) {
-			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx)) &&
-				   (opp->pir & (1 << idx))) {
-				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			}
-		}
-		opp->pir = val;
-		break;
+		/*
+		 * This register is used to reset a CPU core --
+		 * let userspace handle it.
+		 */
+		return 1;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
@@ -695,21 +732,24 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint32_t retval;
+	u32 retval;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval =
-		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		retval = openpic_cpu_read_internal(opp, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,28 +790,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
-	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr == 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
@@ -795,15 +835,17 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +855,7 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -830,18 +873,16 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
@@ -857,15 +898,17 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -884,20 +927,19 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 	}
 
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,17 +953,19 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint64_t r = 0;
+	uint32_t r = 0;
 	int i, srs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return 1;
 
 	srs = addr >> 4;
 
@@ -945,45 +989,47 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+	*ptr = r;
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	uint64_t r = 0;
+	uint32_t r = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	*ptr = r;
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(void *opaque, gpa_t addr, u32 val)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1054,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1089,22 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(void *opaque, gpa_t addr, u32 val)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_write_internal(opp, addr, val,
+					 (addr & 0x1f000) >> 12);
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1114,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1157,35 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
+{
+	struct openpic *opp = vcpu->arch.irqchip_priv;
+	int cpu = vcpu->vcpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,49 +1207,67 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_read_internal(opp, addr, ptr,
+					 (addr & 0x1f000) >> 12);
 }
 
-static const struct kvm_io_device_ops openpic_glb_ops_be = {
+struct mem_reg {
+	struct list_head list;
+	int (*read)(void *opaque, gpa_t addr, u32 *ptr);
+	int (*write)(void *opaque, gpa_t addr, u32 val);
+	gpa_t start_addr;
+	int size;
+};
+
+static struct mem_reg openpic_gbl_mmio = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
+	.start_addr = OPENPIC_GLB_REG_START,
+	.size = OPENPIC_GLB_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_tmr_ops_be = {
+static struct mem_reg openpic_tmr_mmio = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
+	.start_addr = OPENPIC_TMR_REG_START,
+	.size = OPENPIC_TMR_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_cpu_ops_be = {
+static struct mem_reg openpic_cpu_mmio = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
+	.start_addr = OPENPIC_CPU_REG_START,
+	.size = OPENPIC_CPU_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_src_ops_be = {
+static struct mem_reg openpic_src_mmio = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
+	.start_addr = OPENPIC_SRC_REG_START,
+	.size = OPENPIC_SRC_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_msi_ops_be = {
+static struct mem_reg openpic_msi_mmio = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
+	.start_addr = OPENPIC_MSI_REG_START,
+	.size = OPENPIC_MSI_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_summary_ops_be = {
+static struct mem_reg openpic_summary_mmio = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-};
-
-struct mem_reg {
-	const char *name;
-	const struct kvm_io_device_ops *ops;
-	gpa_t start_addr;
-	int size;
+	.start_addr = OPENPIC_SUMMARY_REG_START,
+	.size = OPENPIC_SUMMARY_REG_SIZE,
 };
 
 static void fsl_common_init(struct openpic *opp)
@@ -1192,6 +1275,9 @@ static void fsl_common_init(struct openpic *opp)
 	int i;
 	int virq = MAX_SRC;
 
+	list_add(&openpic_msi_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_summary_mmio.list, &opp->mmio_regions);
+
 	opp->vid = VID_REVISION_1_2;
 	opp->vir = VIR_GENERIC;
 	opp->vector_mask = 0xFFFF;
@@ -1205,11 +1291,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,63 +1311,404 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static int kvm_mpic_read_internal(struct openpic *opp, gpa_t addr, u32 *ptr)
 {
-	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+	struct list_head *node;
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-		(*count)++;
-		list++;
+		return mr->read(opp, addr - mr->start_addr, ptr);
 	}
+
+	return 1;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int kvm_mpic_write_internal(struct openpic *opp, gpa_t addr, u32 val)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_be[] = {
-		{"glb", &openpic_glb_ops_be,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_be,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_be,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_be,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_fsl[] = {
-		{"msi", &openpic_msi_ops_be,
-		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
-		{"summary", &openpic_summary_ops_be,
-		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
-		{NULL}
-	};
+	struct list_head *node;
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
+
+		return mr->write(opp, addr - mr->start_addr, val);
+	}
+
+	return 1;
+}
+
+static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
+			 int len, void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice to
+	 * people dumping registers a byte at a time -- it works in real
+	 * hardware (reads only, not writes).
+	 */
+	if (len == 4) {
+		if (addr & 3) {
+			pr_debug("%s: bad alignment %llx/%d\n",
+				 __func__, addr, len);
+			return -EINVAL;
+		}
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
+		spin_unlock_irq(&opp->lock);
+
+		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
+			 __func__, addr, ret, *(const u32 *)ptr);
+	} else if (len == 1) {
+		union {
+			u32 val;
+			u8 bytes[4];
+		} u;
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
+		spin_unlock_irq(&opp->lock);
+
+		*(u8 *)ptr = u.bytes[addr & 3];
+
+		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
+			 __func__, addr, ret, *(const u8 *)ptr);
+	} else {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
+			  int len, const void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EOPNOTSUPP;
+	}
+	if (addr & 3) {
+		pr_debug("%s: bad alignment %llx/%d\n", __func__, addr, len);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
+				      *(const u32 *)ptr);
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: addr %llx ret %d val %x\n",
+		 __func__, addr, ret, *(const u32 *)ptr);
+
+	return ret;
+}
+
+static void kvm_mpic_dtor(struct kvm_io_device *this)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+
+	opp->mmio_mapped = false;
+}
+
+static const struct kvm_io_device_ops mpic_mmio_ops = {
+	.read = kvm_mpic_read,
+	.write = kvm_mpic_write,
+	.destructor = kvm_mpic_dtor,
+};
+
+static void map_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = true;
+
+	kvm_iodevice_init(&opp->mmio, &mpic_mmio_ops);
+
+	kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+				opp->reg_base, OPENPIC_REG_SIZE,
+				&opp->mmio);
+}
+
+static void unmap_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = false;
+
+	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+}
+
+static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 base;
+
+	if (copy_from_user(&base, (u64 __iomem *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base == opp->reg_base)
+		return 0;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	unmap_mmio(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base == 0)
+		goto out;
+
+	map_mmio(opp);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+out:
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	if (type == ATTR_SET)
+		ret = kvm_mpic_write_internal(opp, addr, *val);
+	else
+		ret = kvm_mpic_read_internal(opp, addr, val);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(opp, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			mutex_lock(&opp->kvm->slots_lock);
+			attr64 = opp->reg_base;
+			mutex_unlock(&opp->kvm->slots_lock);
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		attr32 = opp->src[attr->attr].pending;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_has_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			break;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static long kvm_mpic_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm_device_attr attr;
+	int (*accessor)(struct openpic *opp, struct kvm_device_attr *attr);
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		accessor = mpic_set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		accessor = mpic_get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = mpic_has_attr;
+		break;
 	default:
+		return -ENOTTY;
+	}
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(opp, &attr);
+}
+
+static void mpic_destroy(struct openpic *opp)
+{
+	if (opp->mmio_mapped) {
+		/*
+		 * Normally we get unmapped by kvm_io_bus_destroy(),
+		 * which happens before the VCPUs release their references.
+		 *
+		 * Thus, we should only get here if no VCPUs took a reference
+		 * to us in the first place.
+		 */
+		WARN_ON(opp->nb_cpus != 0);
+		unmap_mmio(opp);
+	}
+
+	kfree(opp);
+}
+
+void kvmppc_mpic_put(struct openpic *opp)
+{
+	if (atomic_dec_and_test(&opp->users))
+		mpic_destroy(opp);
+}
+
+static int kvm_mpic_release(struct inode *inode, struct file *filp)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm *kvm = opp->kvm;
+
+	kvmppc_mpic_put(opp);
+	kvm_put_kvm(kvm);
+	return 0;
+}
+
+static const struct file_operations kvm_mpic_fops = {
+	.unlocked_ioctl = kvm_mpic_ioctl,
+	.release = kvm_mpic_release,
+};
+
+int kvm_create_mpic(struct kvm *kvm, u32 type)
+{
+	struct openpic *opp;
+	int ret, fd;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return -ENOMEM;
+
+	fd = anon_inode_getfd("kvm-mpic", &kvm_mpic_fops, opp, O_RDWR);
+	if (fd < 0) {
+		ret = fd;
+		goto err;
+	}
+
+	opp->kvm = kvm;
+	opp->model = type;
+	atomic_set(&opp->users, 1);
+	spin_lock_init(&opp->lock);
+
+	INIT_LIST_HEAD(&opp->mmio_regions);
+	list_add(&openpic_gbl_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_tmr_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_src_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_cpu_mmio.list, &opp->mmio_regions);
+
+	switch (opp->model) {
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1716,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1727,19 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
-	return 0;
+	openpic_reset(opp);
+	kvm_get_kvm(kvm);
+	return fd;
+
+err:
+	kfree(opp);
+	return ret;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..c9a2972 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -769,7 +770,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -915,6 +919,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -933,7 +938,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -945,7 +949,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -955,7 +958,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -972,7 +974,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -985,7 +986,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c0be23..852a3a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1084,6 +1084,8 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+int kvm_create_mpic(struct kvm *kvm, u32 type);
+
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #else
 static inline void __guest_enter(void) { return; }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20ce2d2..d8f44ef 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -927,6 +927,15 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed033c0..e325f5d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42: {
+		if (test)
+			return 0;
+
+		return kvm_create_mpic(kvm, cd->type);
+	}
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03  1:57       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v3: mpic_put -> kvmppc_mpic_put

 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h        |    8 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    7 +
 arch/powerpc/kvm/Kconfig                   |    5 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   10 +-
 arch/powerpc/kvm/mpic.c                    |  814 +++++++++++++++++++++-------
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |    9 +
 virt/kvm/kvm_main.c                        |    9 +
 11 files changed, 714 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..79e000a
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+============+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest. 
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..7e7aef9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -522,7 +527,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -589,5 +594,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..3b63b97 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+struct openpic;
+void kvmppc_mpic_put(struct openpic *opp);
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
@@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..a87139b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 58057d6..cddc6b3 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -346,7 +346,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -427,8 +427,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear = true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr = true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr = true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+				kvmppc_mpic_set_epr(vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..8cda2fa 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,19 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/anon_inodes.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -36,6 +49,7 @@
 #define OPENPIC_FLAG_ILR          (2 << 0)
 
 /* OpenPIC address map */
+#define OPENPIC_REG_SIZE             0x40000
 #define OPENPIC_GLB_REG_START        0x0
 #define OPENPIC_GLB_REG_SIZE         0x10F0
 #define OPENPIC_TMR_REG_START        0x10F0
@@ -89,6 +103,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +113,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +142,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +169,28 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
 };
 
+struct openpic;
+
 struct openpic {
+	struct kvm *kvm;
+	struct kvm_io_device mmio;
+	struct list_head mmio_regions;
+	atomic_t users;
+	bool mmio_mapped;
+
+	gpa_t reg_base;
+	spinlock_t lock;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +231,47 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +332,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +346,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ = 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] = 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +386,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +401,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +477,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +495,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +508,13 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +564,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +599,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +610,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +625,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +674,22 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
-	bool mpic_proxy = false;
-
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
-	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
-		mpic_proxy = true;
-
-	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_gbl_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
-	struct irq_dest *dst;
-	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -664,22 +712,11 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x1080:		/* VIR */
 		break;
 	case 0x1090:		/* PIR */
-		for (idx = 0; idx < opp->nb_cpus; idx++) {
-			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx)) &&
-				   (opp->pir & (1 << idx))) {
-				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			}
-		}
-		opp->pir = val;
-		break;
+		/*
+		 * This register is used to reset a CPU core --
+		 * let userspace handle it.
+		 */
+		return 1;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
@@ -695,21 +732,24 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint32_t retval;
+	u32 retval;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval -		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		retval = openpic_cpu_read_internal(opp, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,28 +790,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
-	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr = 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
@@ -795,15 +835,17 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +855,7 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -830,18 +873,16 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
@@ -857,15 +898,17 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -884,20 +927,19 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 	}
 
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,17 +953,19 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint64_t r = 0;
+	uint32_t r = 0;
 	int i, srs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return 1;
 
 	srs = addr >> 4;
 
@@ -945,45 +989,47 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+	*ptr = r;
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	uint64_t r = 0;
+	uint32_t r = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	*ptr = r;
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(void *opaque, gpa_t addr, u32 val)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1054,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1089,22 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(void *opaque, gpa_t addr, u32 val)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_write_internal(opp, addr, val,
+					 (addr & 0x1f000) >> 12);
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1114,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1157,35 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
+{
+	struct openpic *opp = vcpu->arch.irqchip_priv;
+	int cpu = vcpu->vcpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) = GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,49 +1207,67 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_read_internal(opp, addr, ptr,
+					 (addr & 0x1f000) >> 12);
 }
 
-static const struct kvm_io_device_ops openpic_glb_ops_be = {
+struct mem_reg {
+	struct list_head list;
+	int (*read)(void *opaque, gpa_t addr, u32 *ptr);
+	int (*write)(void *opaque, gpa_t addr, u32 val);
+	gpa_t start_addr;
+	int size;
+};
+
+static struct mem_reg openpic_gbl_mmio = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
+	.start_addr = OPENPIC_GLB_REG_START,
+	.size = OPENPIC_GLB_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_tmr_ops_be = {
+static struct mem_reg openpic_tmr_mmio = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
+	.start_addr = OPENPIC_TMR_REG_START,
+	.size = OPENPIC_TMR_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_cpu_ops_be = {
+static struct mem_reg openpic_cpu_mmio = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
+	.start_addr = OPENPIC_CPU_REG_START,
+	.size = OPENPIC_CPU_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_src_ops_be = {
+static struct mem_reg openpic_src_mmio = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
+	.start_addr = OPENPIC_SRC_REG_START,
+	.size = OPENPIC_SRC_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_msi_ops_be = {
+static struct mem_reg openpic_msi_mmio = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
+	.start_addr = OPENPIC_MSI_REG_START,
+	.size = OPENPIC_MSI_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_summary_ops_be = {
+static struct mem_reg openpic_summary_mmio = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-};
-
-struct mem_reg {
-	const char *name;
-	const struct kvm_io_device_ops *ops;
-	gpa_t start_addr;
-	int size;
+	.start_addr = OPENPIC_SUMMARY_REG_START,
+	.size = OPENPIC_SUMMARY_REG_SIZE,
 };
 
 static void fsl_common_init(struct openpic *opp)
@@ -1192,6 +1275,9 @@ static void fsl_common_init(struct openpic *opp)
 	int i;
 	int virq = MAX_SRC;
 
+	list_add(&openpic_msi_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_summary_mmio.list, &opp->mmio_regions);
+
 	opp->vid = VID_REVISION_1_2;
 	opp->vir = VIR_GENERIC;
 	opp->vector_mask = 0xFFFF;
@@ -1205,11 +1291,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,63 +1311,404 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static int kvm_mpic_read_internal(struct openpic *opp, gpa_t addr, u32 *ptr)
 {
-	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+	struct list_head *node;
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-		(*count)++;
-		list++;
+		return mr->read(opp, addr - mr->start_addr, ptr);
 	}
+
+	return 1;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int kvm_mpic_write_internal(struct openpic *opp, gpa_t addr, u32 val)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_be[] = {
-		{"glb", &openpic_glb_ops_be,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_be,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_be,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_be,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_fsl[] = {
-		{"msi", &openpic_msi_ops_be,
-		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
-		{"summary", &openpic_summary_ops_be,
-		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
-		{NULL}
-	};
+	struct list_head *node;
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
+
+		return mr->write(opp, addr - mr->start_addr, val);
+	}
+
+	return 1;
+}
+
+static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
+			 int len, void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice to
+	 * people dumping registers a byte at a time -- it works in real
+	 * hardware (reads only, not writes).
+	 */
+	if (len = 4) {
+		if (addr & 3) {
+			pr_debug("%s: bad alignment %llx/%d\n",
+				 __func__, addr, len);
+			return -EINVAL;
+		}
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
+		spin_unlock_irq(&opp->lock);
+
+		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
+			 __func__, addr, ret, *(const u32 *)ptr);
+	} else if (len = 1) {
+		union {
+			u32 val;
+			u8 bytes[4];
+		} u;
+
+		spin_lock_irq(&opp->lock);
+		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
+		spin_unlock_irq(&opp->lock);
+
+		*(u8 *)ptr = u.bytes[addr & 3];
+
+		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
+			 __func__, addr, ret, *(const u8 *)ptr);
+	} else {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
+			  int len, const void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EOPNOTSUPP;
+	}
+	if (addr & 3) {
+		pr_debug("%s: bad alignment %llx/%d\n", __func__, addr, len);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
+				      *(const u32 *)ptr);
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: addr %llx ret %d val %x\n",
+		 __func__, addr, ret, *(const u32 *)ptr);
+
+	return ret;
+}
+
+static void kvm_mpic_dtor(struct kvm_io_device *this)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+
+	opp->mmio_mapped = false;
+}
+
+static const struct kvm_io_device_ops mpic_mmio_ops = {
+	.read = kvm_mpic_read,
+	.write = kvm_mpic_write,
+	.destructor = kvm_mpic_dtor,
+};
+
+static void map_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = true;
+
+	kvm_iodevice_init(&opp->mmio, &mpic_mmio_ops);
+
+	kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+				opp->reg_base, OPENPIC_REG_SIZE,
+				&opp->mmio);
+}
+
+static void unmap_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = false;
+
+	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+}
+
+static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 base;
+
+	if (copy_from_user(&base, (u64 __iomem *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base = opp->reg_base)
+		return 0;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	unmap_mmio(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base = 0)
+		goto out;
+
+	map_mmio(opp);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+out:
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	if (type = ATTR_SET)
+		ret = kvm_mpic_write_internal(opp, addr, *val);
+	else
+		ret = kvm_mpic_read_internal(opp, addr, val);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(opp, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
+				   sizeof(u32)))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			mutex_lock(&opp->kvm->slots_lock);
+			attr64 = opp->reg_base;
+			mutex_unlock(&opp->kvm->slots_lock);
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		attr32 = opp->src[attr->attr].pending;
+
+		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
+				 sizeof(u32)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_has_attr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			break;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static long kvm_mpic_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm_device_attr attr;
+	int (*accessor)(struct openpic *opp, struct kvm_device_attr *attr);
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		accessor = mpic_set_attr;
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		accessor = mpic_get_attr;
+		break;
+	case KVM_HAS_DEVICE_ATTR:
+		accessor = mpic_has_attr;
+		break;
 	default:
+		return -ENOTTY;
+	}
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(opp, &attr);
+}
+
+static void mpic_destroy(struct openpic *opp)
+{
+	if (opp->mmio_mapped) {
+		/*
+		 * Normally we get unmapped by kvm_io_bus_destroy(),
+		 * which happens before the VCPUs release their references.
+		 *
+		 * Thus, we should only get here if no VCPUs took a reference
+		 * to us in the first place.
+		 */
+		WARN_ON(opp->nb_cpus != 0);
+		unmap_mmio(opp);
+	}
+
+	kfree(opp);
+}
+
+void kvmppc_mpic_put(struct openpic *opp)
+{
+	if (atomic_dec_and_test(&opp->users))
+		mpic_destroy(opp);
+}
+
+static int kvm_mpic_release(struct inode *inode, struct file *filp)
+{
+	struct openpic *opp = filp->private_data;
+	struct kvm *kvm = opp->kvm;
+
+	kvmppc_mpic_put(opp);
+	kvm_put_kvm(kvm);
+	return 0;
+}
+
+static const struct file_operations kvm_mpic_fops = {
+	.unlocked_ioctl = kvm_mpic_ioctl,
+	.release = kvm_mpic_release,
+};
+
+int kvm_create_mpic(struct kvm *kvm, u32 type)
+{
+	struct openpic *opp;
+	int ret, fd;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return -ENOMEM;
+
+	fd = anon_inode_getfd("kvm-mpic", &kvm_mpic_fops, opp, O_RDWR);
+	if (fd < 0) {
+		ret = fd;
+		goto err;
+	}
+
+	opp->kvm = kvm;
+	opp->model = type;
+	atomic_set(&opp->users, 1);
+	spin_lock_init(&opp->lock);
+
+	INIT_LIST_HEAD(&opp->mmio_regions);
+	list_add(&openpic_gbl_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_tmr_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_src_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_cpu_mmio.list, &opp->mmio_regions);
+
+	switch (opp->model) {
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1716,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1727,19 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
-	return 0;
+	openpic_reset(opp);
+	kvm_get_kvm(kvm);
+	return fd;
+
+err:
+	kfree(opp);
+	return ret;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..c9a2972 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -769,7 +770,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -915,6 +919,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -933,7 +938,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -945,7 +949,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -955,7 +958,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -972,7 +974,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -985,7 +986,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c0be23..852a3a1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1084,6 +1084,8 @@ static inline bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+int kvm_create_mpic(struct kvm *kvm, u32 type);
+
 #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
 #else
 static inline void __guest_enter(void) { return; }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20ce2d2..d8f44ef 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -927,6 +927,15 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed033c0..e325f5d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42: {
+		if (test)
+			return 0;
+
+		return kvm_create_mpic(kvm, cd->type);
+	}
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-03  1:57       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt   |    8 ++++++
 arch/powerpc/include/asm/kvm_host.h |    8 ++++++
 arch/powerpc/include/asm/kvm_ppc.h  |    2 ++
 arch/powerpc/kvm/booke.c            |    4 ++-
 arch/powerpc/kvm/mpic.c             |   49 +++++++++++++++++++++++++++++++----
 arch/powerpc/kvm/powerpc.c          |   26 +++++++++++++++++++
 include/uapi/linux/kvm.h            |    1 +
 7 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d52f3f9..4c326ae 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] == 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+            args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7e7aef9..2a2e235 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQ_DEFAULT	0
+#define KVMPPC_IRQ_MPIC		1
+
+struct openpic;
+
 struct kvm_vcpu_arch {
 	ulong host_stack;
 	u32 host_pid;
@@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+	int irq_type;		/* one of KVM_IRQ_* */
+	struct openpic *mpic;	/* KVM_IRQ_MPIC */
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 3b63b97..f54707f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cddc6b3..7d00222 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		if (update_epr == true) {
 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
-			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
+				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
 				kvmppc_mpic_set_epr(vcpu);
+			}
 		}
 
 		new_msr &= msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 8cda2fa..caffe3b 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-	struct openpic *opp = vcpu->arch.irqchip_priv;
+	struct openpic *opp = vcpu->arch.mpic;
 	int cpu = vcpu->vcpu_id;
 	unsigned long flags;
 
@@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-	BUG_ON(opp->mmio_mapped);
-	opp->mmio_mapped = false;
-
-	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	if (opp->mmio_mapped) {
+		opp->mmio_mapped = false;
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	}
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
 	.release = kvm_mpic_release,
 };
 
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu)
+{
+	struct openpic *opp = mpic_filp->private_data;
+	int ret = 0;
+
+	if (mpic_filp->f_op != &kvm_mpic_fops)
+		return -EPERM;
+	if (opp->kvm != vcpu->kvm)
+		return -EPERM;
+	if (cpu < 0 || cpu >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	if (opp->dst[cpu].vcpu) {
+		ret = -EEXIST;
+		goto out;
+	}
+	if (vcpu->arch.irq_type) {
+		return -EBUSY;
+		goto out;
+	}
+
+	opp->dst[cpu].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
+
+	vcpu->arch.mpic = opp;
+	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
+	atomic_inc(&opp->users);
+
+	if (opp->mpic_mode_mask == GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+out:
+	spin_unlock_irq(&opp->lock);
+	return ret;
+}
+
 int kvm_create_mpic(struct kvm *kvm, u32 type)
 {
 	struct openpic *opp;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index c9a2972..290a905 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/cputable.h>
 #include <asm/uaccess.h>
 #include <asm/kvm_ppc.h>
@@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
 	case KVM_CAP_SW_TLB:
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	tasklet_kill(&vcpu->arch.tasklet);
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
+
+	switch (vcpu->arch.irq_type) {
+	case KVMPPC_IRQ_MPIC:
+		kvmppc_mpic_put(vcpu->arch.mpic);
+		break;
+	}
+
 	kvmppc_core_vcpu_free(vcpu);
 }
 
@@ -794,6 +805,21 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC: {
+		struct file *filp;
+
+		r = -EBADF;
+		filp = fget(cap->args[0]);
+		if (!filp)
+			break;
+
+		r = kvmppc_mpic_connect_vcpu(filp, vcpu, cap->args[1]);
+
+		fput(filp);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d8f44ef..22fce7b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 #define KVM_CAP_DEVICE_CTRL 89
+#define KVM_CAP_IRQ_MPIC 90
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-03  1:57       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03  1:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 Documentation/virtual/kvm/api.txt   |    8 ++++++
 arch/powerpc/include/asm/kvm_host.h |    8 ++++++
 arch/powerpc/include/asm/kvm_ppc.h  |    2 ++
 arch/powerpc/kvm/booke.c            |    4 ++-
 arch/powerpc/kvm/mpic.c             |   49 +++++++++++++++++++++++++++++++----
 arch/powerpc/kvm/powerpc.c          |   26 +++++++++++++++++++
 include/uapi/linux/kvm.h            |    1 +
 7 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d52f3f9..4c326ae 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] = 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+            args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7e7aef9..2a2e235 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQ_DEFAULT	0
+#define KVMPPC_IRQ_MPIC		1
+
+struct openpic;
+
 struct kvm_vcpu_arch {
 	ulong host_stack;
 	u32 host_pid;
@@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+	int irq_type;		/* one of KVM_IRQ_* */
+	struct openpic *mpic;	/* KVM_IRQ_MPIC */
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 3b63b97..f54707f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cddc6b3..7d00222 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		if (update_epr = true) {
 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
-			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
+				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
 				kvmppc_mpic_set_epr(vcpu);
+			}
 		}
 
 		new_msr &= msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 8cda2fa..caffe3b 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-	struct openpic *opp = vcpu->arch.irqchip_priv;
+	struct openpic *opp = vcpu->arch.mpic;
 	int cpu = vcpu->vcpu_id;
 	unsigned long flags;
 
@@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-	BUG_ON(opp->mmio_mapped);
-	opp->mmio_mapped = false;
-
-	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	if (opp->mmio_mapped) {
+		opp->mmio_mapped = false;
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	}
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
 	.release = kvm_mpic_release,
 };
 
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+			     u32 cpu)
+{
+	struct openpic *opp = mpic_filp->private_data;
+	int ret = 0;
+
+	if (mpic_filp->f_op != &kvm_mpic_fops)
+		return -EPERM;
+	if (opp->kvm != vcpu->kvm)
+		return -EPERM;
+	if (cpu < 0 || cpu >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	if (opp->dst[cpu].vcpu) {
+		ret = -EEXIST;
+		goto out;
+	}
+	if (vcpu->arch.irq_type) {
+		return -EBUSY;
+		goto out;
+	}
+
+	opp->dst[cpu].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
+
+	vcpu->arch.mpic = opp;
+	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
+	atomic_inc(&opp->users);
+
+	if (opp->mpic_mode_mask = GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+out:
+	spin_unlock_irq(&opp->lock);
+	return ret;
+}
+
 int kvm_create_mpic(struct kvm *kvm, u32 type)
 {
 	struct openpic *opp;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index c9a2972..290a905 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/cputable.h>
 #include <asm/uaccess.h>
 #include <asm/kvm_ppc.h>
@@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
 	case KVM_CAP_SW_TLB:
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	tasklet_kill(&vcpu->arch.tasklet);
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
+
+	switch (vcpu->arch.irq_type) {
+	case KVMPPC_IRQ_MPIC:
+		kvmppc_mpic_put(vcpu->arch.mpic);
+		break;
+	}
+
 	kvmppc_core_vcpu_free(vcpu);
 }
 
@@ -794,6 +805,21 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC: {
+		struct file *filp;
+
+		r = -EBADF;
+		filp = fget(cap->args[0]);
+		if (!filp)
+			break;
+
+		r = kvmppc_mpic_connect_vcpu(filp, vcpu, cap->args[1]);
+
+		fput(filp);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d8f44ef..22fce7b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 #define KVM_CAP_DEVICE_CTRL 89
+#define KVM_CAP_IRQ_MPIC 90
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03  1:19         ` Scott Wood
@ 2013-04-03  2:17           ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-03  2:17 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >> +4.79 KVM_CREATE_DEVICE
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >
> >I notice this patch doesn't add this capability;
> 
> Yes, it does (see below).
> 
> >you add it in a later patch.
> 
> Maybe you're thinking of KVM_CAP_IRQ_MPIC?

No, I was referring to the addition to kvm_dev_ioctl_check_extension()
of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
userspace queries the KVM_CAP_DEVICE_CTRL capability.

> >> +/* ioctl for vm fd */
> >> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
> >kvm_create_device)
> >
> >This define should go with the other VM ioctls, otherwise the next
> >person to add a VM ioctl will probably miss it and reuse the 0xe0
> >code.
> 
> That's actually why I moved it to a new section, with device control
> ioctls getting their own range, as the legacy "device model" and
> some other things did.  0xe0 is not the next ioctl that would be
> used for either vm or vcpu.  The ioctl numbering is actually already
> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> from overlapping, but on other places overlapping does happen.  I'm
> not sure what exactly I should do here.

Well, even if you are using a new range, I still think that
KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
ioctls.  I guess it's ultimately up to the maintainers.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03  2:17           ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-03  2:17 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >> +4.79 KVM_CREATE_DEVICE
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >
> >I notice this patch doesn't add this capability;
> 
> Yes, it does (see below).
> 
> >you add it in a later patch.
> 
> Maybe you're thinking of KVM_CAP_IRQ_MPIC?

No, I was referring to the addition to kvm_dev_ioctl_check_extension()
of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
userspace queries the KVM_CAP_DEVICE_CTRL capability.

> >> +/* ioctl for vm fd */
> >> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
> >kvm_create_device)
> >
> >This define should go with the other VM ioctls, otherwise the next
> >person to add a VM ioctl will probably miss it and reuse the 0xe0
> >code.
> 
> That's actually why I moved it to a new section, with device control
> ioctls getting their own range, as the legacy "device model" and
> some other things did.  0xe0 is not the next ioctl that would be
> used for either vm or vcpu.  The ioctl numbering is actually already
> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> from overlapping, but on other places overlapping does happen.  I'm
> not sure what exactly I should do here.

Well, even if you are using a new range, I still think that
KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
ioctls.  I guess it's ultimately up to the maintainers.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03  2:17           ` Paul Mackerras
@ 2013-04-03 13:22             ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 13:22 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, kvm-ppc, kvm


On 03.04.2013, at 04:17, Paul Mackerras wrote:

> On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
>> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
>>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
>>>> +4.79 KVM_CREATE_DEVICE
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>> 
>>> I notice this patch doesn't add this capability;
>> 
>> Yes, it does (see below).
>> 
>>> you add it in a later patch.
>> 
>> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> 
> No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
> KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> userspace queries the KVM_CAP_DEVICE_CTRL capability.
> 
>>>> +/* ioctl for vm fd */
>>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
>>> kvm_create_device)
>>> 
>>> This define should go with the other VM ioctls, otherwise the next
>>> person to add a VM ioctl will probably miss it and reuse the 0xe0
>>> code.
>> 
>> That's actually why I moved it to a new section, with device control
>> ioctls getting their own range, as the legacy "device model" and
>> some other things did.  0xe0 is not the next ioctl that would be
>> used for either vm or vcpu.  The ioctl numbering is actually already
>> a mess, with sometimes care being taken to keep vcpu and vm ioctls
>> from overlapping, but on other places overlapping does happen.  I'm
>> not sure what exactly I should do here.
> 
> Well, even if you are using a new range, I still think that
> KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> ioctls.  I guess it's ultimately up to the maintainers.

I agree. Things get confusing for VM ioctls otherwise.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03 13:22             ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 13:22 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, kvm-ppc, kvm


On 03.04.2013, at 04:17, Paul Mackerras wrote:

> On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
>> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
>>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
>>>> +4.79 KVM_CREATE_DEVICE
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>> 
>>> I notice this patch doesn't add this capability;
>> 
>> Yes, it does (see below).
>> 
>>> you add it in a later patch.
>> 
>> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> 
> No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
> KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> userspace queries the KVM_CAP_DEVICE_CTRL capability.
> 
>>>> +/* ioctl for vm fd */
>>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
>>> kvm_create_device)
>>> 
>>> This define should go with the other VM ioctls, otherwise the next
>>> person to add a VM ioctl will probably miss it and reuse the 0xe0
>>> code.
>> 
>> That's actually why I moved it to a new section, with device control
>> ioctls getting their own range, as the legacy "device model" and
>> some other things did.  0xe0 is not the next ioctl that would be
>> used for either vm or vcpu.  The ioctl numbering is actually already
>> a mess, with sometimes care being taken to keep vcpu and vm ioctls
>> from overlapping, but on other places overlapping does happen.  I'm
>> not sure what exactly I should do here.
> 
> Well, even if you are using a new range, I still think that
> KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> ioctls.  I guess it's ultimately up to the maintainers.

I agree. Things get confusing for VM ioctls otherwise.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-03 15:13         ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 15:13 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm, paulus


On 03.04.2013, at 03:57, Scott Wood wrote:

> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> 
> NOTE: I had some difficulty figuring out what ioctl numbers I should
> assign...  it seems that at one point care was taken to keep vcpu and
> vm ioctls separate, but some overlap exists now (despite not exhausing
> the ioctl space).  Some of that was my fault, but not all of it. :-)
> I moved to a new ioctl range for device control -- please let me know
> if there's something else you'd prefer I do.
> ---
> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
> Documentation/virtual/kvm/devices/README |    1 +
> include/uapi/linux/kvm.h                 |   27 ++++++++++++
> virt/kvm/kvm_main.c                      |   31 +++++++++++++
> 4 files changed, 129 insertions(+)
> create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
> written, then `n_invalid' invalid entries, invalidating any previously
> valid entries found.
> 
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
> 
> 4.77 KVM_ARM_VCPU_INIT
> 
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_EPR 86
> #define KVM_CAP_ARM_PSCI 87
> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
> 
> #ifdef KVM_CAP_IRQ_ROUTING
> 
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
> 
> /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};

Please move these above the ioctl number definitions, where all the other structs already are.


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-03 15:13         ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 15:13 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm, paulus


On 03.04.2013, at 03:57, Scott Wood wrote:

> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> 
> NOTE: I had some difficulty figuring out what ioctl numbers I should
> assign...  it seems that at one point care was taken to keep vcpu and
> vm ioctls separate, but some overlap exists now (despite not exhausing
> the ioctl space).  Some of that was my fault, but not all of it. :-)
> I moved to a new ioctl range for device control -- please let me know
> if there's something else you'd prefer I do.
> ---
> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
> Documentation/virtual/kvm/devices/README |    1 +
> include/uapi/linux/kvm.h                 |   27 ++++++++++++
> virt/kvm/kvm_main.c                      |   31 +++++++++++++
> 4 files changed, 129 insertions(+)
> create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
> written, then `n_invalid' invalid entries, invalidating any previously
> valid entries found.
> 
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
> 
> 4.77 KVM_ARM_VCPU_INIT
> 
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_EPR 86
> #define KVM_CAP_ARM_PSCI 87
> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
> 
> #ifdef KVM_CAP_IRQ_ROUTING
> 
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
> 
> /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};

Please move these above the ioctl number definitions, where all the other structs already are.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-03 15:55         ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-03 15:55 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> support
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
[skip]

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 20ce2d2..d8f44ef 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -927,6 +927,15 @@ struct kvm_device_attr {
>  	__u64	addr;		/* userspace address of attr data */
>  };
>  
> +#define KVM_DEV_TYPE_FSL_MPIC_20	1
> +#define KVM_DEV_TYPE_FSL_MPIC_42	2
> +
> +#define KVM_DEV_MPIC_GRP_MISC		1
> +#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> +
> +#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
> +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
Why not put them in arch specific header?

> +
>  /* ioctl for vm fd */
>  #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ed033c0..e325f5d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>  	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>  
>  	switch (cd->type) {
> +#ifdef CONFIG_KVM_MPIC
> +	case KVM_DEV_TYPE_FSL_MPIC_20:
> +	case KVM_DEV_TYPE_FSL_MPIC_42: {
> +		if (test)
> +			return 0;
> +
> +		return kvm_create_mpic(kvm, cd->type);
> +	}
> +#endif
>  	default:
>  		return -ENODEV;
>  	}
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 15:55         ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-03 15:55 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> support
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
[skip]

> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 20ce2d2..d8f44ef 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -927,6 +927,15 @@ struct kvm_device_attr {
>  	__u64	addr;		/* userspace address of attr data */
>  };
>  
> +#define KVM_DEV_TYPE_FSL_MPIC_20	1
> +#define KVM_DEV_TYPE_FSL_MPIC_42	2
> +
> +#define KVM_DEV_MPIC_GRP_MISC		1
> +#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> +
> +#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
> +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
Why not put them in arch specific header?

> +
>  /* ioctl for vm fd */
>  #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ed033c0..e325f5d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>  	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>  
>  	switch (cd->type) {
> +#ifdef CONFIG_KVM_MPIC
> +	case KVM_DEV_TYPE_FSL_MPIC_20:
> +	case KVM_DEV_TYPE_FSL_MPIC_42: {
> +		if (test)
> +			return 0;
> +
> +		return kvm_create_mpic(kvm, cd->type);
> +	}
> +#endif
>  	default:
>  		return -ENODEV;
>  	}
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-03 16:19         ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 16:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm, paulus


On 03.04.2013, at 03:57, Scott Wood wrote:

> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> support
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v3: mpic_put -> kvmppc_mpic_put
> 
> 

[...]

> +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> +
> int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> 			      struct kvm_config_tlb *cfg);
> int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 63c67ec..a87139b 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -151,6 +151,11 @@ config KVM_E500MC
> 
> 	  If unsure, say N.
> 
> +config KVM_MPIC
> +	bool "KVM in-kernel MPIC emulation"
> +	depends on KVM

This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.

> +
> +
> source drivers/vhost/Kconfig
> 
> endif # VIRTUALIZATION
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index b772ede..4a2277a 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
> 	book3s_32_mmu.o
> kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
> 
> +kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
> +
> kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
> 
> obj-$(CONFIG_KVM_440) += kvm.o
> 

[...]

> struct irq_dest {
> +	struct kvm_vcpu *vcpu;
> +
> 	int32_t ctpr;		/* CPU current task priority */
> 	struct irq_queue raised;
> 	struct irq_queue servicing;
> -	qemu_irq *irqs;
> 
> 	/* Count of IRQ sources asserting on non-INT outputs */
> -	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
> +	uint32_t outputs_active[NUM_OUTPUTS];
> };
> 
> +struct openpic;

Isn't this superfluous?

> +
> struct openpic {
> +	struct kvm *kvm;
> +	struct kvm_io_device mmio;
> +	struct list_head mmio_regions;
> +	atomic_t users;
> +	bool mmio_mapped;
> +
> +	gpa_t reg_base;
> +	spinlock_t lock;
> +
> 	/* Behavior control */
> 	struct fsl_mpic_info *fsl;
> 	uint32_t model;
> @@ -208,6 +231,47 @@ struct openpic {
> 	uint32_t irq_msi;
> };
> 
> 

[...]

> -static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
> +static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
> {
> 	struct openpic *opp = opaque;
> -	uint32_t retval;
> +	u32 retval;
> 
> -	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
> +	pr_debug("%s: addr %#llx\n", __func__, addr);
> 	retval = 0xFFFFFFFF;
> 	if (addr & 0xF)
> -		return retval;
> +		goto out;
> 
> 	switch (addr) {
> 	case 0x1000:		/* FRR */
> 		retval = opp->frr;
> +		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
> 		break;
> 	case 0x1020:		/* GCR */
> 		retval = opp->gcr;
> @@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
> 	case 0x90:
> 	case 0xA0:
> 	case 0xB0:
> -		retval =
> -		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
> +		retval = openpic_cpu_read_internal(opp, addr,
> +			&retval, get_current_cpu());

This looks bogus. You're passing &retval and overwrite it with the return value right after the function returns?

> 		break;
> 	case 0x10A0:		/* IPI_IVPR */
> 	case 0x10B0:
> @@ -750,28 +790,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
> 	default:
> 		break;
> 	}
> -	pr_debug("%s: => 0x%08x\n", __func__, retval);
> 
> -	return retval;
> +out:
> +	pr_debug("%s: => 0x%08x\n", __func__, retval);
> +	*ptr = retval;
> +	return 0;
> }
> 

[...]

> 
> +static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
> +			 int len, void *ptr)
> +{
> +	struct openpic *opp = container_of(this, struct openpic, mmio);
> +	int ret;
> +
> +	/*
> +	 * Technically only 32-bit accesses are allowed, but be nice to
> +	 * people dumping registers a byte at a time -- it works in real
> +	 * hardware (reads only, not writes).

Do 16-bit accesses work in real hardware?

> +	 */
> +	if (len == 4) {
> +		if (addr & 3) {
> +			pr_debug("%s: bad alignment %llx/%d\n",
> +				 __func__, addr, len);
> +			return -EINVAL;
> +		}

if (addr & (len - 1))

Then the read_internal call can be shared between the different access sizes, no?

> +
> +		spin_lock_irq(&opp->lock);
> +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
> +		spin_unlock_irq(&opp->lock);
> +
> +		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
> +			 __func__, addr, ret, *(const u32 *)ptr);
> +	} else if (len == 1) {
> +		union {
> +			u32 val;
> +			u8 bytes[4];
> +		} u;
> +
> +		spin_lock_irq(&opp->lock);
> +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
> +		spin_unlock_irq(&opp->lock);
> +
> +		*(u8 *)ptr = u.bytes[addr & 3];
> +
> +		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
> +			 __func__, addr, ret, *(const u8 *)ptr);
> +	} else {
> +		pr_debug("%s: bad length %d\n", __func__, len);
> +		return -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +

[...]

> 
> +static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
> +{
> +	u32 attr32;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_MPIC_GRP_MISC:
> +		switch (attr->attr) {
> +		case KVM_DEV_MPIC_BASE_ADDR:
> +			return set_base_addr(opp, attr);
> +		}
> +
> +		break;
> +
> +	case KVM_DEV_MPIC_GRP_REGISTER:
> +		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
> +				   sizeof(u32)))

get_user?

> +			return -EFAULT;
> +
> +		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
> +
> +	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> +		if (attr->attr > MAX_SRC)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
> +				   sizeof(u32)))

same here

> +			return -EFAULT;
> +
> +		if (attr32 != 0 && attr32 != 1)
> +			return -EINVAL;
> +
> +		spin_lock_irq(&opp->lock);
> +		openpic_set_irq(opp, attr->attr, attr32);
> +		spin_unlock_irq(&opp->lock);
> +		return 0;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static int mpic_get_attr(struct openpic *opp, struct kvm_device_attr *attr)
> +{
> +	u64 attr64;
> +	u32 attr32;
> +	int ret;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_MPIC_GRP_MISC:
> +		switch (attr->attr) {
> +		case KVM_DEV_MPIC_BASE_ADDR:
> +			mutex_lock(&opp->kvm->slots_lock);
> +			attr64 = opp->reg_base;
> +			mutex_unlock(&opp->kvm->slots_lock);
> +
> +			if (copy_to_user((u64 __user *)(long)attr->addr,
> +					 &attr64, sizeof(u64)))

u64 is tricky with put_user on 32bit hosts, so here copy_to_user makes sense

> +				return -EFAULT;
> +
> +			return 0;
> +		}
> +
> +		break;
> +
> +	case KVM_DEV_MPIC_GRP_REGISTER:
> +		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
> +		if (ret)
> +			return ret;
> +
> +		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
> +				 sizeof(u32)))

put_user

> +			return -EFAULT;
> +
> +		return 0;
> +
> +	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> +		if (attr->attr > MAX_SRC)
> +			return -EINVAL;
> +
> +		attr32 = opp->src[attr->attr].pending;

Isn't this missing a lock?

> +
> +		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
> +				 sizeof(u32)))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +
> +	return -ENXIO;
> +}


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 16:19         ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 16:19 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm, paulus


On 03.04.2013, at 03:57, Scott Wood wrote:

> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> support
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v3: mpic_put -> kvmppc_mpic_put
> 
> 

[...]

> +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> +
> int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> 			      struct kvm_config_tlb *cfg);
> int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 63c67ec..a87139b 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -151,6 +151,11 @@ config KVM_E500MC
> 
> 	  If unsure, say N.
> 
> +config KVM_MPIC
> +	bool "KVM in-kernel MPIC emulation"
> +	depends on KVM

This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.

> +
> +
> source drivers/vhost/Kconfig
> 
> endif # VIRTUALIZATION
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index b772ede..4a2277a 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
> 	book3s_32_mmu.o
> kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
> 
> +kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
> +
> kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
> 
> obj-$(CONFIG_KVM_440) += kvm.o
> 

[...]

> struct irq_dest {
> +	struct kvm_vcpu *vcpu;
> +
> 	int32_t ctpr;		/* CPU current task priority */
> 	struct irq_queue raised;
> 	struct irq_queue servicing;
> -	qemu_irq *irqs;
> 
> 	/* Count of IRQ sources asserting on non-INT outputs */
> -	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
> +	uint32_t outputs_active[NUM_OUTPUTS];
> };
> 
> +struct openpic;

Isn't this superfluous?

> +
> struct openpic {
> +	struct kvm *kvm;
> +	struct kvm_io_device mmio;
> +	struct list_head mmio_regions;
> +	atomic_t users;
> +	bool mmio_mapped;
> +
> +	gpa_t reg_base;
> +	spinlock_t lock;
> +
> 	/* Behavior control */
> 	struct fsl_mpic_info *fsl;
> 	uint32_t model;
> @@ -208,6 +231,47 @@ struct openpic {
> 	uint32_t irq_msi;
> };
> 
> 

[...]

> -static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
> +static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
> {
> 	struct openpic *opp = opaque;
> -	uint32_t retval;
> +	u32 retval;
> 
> -	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
> +	pr_debug("%s: addr %#llx\n", __func__, addr);
> 	retval = 0xFFFFFFFF;
> 	if (addr & 0xF)
> -		return retval;
> +		goto out;
> 
> 	switch (addr) {
> 	case 0x1000:		/* FRR */
> 		retval = opp->frr;
> +		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
> 		break;
> 	case 0x1020:		/* GCR */
> 		retval = opp->gcr;
> @@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
> 	case 0x90:
> 	case 0xA0:
> 	case 0xB0:
> -		retval > -		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
> +		retval = openpic_cpu_read_internal(opp, addr,
> +			&retval, get_current_cpu());

This looks bogus. You're passing &retval and overwrite it with the return value right after the function returns?

> 		break;
> 	case 0x10A0:		/* IPI_IVPR */
> 	case 0x10B0:
> @@ -750,28 +790,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
> 	default:
> 		break;
> 	}
> -	pr_debug("%s: => 0x%08x\n", __func__, retval);
> 
> -	return retval;
> +out:
> +	pr_debug("%s: => 0x%08x\n", __func__, retval);
> +	*ptr = retval;
> +	return 0;
> }
> 

[...]

> 
> +static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
> +			 int len, void *ptr)
> +{
> +	struct openpic *opp = container_of(this, struct openpic, mmio);
> +	int ret;
> +
> +	/*
> +	 * Technically only 32-bit accesses are allowed, but be nice to
> +	 * people dumping registers a byte at a time -- it works in real
> +	 * hardware (reads only, not writes).

Do 16-bit accesses work in real hardware?

> +	 */
> +	if (len = 4) {
> +		if (addr & 3) {
> +			pr_debug("%s: bad alignment %llx/%d\n",
> +				 __func__, addr, len);
> +			return -EINVAL;
> +		}

if (addr & (len - 1))

Then the read_internal call can be shared between the different access sizes, no?

> +
> +		spin_lock_irq(&opp->lock);
> +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
> +		spin_unlock_irq(&opp->lock);
> +
> +		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
> +			 __func__, addr, ret, *(const u32 *)ptr);
> +	} else if (len = 1) {
> +		union {
> +			u32 val;
> +			u8 bytes[4];
> +		} u;
> +
> +		spin_lock_irq(&opp->lock);
> +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
> +		spin_unlock_irq(&opp->lock);
> +
> +		*(u8 *)ptr = u.bytes[addr & 3];
> +
> +		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
> +			 __func__, addr, ret, *(const u8 *)ptr);
> +	} else {
> +		pr_debug("%s: bad length %d\n", __func__, len);
> +		return -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +

[...]

> 
> +static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
> +{
> +	u32 attr32;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_MPIC_GRP_MISC:
> +		switch (attr->attr) {
> +		case KVM_DEV_MPIC_BASE_ADDR:
> +			return set_base_addr(opp, attr);
> +		}
> +
> +		break;
> +
> +	case KVM_DEV_MPIC_GRP_REGISTER:
> +		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
> +				   sizeof(u32)))

get_user?

> +			return -EFAULT;
> +
> +		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
> +
> +	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> +		if (attr->attr > MAX_SRC)
> +			return -EINVAL;
> +
> +		if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
> +				   sizeof(u32)))

same here

> +			return -EFAULT;
> +
> +		if (attr32 != 0 && attr32 != 1)
> +			return -EINVAL;
> +
> +		spin_lock_irq(&opp->lock);
> +		openpic_set_irq(opp, attr->attr, attr32);
> +		spin_unlock_irq(&opp->lock);
> +		return 0;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static int mpic_get_attr(struct openpic *opp, struct kvm_device_attr *attr)
> +{
> +	u64 attr64;
> +	u32 attr32;
> +	int ret;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_MPIC_GRP_MISC:
> +		switch (attr->attr) {
> +		case KVM_DEV_MPIC_BASE_ADDR:
> +			mutex_lock(&opp->kvm->slots_lock);
> +			attr64 = opp->reg_base;
> +			mutex_unlock(&opp->kvm->slots_lock);
> +
> +			if (copy_to_user((u64 __user *)(long)attr->addr,
> +					 &attr64, sizeof(u64)))

u64 is tricky with put_user on 32bit hosts, so here copy_to_user makes sense

> +				return -EFAULT;
> +
> +			return 0;
> +		}
> +
> +		break;
> +
> +	case KVM_DEV_MPIC_GRP_REGISTER:
> +		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
> +		if (ret)
> +			return ret;
> +
> +		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
> +				 sizeof(u32)))

put_user

> +			return -EFAULT;
> +
> +		return 0;
> +
> +	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> +		if (attr->attr > MAX_SRC)
> +			return -EINVAL;
> +
> +		attr32 = opp->src[attr->attr].pending;

Isn't this missing a lock?

> +
> +		if (copy_to_user((u32 __user *)(long)attr->addr, &attr32,
> +				 sizeof(u32)))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +
> +	return -ENXIO;
> +}


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03 13:22             ` Alexander Graf
@ 2013-04-03 17:37               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 17:37 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, kvm-ppc, kvm

On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
> 
> On 03.04.2013, at 04:17, Paul Mackerras wrote:
> 
> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >>>> +4.79 KVM_CREATE_DEVICE
> >>>> +
> >>>> +Capability: KVM_CAP_DEVICE_CTRL
> >>>
> >>> I notice this patch doesn't add this capability;
> >>
> >> Yes, it does (see below).
> >>
> >>> you add it in a later patch.
> >>
> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> >
> > No, I was referring to the addition to  
> kvm_dev_ioctl_check_extension()
> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to  
> handle
> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
> >
> >>>> +/* ioctl for vm fd */
> >>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
> >>> kvm_create_device)
> >>>
> >>> This define should go with the other VM ioctls, otherwise the next
> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
> >>> code.
> >>
> >> That's actually why I moved it to a new section, with device  
> control
> >> ioctls getting their own range, as the legacy "device model" and
> >> some other things did.  0xe0 is not the next ioctl that would be
> >> used for either vm or vcpu.  The ioctl numbering is actually  
> already
> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> >> from overlapping, but on other places overlapping does happen.  I'm
> >> not sure what exactly I should do here.
> >
> > Well, even if you are using a new range, I still think that
> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> > ioctls.  I guess it's ultimately up to the maintainers.
> 
> I agree. Things get confusing for VM ioctls otherwise.

Things are already confusing. :-)

I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what  
number should it get?  The last VM ioctl is 0xab (which is also a VCPU  
ioctl).  Should I use 0xac (which is also a VCPU ioctl)?  Or should I  
try to avoid a conflict, as was sometimes done in the past -- in which  
case, which number should I use?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03 17:37               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 17:37 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, kvm-ppc, kvm

On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
> 
> On 03.04.2013, at 04:17, Paul Mackerras wrote:
> 
> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >>>> +4.79 KVM_CREATE_DEVICE
> >>>> +
> >>>> +Capability: KVM_CAP_DEVICE_CTRL
> >>>
> >>> I notice this patch doesn't add this capability;
> >>
> >> Yes, it does (see below).
> >>
> >>> you add it in a later patch.
> >>
> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> >
> > No, I was referring to the addition to  
> kvm_dev_ioctl_check_extension()
> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to  
> handle
> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
> >
> >>>> +/* ioctl for vm fd */
> >>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
> >>> kvm_create_device)
> >>>
> >>> This define should go with the other VM ioctls, otherwise the next
> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
> >>> code.
> >>
> >> That's actually why I moved it to a new section, with device  
> control
> >> ioctls getting their own range, as the legacy "device model" and
> >> some other things did.  0xe0 is not the next ioctl that would be
> >> used for either vm or vcpu.  The ioctl numbering is actually  
> already
> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> >> from overlapping, but on other places overlapping does happen.  I'm
> >> not sure what exactly I should do here.
> >
> > Well, even if you are using a new range, I still think that
> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> > ioctls.  I guess it's ultimately up to the maintainers.
> 
> I agree. Things get confusing for VM ioctls otherwise.

Things are already confusing. :-)

I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what  
number should it get?  The last VM ioctl is 0xab (which is also a VCPU  
ioctl).  Should I use 0xac (which is also a VCPU ioctl)?  Or should I  
try to avoid a conflict, as was sometimes done in the past -- in which  
case, which number should I use?

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03 17:37               ` Scott Wood
@ 2013-04-03 17:39                 ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 17:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Paul Mackerras, kvm-ppc, kvm@vger.kernel.org list, Gleb Natapov,
	Marcelo Tosatti


On 03.04.2013, at 19:37, Scott Wood wrote:

> On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
>> On 03.04.2013, at 04:17, Paul Mackerras wrote:
>> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
>> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
>> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
>> >>>> +4.79 KVM_CREATE_DEVICE
>> >>>> +
>> >>>> +Capability: KVM_CAP_DEVICE_CTRL
>> >>>
>> >>> I notice this patch doesn't add this capability;
>> >>
>> >> Yes, it does (see below).
>> >>
>> >>> you add it in a later patch.
>> >>
>> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
>> >
>> > No, I was referring to the addition to kvm_dev_ioctl_check_extension()
>> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
>> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
>> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
>> >
>> >>>> +/* ioctl for vm fd */
>> >>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
>> >>> kvm_create_device)
>> >>>
>> >>> This define should go with the other VM ioctls, otherwise the next
>> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
>> >>> code.
>> >>
>> >> That's actually why I moved it to a new section, with device control
>> >> ioctls getting their own range, as the legacy "device model" and
>> >> some other things did.  0xe0 is not the next ioctl that would be
>> >> used for either vm or vcpu.  The ioctl numbering is actually already
>> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
>> >> from overlapping, but on other places overlapping does happen.  I'm
>> >> not sure what exactly I should do here.
>> >
>> > Well, even if you are using a new range, I still think that
>> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
>> > ioctls.  I guess it's ultimately up to the maintainers.
>> I agree. Things get confusing for VM ioctls otherwise.
> 
> Things are already confusing. :-)
> 
> I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what number should it get?  The last VM ioctl is 0xab (which is also a VCPU ioctl).  Should I use 0xac (which is also a VCPU ioctl)?  Or should I try to avoid a conflict, as was sometimes done in the past -- in which case, which number should I use?

Gleb, Marcelo?


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03 17:39                 ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 17:39 UTC (permalink / raw)
  To: Scott Wood
  Cc: Paul Mackerras, kvm-ppc, kvm@vger.kernel.org list, Gleb Natapov,
	Marcelo Tosatti


On 03.04.2013, at 19:37, Scott Wood wrote:

> On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
>> On 03.04.2013, at 04:17, Paul Mackerras wrote:
>> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
>> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
>> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
>> >>>> +4.79 KVM_CREATE_DEVICE
>> >>>> +
>> >>>> +Capability: KVM_CAP_DEVICE_CTRL
>> >>>
>> >>> I notice this patch doesn't add this capability;
>> >>
>> >> Yes, it does (see below).
>> >>
>> >>> you add it in a later patch.
>> >>
>> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
>> >
>> > No, I was referring to the addition to kvm_dev_ioctl_check_extension()
>> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
>> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
>> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
>> >
>> >>>> +/* ioctl for vm fd */
>> >>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
>> >>> kvm_create_device)
>> >>>
>> >>> This define should go with the other VM ioctls, otherwise the next
>> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
>> >>> code.
>> >>
>> >> That's actually why I moved it to a new section, with device control
>> >> ioctls getting their own range, as the legacy "device model" and
>> >> some other things did.  0xe0 is not the next ioctl that would be
>> >> used for either vm or vcpu.  The ioctl numbering is actually already
>> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
>> >> from overlapping, but on other places overlapping does happen.  I'm
>> >> not sure what exactly I should do here.
>> >
>> > Well, even if you are using a new range, I still think that
>> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
>> > ioctls.  I guess it's ultimately up to the maintainers.
>> I agree. Things get confusing for VM ioctls otherwise.
> 
> Things are already confusing. :-)
> 
> I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what number should it get?  The last VM ioctl is 0xab (which is also a VCPU ioctl).  Should I use 0xac (which is also a VCPU ioctl)?  Or should I try to avoid a conflict, as was sometimes done in the past -- in which case, which number should I use?

Gleb, Marcelo?


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 15:55         ` Gleb Natapov
@ 2013-04-03 20:58           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 20:58 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/03/2013 10:55:27 AM, Gleb Natapov wrote:
> On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >
> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> > support
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> > ---
> [skip]
> 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 20ce2d2..d8f44ef 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -927,6 +927,15 @@ struct kvm_device_attr {
> >  	__u64	addr;		/* userspace address of attr data */
> >  };
> >
> > +#define KVM_DEV_TYPE_FSL_MPIC_20	1
> > +#define KVM_DEV_TYPE_FSL_MPIC_42	2
> > +
> > +#define KVM_DEV_MPIC_GRP_MISC		1
> > +#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> > +
> > +#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
> > +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
> Why not put them in arch specific header?

KVM_DEV_TYPE_* is not an arch-specific enumeration -- this was  
discussed last time around.

KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering the  
main kvm.h.  The arch header would be OK, since the non-arch header  
includes the arch header, and thus it wouldn't be visible to userspace  
where it is -- if there later is a need for MPIC (or whatever other  
device follows MPIC's example) on another architecture, it could be  
moved without breaking anything.  Or, we could just have a header for  
each device type.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 20:58           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 20:58 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/03/2013 10:55:27 AM, Gleb Natapov wrote:
> On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >
> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> > support
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> > ---
> [skip]
> 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 20ce2d2..d8f44ef 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -927,6 +927,15 @@ struct kvm_device_attr {
> >  	__u64	addr;		/* userspace address of attr data */
> >  };
> >
> > +#define KVM_DEV_TYPE_FSL_MPIC_20	1
> > +#define KVM_DEV_TYPE_FSL_MPIC_42	2
> > +
> > +#define KVM_DEV_MPIC_GRP_MISC		1
> > +#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> > +
> > +#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
> > +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
> Why not put them in arch specific header?

KVM_DEV_TYPE_* is not an arch-specific enumeration -- this was  
discussed last time around.

KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering the  
main kvm.h.  The arch header would be OK, since the non-arch header  
includes the arch header, and thus it wouldn't be visible to userspace  
where it is -- if there later is a need for MPIC (or whatever other  
device follows MPIC's example) on another architecture, it could be  
moved without breaking anything.  Or, we could just have a header for  
each device type.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03  2:17           ` Paul Mackerras
@ 2013-04-03 21:03             ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 21:03 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/02/2013 09:17:58 PM, Paul Mackerras wrote:
> On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> > On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> > >On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> > >> +4.79 KVM_CREATE_DEVICE
> > >> +
> > >> +Capability: KVM_CAP_DEVICE_CTRL
> > >
> > >I notice this patch doesn't add this capability;
> >
> > Yes, it does (see below).
> >
> > >you add it in a later patch.
> >
> > Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> 
> No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to  
> handle
> KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> userspace queries the KVM_CAP_DEVICE_CTRL capability.

Ah.  In that case we should probably recognize the capability in  
generic code rather than ppc code.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-03 21:03             ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 21:03 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/02/2013 09:17:58 PM, Paul Mackerras wrote:
> On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> > On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> > >On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> > >> +4.79 KVM_CREATE_DEVICE
> > >> +
> > >> +Capability: KVM_CAP_DEVICE_CTRL
> > >
> > >I notice this patch doesn't add this capability;
> >
> > Yes, it does (see below).
> >
> > >you add it in a later patch.
> >
> > Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> 
> No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to  
> handle
> KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> userspace queries the KVM_CAP_DEVICE_CTRL capability.

Ah.  In that case we should probably recognize the capability in  
generic code rather than ppc code.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 16:19         ` Alexander Graf
@ 2013-04-03 21:38           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 21:38 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
> 
> On 03.04.2013, at 03:57, Scott Wood wrote:
> 
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >
> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> > support
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> > ---
> > v3: mpic_put -> kvmppc_mpic_put
> >
> >
> 
> [...]
> 
> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> > +
> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> > 			      struct kvm_config_tlb *cfg);
> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> > index 63c67ec..a87139b 100644
> > --- a/arch/powerpc/kvm/Kconfig
> > +++ b/arch/powerpc/kvm/Kconfig
> > @@ -151,6 +151,11 @@ config KVM_E500MC
> >
> > 	  If unsure, say N.
> >
> > +config KVM_MPIC
> > +	bool "KVM in-kernel MPIC emulation"
> > +	depends on KVM
> 
> This should probably depend on FSL KVM for now, until someone adds  
> support for other MPIC revisions.

I don't see a symbol specifically for "FSL KVM".  What part of the MPIC  
code depends on booke or any FSL-specific code?

> > struct irq_dest {
> > +	struct kvm_vcpu *vcpu;
> > +
> > 	int32_t ctpr;		/* CPU current task priority */
> > 	struct irq_queue raised;
> > 	struct irq_queue servicing;
> > -	qemu_irq *irqs;
> >
> > 	/* Count of IRQ sources asserting on non-INT outputs */
> > -	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
> > +	uint32_t outputs_active[NUM_OUTPUTS];
> > };
> >
> > +struct openpic;
> 
> Isn't this superfluous?

Yes, will remove.  Probably a leftover from when there was other stuff  
in between that referenced it.

> > @@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque,  
> gpa_t addr, unsigned len)
> > 	case 0x90:
> > 	case 0xA0:
> > 	case 0xB0:
> > -		retval =
> > -		    openpic_cpu_read_internal(opp, addr,  
> get_current_cpu());
> > +		retval = openpic_cpu_read_internal(opp, addr,
> > +			&retval, get_current_cpu());
> 
> This looks bogus. You're passing &retval and overwrite it with the  
> return value right after the function returns?

Yeah, will fix.  Thanks for spotting it.

> > +static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
> > +			 int len, void *ptr)
> > +{
> > +	struct openpic *opp = container_of(this, struct openpic, mmio);
> > +	int ret;
> > +
> > +	/*
> > +	 * Technically only 32-bit accesses are allowed, but be nice to
> > +	 * people dumping registers a byte at a time -- it works in real
> > +	 * hardware (reads only, not writes).
> 
> Do 16-bit accesses work in real hardware?

Probably, though according to the documentation only 32-bit accesses  
should be used.  As the comment says, I'm just being nice to hexdumps  
and such, which are unlikely to use 16-bit accesses.

> > +	 */
> > +	if (len == 4) {
> > +		if (addr & 3) {
> > +			pr_debug("%s: bad alignment %llx/%d\n",
> > +				 __func__, addr, len);
> > +			return -EINVAL;
> > +		}
> 
> if (addr & (len - 1))
> 
> Then the read_internal call can be shared between the different  
> access sizes, no?

Not as is, because the read_internal call passes a different pointer in  
the two different cases.  I originally tried to write this with more in  
common between the two cases, and it got a bit messy.  I'll give it  
another shot, though.

> > +		spin_lock_irq(&opp->lock);
> > +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base,  
> ptr);
> > +		spin_unlock_irq(&opp->lock);
> > +
> > +		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
> > +			 __func__, addr, ret, *(const u32 *)ptr);
> > +	} else if (len == 1) {
> > +		union {
> > +			u32 val;
> > +			u8 bytes[4];
> > +		} u;
> > +
> > +		spin_lock_irq(&opp->lock);
> > +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base,  
> &u.val);
> > +		spin_unlock_irq(&opp->lock);
> > +
> > +		*(u8 *)ptr = u.bytes[addr & 3];
> > +
> > +		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
> > +			 __func__, addr, ret, *(const u8 *)ptr);
> > +	} else {
> > +		pr_debug("%s: bad length %d\n", __func__, len);
> > +		return -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> 
> [...]
> 
> >
> > +static int mpic_set_attr(struct openpic *opp, struct  
> kvm_device_attr *attr)
> > +{
> > +	u32 attr32;
> > +
> > +	switch (attr->group) {
> > +	case KVM_DEV_MPIC_GRP_MISC:
> > +		switch (attr->attr) {
> > +		case KVM_DEV_MPIC_BASE_ADDR:
> > +			return set_base_addr(opp, attr);
> > +		}
> > +
> > +		break;
> > +
> > +	case KVM_DEV_MPIC_GRP_REGISTER:
> > +		if (copy_from_user(&attr32, (u32 __user  
> *)(long)attr->addr,
> > +				   sizeof(u32)))
> 
> get_user?

OK.

> > +	switch (attr->group) {
> > +	case KVM_DEV_MPIC_GRP_MISC:
> > +		switch (attr->attr) {
> > +		case KVM_DEV_MPIC_BASE_ADDR:
> > +			mutex_lock(&opp->kvm->slots_lock);
> > +			attr64 = opp->reg_base;
> > +			mutex_unlock(&opp->kvm->slots_lock);
> > +
> > +			if (copy_to_user((u64 __user *)(long)attr->addr,
> > +					 &attr64, sizeof(u64)))
> 
> u64 is tricky with put_user on 32bit hosts, so here copy_to_user  
> makes sense

What are the issues with put_user?  It looks like it's supported with a  
pair of "stw" instructions.

> > +	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> > +		if (attr->attr > MAX_SRC)
> > +			return -EINVAL;
> > +
> > +		attr32 = opp->src[attr->attr].pending;
> 
> Isn't this missing a lock?

I don't see why it needs one.  If the pending status changes during the  
ioctl, it's undefined which state you'll read back, and a lock wouldn't  
change that (you could end up taking the lock before or after the  
change gets made).

reg_base above was a different situation -- it's 64-bit, so we can't  
read it atomically on 32-bit.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 21:38           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 21:38 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
> 
> On 03.04.2013, at 03:57, Scott Wood wrote:
> 
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >
> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> > support
> >
> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> > ---
> > v3: mpic_put -> kvmppc_mpic_put
> >
> >
> 
> [...]
> 
> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> > +
> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> > 			      struct kvm_config_tlb *cfg);
> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> > index 63c67ec..a87139b 100644
> > --- a/arch/powerpc/kvm/Kconfig
> > +++ b/arch/powerpc/kvm/Kconfig
> > @@ -151,6 +151,11 @@ config KVM_E500MC
> >
> > 	  If unsure, say N.
> >
> > +config KVM_MPIC
> > +	bool "KVM in-kernel MPIC emulation"
> > +	depends on KVM
> 
> This should probably depend on FSL KVM for now, until someone adds  
> support for other MPIC revisions.

I don't see a symbol specifically for "FSL KVM".  What part of the MPIC  
code depends on booke or any FSL-specific code?

> > struct irq_dest {
> > +	struct kvm_vcpu *vcpu;
> > +
> > 	int32_t ctpr;		/* CPU current task priority */
> > 	struct irq_queue raised;
> > 	struct irq_queue servicing;
> > -	qemu_irq *irqs;
> >
> > 	/* Count of IRQ sources asserting on non-INT outputs */
> > -	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
> > +	uint32_t outputs_active[NUM_OUTPUTS];
> > };
> >
> > +struct openpic;
> 
> Isn't this superfluous?

Yes, will remove.  Probably a leftover from when there was other stuff  
in between that referenced it.

> > @@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque,  
> gpa_t addr, unsigned len)
> > 	case 0x90:
> > 	case 0xA0:
> > 	case 0xB0:
> > -		retval > > -		    openpic_cpu_read_internal(opp, addr,  
> get_current_cpu());
> > +		retval = openpic_cpu_read_internal(opp, addr,
> > +			&retval, get_current_cpu());
> 
> This looks bogus. You're passing &retval and overwrite it with the  
> return value right after the function returns?

Yeah, will fix.  Thanks for spotting it.

> > +static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
> > +			 int len, void *ptr)
> > +{
> > +	struct openpic *opp = container_of(this, struct openpic, mmio);
> > +	int ret;
> > +
> > +	/*
> > +	 * Technically only 32-bit accesses are allowed, but be nice to
> > +	 * people dumping registers a byte at a time -- it works in real
> > +	 * hardware (reads only, not writes).
> 
> Do 16-bit accesses work in real hardware?

Probably, though according to the documentation only 32-bit accesses  
should be used.  As the comment says, I'm just being nice to hexdumps  
and such, which are unlikely to use 16-bit accesses.

> > +	 */
> > +	if (len = 4) {
> > +		if (addr & 3) {
> > +			pr_debug("%s: bad alignment %llx/%d\n",
> > +				 __func__, addr, len);
> > +			return -EINVAL;
> > +		}
> 
> if (addr & (len - 1))
> 
> Then the read_internal call can be shared between the different  
> access sizes, no?

Not as is, because the read_internal call passes a different pointer in  
the two different cases.  I originally tried to write this with more in  
common between the two cases, and it got a bit messy.  I'll give it  
another shot, though.

> > +		spin_lock_irq(&opp->lock);
> > +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base,  
> ptr);
> > +		spin_unlock_irq(&opp->lock);
> > +
> > +		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
> > +			 __func__, addr, ret, *(const u32 *)ptr);
> > +	} else if (len = 1) {
> > +		union {
> > +			u32 val;
> > +			u8 bytes[4];
> > +		} u;
> > +
> > +		spin_lock_irq(&opp->lock);
> > +		ret = kvm_mpic_read_internal(opp, addr - opp->reg_base,  
> &u.val);
> > +		spin_unlock_irq(&opp->lock);
> > +
> > +		*(u8 *)ptr = u.bytes[addr & 3];
> > +
> > +		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
> > +			 __func__, addr, ret, *(const u8 *)ptr);
> > +	} else {
> > +		pr_debug("%s: bad length %d\n", __func__, len);
> > +		return -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> 
> [...]
> 
> >
> > +static int mpic_set_attr(struct openpic *opp, struct  
> kvm_device_attr *attr)
> > +{
> > +	u32 attr32;
> > +
> > +	switch (attr->group) {
> > +	case KVM_DEV_MPIC_GRP_MISC:
> > +		switch (attr->attr) {
> > +		case KVM_DEV_MPIC_BASE_ADDR:
> > +			return set_base_addr(opp, attr);
> > +		}
> > +
> > +		break;
> > +
> > +	case KVM_DEV_MPIC_GRP_REGISTER:
> > +		if (copy_from_user(&attr32, (u32 __user  
> *)(long)attr->addr,
> > +				   sizeof(u32)))
> 
> get_user?

OK.

> > +	switch (attr->group) {
> > +	case KVM_DEV_MPIC_GRP_MISC:
> > +		switch (attr->attr) {
> > +		case KVM_DEV_MPIC_BASE_ADDR:
> > +			mutex_lock(&opp->kvm->slots_lock);
> > +			attr64 = opp->reg_base;
> > +			mutex_unlock(&opp->kvm->slots_lock);
> > +
> > +			if (copy_to_user((u64 __user *)(long)attr->addr,
> > +					 &attr64, sizeof(u64)))
> 
> u64 is tricky with put_user on 32bit hosts, so here copy_to_user  
> makes sense

What are the issues with put_user?  It looks like it's supported with a  
pair of "stw" instructions.

> > +	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> > +		if (attr->attr > MAX_SRC)
> > +			return -EINVAL;
> > +
> > +		attr32 = opp->src[attr->attr].pending;
> 
> Isn't this missing a lock?

I don't see why it needs one.  If the pending status changes during the  
ioctl, it's undefined which state you'll read back, and a lock wouldn't  
change that (you could end up taking the lock before or after the  
change gets made).

reg_base above was a different situation -- it's 64-bit, so we can't  
read it atomically on 32-bit.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 21:38           ` Scott Wood
@ 2013-04-03 21:58             ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 21:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> On 03.04.2013, at 03:57, Scott Wood wrote:
>> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >
>> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
>> > support
>> >
>> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> > ---
>> > v3: mpic_put -> kvmppc_mpic_put
>> >
>> >
>> [...]
>> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
>> > +
>> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
>> >                  struct kvm_config_tlb *cfg);
>> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
>> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> > index 63c67ec..a87139b 100644
>> > --- a/arch/powerpc/kvm/Kconfig
>> > +++ b/arch/powerpc/kvm/Kconfig
>> > @@ -151,6 +151,11 @@ config KVM_E500MC
>> >
>> >      If unsure, say N.
>> >
>> > +config KVM_MPIC
>> > +    bool "KVM in-kernel MPIC emulation"
>> > +    depends on KVM
>> This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.
> 
> I don't see a symbol specifically for "FSL KVM".  What part of the MPIC code depends on booke or any FSL-specific code?

You support only FSL mpic device IDs :). So if someone on book3s goes along and sees this, he'd think "yes, I want an in-kernel MPIC", enables the option and wastes space.

> 
>> > struct irq_dest {
>> > +    struct kvm_vcpu *vcpu;
>> > +
>> >    int32_t ctpr;        /* CPU current task priority */
>> >    struct irq_queue raised;
>> >    struct irq_queue servicing;
>> > -    qemu_irq *irqs;
>> >
>> >    /* Count of IRQ sources asserting on non-INT outputs */
>> > -    uint32_t outputs_active[OPENPIC_OUTPUT_NB];
>> > +    uint32_t outputs_active[NUM_OUTPUTS];
>> > };
>> >
>> > +struct openpic;
>> Isn't this superfluous?
> 
> Yes, will remove.  Probably a leftover from when there was other stuff in between that referenced it.
> 
>> > @@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
>> >    case 0x90:
>> >    case 0xA0:
>> >    case 0xB0:
>> > -        retval =
>> > -            openpic_cpu_read_internal(opp, addr, get_current_cpu());
>> > +        retval = openpic_cpu_read_internal(opp, addr,
>> > +            &retval, get_current_cpu());
>> This looks bogus. You're passing &retval and overwrite it with the return value right after the function returns?
> 
> Yeah, will fix.  Thanks for spotting it.
> 
>> > +static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
>> > +             int len, void *ptr)
>> > +{
>> > +    struct openpic *opp = container_of(this, struct openpic, mmio);
>> > +    int ret;
>> > +
>> > +    /*
>> > +     * Technically only 32-bit accesses are allowed, but be nice to
>> > +     * people dumping registers a byte at a time -- it works in real
>> > +     * hardware (reads only, not writes).
>> Do 16-bit accesses work in real hardware?
> 
> Probably, though according to the documentation only 32-bit accesses should be used.  As the comment says, I'm just being nice to hexdumps and such, which are unlikely to use 16-bit accesses.
> 
>> > +     */
>> > +    if (len == 4) {
>> > +        if (addr & 3) {
>> > +            pr_debug("%s: bad alignment %llx/%d\n",
>> > +                 __func__, addr, len);
>> > +            return -EINVAL;
>> > +        }
>> if (addr & (len - 1))
>> Then the read_internal call can be shared between the different access sizes, no?
> 
> Not as is, because the read_internal call passes a different pointer in the two different cases.  I originally tried to write this with more in common between the two cases, and it got a bit messy.  I'll give it another shot, though.

Yeah, but don't waste too much effort on it :). It's not that important.

> 
>> > +        spin_lock_irq(&opp->lock);
>> > +        ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
>> > +        spin_unlock_irq(&opp->lock);
>> > +
>> > +        pr_debug("%s: addr %llx ret %d len 4 val %x\n",
>> > +             __func__, addr, ret, *(const u32 *)ptr);
>> > +    } else if (len == 1) {
>> > +        union {
>> > +            u32 val;
>> > +            u8 bytes[4];
>> > +        } u;
>> > +
>> > +        spin_lock_irq(&opp->lock);
>> > +        ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
>> > +        spin_unlock_irq(&opp->lock);
>> > +
>> > +        *(u8 *)ptr = u.bytes[addr & 3];
>> > +
>> > +        pr_debug("%s: addr %llx ret %d len 1 val %x\n",
>> > +             __func__, addr, ret, *(const u8 *)ptr);
>> > +    } else {
>> > +        pr_debug("%s: bad length %d\n", __func__, len);
>> > +        return -EINVAL;
>> > +    }
>> > +
>> > +    return ret;
>> > +}
>> > +
>> [...]
>> >
>> > +static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
>> > +{
>> > +    u32 attr32;
>> > +
>> > +    switch (attr->group) {
>> > +    case KVM_DEV_MPIC_GRP_MISC:
>> > +        switch (attr->attr) {
>> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> > +            return set_base_addr(opp, attr);
>> > +        }
>> > +
>> > +        break;
>> > +
>> > +    case KVM_DEV_MPIC_GRP_REGISTER:
>> > +        if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
>> > +                   sizeof(u32)))
>> get_user?
> 
> OK.
> 
>> > +    switch (attr->group) {
>> > +    case KVM_DEV_MPIC_GRP_MISC:
>> > +        switch (attr->attr) {
>> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> > +            mutex_lock(&opp->kvm->slots_lock);
>> > +            attr64 = opp->reg_base;
>> > +            mutex_unlock(&opp->kvm->slots_lock);
>> > +
>> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> > +                     &attr64, sizeof(u64)))
>> u64 is tricky with put_user on 32bit hosts, so here copy_to_user makes sense
> 
> What are the issues with put_user?  It looks like it's supported with a pair of "stw" instructions.

Oh? Last time I tried to use get/put_user for one_reg it failed on ppc32. So maybe the u64 support is new?

> 
>> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
>> > +        if (attr->attr > MAX_SRC)
>> > +            return -EINVAL;
>> > +
>> > +        attr32 = opp->src[attr->attr].pending;
>> Isn't this missing a lock?
> 
> I don't see why it needs one.  If the pending status changes during the ioctl, it's undefined which state you'll read back, and a lock wouldn't change that (you could end up taking the lock before or after the change gets made).
> 
> reg_base above was a different situation -- it's 64-bit, so we can't read it atomically on 32-bit.

Ok, so this relies on 32bit read accesses being atomic and stale values ok. That works for me, but deserves a comment.


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 21:58             ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 21:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> On 03.04.2013, at 03:57, Scott Wood wrote:
>> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >
>> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
>> > support
>> >
>> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> > ---
>> > v3: mpic_put -> kvmppc_mpic_put
>> >
>> >
>> [...]
>> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
>> > +
>> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
>> >                  struct kvm_config_tlb *cfg);
>> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
>> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> > index 63c67ec..a87139b 100644
>> > --- a/arch/powerpc/kvm/Kconfig
>> > +++ b/arch/powerpc/kvm/Kconfig
>> > @@ -151,6 +151,11 @@ config KVM_E500MC
>> >
>> >      If unsure, say N.
>> >
>> > +config KVM_MPIC
>> > +    bool "KVM in-kernel MPIC emulation"
>> > +    depends on KVM
>> This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.
> 
> I don't see a symbol specifically for "FSL KVM".  What part of the MPIC code depends on booke or any FSL-specific code?

You support only FSL mpic device IDs :). So if someone on book3s goes along and sees this, he'd think "yes, I want an in-kernel MPIC", enables the option and wastes space.

> 
>> > struct irq_dest {
>> > +    struct kvm_vcpu *vcpu;
>> > +
>> >    int32_t ctpr;        /* CPU current task priority */
>> >    struct irq_queue raised;
>> >    struct irq_queue servicing;
>> > -    qemu_irq *irqs;
>> >
>> >    /* Count of IRQ sources asserting on non-INT outputs */
>> > -    uint32_t outputs_active[OPENPIC_OUTPUT_NB];
>> > +    uint32_t outputs_active[NUM_OUTPUTS];
>> > };
>> >
>> > +struct openpic;
>> Isn't this superfluous?
> 
> Yes, will remove.  Probably a leftover from when there was other stuff in between that referenced it.
> 
>> > @@ -731,8 +771,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
>> >    case 0x90:
>> >    case 0xA0:
>> >    case 0xB0:
>> > -        retval >> > -            openpic_cpu_read_internal(opp, addr, get_current_cpu());
>> > +        retval = openpic_cpu_read_internal(opp, addr,
>> > +            &retval, get_current_cpu());
>> This looks bogus. You're passing &retval and overwrite it with the return value right after the function returns?
> 
> Yeah, will fix.  Thanks for spotting it.
> 
>> > +static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
>> > +             int len, void *ptr)
>> > +{
>> > +    struct openpic *opp = container_of(this, struct openpic, mmio);
>> > +    int ret;
>> > +
>> > +    /*
>> > +     * Technically only 32-bit accesses are allowed, but be nice to
>> > +     * people dumping registers a byte at a time -- it works in real
>> > +     * hardware (reads only, not writes).
>> Do 16-bit accesses work in real hardware?
> 
> Probably, though according to the documentation only 32-bit accesses should be used.  As the comment says, I'm just being nice to hexdumps and such, which are unlikely to use 16-bit accesses.
> 
>> > +     */
>> > +    if (len = 4) {
>> > +        if (addr & 3) {
>> > +            pr_debug("%s: bad alignment %llx/%d\n",
>> > +                 __func__, addr, len);
>> > +            return -EINVAL;
>> > +        }
>> if (addr & (len - 1))
>> Then the read_internal call can be shared between the different access sizes, no?
> 
> Not as is, because the read_internal call passes a different pointer in the two different cases.  I originally tried to write this with more in common between the two cases, and it got a bit messy.  I'll give it another shot, though.

Yeah, but don't waste too much effort on it :). It's not that important.

> 
>> > +        spin_lock_irq(&opp->lock);
>> > +        ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, ptr);
>> > +        spin_unlock_irq(&opp->lock);
>> > +
>> > +        pr_debug("%s: addr %llx ret %d len 4 val %x\n",
>> > +             __func__, addr, ret, *(const u32 *)ptr);
>> > +    } else if (len = 1) {
>> > +        union {
>> > +            u32 val;
>> > +            u8 bytes[4];
>> > +        } u;
>> > +
>> > +        spin_lock_irq(&opp->lock);
>> > +        ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
>> > +        spin_unlock_irq(&opp->lock);
>> > +
>> > +        *(u8 *)ptr = u.bytes[addr & 3];
>> > +
>> > +        pr_debug("%s: addr %llx ret %d len 1 val %x\n",
>> > +             __func__, addr, ret, *(const u8 *)ptr);
>> > +    } else {
>> > +        pr_debug("%s: bad length %d\n", __func__, len);
>> > +        return -EINVAL;
>> > +    }
>> > +
>> > +    return ret;
>> > +}
>> > +
>> [...]
>> >
>> > +static int mpic_set_attr(struct openpic *opp, struct kvm_device_attr *attr)
>> > +{
>> > +    u32 attr32;
>> > +
>> > +    switch (attr->group) {
>> > +    case KVM_DEV_MPIC_GRP_MISC:
>> > +        switch (attr->attr) {
>> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> > +            return set_base_addr(opp, attr);
>> > +        }
>> > +
>> > +        break;
>> > +
>> > +    case KVM_DEV_MPIC_GRP_REGISTER:
>> > +        if (copy_from_user(&attr32, (u32 __user *)(long)attr->addr,
>> > +                   sizeof(u32)))
>> get_user?
> 
> OK.
> 
>> > +    switch (attr->group) {
>> > +    case KVM_DEV_MPIC_GRP_MISC:
>> > +        switch (attr->attr) {
>> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> > +            mutex_lock(&opp->kvm->slots_lock);
>> > +            attr64 = opp->reg_base;
>> > +            mutex_unlock(&opp->kvm->slots_lock);
>> > +
>> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> > +                     &attr64, sizeof(u64)))
>> u64 is tricky with put_user on 32bit hosts, so here copy_to_user makes sense
> 
> What are the issues with put_user?  It looks like it's supported with a pair of "stw" instructions.

Oh? Last time I tried to use get/put_user for one_reg it failed on ppc32. So maybe the u64 support is new?

> 
>> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
>> > +        if (attr->attr > MAX_SRC)
>> > +            return -EINVAL;
>> > +
>> > +        attr32 = opp->src[attr->attr].pending;
>> Isn't this missing a lock?
> 
> I don't see why it needs one.  If the pending status changes during the ioctl, it's undefined which state you'll read back, and a lock wouldn't change that (you could end up taking the lock before or after the change gets made).
> 
> reg_base above was a different situation -- it's 64-bit, so we can't read it atomically on 32-bit.

Ok, so this relies on 32bit read accesses being atomic and stale values ok. That works for me, but deserves a comment.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 21:58             ` Alexander Graf
@ 2013-04-03 22:07               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 22:07 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
> 
> 
> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >> >
> >> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> >> > support
> >> >
> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> > ---
> >> > v3: mpic_put -> kvmppc_mpic_put
> >> >
> >> >
> >> [...]
> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> >> > +
> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> >> >                  struct kvm_config_tlb *cfg);
> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> >> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >> > index 63c67ec..a87139b 100644
> >> > --- a/arch/powerpc/kvm/Kconfig
> >> > +++ b/arch/powerpc/kvm/Kconfig
> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
> >> >
> >> >      If unsure, say N.
> >> >
> >> > +config KVM_MPIC
> >> > +    bool "KVM in-kernel MPIC emulation"
> >> > +    depends on KVM
> >> This should probably depend on FSL KVM for now, until someone adds  
> support for other MPIC revisions.
> >
> > I don't see a symbol specifically for "FSL KVM".  What part of the  
> MPIC code depends on booke or any FSL-specific code?
> 
> You support only FSL mpic device IDs :). So if someone on book3s goes  
> along and sees this, he'd think "yes, I want an in-kernel MPIC",  
> enables the option and wastes space.

"Would this waste space" is not generally the criteria for kconfig  
dependencies.  Who is the kernel to get in the way of someone that  
wants an FSL MPIC on a 4xx VM? :-)

And again, there's no symbol for FSL KVM -- I'd have to use a list that  
could get out of date.  And it would reduce build testing in  
allyesconfig-type configs.

> >> > +    switch (attr->group) {
> >> > +    case KVM_DEV_MPIC_GRP_MISC:
> >> > +        switch (attr->attr) {
> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
> >> > +            mutex_lock(&opp->kvm->slots_lock);
> >> > +            attr64 = opp->reg_base;
> >> > +            mutex_unlock(&opp->kvm->slots_lock);
> >> > +
> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
> >> > +                     &attr64, sizeof(u64)))
> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user  
> makes sense
> >
> > What are the issues with put_user?  It looks like it's supported  
> with a pair of "stw" instructions.
> 
> Oh? Last time I tried to use get/put_user for one_reg it failed on  
> ppc32. So maybe the u64 support is new?

Not new according to git -- though I haven't tried to use it yet; maybe  
it's broken.

> >> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> >> > +        if (attr->attr > MAX_SRC)
> >> > +            return -EINVAL;
> >> > +
> >> > +        attr32 = opp->src[attr->attr].pending;
> >> Isn't this missing a lock?
> >
> > I don't see why it needs one.  If the pending status changes during  
> the ioctl, it's undefined which state you'll read back, and a lock  
> wouldn't change that (you could end up taking the lock before or  
> after the change gets made).
> >
> > reg_base above was a different situation -- it's 64-bit, so we  
> can't read it atomically on 32-bit.
> 
> Ok, so this relies on 32bit read accesses being atomic and stale  
> values ok.

Not just 32-bit, but only two possible values, with one bitflip between  
them...  Even if GCC does the read a byte at a time it'd be OK.

> That works for me, but deserves a comment.

If I'm going to change it, might as well just put the lock in to be  
consistent.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 22:07               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 22:07 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
> 
> 
> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >> >
> >> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> >> > support
> >> >
> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> > ---
> >> > v3: mpic_put -> kvmppc_mpic_put
> >> >
> >> >
> >> [...]
> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> >> > +
> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> >> >                  struct kvm_config_tlb *cfg);
> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> >> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> >> > index 63c67ec..a87139b 100644
> >> > --- a/arch/powerpc/kvm/Kconfig
> >> > +++ b/arch/powerpc/kvm/Kconfig
> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
> >> >
> >> >      If unsure, say N.
> >> >
> >> > +config KVM_MPIC
> >> > +    bool "KVM in-kernel MPIC emulation"
> >> > +    depends on KVM
> >> This should probably depend on FSL KVM for now, until someone adds  
> support for other MPIC revisions.
> >
> > I don't see a symbol specifically for "FSL KVM".  What part of the  
> MPIC code depends on booke or any FSL-specific code?
> 
> You support only FSL mpic device IDs :). So if someone on book3s goes  
> along and sees this, he'd think "yes, I want an in-kernel MPIC",  
> enables the option and wastes space.

"Would this waste space" is not generally the criteria for kconfig  
dependencies.  Who is the kernel to get in the way of someone that  
wants an FSL MPIC on a 4xx VM? :-)

And again, there's no symbol for FSL KVM -- I'd have to use a list that  
could get out of date.  And it would reduce build testing in  
allyesconfig-type configs.

> >> > +    switch (attr->group) {
> >> > +    case KVM_DEV_MPIC_GRP_MISC:
> >> > +        switch (attr->attr) {
> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
> >> > +            mutex_lock(&opp->kvm->slots_lock);
> >> > +            attr64 = opp->reg_base;
> >> > +            mutex_unlock(&opp->kvm->slots_lock);
> >> > +
> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
> >> > +                     &attr64, sizeof(u64)))
> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user  
> makes sense
> >
> > What are the issues with put_user?  It looks like it's supported  
> with a pair of "stw" instructions.
> 
> Oh? Last time I tried to use get/put_user for one_reg it failed on  
> ppc32. So maybe the u64 support is new?

Not new according to git -- though I haven't tried to use it yet; maybe  
it's broken.

> >> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> >> > +        if (attr->attr > MAX_SRC)
> >> > +            return -EINVAL;
> >> > +
> >> > +        attr32 = opp->src[attr->attr].pending;
> >> Isn't this missing a lock?
> >
> > I don't see why it needs one.  If the pending status changes during  
> the ioctl, it's undefined which state you'll read back, and a lock  
> wouldn't change that (you could end up taking the lock before or  
> after the change gets made).
> >
> > reg_base above was a different situation -- it's 64-bit, so we  
> can't read it atomically on 32-bit.
> 
> Ok, so this relies on 32bit read accesses being atomic and stale  
> values ok.

Not just 32-bit, but only two possible values, with one bitflip between  
them...  Even if GCC does the read a byte at a time it'd be OK.

> That works for me, but deserves a comment.

If I'm going to change it, might as well just put the lock in to be  
consistent.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 22:07               ` Scott Wood
@ 2013-04-03 22:12                 ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 22:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 04.04.2013 um 00:07 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >> >
>> >> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
>> >> > support
>> >> >
>> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> >> > ---
>> >> > v3: mpic_put -> kvmppc_mpic_put
>> >> >
>> >> >
>> >> [...]
>> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
>> >> > +
>> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
>> >> >                  struct kvm_config_tlb *cfg);
>> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
>> >> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> >> > index 63c67ec..a87139b 100644
>> >> > --- a/arch/powerpc/kvm/Kconfig
>> >> > +++ b/arch/powerpc/kvm/Kconfig
>> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
>> >> >
>> >> >      If unsure, say N.
>> >> >
>> >> > +config KVM_MPIC
>> >> > +    bool "KVM in-kernel MPIC emulation"
>> >> > +    depends on KVM
>> >> This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.
>> >
>> > I don't see a symbol specifically for "FSL KVM".  What part of the MPIC code depends on booke or any FSL-specific code?
>> You support only FSL mpic device IDs :). So if someone on book3s goes along and sees this, he'd think "yes, I want an in-kernel MPIC", enables the option and wastes space.
> 
> "Would this waste space" is not generally the criteria for kconfig dependencies.  Who is the kernel to get in the way of someone that wants an FSL MPIC on a 4xx VM? :-)
> 
> And again, there's no symbol for FSL KVM -- I'd have to use a list that could get out of date.  And it would reduce build testing in allyesconfig-type configs.

Ok, please indicate compatibility limitations in the Kconfig description at least then.

> 
>> >> > +    switch (attr->group) {
>> >> > +    case KVM_DEV_MPIC_GRP_MISC:
>> >> > +        switch (attr->attr) {
>> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> >> > +            mutex_lock(&opp->kvm->slots_lock);
>> >> > +            attr64 = opp->reg_base;
>> >> > +            mutex_unlock(&opp->kvm->slots_lock);
>> >> > +
>> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> >> > +                     &attr64, sizeof(u64)))
>> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user makes sense
>> >
>> > What are the issues with put_user?  It looks like it's supported with a pair of "stw" instructions.
>> Oh? Last time I tried to use get/put_user for one_reg it failed on ppc32. So maybe the u64 support is new?
> 
> Not new according to git -- though I haven't tried to use it yet; maybe it's broken.
> 
>> >> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
>> >> > +        if (attr->attr > MAX_SRC)
>> >> > +            return -EINVAL;
>> >> > +
>> >> > +        attr32 = opp->src[attr->attr].pending;
>> >> Isn't this missing a lock?
>> >
>> > I don't see why it needs one.  If the pending status changes during the ioctl, it's undefined which state you'll read back, and a lock wouldn't change that (you could end up taking the lock before or after the change gets made).
>> >
>> > reg_base above was a different situation -- it's 64-bit, so we can't read it atomically on 32-bit.
>> Ok, so this relies on 32bit read accesses being atomic and stale values ok.
> 
> Not just 32-bit, but only two possible values, with one bitflip between them...  Even if GCC does the read a byte at a time it'd be OK.
> 
>> That works for me, but deserves a comment.
> 
> If I'm going to change it, might as well just put the lock in to be consistent.

Either way works, just want to make sure whoever reads the code knows that things were done on purpose :)

Alex

> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 22:12                 ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-03 22:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 04.04.2013 um 00:07 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >> >
>> >> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
>> >> > support
>> >> >
>> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> >> > ---
>> >> > v3: mpic_put -> kvmppc_mpic_put
>> >> >
>> >> >
>> >> [...]
>> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
>> >> > +
>> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
>> >> >                  struct kvm_config_tlb *cfg);
>> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
>> >> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> >> > index 63c67ec..a87139b 100644
>> >> > --- a/arch/powerpc/kvm/Kconfig
>> >> > +++ b/arch/powerpc/kvm/Kconfig
>> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
>> >> >
>> >> >      If unsure, say N.
>> >> >
>> >> > +config KVM_MPIC
>> >> > +    bool "KVM in-kernel MPIC emulation"
>> >> > +    depends on KVM
>> >> This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.
>> >
>> > I don't see a symbol specifically for "FSL KVM".  What part of the MPIC code depends on booke or any FSL-specific code?
>> You support only FSL mpic device IDs :). So if someone on book3s goes along and sees this, he'd think "yes, I want an in-kernel MPIC", enables the option and wastes space.
> 
> "Would this waste space" is not generally the criteria for kconfig dependencies.  Who is the kernel to get in the way of someone that wants an FSL MPIC on a 4xx VM? :-)
> 
> And again, there's no symbol for FSL KVM -- I'd have to use a list that could get out of date.  And it would reduce build testing in allyesconfig-type configs.

Ok, please indicate compatibility limitations in the Kconfig description at least then.

> 
>> >> > +    switch (attr->group) {
>> >> > +    case KVM_DEV_MPIC_GRP_MISC:
>> >> > +        switch (attr->attr) {
>> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> >> > +            mutex_lock(&opp->kvm->slots_lock);
>> >> > +            attr64 = opp->reg_base;
>> >> > +            mutex_unlock(&opp->kvm->slots_lock);
>> >> > +
>> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> >> > +                     &attr64, sizeof(u64)))
>> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user makes sense
>> >
>> > What are the issues with put_user?  It looks like it's supported with a pair of "stw" instructions.
>> Oh? Last time I tried to use get/put_user for one_reg it failed on ppc32. So maybe the u64 support is new?
> 
> Not new according to git -- though I haven't tried to use it yet; maybe it's broken.
> 
>> >> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
>> >> > +        if (attr->attr > MAX_SRC)
>> >> > +            return -EINVAL;
>> >> > +
>> >> > +        attr32 = opp->src[attr->attr].pending;
>> >> Isn't this missing a lock?
>> >
>> > I don't see why it needs one.  If the pending status changes during the ioctl, it's undefined which state you'll read back, and a lock wouldn't change that (you could end up taking the lock before or after the change gets made).
>> >
>> > reg_base above was a different situation -- it's 64-bit, so we can't read it atomically on 32-bit.
>> Ok, so this relies on 32bit read accesses being atomic and stale values ok.
> 
> Not just 32-bit, but only two possible values, with one bitflip between them...  Even if GCC does the read a byte at a time it'd be OK.
> 
>> That works for me, but deserves a comment.
> 
> If I'm going to change it, might as well just put the lock in to be consistent.

Either way works, just want to make sure whoever reads the code knows that things were done on purpose :)

Alex

> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 22:12                 ` Alexander Graf
@ 2013-04-03 22:54                   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 22:54 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/03/2013 05:12:06 PM, Alexander Graf wrote:
> 
> 
> Am 04.04.2013 um 00:07 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
> >> Am 03.04.2013 um 23:38 schrieb Scott Wood  
> <scottwood@freescale.com>:
> >> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
> >> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >> >> >
> >> >> > TODO: irqfd support, split up into multiple patches,  
> KVM_IRQ_LINE
> >> >> > support
> >> >> >
> >> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> >> > ---
> >> >> > v3: mpic_put -> kvmppc_mpic_put
> >> >> >
> >> >> >
> >> >> [...]
> >> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> >> >> > +
> >> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> >> >> >                  struct kvm_config_tlb *cfg);
> >> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> >> >> > diff --git a/arch/powerpc/kvm/Kconfig  
> b/arch/powerpc/kvm/Kconfig
> >> >> > index 63c67ec..a87139b 100644
> >> >> > --- a/arch/powerpc/kvm/Kconfig
> >> >> > +++ b/arch/powerpc/kvm/Kconfig
> >> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
> >> >> >
> >> >> >      If unsure, say N.
> >> >> >
> >> >> > +config KVM_MPIC
> >> >> > +    bool "KVM in-kernel MPIC emulation"
> >> >> > +    depends on KVM
> >> >> This should probably depend on FSL KVM for now, until someone  
> adds support for other MPIC revisions.
> >> >
> >> > I don't see a symbol specifically for "FSL KVM".  What part of  
> the MPIC code depends on booke or any FSL-specific code?
> >> You support only FSL mpic device IDs :). So if someone on book3s  
> goes along and sees this, he'd think "yes, I want an in-kernel MPIC",  
> enables the option and wastes space.
> >
> > "Would this waste space" is not generally the criteria for kconfig  
> dependencies.  Who is the kernel to get in the way of someone that  
> wants an FSL MPIC on a 4xx VM? :-)
> >
> > And again, there's no symbol for FSL KVM -- I'd have to use a list  
> that could get out of date.  And it would reduce build testing in  
> allyesconfig-type configs.
> 
> Ok, please indicate compatibility limitations in the Kconfig  
> description at least then.

OK -- not really a "compatibility" limitation so much as what models  
are supported.

Note that mpc86xx has a 74xx-derived core, but also has an FSL MPIC...

Is 74xx/e600 supported by book3s_pr?  Can't tell from the kconfig text.  
:-)

> >> >> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> >> >> > +        if (attr->attr > MAX_SRC)
> >> >> > +            return -EINVAL;
> >> >> > +
> >> >> > +        attr32 = opp->src[attr->attr].pending;
> >> >> Isn't this missing a lock?
> >> >
> >> > I don't see why it needs one.  If the pending status changes  
> during the ioctl, it's undefined which state you'll read back, and a  
> lock wouldn't change that (you could end up taking the lock before or  
> after the change gets made).
> >> >
> >> > reg_base above was a different situation -- it's 64-bit, so we  
> can't read it atomically on 32-bit.
> >> Ok, so this relies on 32bit read accesses being atomic and stale  
> values ok.
> >
> > Not just 32-bit, but only two possible values, with one bitflip  
> between them...  Even if GCC does the read a byte at a time it'd be  
> OK.
> >
> >> That works for me, but deserves a comment.
> >
> > If I'm going to change it, might as well just put the lock in to be  
> consistent.
> 
> Either way works, just want to make sure whoever reads the code knows  
> that things were done on purpose :)

OTOH, it looks like access_reg() *was* missing a lock. :-P

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 22:54                   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 22:54 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/03/2013 05:12:06 PM, Alexander Graf wrote:
> 
> 
> Am 04.04.2013 um 00:07 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
> >> Am 03.04.2013 um 23:38 schrieb Scott Wood  
> <scottwood@freescale.com>:
> >> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
> >> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >> >> >
> >> >> > TODO: irqfd support, split up into multiple patches,  
> KVM_IRQ_LINE
> >> >> > support
> >> >> >
> >> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> >> > ---
> >> >> > v3: mpic_put -> kvmppc_mpic_put
> >> >> >
> >> >> >
> >> >> [...]
> >> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> >> >> > +
> >> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> >> >> >                  struct kvm_config_tlb *cfg);
> >> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> >> >> > diff --git a/arch/powerpc/kvm/Kconfig  
> b/arch/powerpc/kvm/Kconfig
> >> >> > index 63c67ec..a87139b 100644
> >> >> > --- a/arch/powerpc/kvm/Kconfig
> >> >> > +++ b/arch/powerpc/kvm/Kconfig
> >> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
> >> >> >
> >> >> >      If unsure, say N.
> >> >> >
> >> >> > +config KVM_MPIC
> >> >> > +    bool "KVM in-kernel MPIC emulation"
> >> >> > +    depends on KVM
> >> >> This should probably depend on FSL KVM for now, until someone  
> adds support for other MPIC revisions.
> >> >
> >> > I don't see a symbol specifically for "FSL KVM".  What part of  
> the MPIC code depends on booke or any FSL-specific code?
> >> You support only FSL mpic device IDs :). So if someone on book3s  
> goes along and sees this, he'd think "yes, I want an in-kernel MPIC",  
> enables the option and wastes space.
> >
> > "Would this waste space" is not generally the criteria for kconfig  
> dependencies.  Who is the kernel to get in the way of someone that  
> wants an FSL MPIC on a 4xx VM? :-)
> >
> > And again, there's no symbol for FSL KVM -- I'd have to use a list  
> that could get out of date.  And it would reduce build testing in  
> allyesconfig-type configs.
> 
> Ok, please indicate compatibility limitations in the Kconfig  
> description at least then.

OK -- not really a "compatibility" limitation so much as what models  
are supported.

Note that mpc86xx has a 74xx-derived core, but also has an FSL MPIC...

Is 74xx/e600 supported by book3s_pr?  Can't tell from the kconfig text.  
:-)

> >> >> > +    case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
> >> >> > +        if (attr->attr > MAX_SRC)
> >> >> > +            return -EINVAL;
> >> >> > +
> >> >> > +        attr32 = opp->src[attr->attr].pending;
> >> >> Isn't this missing a lock?
> >> >
> >> > I don't see why it needs one.  If the pending status changes  
> during the ioctl, it's undefined which state you'll read back, and a  
> lock wouldn't change that (you could end up taking the lock before or  
> after the change gets made).
> >> >
> >> > reg_base above was a different situation -- it's 64-bit, so we  
> can't read it atomically on 32-bit.
> >> Ok, so this relies on 32bit read accesses being atomic and stale  
> values ok.
> >
> > Not just 32-bit, but only two possible values, with one bitflip  
> between them...  Even if GCC does the read a byte at a time it'd be  
> OK.
> >
> >> That works for me, but deserves a comment.
> >
> > If I'm going to change it, might as well just put the lock in to be  
> consistent.
> 
> Either way works, just want to make sure whoever reads the code knows  
> that things were done on purpose :)

OTOH, it looks like access_reg() *was* missing a lock. :-P

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 22:07               ` Scott Wood
  (?)
@ 2013-04-03 23:23                 ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 23:23 UTC (permalink / raw)
  To: Scott Wood
  Cc: Alexander Graf, <kvm-ppc@vger.kernel.org>,
	<kvm@vger.kernel.org>, <paulus@samba.org>,
	linuxppc-dev

On 04/03/2013 05:07:30 PM, Scott Wood wrote:
> On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>> 
>> 
>> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>> 
>> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > +    switch (attr->group) {
>> >> > +    case KVM_DEV_MPIC_GRP_MISC:
>> >> > +        switch (attr->attr) {
>> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> >> > +            mutex_lock(&opp->kvm->slots_lock);
>> >> > +            attr64 = opp->reg_base;
>> >> > +            mutex_unlock(&opp->kvm->slots_lock);
>> >> > +
>> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> >> > +                     &attr64, sizeof(u64)))
>> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user  
>> makes sense
>> >
>> > What are the issues with put_user?  It looks like it's supported  
>> with a pair of "stw" instructions.
>> 
>> Oh? Last time I tried to use get/put_user for one_reg it failed on  
>> ppc32. So maybe the u64 support is new?
> 
> Not new according to git -- though I haven't tried to use it yet;  
> maybe it's broken.

Yeah, it's broken. :-P

__get_user_size() looks OK, but __get_user_check/nocheck() goes through  
an intermediary "unsigned long __gu_val".

There's a separate __get_user64_nocheck() that uses "long long", but no  
"check" variant, no "put", and it's only available in 32-bit builds.   
And it's not used anywhere (barring ungreppable token-pasting magic).   
Sigh.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 23:23                 ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 23:23 UTC (permalink / raw)
  To: Scott Wood
  Cc: linuxppc-dev, <paulus@samba.org>,
	Alexander Graf, <kvm-ppc@vger.kernel.org>,
	<kvm@vger.kernel.org>

On 04/03/2013 05:07:30 PM, Scott Wood wrote:
> On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>>=20
>>=20
>> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>>=20
>> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > +    switch (attr->group) {
>> >> > +    case KVM_DEV_MPIC_GRP_MISC:
>> >> > +        switch (attr->attr) {
>> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> >> > +            mutex_lock(&opp->kvm->slots_lock);
>> >> > +            attr64 =3D opp->reg_base;
>> >> > +            mutex_unlock(&opp->kvm->slots_lock);
>> >> > +
>> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> >> > +                     &attr64, sizeof(u64)))
>> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user =20
>> makes sense
>> >
>> > What are the issues with put_user?  It looks like it's supported =20
>> with a pair of "stw" instructions.
>>=20
>> Oh? Last time I tried to use get/put_user for one_reg it failed on =20
>> ppc32. So maybe the u64 support is new?
>=20
> Not new according to git -- though I haven't tried to use it yet; =20
> maybe it's broken.

Yeah, it's broken. :-P

__get_user_size() looks OK, but __get_user_check/nocheck() goes through =20
an intermediary "unsigned long __gu_val".

There's a separate __get_user64_nocheck() that uses "long long", but no =20
"check" variant, no "put", and it's only available in 32-bit builds.  =20
And it's not used anywhere (barring ungreppable token-pasting magic).  =20
Sigh.

-Scott=

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-03 23:23                 ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-03 23:23 UTC (permalink / raw)
  To: Scott Wood
  Cc: Alexander Graf, <kvm-ppc@vger.kernel.org>,
	<kvm@vger.kernel.org>, <paulus@samba.org>,
	linuxppc-dev

On 04/03/2013 05:07:30 PM, Scott Wood wrote:
> On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>> 
>> 
>> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>> 
>> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > +    switch (attr->group) {
>> >> > +    case KVM_DEV_MPIC_GRP_MISC:
>> >> > +        switch (attr->attr) {
>> >> > +        case KVM_DEV_MPIC_BASE_ADDR:
>> >> > +            mutex_lock(&opp->kvm->slots_lock);
>> >> > +            attr64 = opp->reg_base;
>> >> > +            mutex_unlock(&opp->kvm->slots_lock);
>> >> > +
>> >> > +            if (copy_to_user((u64 __user *)(long)attr->addr,
>> >> > +                     &attr64, sizeof(u64)))
>> >> u64 is tricky with put_user on 32bit hosts, so here copy_to_user  
>> makes sense
>> >
>> > What are the issues with put_user?  It looks like it's supported  
>> with a pair of "stw" instructions.
>> 
>> Oh? Last time I tried to use get/put_user for one_reg it failed on  
>> ppc32. So maybe the u64 support is new?
> 
> Not new according to git -- though I haven't tried to use it yet;  
> maybe it's broken.

Yeah, it's broken. :-P

__get_user_size() looks OK, but __get_user_check/nocheck() goes through  
an intermediary "unsigned long __gu_val".

There's a separate __get_user64_nocheck() that uses "long long", but no  
"check" variant, no "put", and it's only available in 32-bit builds.   
And it's not used anywhere (barring ungreppable token-pasting magic).   
Sigh.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 20:58           ` Scott Wood
@ 2013-04-04  5:59             ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-04  5:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> On 04/03/2013 10:55:27 AM, Gleb Natapov wrote:
> >On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> >> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >>
> >> TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> >> support
> >>
> >> Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> ---
> >[skip]
> >
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 20ce2d2..d8f44ef 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -927,6 +927,15 @@ struct kvm_device_attr {
> >>  	__u64	addr;		/* userspace address of attr data */
> >>  };
> >>
> >> +#define KVM_DEV_TYPE_FSL_MPIC_20	1
> >> +#define KVM_DEV_TYPE_FSL_MPIC_42	2
> >> +
> >> +#define KVM_DEV_MPIC_GRP_MISC		1
> >> +#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> >> +
> >> +#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
> >> +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
> >Why not put them in arch specific header?
> 
> KVM_DEV_TYPE_* is not an arch-specific enumeration -- this was
> discussed last time around.
> 
Yes, I am talking about KVM_DEV_MPIC_* only. KVM_DEV_TYPE_ are used by
common code so should stay here.

> KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> the main kvm.h.  The arch header would be OK, since the non-arch
> header includes the arch header, and thus it wouldn't be visible to
> userspace where it is -- if there later is a need for MPIC (or
> whatever other device follows MPIC's example) on another
> architecture, it could be moved without breaking anything.  Or, we
> could just have a header for each device type.
> 
If device will be used by more then one arch it will move into virt/kvm
and will have its own header, like ioapic.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-04  5:59             ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-04  5:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> On 04/03/2013 10:55:27 AM, Gleb Natapov wrote:
> >On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> >> Hook the MPIC code up to the KVM interfaces, add locking, etc.
> >>
> >> TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
> >> support
> >>
> >> Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> ---
> >[skip]
> >
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 20ce2d2..d8f44ef 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -927,6 +927,15 @@ struct kvm_device_attr {
> >>  	__u64	addr;		/* userspace address of attr data */
> >>  };
> >>
> >> +#define KVM_DEV_TYPE_FSL_MPIC_20	1
> >> +#define KVM_DEV_TYPE_FSL_MPIC_42	2
> >> +
> >> +#define KVM_DEV_MPIC_GRP_MISC		1
> >> +#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> >> +
> >> +#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
> >> +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
> >Why not put them in arch specific header?
> 
> KVM_DEV_TYPE_* is not an arch-specific enumeration -- this was
> discussed last time around.
> 
Yes, I am talking about KVM_DEV_MPIC_* only. KVM_DEV_TYPE_ are used by
common code so should stay here.

> KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> the main kvm.h.  The arch header would be OK, since the non-arch
> header includes the arch header, and thus it wouldn't be visible to
> userspace where it is -- if there later is a need for MPIC (or
> whatever other device follows MPIC's example) on another
> architecture, it could be moved without breaking anything.  Or, we
> could just have a header for each device type.
> 
If device will be used by more then one arch it will move into virt/kvm
and will have its own header, like ioapic.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03 22:54                   ` Scott Wood
@ 2013-04-04  9:42                     ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-04  9:42 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>


On 04.04.2013, at 00:54, Scott Wood wrote:

> On 04/03/2013 05:12:06 PM, Alexander Graf wrote:
>> Am 04.04.2013 um 00:07 schrieb Scott Wood <scottwood@freescale.com>:
>> > On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>> >> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>> >> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >> >> >
>> >> >> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
>> >> >> > support
>> >> >> >
>> >> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> >> >> > ---
>> >> >> > v3: mpic_put -> kvmppc_mpic_put
>> >> >> >
>> >> >> >
>> >> >> [...]
>> >> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
>> >> >> > +
>> >> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
>> >> >> >                  struct kvm_config_tlb *cfg);
>> >> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
>> >> >> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> >> >> > index 63c67ec..a87139b 100644
>> >> >> > --- a/arch/powerpc/kvm/Kconfig
>> >> >> > +++ b/arch/powerpc/kvm/Kconfig
>> >> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
>> >> >> >
>> >> >> >      If unsure, say N.
>> >> >> >
>> >> >> > +config KVM_MPIC
>> >> >> > +    bool "KVM in-kernel MPIC emulation"
>> >> >> > +    depends on KVM
>> >> >> This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.
>> >> >
>> >> > I don't see a symbol specifically for "FSL KVM".  What part of the MPIC code depends on booke or any FSL-specific code?
>> >> You support only FSL mpic device IDs :). So if someone on book3s goes along and sees this, he'd think "yes, I want an in-kernel MPIC", enables the option and wastes space.
>> >
>> > "Would this waste space" is not generally the criteria for kconfig dependencies.  Who is the kernel to get in the way of someone that wants an FSL MPIC on a 4xx VM? :-)
>> >
>> > And again, there's no symbol for FSL KVM -- I'd have to use a list that could get out of date.  And it would reduce build testing in allyesconfig-type configs.
>> Ok, please indicate compatibility limitations in the Kconfig description at least then.
> 
> OK -- not really a "compatibility" limitation so much as what models are supported.
> 
> Note that mpc86xx has a 74xx-derived core, but also has an FSL MPIC...
> 
> Is 74xx/e600 supported by book3s_pr?  Can't tell from the kconfig text. :-)

On book3s_pr we don't have a good compatibility check mechanism in place. That's really suboptimal.

What I'm saying is that Kconfig should say "In-kernel emulation of FSL MPIC 2.0 and FSL MPIC 4.2 interrupt controllers. Say Y here if you plan to run KVM on an FSL system". That'd be in line with what you can actually enable using the ioctls and leaves the decision whether that's a good thing to the user.

Code-wise there shouldn't be any dependency on host or guest architecture of course.


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-04  9:42                     ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-04  9:42 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>


On 04.04.2013, at 00:54, Scott Wood wrote:

> On 04/03/2013 05:12:06 PM, Alexander Graf wrote:
>> Am 04.04.2013 um 00:07 schrieb Scott Wood <scottwood@freescale.com>:
>> > On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
>> >> Am 03.04.2013 um 23:38 schrieb Scott Wood <scottwood@freescale.com>:
>> >> > On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
>> >> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> >> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
>> >> >> >
>> >> >> > TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
>> >> >> > support
>> >> >> >
>> >> >> > Signed-off-by: Scott Wood <scottwood@freescale.com>
>> >> >> > ---
>> >> >> > v3: mpic_put -> kvmppc_mpic_put
>> >> >> >
>> >> >> >
>> >> >> [...]
>> >> >> > +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
>> >> >> > +
>> >> >> > int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
>> >> >> >                  struct kvm_config_tlb *cfg);
>> >> >> > int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
>> >> >> > diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> >> >> > index 63c67ec..a87139b 100644
>> >> >> > --- a/arch/powerpc/kvm/Kconfig
>> >> >> > +++ b/arch/powerpc/kvm/Kconfig
>> >> >> > @@ -151,6 +151,11 @@ config KVM_E500MC
>> >> >> >
>> >> >> >      If unsure, say N.
>> >> >> >
>> >> >> > +config KVM_MPIC
>> >> >> > +    bool "KVM in-kernel MPIC emulation"
>> >> >> > +    depends on KVM
>> >> >> This should probably depend on FSL KVM for now, until someone adds support for other MPIC revisions.
>> >> >
>> >> > I don't see a symbol specifically for "FSL KVM".  What part of the MPIC code depends on booke or any FSL-specific code?
>> >> You support only FSL mpic device IDs :). So if someone on book3s goes along and sees this, he'd think "yes, I want an in-kernel MPIC", enables the option and wastes space.
>> >
>> > "Would this waste space" is not generally the criteria for kconfig dependencies.  Who is the kernel to get in the way of someone that wants an FSL MPIC on a 4xx VM? :-)
>> >
>> > And again, there's no symbol for FSL KVM -- I'd have to use a list that could get out of date.  And it would reduce build testing in allyesconfig-type configs.
>> Ok, please indicate compatibility limitations in the Kconfig description at least then.
> 
> OK -- not really a "compatibility" limitation so much as what models are supported.
> 
> Note that mpc86xx has a 74xx-derived core, but also has an FSL MPIC...
> 
> Is 74xx/e600 supported by book3s_pr?  Can't tell from the kconfig text. :-)

On book3s_pr we don't have a good compatibility check mechanism in place. That's really suboptimal.

What I'm saying is that Kconfig should say "In-kernel emulation of FSL MPIC 2.0 and FSL MPIC 4.2 interrupt controllers. Say Y here if you plan to run KVM on an FSL system". That'd be in line with what you can actually enable using the ioctls and leaves the decision whether that's a good thing to the user.

Code-wise there shouldn't be any dependency on host or guest architecture of course.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
  2013-04-03 17:39                 ` Alexander Graf
@ 2013-04-04  9:58                   ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-04  9:58 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Scott Wood, Paul Mackerras, kvm-ppc, kvm@vger.kernel.org list,
	Marcelo Tosatti

On Wed, Apr 03, 2013 at 07:39:52PM +0200, Alexander Graf wrote:
> 
> On 03.04.2013, at 19:37, Scott Wood wrote:
> 
> > On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 04:17, Paul Mackerras wrote:
> >> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> >> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >> >>>> +4.79 KVM_CREATE_DEVICE
> >> >>>> +
> >> >>>> +Capability: KVM_CAP_DEVICE_CTRL
> >> >>>
> >> >>> I notice this patch doesn't add this capability;
> >> >>
> >> >> Yes, it does (see below).
> >> >>
> >> >>> you add it in a later patch.
> >> >>
> >> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> >> >
> >> > No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> >> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
> >> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> >> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
> >> >
> >> >>>> +/* ioctl for vm fd */
> >> >>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
> >> >>> kvm_create_device)
> >> >>>
> >> >>> This define should go with the other VM ioctls, otherwise the next
> >> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
> >> >>> code.
> >> >>
> >> >> That's actually why I moved it to a new section, with device control
> >> >> ioctls getting their own range, as the legacy "device model" and
> >> >> some other things did.  0xe0 is not the next ioctl that would be
> >> >> used for either vm or vcpu.  The ioctl numbering is actually already
> >> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> >> >> from overlapping, but on other places overlapping does happen.  I'm
> >> >> not sure what exactly I should do here.
> >> >
> >> > Well, even if you are using a new range, I still think that
> >> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> >> > ioctls.  I guess it's ultimately up to the maintainers.
> >> I agree. Things get confusing for VM ioctls otherwise.
> > 
> > Things are already confusing. :-)
> > 
> > I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what number should it get?  The last VM ioctl is 0xab (which is also a VCPU ioctl).  Should I use 0xac (which is also a VCPU ioctl)?  Or should I try to avoid a conflict, as was sometimes done in the past -- in which case, which number should I use?
> 
> Gleb, Marcelo?
> 
> 
Yes, ioctls number assignments are a little bit of a mess :( There are 14
numbers that are used twice and some of them are used twice for the same
type of fd, but with different direction bits and there is one, 0x9b,
that is used twice and have the same IO direction and even, potentially,
same parameter size (just checked it has the same parameter size)! Good that
the second use is only on IA64 which is almost dead.

Although ioctls numbers between different types of fds should not conflict
lets try and make them unique. Scott, please put KVM_CREATE_DEVICE near
device model ioctls and give it a unique number.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v2 1/6] kvm: add device control API
@ 2013-04-04  9:58                   ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-04  9:58 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Scott Wood, Paul Mackerras, kvm-ppc, kvm@vger.kernel.org list,
	Marcelo Tosatti

On Wed, Apr 03, 2013 at 07:39:52PM +0200, Alexander Graf wrote:
> 
> On 03.04.2013, at 19:37, Scott Wood wrote:
> 
> > On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 04:17, Paul Mackerras wrote:
> >> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> >> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >> >>>> +4.79 KVM_CREATE_DEVICE
> >> >>>> +
> >> >>>> +Capability: KVM_CAP_DEVICE_CTRL
> >> >>>
> >> >>> I notice this patch doesn't add this capability;
> >> >>
> >> >> Yes, it does (see below).
> >> >>
> >> >>> you add it in a later patch.
> >> >>
> >> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> >> >
> >> > No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> >> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
> >> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> >> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
> >> >
> >> >>>> +/* ioctl for vm fd */
> >> >>>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct
> >> >>> kvm_create_device)
> >> >>>
> >> >>> This define should go with the other VM ioctls, otherwise the next
> >> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
> >> >>> code.
> >> >>
> >> >> That's actually why I moved it to a new section, with device control
> >> >> ioctls getting their own range, as the legacy "device model" and
> >> >> some other things did.  0xe0 is not the next ioctl that would be
> >> >> used for either vm or vcpu.  The ioctl numbering is actually already
> >> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> >> >> from overlapping, but on other places overlapping does happen.  I'm
> >> >> not sure what exactly I should do here.
> >> >
> >> > Well, even if you are using a new range, I still think that
> >> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> >> > ioctls.  I guess it's ultimately up to the maintainers.
> >> I agree. Things get confusing for VM ioctls otherwise.
> > 
> > Things are already confusing. :-)
> > 
> > I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what number should it get?  The last VM ioctl is 0xab (which is also a VCPU ioctl).  Should I use 0xac (which is also a VCPU ioctl)?  Or should I try to avoid a conflict, as was sometimes done in the past -- in which case, which number should I use?
> 
> Gleb, Marcelo?
> 
> 
Yes, ioctls number assignments are a little bit of a mess :( There are 14
numbers that are used twice and some of them are used twice for the same
type of fd, but with different direction bits and there is one, 0x9b,
that is used twice and have the same IO direction and even, potentially,
same parameter size (just checked it has the same parameter size)! Good that
the second use is only on IA64 which is almost dead.

Although ioctls numbers between different types of fds should not conflict
lets try and make them unique. Scott, please put KVM_CREATE_DEVICE near
device model ioctls and give it a unique number.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-04 10:41         ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-04 10:41 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> 
> NOTE: I had some difficulty figuring out what ioctl numbers I should
> assign...  it seems that at one point care was taken to keep vcpu and
> vm ioctls separate, but some overlap exists now (despite not exhausing
> the ioctl space).  Some of that was my fault, but not all of it. :-)
> I moved to a new ioctl range for device control -- please let me know
> if there's something else you'd prefer I do.
> ---
>  Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/uapi/linux/kvm.h                 |   27 ++++++++++++
>  virt/kvm/kvm_main.c                      |   31 +++++++++++++
>  4 files changed, 129 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
Since now each device has its own fd is it an advantage to enforce
common interface between different devices? If we do so though why
not handle file creation, ioctl and file descriptor lifetime in the
common code. Common code will have "struct kvm_device" with "struct
kvm_device_arch" and "struct kvm_device_ops" members. Instead of
kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
to a device using kvm_device->ops->(set|get|has)_attr pointers.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-04 10:41         ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-04 10:41 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> 
> NOTE: I had some difficulty figuring out what ioctl numbers I should
> assign...  it seems that at one point care was taken to keep vcpu and
> vm ioctls separate, but some overlap exists now (despite not exhausing
> the ioctl space).  Some of that was my fault, but not all of it. :-)
> I moved to a new ioctl range for device control -- please let me know
> if there's something else you'd prefer I do.
> ---
>  Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/uapi/linux/kvm.h                 |   27 ++++++++++++
>  virt/kvm/kvm_main.c                      |   31 +++++++++++++
>  4 files changed, 129 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
Since now each device has its own fd is it an advantage to enforce
common interface between different devices? If we do so though why
not handle file creation, ioctl and file descriptor lifetime in the
common code. Common code will have "struct kvm_device" with "struct
kvm_device_arch" and "struct kvm_device_ops" members. Instead of
kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
to a device using kvm_device->ops->(set|get|has)_attr pointers.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-04 12:54         ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-04 12:54 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm, paulus


On 03.04.2013, at 03:57, Scott Wood wrote:

> Enabling this capability connects the vcpu to the designated in-kernel
> MPIC.  Using explicit connections between vcpus and irqchips allows
> for flexibility, but the main benefit at the moment is that it
> simplifies the code -- KVM doesn't need vm-global state to remember
> which MPIC object is associated with this vm, and it doesn't need to
> care about ordering between irqchip creation and vcpu creation.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> Documentation/virtual/kvm/api.txt   |    8 ++++++
> arch/powerpc/include/asm/kvm_host.h |    8 ++++++
> arch/powerpc/include/asm/kvm_ppc.h  |    2 ++
> arch/powerpc/kvm/booke.c            |    4 ++-
> arch/powerpc/kvm/mpic.c             |   49 +++++++++++++++++++++++++++++++----
> arch/powerpc/kvm/powerpc.c          |   26 +++++++++++++++++++
> include/uapi/linux/kvm.h            |    1 +
> 7 files changed, 92 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index d52f3f9..4c326ae 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
> When disabled (args[0] == 0), behavior is as if this facility is unsupported.
> 
> When this capability is enabled, KVM_EXIT_EPR can occur.
> +
> +6.6 KVM_CAP_IRQ_MPIC
> +
> +Architectures: ppc
> +Parameters: args[0] is the MPIC device fd
> +            args[1] is the MPIC CPU number for this vcpu
> +
> +This capability connects the vcpu to an in-kernel MPIC device.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7e7aef9..2a2e235 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
> 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
> };
> 
> +#define KVMPPC_IRQ_DEFAULT	0
> +#define KVMPPC_IRQ_MPIC		1
> +
> +struct openpic;
> +
> struct kvm_vcpu_arch {
> 	ulong host_stack;
> 	u32 host_pid;
> @@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
> 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
> 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
> 
> +	int irq_type;		/* one of KVM_IRQ_* */
> +	struct openpic *mpic;	/* KVM_IRQ_MPIC */
> +
> #ifdef CONFIG_KVM_BOOK3S_64_HV
> 	struct kvm_vcpu_arch_shared shregs;
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 3b63b97..f54707f 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
> }
> 
> void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
> +			     u32 cpu);
> 
> int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> 			      struct kvm_config_tlb *cfg);
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index cddc6b3..7d00222 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
> 		if (update_epr == true) {
> 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
> 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
> -			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
> +			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
> +				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
> 				kvmppc_mpic_set_epr(vcpu);
> +			}
> 		}
> 
> 		new_msr &= msr_mask;
> diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
> index 8cda2fa..caffe3b 100644
> --- a/arch/powerpc/kvm/mpic.c
> +++ b/arch/powerpc/kvm/mpic.c
> @@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
> 
> void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
> {
> -	struct openpic *opp = vcpu->arch.irqchip_priv;
> +	struct openpic *opp = vcpu->arch.mpic;
> 	int cpu = vcpu->vcpu_id;
> 	unsigned long flags;
> 
> @@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
> 
> static void unmap_mmio(struct openpic *opp)
> {
> -	BUG_ON(opp->mmio_mapped);
> -	opp->mmio_mapped = false;
> -
> -	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
> +	if (opp->mmio_mapped) {
> +		opp->mmio_mapped = false;
> +		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
> +	}
> }
> 
> static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
> @@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
> 	.release = kvm_mpic_release,
> };
> 
> +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
> +			     u32 cpu)
> +{
> +	struct openpic *opp = mpic_filp->private_data;
> +	int ret = 0;
> +
> +	if (mpic_filp->f_op != &kvm_mpic_fops)
> +		return -EPERM;
> +	if (opp->kvm != vcpu->kvm)
> +		return -EPERM;
> +	if (cpu < 0 || cpu >= MAX_CPU)
> +		return -EPERM;
> +
> +	spin_lock_irq(&opp->lock);
> +
> +	if (opp->dst[cpu].vcpu) {
> +		ret = -EEXIST;
> +		goto out;
> +	}
> +	if (vcpu->arch.irq_type) {
> +		return -EBUSY;
> +		goto out;
> +	}
> +
> +	opp->dst[cpu].vcpu = vcpu;
> +	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
> +
> +	vcpu->arch.mpic = opp;
> +	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
> +	atomic_inc(&opp->users);
> +
> +	if (opp->mpic_mode_mask == GCR_MODE_PROXY)

Shouldn't this be an &?

> +		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
> +
> +out:
> +	spin_unlock_irq(&opp->lock);
> +	return ret;
> +}
> +
> int kvm_create_mpic(struct kvm *kvm, u32 type)
> {
> 	struct openpic *opp;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index c9a2972..290a905 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -25,6 +25,7 @@
> #include <linux/hrtimer.h>
> #include <linux/fs.h>
> #include <linux/slab.h>
> +#include <linux/file.h>
> #include <asm/cputable.h>
> #include <asm/uaccess.h>
> #include <asm/kvm_ppc.h>
> @@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
> #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
> 	case KVM_CAP_SW_TLB:
> #endif
> +#ifdef CONFIG_KVM_MPIC
> +	case KVM_CAP_IRQ_MPIC:
> +#endif
> 		r = 1;
> 		break;
> 	case KVM_CAP_COALESCED_MMIO:
> @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> 	tasklet_kill(&vcpu->arch.tasklet);
> 
> 	kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +	switch (vcpu->arch.irq_type) {
> +	case KVMPPC_IRQ_MPIC:
> +		kvmppc_mpic_put(vcpu->arch.mpic);

This doesn't tell the MPIC that this exact CPU is getting killed. What if we hotplug remove just a single CPU? Don't we have to deregister the CPU with the MPIC?


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-04 12:54         ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-04 12:54 UTC (permalink / raw)
  To: Scott Wood; +Cc: kvm-ppc, kvm, paulus


On 03.04.2013, at 03:57, Scott Wood wrote:

> Enabling this capability connects the vcpu to the designated in-kernel
> MPIC.  Using explicit connections between vcpus and irqchips allows
> for flexibility, but the main benefit at the moment is that it
> simplifies the code -- KVM doesn't need vm-global state to remember
> which MPIC object is associated with this vm, and it doesn't need to
> care about ordering between irqchip creation and vcpu creation.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> Documentation/virtual/kvm/api.txt   |    8 ++++++
> arch/powerpc/include/asm/kvm_host.h |    8 ++++++
> arch/powerpc/include/asm/kvm_ppc.h  |    2 ++
> arch/powerpc/kvm/booke.c            |    4 ++-
> arch/powerpc/kvm/mpic.c             |   49 +++++++++++++++++++++++++++++++----
> arch/powerpc/kvm/powerpc.c          |   26 +++++++++++++++++++
> include/uapi/linux/kvm.h            |    1 +
> 7 files changed, 92 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index d52f3f9..4c326ae 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
> When disabled (args[0] = 0), behavior is as if this facility is unsupported.
> 
> When this capability is enabled, KVM_EXIT_EPR can occur.
> +
> +6.6 KVM_CAP_IRQ_MPIC
> +
> +Architectures: ppc
> +Parameters: args[0] is the MPIC device fd
> +            args[1] is the MPIC CPU number for this vcpu
> +
> +This capability connects the vcpu to an in-kernel MPIC device.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7e7aef9..2a2e235 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
> 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
> };
> 
> +#define KVMPPC_IRQ_DEFAULT	0
> +#define KVMPPC_IRQ_MPIC		1
> +
> +struct openpic;
> +
> struct kvm_vcpu_arch {
> 	ulong host_stack;
> 	u32 host_pid;
> @@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
> 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
> 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
> 
> +	int irq_type;		/* one of KVM_IRQ_* */
> +	struct openpic *mpic;	/* KVM_IRQ_MPIC */
> +
> #ifdef CONFIG_KVM_BOOK3S_64_HV
> 	struct kvm_vcpu_arch_shared shregs;
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 3b63b97..f54707f 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
> }
> 
> void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
> +			     u32 cpu);
> 
> int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> 			      struct kvm_config_tlb *cfg);
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index cddc6b3..7d00222 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
> 		if (update_epr = true) {
> 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
> 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
> -			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
> +			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
> +				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
> 				kvmppc_mpic_set_epr(vcpu);
> +			}
> 		}
> 
> 		new_msr &= msr_mask;
> diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
> index 8cda2fa..caffe3b 100644
> --- a/arch/powerpc/kvm/mpic.c
> +++ b/arch/powerpc/kvm/mpic.c
> @@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
> 
> void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
> {
> -	struct openpic *opp = vcpu->arch.irqchip_priv;
> +	struct openpic *opp = vcpu->arch.mpic;
> 	int cpu = vcpu->vcpu_id;
> 	unsigned long flags;
> 
> @@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
> 
> static void unmap_mmio(struct openpic *opp)
> {
> -	BUG_ON(opp->mmio_mapped);
> -	opp->mmio_mapped = false;
> -
> -	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
> +	if (opp->mmio_mapped) {
> +		opp->mmio_mapped = false;
> +		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
> +	}
> }
> 
> static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
> @@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
> 	.release = kvm_mpic_release,
> };
> 
> +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
> +			     u32 cpu)
> +{
> +	struct openpic *opp = mpic_filp->private_data;
> +	int ret = 0;
> +
> +	if (mpic_filp->f_op != &kvm_mpic_fops)
> +		return -EPERM;
> +	if (opp->kvm != vcpu->kvm)
> +		return -EPERM;
> +	if (cpu < 0 || cpu >= MAX_CPU)
> +		return -EPERM;
> +
> +	spin_lock_irq(&opp->lock);
> +
> +	if (opp->dst[cpu].vcpu) {
> +		ret = -EEXIST;
> +		goto out;
> +	}
> +	if (vcpu->arch.irq_type) {
> +		return -EBUSY;
> +		goto out;
> +	}
> +
> +	opp->dst[cpu].vcpu = vcpu;
> +	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
> +
> +	vcpu->arch.mpic = opp;
> +	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
> +	atomic_inc(&opp->users);
> +
> +	if (opp->mpic_mode_mask = GCR_MODE_PROXY)

Shouldn't this be an &?

> +		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
> +
> +out:
> +	spin_unlock_irq(&opp->lock);
> +	return ret;
> +}
> +
> int kvm_create_mpic(struct kvm *kvm, u32 type)
> {
> 	struct openpic *opp;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index c9a2972..290a905 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -25,6 +25,7 @@
> #include <linux/hrtimer.h>
> #include <linux/fs.h>
> #include <linux/slab.h>
> +#include <linux/file.h>
> #include <asm/cputable.h>
> #include <asm/uaccess.h>
> #include <asm/kvm_ppc.h>
> @@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
> #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
> 	case KVM_CAP_SW_TLB:
> #endif
> +#ifdef CONFIG_KVM_MPIC
> +	case KVM_CAP_IRQ_MPIC:
> +#endif
> 		r = 1;
> 		break;
> 	case KVM_CAP_COALESCED_MMIO:
> @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> 	tasklet_kill(&vcpu->arch.tasklet);
> 
> 	kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +	switch (vcpu->arch.irq_type) {
> +	case KVMPPC_IRQ_MPIC:
> +		kvmppc_mpic_put(vcpu->arch.mpic);

This doesn't tell the MPIC that this exact CPU is getting killed. What if we hotplug remove just a single CPU? Don't we have to deregister the CPU with the MPIC?


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-04 12:54         ` Alexander Graf
@ 2013-04-04 18:41           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 18:41 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
> 
> On 03.04.2013, at 03:57, Scott Wood wrote:
> 
> > +	if (opp->mpic_mode_mask == GCR_MODE_PROXY)
> 
> Shouldn't this be an &?

The way the mode field was originally documented was a two-bit field,  
where 0b11 was external proxy, and 0b10 was reserved.  If we use & it  
would have to be:

	if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
		...

Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true  
in the case of GCR_MODE_MIXED.

In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that  
is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a  
bit more legitimate to view it as a bitmap.  Still, I doubt we'll see  
new mode bits.

> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> > 	tasklet_kill(&vcpu->arch.tasklet);
> >
> > 	kvmppc_remove_vcpu_debugfs(vcpu);
> > +
> > +	switch (vcpu->arch.irq_type) {
> > +	case KVMPPC_IRQ_MPIC:
> > +		kvmppc_mpic_put(vcpu->arch.mpic);
> 
> This doesn't tell the MPIC that this exact CPU is getting killed.  
> What if we hotplug remove just a single CPU? Don't we have to  
> deregister the CPU with the MPIC?

If we ever support hot vcpu removal, yes.  We'd probably need some MPIC  
code changes to accommodate that, and we wouldn't currently have a way  
to test it, so I'd rather make it obviously not supported for now.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-04 18:41           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 18:41 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
> 
> On 03.04.2013, at 03:57, Scott Wood wrote:
> 
> > +	if (opp->mpic_mode_mask = GCR_MODE_PROXY)
> 
> Shouldn't this be an &?

The way the mode field was originally documented was a two-bit field,  
where 0b11 was external proxy, and 0b10 was reserved.  If we use & it  
would have to be:

	if ((opp->mpic_mode_mask & GCR_MODE_PROXY) = GCR_MODE_PROXY)
		...

Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true  
in the case of GCR_MODE_MIXED.

In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that  
is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a  
bit more legitimate to view it as a bitmap.  Still, I doubt we'll see  
new mode bits.

> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> > 	tasklet_kill(&vcpu->arch.tasklet);
> >
> > 	kvmppc_remove_vcpu_debugfs(vcpu);
> > +
> > +	switch (vcpu->arch.irq_type) {
> > +	case KVMPPC_IRQ_MPIC:
> > +		kvmppc_mpic_put(vcpu->arch.mpic);
> 
> This doesn't tell the MPIC that this exact CPU is getting killed.  
> What if we hotplug remove just a single CPU? Don't we have to  
> deregister the CPU with the MPIC?

If we ever support hot vcpu removal, yes.  We'd probably need some MPIC  
code changes to accommodate that, and we wouldn't currently have a way  
to test it, so I'd rather make it obviously not supported for now.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-04 18:41           ` Scott Wood
@ 2013-04-04 22:30             ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-04 22:30 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 04.04.2013 um 20:41 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> On 03.04.2013, at 03:57, Scott Wood wrote:
>> > +    if (opp->mpic_mode_mask == GCR_MODE_PROXY)
>> Shouldn't this be an &?
> 
> The way the mode field was originally documented was a two-bit field, where 0b11 was external proxy, and 0b10 was reserved.  If we use & it would have to be:
> 
>    if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
>        ...
> 
> Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true in the case of GCR_MODE_MIXED.
> 
> In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit more legitimate to view it as a bitmap.  Still, I doubt we'll see new mode bits.

Ok, please add a comment about this here then :).

> 
>> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>> >    tasklet_kill(&vcpu->arch.tasklet);
>> >
>> >    kvmppc_remove_vcpu_debugfs(vcpu);
>> > +
>> > +    switch (vcpu->arch.irq_type) {
>> > +    case KVMPPC_IRQ_MPIC:
>> > +        kvmppc_mpic_put(vcpu->arch.mpic);
>> This doesn't tell the MPIC that this exact CPU is getting killed. What if we hotplug remove just a single CPU? Don't we have to deregister the CPU with the MPIC?
> 
> If we ever support hot vcpu removal, yes.  We'd probably need some MPIC code changes to accommodate that, and we wouldn't currently have a way to test it, so I'd rather make it obviously not supported for now.

Is there any way to break heavily if user space attempts this?

Alex

> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-04 22:30             ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-04 22:30 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 04.04.2013 um 20:41 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> On 03.04.2013, at 03:57, Scott Wood wrote:
>> > +    if (opp->mpic_mode_mask = GCR_MODE_PROXY)
>> Shouldn't this be an &?
> 
> The way the mode field was originally documented was a two-bit field, where 0b11 was external proxy, and 0b10 was reserved.  If we use & it would have to be:
> 
>    if ((opp->mpic_mode_mask & GCR_MODE_PROXY) = GCR_MODE_PROXY)
>        ...
> 
> Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true in the case of GCR_MODE_MIXED.
> 
> In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit more legitimate to view it as a bitmap.  Still, I doubt we'll see new mode bits.

Ok, please add a comment about this here then :).

> 
>> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>> >    tasklet_kill(&vcpu->arch.tasklet);
>> >
>> >    kvmppc_remove_vcpu_debugfs(vcpu);
>> > +
>> > +    switch (vcpu->arch.irq_type) {
>> > +    case KVMPPC_IRQ_MPIC:
>> > +        kvmppc_mpic_put(vcpu->arch.mpic);
>> This doesn't tell the MPIC that this exact CPU is getting killed. What if we hotplug remove just a single CPU? Don't we have to deregister the CPU with the MPIC?
> 
> If we ever support hot vcpu removal, yes.  We'd probably need some MPIC code changes to accommodate that, and we wouldn't currently have a way to test it, so I'd rather make it obviously not supported for now.

Is there any way to break heavily if user space attempts this?

Alex

> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-04 22:30             ` Alexander Graf
@ 2013-04-04 22:35               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 22:35 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
> 
> 
> Am 04.04.2013 um 20:41 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> > +    if (opp->mpic_mode_mask == GCR_MODE_PROXY)
> >> Shouldn't this be an &?
> >
> > The way the mode field was originally documented was a two-bit  
> field, where 0b11 was external proxy, and 0b10 was reserved.  If we  
> use & it would have to be:
> >
> >    if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
> >        ...
> >
> > Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return  
> true in the case of GCR_MODE_MIXED.
> >
> > In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI])  
> that is ignored if the mixed-mode bit (GCR[M]) is not set, which  
> makes it a bit more legitimate to view it as a bitmap.  Still, I  
> doubt we'll see new mode bits.
> 
> Ok, please add a comment about this here then :).

What sort of comment would you like?  Or do you want me to use the "(x  
& y) == y" version?

> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu  
> *vcpu)
> >> >    tasklet_kill(&vcpu->arch.tasklet);
> >> >
> >> >    kvmppc_remove_vcpu_debugfs(vcpu);
> >> > +
> >> > +    switch (vcpu->arch.irq_type) {
> >> > +    case KVMPPC_IRQ_MPIC:
> >> > +        kvmppc_mpic_put(vcpu->arch.mpic);
> >> This doesn't tell the MPIC that this exact CPU is getting killed.  
> What if we hotplug remove just a single CPU? Don't we have to  
> deregister the CPU with the MPIC?
> >
> > If we ever support hot vcpu removal, yes.  We'd probably need some  
> MPIC code changes to accommodate that, and we wouldn't currently have  
> a way to test it, so I'd rather make it obviously not supported for  
> now.
> 
> Is there any way to break heavily if user space attempts this?

Is there any way for userspace to request this currently?  They can  
close the vcpu fd, but the vcpu won't actually be destroyed until the  
vm goes down.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-04 22:35               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 22:35 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
> 
> 
> Am 04.04.2013 um 20:41 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> > +    if (opp->mpic_mode_mask = GCR_MODE_PROXY)
> >> Shouldn't this be an &?
> >
> > The way the mode field was originally documented was a two-bit  
> field, where 0b11 was external proxy, and 0b10 was reserved.  If we  
> use & it would have to be:
> >
> >    if ((opp->mpic_mode_mask & GCR_MODE_PROXY) = GCR_MODE_PROXY)
> >        ...
> >
> > Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return  
> true in the case of GCR_MODE_MIXED.
> >
> > In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI])  
> that is ignored if the mixed-mode bit (GCR[M]) is not set, which  
> makes it a bit more legitimate to view it as a bitmap.  Still, I  
> doubt we'll see new mode bits.
> 
> Ok, please add a comment about this here then :).

What sort of comment would you like?  Or do you want me to use the "(x  
& y) = y" version?

> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu  
> *vcpu)
> >> >    tasklet_kill(&vcpu->arch.tasklet);
> >> >
> >> >    kvmppc_remove_vcpu_debugfs(vcpu);
> >> > +
> >> > +    switch (vcpu->arch.irq_type) {
> >> > +    case KVMPPC_IRQ_MPIC:
> >> > +        kvmppc_mpic_put(vcpu->arch.mpic);
> >> This doesn't tell the MPIC that this exact CPU is getting killed.  
> What if we hotplug remove just a single CPU? Don't we have to  
> deregister the CPU with the MPIC?
> >
> > If we ever support hot vcpu removal, yes.  We'd probably need some  
> MPIC code changes to accommodate that, and we wouldn't currently have  
> a way to test it, so I'd rather make it obviously not supported for  
> now.
> 
> Is there any way to break heavily if user space attempts this?

Is there any way for userspace to request this currently?  They can  
close the vcpu fd, but the vcpu won't actually be destroyed until the  
vm goes down.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-04  5:59             ` Gleb Natapov
@ 2013-04-04 23:33               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 23:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/04/2013 12:59:02 AM, Gleb Natapov wrote:
> On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> > KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> > the main kvm.h.  The arch header would be OK, since the non-arch
> > header includes the arch header, and thus it wouldn't be visible to
> > userspace where it is -- if there later is a need for MPIC (or
> > whatever other device follows MPIC's example) on another
> > architecture, it could be moved without breaking anything.  Or, we
> > could just have a header for each device type.
> >
> If device will be used by more then one arch it will move into  
> virt/kvm
> and will have its own header, like ioapic.

virt/kvm/ioapic.h is not uapi.  The ioapic uapi component (e.g. struct  
kvm_ioapic_state) is duplicated between x86 and ia64, which is the sort  
of thing I'd like to avoid.  I'm OK with putting it in the PPC header  
if, upon a later need for multi-architecture support, it could move  
into either the main uapi header or a separate uapi header that the  
main uapi header includes (i.e. no userspace-visible change in which  
header needs to be included).

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-04 23:33               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 23:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/04/2013 12:59:02 AM, Gleb Natapov wrote:
> On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> > KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> > the main kvm.h.  The arch header would be OK, since the non-arch
> > header includes the arch header, and thus it wouldn't be visible to
> > userspace where it is -- if there later is a need for MPIC (or
> > whatever other device follows MPIC's example) on another
> > architecture, it could be moved without breaking anything.  Or, we
> > could just have a header for each device type.
> >
> If device will be used by more then one arch it will move into  
> virt/kvm
> and will have its own header, like ioapic.

virt/kvm/ioapic.h is not uapi.  The ioapic uapi component (e.g. struct  
kvm_ioapic_state) is duplicated between x86 and ia64, which is the sort  
of thing I'd like to avoid.  I'm OK with putting it in the PPC header  
if, upon a later need for multi-architecture support, it could move  
into either the main uapi header or a separate uapi header that the  
main uapi header includes (i.e. no userspace-visible change in which  
header needs to be included).

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-04 10:41         ` Gleb Natapov
@ 2013-04-04 23:47           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 23:47 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/04/2013 05:41:35 AM, Gleb Natapov wrote:
> On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> > +struct kvm_device_attr {
> > +	__u32	flags;		/* no flags currently defined */
> > +	__u32	group;		/* device-defined */
> > +	__u64	attr;		/* group-defined */
> > +	__u64	addr;		/* userspace address of attr data */
> > +};
> > +
> Since now each device has its own fd is it an advantage to enforce
> common interface between different devices?

I think so, even if only to avoid repeating the various pains  
surrounding adding ioctls.  Not necessarily "enforce", just enable.  If  
a device has some sort of command that does not fit neatly into the  
"set or get" model, it could still add a new ioctl.

> If we do so though why not handle file creation, ioctl and file  
> descriptor lifetime in the
> common code. Common code will have "struct kvm_device" with "struct
> kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch  
> ioctls
> to a device using kvm_device->ops->(set|get|has)_attr pointers.

So make it more like the pre-fd version, except for the actual fd  
usage?  It would make destruction a bit simpler (assuming there's no  
need for vcpu destruction code to access a device).  Hopefully nobody  
asks me to change it back again, though. :-)

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-04 23:47           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-04 23:47 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/04/2013 05:41:35 AM, Gleb Natapov wrote:
> On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> > +struct kvm_device_attr {
> > +	__u32	flags;		/* no flags currently defined */
> > +	__u32	group;		/* device-defined */
> > +	__u64	attr;		/* group-defined */
> > +	__u64	addr;		/* userspace address of attr data */
> > +};
> > +
> Since now each device has its own fd is it an advantage to enforce
> common interface between different devices?

I think so, even if only to avoid repeating the various pains  
surrounding adding ioctls.  Not necessarily "enforce", just enable.  If  
a device has some sort of command that does not fit neatly into the  
"set or get" model, it could still add a new ioctl.

> If we do so though why not handle file creation, ioctl and file  
> descriptor lifetime in the
> common code. Common code will have "struct kvm_device" with "struct
> kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch  
> ioctls
> to a device using kvm_device->ops->(set|get|has)_attr pointers.

So make it more like the pre-fd version, except for the actual fd  
usage?  It would make destruction a bit simpler (assuming there's no  
need for vcpu destruction code to access a device).  Hopefully nobody  
asks me to change it back again, though. :-)

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-04 10:41         ` Gleb Natapov
@ 2013-04-05  1:02           ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-05  1:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Thu, Apr 04, 2013 at 01:41:35PM +0300, Gleb Natapov wrote:

> Since now each device has its own fd is it an advantage to enforce
> common interface between different devices? If we do so though why
> not handle file creation, ioctl and file descriptor lifetime in the
> common code. Common code will have "struct kvm_device" with "struct
> kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
> to a device using kvm_device->ops->(set|get|has)_attr pointers.

I thought about making the same request, but when I looked at it, the
amount of code that could be made common in this way is pretty tiny,
and doing that involves a bit of extra complexity, so I thought that
on the whole it wouldn't be worthwhile.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-05  1:02           ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-05  1:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Thu, Apr 04, 2013 at 01:41:35PM +0300, Gleb Natapov wrote:

> Since now each device has its own fd is it an advantage to enforce
> common interface between different devices? If we do so though why
> not handle file creation, ioctl and file descriptor lifetime in the
> common code. Common code will have "struct kvm_device" with "struct
> kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
> to a device using kvm_device->ops->(set|get|has)_attr pointers.

I thought about making the same request, but when I looked at it, the
amount of code that could be made common in this way is pretty tiny,
and doing that involves a bit of extra complexity, so I thought that
on the whole it wouldn't be worthwhile.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-04 22:35               ` Scott Wood
@ 2013-04-05  6:09                 ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-05  6:09 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 05.04.2013 um 00:35 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
>> Am 04.04.2013 um 20:41 schrieb Scott Wood <scottwood@freescale.com>:
>> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > +    if (opp->mpic_mode_mask == GCR_MODE_PROXY)
>> >> Shouldn't this be an &?
>> >
>> > The way the mode field was originally documented was a two-bit field, where 0b11 was external proxy, and 0b10 was reserved.  If we use & it would have to be:
>> >
>> >    if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
>> >        ...
>> >
>> > Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true in the case of GCR_MODE_MIXED.
>> >
>> > In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit more legitimate to view it as a bitmap.  Still, I doubt we'll see new mode bits.
>> Ok, please add a comment about this here then :).
> 
> What sort of comment would you like?  Or do you want me to use the "(x & y) == y" version?

/* This might need to be changed if GCR gets extended */

> 
>> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>> >> >    tasklet_kill(&vcpu->arch.tasklet);
>> >> >
>> >> >    kvmppc_remove_vcpu_debugfs(vcpu);
>> >> > +
>> >> > +    switch (vcpu->arch.irq_type) {
>> >> > +    case KVMPPC_IRQ_MPIC:
>> >> > +        kvmppc_mpic_put(vcpu->arch.mpic);
>> >> This doesn't tell the MPIC that this exact CPU is getting killed. What if we hotplug remove just a single CPU? Don't we have to deregister the CPU with the MPIC?
>> >
>> > If we ever support hot vcpu removal, yes.  We'd probably need some MPIC code changes to accommodate that, and we wouldn't currently have a way to test it, so I'd rather make it obviously not supported for now.
>> Is there any way to break heavily if user space attempts this?
> 
> Is there any way for userspace to request this currently?  They can close the vcpu fd, but the vcpu won't actually be destroyed until the vm goes down.

Are you sure? X86 does CPU hotplug today, so there has to be something :)

Alex

> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-05  6:09                 ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-05  6:09 UTC (permalink / raw)
  To: Scott Wood
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>



Am 05.04.2013 um 00:35 schrieb Scott Wood <scottwood@freescale.com>:

> On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
>> Am 04.04.2013 um 20:41 schrieb Scott Wood <scottwood@freescale.com>:
>> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > +    if (opp->mpic_mode_mask = GCR_MODE_PROXY)
>> >> Shouldn't this be an &?
>> >
>> > The way the mode field was originally documented was a two-bit field, where 0b11 was external proxy, and 0b10 was reserved.  If we use & it would have to be:
>> >
>> >    if ((opp->mpic_mode_mask & GCR_MODE_PROXY) = GCR_MODE_PROXY)
>> >        ...
>> >
>> > Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true in the case of GCR_MODE_MIXED.
>> >
>> > In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit more legitimate to view it as a bitmap.  Still, I doubt we'll see new mode bits.
>> Ok, please add a comment about this here then :).
> 
> What sort of comment would you like?  Or do you want me to use the "(x & y) = y" version?

/* This might need to be changed if GCR gets extended */

> 
>> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>> >> >    tasklet_kill(&vcpu->arch.tasklet);
>> >> >
>> >> >    kvmppc_remove_vcpu_debugfs(vcpu);
>> >> > +
>> >> > +    switch (vcpu->arch.irq_type) {
>> >> > +    case KVMPPC_IRQ_MPIC:
>> >> > +        kvmppc_mpic_put(vcpu->arch.mpic);
>> >> This doesn't tell the MPIC that this exact CPU is getting killed. What if we hotplug remove just a single CPU? Don't we have to deregister the CPU with the MPIC?
>> >
>> > If we ever support hot vcpu removal, yes.  We'd probably need some MPIC code changes to accommodate that, and we wouldn't currently have a way to test it, so I'd rather make it obviously not supported for now.
>> Is there any way to break heavily if user space attempts this?
> 
> Is there any way for userspace to request this currently?  They can close the vcpu fd, but the vcpu won't actually be destroyed until the vm goes down.

Are you sure? X86 does CPU hotplug today, so there has to be something :)

Alex

> 
> -Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-05  6:09                 ` Alexander Graf
@ 2013-04-05 17:11                   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-05 17:11 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/05/2013 01:09:50 AM, Alexander Graf wrote:
> 
> 
> Am 05.04.2013 um 00:35 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
> >> Am 04.04.2013 um 20:41 schrieb Scott Wood  
> <scottwood@freescale.com>:
> >> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
> >> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu  
> *vcpu)
> >> >> >    tasklet_kill(&vcpu->arch.tasklet);
> >> >> >
> >> >> >    kvmppc_remove_vcpu_debugfs(vcpu);
> >> >> > +
> >> >> > +    switch (vcpu->arch.irq_type) {
> >> >> > +    case KVMPPC_IRQ_MPIC:
> >> >> > +        kvmppc_mpic_put(vcpu->arch.mpic);
> >> >> This doesn't tell the MPIC that this exact CPU is getting  
> killed. What if we hotplug remove just a single CPU? Don't we have to  
> deregister the CPU with the MPIC?
> >> >
> >> > If we ever support hot vcpu removal, yes.  We'd probably need  
> some MPIC code changes to accommodate that, and we wouldn't currently  
> have a way to test it, so I'd rather make it obviously not supported  
> for now.
> >> Is there any way to break heavily if user space attempts this?
> >
> > Is there any way for userspace to request this currently?  They can  
> close the vcpu fd, but the vcpu won't actually be destroyed until the  
> vm goes down.
> 
> Are you sure? X86 does CPU hotplug today, so there has to be  
> something :)

Hot add, or hot remove?

Can you give me any hint on where to look?

kvm_cpu_hotplug() appears to deal with hotplug of *physical* cpus --  
and is currently a no-op on powerpc.  Other than that, grepping isn't  
turning up much.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-05 17:11                   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-05 17:11 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm-ppc@vger.kernel.org>, <kvm@vger.kernel.org>,
	<paulus@samba.org>

On 04/05/2013 01:09:50 AM, Alexander Graf wrote:
> 
> 
> Am 05.04.2013 um 00:35 schrieb Scott Wood <scottwood@freescale.com>:
> 
> > On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
> >> Am 04.04.2013 um 20:41 schrieb Scott Wood  
> <scottwood@freescale.com>:
> >> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
> >> >> On 03.04.2013, at 03:57, Scott Wood wrote:
> >> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu  
> *vcpu)
> >> >> >    tasklet_kill(&vcpu->arch.tasklet);
> >> >> >
> >> >> >    kvmppc_remove_vcpu_debugfs(vcpu);
> >> >> > +
> >> >> > +    switch (vcpu->arch.irq_type) {
> >> >> > +    case KVMPPC_IRQ_MPIC:
> >> >> > +        kvmppc_mpic_put(vcpu->arch.mpic);
> >> >> This doesn't tell the MPIC that this exact CPU is getting  
> killed. What if we hotplug remove just a single CPU? Don't we have to  
> deregister the CPU with the MPIC?
> >> >
> >> > If we ever support hot vcpu removal, yes.  We'd probably need  
> some MPIC code changes to accommodate that, and we wouldn't currently  
> have a way to test it, so I'd rather make it obviously not supported  
> for now.
> >> Is there any way to break heavily if user space attempts this?
> >
> > Is there any way for userspace to request this currently?  They can  
> close the vcpu fd, but the vcpu won't actually be destroyed until the  
> vm goes down.
> 
> Are you sure? X86 does CPU hotplug today, so there has to be  
> something :)

Hot add, or hot remove?

Can you give me any hint on where to look?

kvm_cpu_hotplug() appears to deal with hotplug of *physical* cpus --  
and is currently a no-op on powerpc.  Other than that, grepping isn't  
turning up much.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-08  5:33         ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-08  5:33 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:

[snip]

> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +				   struct kvm_create_device *cd)
> +{
> +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +
> +	switch (cd->type) {
> +	default:
> +		return -ENODEV;
> +	}
> +}

This gives a compile error saying "error: unused variable `test'",
which is fatal since this gets compiled under arch/powerpc/kvm, and we
treat all warnings as errors there.

This still gives a compile error at the end of your series if you try
to compile with CONFIG_KVM_MPIC=n.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-08  5:33         ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-08  5:33 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:

[snip]

> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +				   struct kvm_create_device *cd)
> +{
> +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +
> +	switch (cd->type) {
> +	default:
> +		return -ENODEV;
> +	}
> +}

This gives a compile error saying "error: unused variable `test'",
which is fatal since this gets compiled under arch/powerpc/kvm, and we
treat all warnings as errors there.

This still gives a compile error at the end of your series if you try
to compile with CONFIG_KVM_MPIC=n.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-03  1:57       ` Scott Wood
@ 2013-04-08  6:30         ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-08  6:30 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> Hook the MPIC code up to the KVM interfaces, add locking, etc.

[snip]

> @@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>  	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>  
>  	switch (cd->type) {
> +#ifdef CONFIG_KVM_MPIC
> +	case KVM_DEV_TYPE_FSL_MPIC_20:
> +	case KVM_DEV_TYPE_FSL_MPIC_42: {
> +		if (test)
> +			return 0;
> +
> +		return kvm_create_mpic(kvm, cd->type);
> +	}
> +#endif

I think this needs to be more like:

#ifdef CONFIG_KVM_MPIC
	case KVM_DEV_TYPE_FSL_MPIC_20:
	case KVM_DEV_TYPE_FSL_MPIC_42: {
		int fd;

		if (test)
			return 0;

		fd = kvm_create_mpic(kvm, cd->type);
		if (fd < 0)
			return fd;
		cd->fd = fd;
		return 0;
	}
#endif

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-08  6:30         ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-08  6:30 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> Hook the MPIC code up to the KVM interfaces, add locking, etc.

[snip]

> @@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>  	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
>  
>  	switch (cd->type) {
> +#ifdef CONFIG_KVM_MPIC
> +	case KVM_DEV_TYPE_FSL_MPIC_20:
> +	case KVM_DEV_TYPE_FSL_MPIC_42: {
> +		if (test)
> +			return 0;
> +
> +		return kvm_create_mpic(kvm, cd->type);
> +	}
> +#endif

I think this needs to be more like:

#ifdef CONFIG_KVM_MPIC
	case KVM_DEV_TYPE_FSL_MPIC_20:
	case KVM_DEV_TYPE_FSL_MPIC_42: {
		int fd;

		if (test)
			return 0;

		fd = kvm_create_mpic(kvm, cd->type);
		if (fd < 0)
			return fd;
		cd->fd = fd;
		return 0;
	}
#endif

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-04 23:47           ` Scott Wood
@ 2013-04-08 10:34             ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-08 10:34 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 04, 2013 at 06:47:45PM -0500, Scott Wood wrote:
> On 04/04/2013 05:41:35 AM, Gleb Natapov wrote:
> >On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> >> +struct kvm_device_attr {
> >> +	__u32	flags;		/* no flags currently defined */
> >> +	__u32	group;		/* device-defined */
> >> +	__u64	attr;		/* group-defined */
> >> +	__u64	addr;		/* userspace address of attr data */
> >> +};
> >> +
> >Since now each device has its own fd is it an advantage to enforce
> >common interface between different devices?
> 
> I think so, even if only to avoid repeating the various pains
> surrounding adding ioctls.  Not necessarily "enforce", just enable.
> If a device has some sort of command that does not fit neatly into
> the "set or get" model, it could still add a new ioctl.
> 
Make sense.

> >If we do so though why not handle file creation, ioctl and file
> >descriptor lifetime in the
> >common code. Common code will have "struct kvm_device" with "struct
> >kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> >kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch
> >ioctls
> >to a device using kvm_device->ops->(set|get|has)_attr pointers.
> 
> So make it more like the pre-fd version, except for the actual fd
> usage?  It would make destruction a bit simpler (assuming there's no
> need for vcpu destruction code to access a device).  Hopefully
> nobody asks me to change it back again, though. :-)
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-08 10:34             ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-08 10:34 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 04, 2013 at 06:47:45PM -0500, Scott Wood wrote:
> On 04/04/2013 05:41:35 AM, Gleb Natapov wrote:
> >On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> >> +struct kvm_device_attr {
> >> +	__u32	flags;		/* no flags currently defined */
> >> +	__u32	group;		/* device-defined */
> >> +	__u64	attr;		/* group-defined */
> >> +	__u64	addr;		/* userspace address of attr data */
> >> +};
> >> +
> >Since now each device has its own fd is it an advantage to enforce
> >common interface between different devices?
> 
> I think so, even if only to avoid repeating the various pains
> surrounding adding ioctls.  Not necessarily "enforce", just enable.
> If a device has some sort of command that does not fit neatly into
> the "set or get" model, it could still add a new ioctl.
> 
Make sense.

> >If we do so though why not handle file creation, ioctl and file
> >descriptor lifetime in the
> >common code. Common code will have "struct kvm_device" with "struct
> >kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> >kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch
> >ioctls
> >to a device using kvm_device->ops->(set|get|has)_attr pointers.
> 
> So make it more like the pre-fd version, except for the actual fd
> usage?  It would make destruction a bit simpler (assuming there's no
> need for vcpu destruction code to access a device).  Hopefully
> nobody asks me to change it back again, though. :-)
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-05  1:02           ` Paul Mackerras
@ 2013-04-08 10:37             ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-08 10:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Fri, Apr 05, 2013 at 12:02:06PM +1100, Paul Mackerras wrote:
> On Thu, Apr 04, 2013 at 01:41:35PM +0300, Gleb Natapov wrote:
> 
> > Since now each device has its own fd is it an advantage to enforce
> > common interface between different devices? If we do so though why
> > not handle file creation, ioctl and file descriptor lifetime in the
> > common code. Common code will have "struct kvm_device" with "struct
> > kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> > kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
> > to a device using kvm_device->ops->(set|get|has)_attr pointers.
> 
> I thought about making the same request, but when I looked at it, the
> amount of code that could be made common in this way is pretty tiny,
> and doing that involves a bit of extra complexity, so I thought that
> on the whole it wouldn't be worthwhile.
> 
The value of doing so is not only in making some code common, but also
moving fd lifetime management into the common code where it can be
debugged once and for all potential users. I also expect the amount of
shared code to grow when interface will be used by more architectures.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-08 10:37             ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-08 10:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Scott Wood, Alexander Graf, kvm-ppc, kvm

On Fri, Apr 05, 2013 at 12:02:06PM +1100, Paul Mackerras wrote:
> On Thu, Apr 04, 2013 at 01:41:35PM +0300, Gleb Natapov wrote:
> 
> > Since now each device has its own fd is it an advantage to enforce
> > common interface between different devices? If we do so though why
> > not handle file creation, ioctl and file descriptor lifetime in the
> > common code. Common code will have "struct kvm_device" with "struct
> > kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> > kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
> > to a device using kvm_device->ops->(set|get|has)_attr pointers.
> 
> I thought about making the same request, but when I looked at it, the
> amount of code that could be made common in this way is pretty tiny,
> and doing that involves a bit of extra complexity, so I thought that
> on the whole it wouldn't be worthwhile.
> 
The value of doing so is not only in making some code common, but also
moving fd lifetime management into the common code where it can be
debugged once and for all potential users. I also expect the amount of
shared code to grow when interface will be used by more architectures.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-04 23:33               ` Scott Wood
@ 2013-04-08 10:39                 ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-08 10:39 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 04, 2013 at 06:33:38PM -0500, Scott Wood wrote:
> On 04/04/2013 12:59:02 AM, Gleb Natapov wrote:
> >On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> >> KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> >> the main kvm.h.  The arch header would be OK, since the non-arch
> >> header includes the arch header, and thus it wouldn't be visible to
> >> userspace where it is -- if there later is a need for MPIC (or
> >> whatever other device follows MPIC's example) on another
> >> architecture, it could be moved without breaking anything.  Or, we
> >> could just have a header for each device type.
> >>
> >If device will be used by more then one arch it will move into
> >virt/kvm
> >and will have its own header, like ioapic.
> 
> virt/kvm/ioapic.h is not uapi.  The ioapic uapi component (e.g.
> struct kvm_ioapic_state) is duplicated between x86 and ia64, which
> is the sort of thing I'd like to avoid.  I'm OK with putting it in
> the PPC header if, upon a later need for multi-architecture support,
> it could move into either the main uapi header or a separate uapi
> header that the main uapi header includes (i.e. no userspace-visible
> change in which header needs to be included).
> 
Agree, it make sense to have separate uapi header for a device that is
used by more then one arch.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-08 10:39                 ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-08 10:39 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 04, 2013 at 06:33:38PM -0500, Scott Wood wrote:
> On 04/04/2013 12:59:02 AM, Gleb Natapov wrote:
> >On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> >> KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> >> the main kvm.h.  The arch header would be OK, since the non-arch
> >> header includes the arch header, and thus it wouldn't be visible to
> >> userspace where it is -- if there later is a need for MPIC (or
> >> whatever other device follows MPIC's example) on another
> >> architecture, it could be moved without breaking anything.  Or, we
> >> could just have a header for each device type.
> >>
> >If device will be used by more then one arch it will move into
> >virt/kvm
> >and will have its own header, like ioapic.
> 
> virt/kvm/ioapic.h is not uapi.  The ioapic uapi component (e.g.
> struct kvm_ioapic_state) is duplicated between x86 and ia64, which
> is the sort of thing I'd like to avoid.  I'm OK with putting it in
> the PPC header if, upon a later need for multi-architecture support,
> it could move into either the main uapi header or a separate uapi
> header that the main uapi header includes (i.e. no userspace-visible
> change in which header needs to be included).
> 
Agree, it make sense to have separate uapi header for a device that is
used by more then one arch.

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-08  6:30         ` Paul Mackerras
@ 2013-04-09  0:49           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-09  0:49 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/08/2013 01:30:42 AM, Paul Mackerras wrote:
> On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> [snip]
> 
> > @@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct  
> kvm *kvm,
> >  	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> >
> >  	switch (cd->type) {
> > +#ifdef CONFIG_KVM_MPIC
> > +	case KVM_DEV_TYPE_FSL_MPIC_20:
> > +	case KVM_DEV_TYPE_FSL_MPIC_42: {
> > +		if (test)
> > +			return 0;
> > +
> > +		return kvm_create_mpic(kvm, cd->type);
> > +	}
> > +#endif
> 
> I think this needs to be more like:
> 
> #ifdef CONFIG_KVM_MPIC
> 	case KVM_DEV_TYPE_FSL_MPIC_20:
> 	case KVM_DEV_TYPE_FSL_MPIC_42: {
> 		int fd;
> 
> 		if (test)
> 			return 0;
> 
> 		fd = kvm_create_mpic(kvm, cd->type);
> 		if (fd < 0)
> 			return fd;
> 		cd->fd = fd;
> 		return 0;
> 	}
> #endif

Right, thanks for spotting.  It didn't show up in my testing because I  
did the same thing on the QEMU side.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-09  0:49           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-09  0:49 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/08/2013 01:30:42 AM, Paul Mackerras wrote:
> On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
> > Hook the MPIC code up to the KVM interfaces, add locking, etc.
> 
> [snip]
> 
> > @@ -2164,6 +2164,15 @@ static int kvm_ioctl_create_device(struct  
> kvm *kvm,
> >  	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> >
> >  	switch (cd->type) {
> > +#ifdef CONFIG_KVM_MPIC
> > +	case KVM_DEV_TYPE_FSL_MPIC_20:
> > +	case KVM_DEV_TYPE_FSL_MPIC_42: {
> > +		if (test)
> > +			return 0;
> > +
> > +		return kvm_create_mpic(kvm, cd->type);
> > +	}
> > +#endif
> 
> I think this needs to be more like:
> 
> #ifdef CONFIG_KVM_MPIC
> 	case KVM_DEV_TYPE_FSL_MPIC_20:
> 	case KVM_DEV_TYPE_FSL_MPIC_42: {
> 		int fd;
> 
> 		if (test)
> 			return 0;
> 
> 		fd = kvm_create_mpic(kvm, cd->type);
> 		if (fd < 0)
> 			return fd;
> 		cd->fd = fd;
> 		return 0;
> 	}
> #endif

Right, thanks for spotting.  It didn't show up in my testing because I  
did the same thing on the QEMU side.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
  2013-04-08  5:33         ` Paul Mackerras
@ 2013-04-09  0:50           ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-09  0:50 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/08/2013 12:33:13 AM, Paul Mackerras wrote:
> On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> 
> [snip]
> 
> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> > +				   struct kvm_create_device *cd)
> > +{
> > +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> > +
> > +	switch (cd->type) {
> > +	default:
> > +		return -ENODEV;
> > +	}
> > +}
> 
> This gives a compile error saying "error: unused variable `test'",
> which is fatal since this gets compiled under arch/powerpc/kvm, and we
> treat all warnings as errors there.
> 
> This still gives a compile error at the end of your series if you try
> to compile with CONFIG_KVM_MPIC=n.

Ah, right.  Will mark as __maybe_unused.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [RFC PATCH v3 1/6] kvm: add device control API
@ 2013-04-09  0:50           ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-09  0:50 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/08/2013 12:33:13 AM, Paul Mackerras wrote:
> On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> 
> [snip]
> 
> > +static int kvm_ioctl_create_device(struct kvm *kvm,
> > +				   struct kvm_create_device *cd)
> > +{
> > +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> > +
> > +	switch (cd->type) {
> > +	default:
> > +		return -ENODEV;
> > +	}
> > +}
> 
> This gives a compile error saying "error: unused variable `test'",
> which is fatal since this gets compiled under arch/powerpc/kvm, and we
> treat all warnings as errors there.
> 
> This still gives a compile error at the end of your series if you try
> to compile with CONFIG_KVM_MPIC=n.

Ah, right.  Will mark as __maybe_unused.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* [PATCH v4 0/6] device-control and in-kernel MPIC
  2013-04-03  1:57     ` Scott Wood
@ 2013-04-13  0:08       ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt          |   78 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h        |   17 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    9 +
 arch/powerpc/include/uapi/asm/kvm.h        |    7 +
 arch/powerpc/kvm/Kconfig                   |    9 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   12 +-
 arch/powerpc/kvm/mpic.c                    | 1758 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   42 +-
 include/linux/kvm_host.h                   |   37 +
 include/uapi/linux/kvm.h                   |   31 +
 virt/kvm/kvm_main.c                        |  135 +++
 14 files changed, 2165 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.10.4



^ permalink raw reply	[flat|nested] 261+ messages in thread

* [PATCH v4 0/6] device-control and in-kernel MPIC
@ 2013-04-13  0:08       ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt          |   78 ++
 Documentation/virtual/kvm/devices/README   |    1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h        |   17 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    9 +
 arch/powerpc/include/uapi/asm/kvm.h        |    7 +
 arch/powerpc/kvm/Kconfig                   |    9 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |   12 +-
 arch/powerpc/kvm/mpic.c                    | 1758 ++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   42 +-
 include/linux/kvm_host.h                   |   37 +
 include/uapi/linux/kvm.h                   |   31 +
 virt/kvm/kvm_main.c                        |  135 +++
 14 files changed, 2165 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.10.4



^ permalink raw reply	[flat|nested] 261+ messages in thread

* [PATCH v4 1/6] kvm: add device control API
  2013-04-13  0:08       ` Scott Wood
@ 2013-04-13  0:08         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v4:
 - Move some boilerplate back into generic code, as requested by Gleb.
   File descriptor management and reference counting is no longer the
   concern of the device implementation.

 - Don't hold kvm->lock during create.  The original reasons
   for doing so have vanished as for as MPIC is concerned, and
   this avoids needing to answer the question of whether to
   hold the lock during destroy as well.

   Paul, you may need to acquire the lock yourself in kvm_create_xics()
   to protect the -EEXIST check.

v3: remove some changes that were merged into this patch by accident,
and fix the error documentation for KVM_CREATE_DEVICE.
---
 Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/linux/kvm_host.h                 |   35 ++++++++
 include/uapi/linux/kvm.h                 |   27 +++++++
 virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
 5 files changed, 262 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..d52f3f9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 20d77d2..8fce9bc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
 
 extern bool kvm_rebooting;
 
+struct kvm_device_ops;
+
+struct kvm_device {
+	struct kvm_device_ops *ops;
+	struct kvm *kvm;
+	atomic_t users;
+	void *private;
+};
+
+/* create, destroy, and name are mandatory */
+struct kvm_device_ops {
+	const char *name;
+	int (*create)(struct kvm_device *dev, u32 type);
+
+	/*
+	 * Destroy is responsible for freeing dev.
+	 *
+	 * Destroy may be called before or after destructors are called
+	 * on emulated I/O regions, depending on whether a reference is
+	 * held by a vcpu or other kvm component that gets destroyed
+	 * after the emulated I/O.
+	 */
+	void (*destroy)(struct kvm_device *dev);
+
+	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
+	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
+	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
+	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
+		      unsigned long arg);
+};
+
+void kvm_device_get(struct kvm_device *dev);
+void kvm_device_put(struct kvm_device *dev);
+struct kvm_device *kvm_device_from_filp(struct file *filp);
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
 static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..20ce2d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_DEVICE_CTRL 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
 
 /*
+ * Device control API, available with KVM_CAP_DEVICE_CTRL
+ */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* ioctl for vm fd */
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
+
+/* ioctls for fds returned by KVM_CREATE_DEVICE */
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
+
+/*
  * ioctls for vcpu fds
  */
 #define KVM_RUN                   _IO(KVMIO,   0x80)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5cc53c9..e2b18af 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,117 @@ out:
 }
 #endif
 
+static int kvm_device_ioctl_attr(struct kvm_device *dev,
+				 int (*accessor)(struct kvm_device *dev,
+						 struct kvm_device_attr *attr),
+				 unsigned long arg)
+{
+	struct kvm_device_attr attr;
+
+	if (!accessor)
+		return -EPERM;
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(dev, &attr);
+}
+
+static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
+			     unsigned long arg)
+{
+	struct kvm_device *dev = filp->private_data;
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
+	case KVM_GET_DEVICE_ATTR:
+		return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
+	case KVM_HAS_DEVICE_ATTR:
+		return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
+	default:
+		if (dev->ops->ioctl)
+			return dev->ops->ioctl(dev, ioctl, arg);
+
+		return -ENOTTY;
+	}
+}
+
+void kvm_device_get(struct kvm_device *dev)
+{
+	atomic_inc(&dev->users);
+}
+
+void kvm_device_put(struct kvm_device *dev)
+{
+	if (atomic_dec_and_test(&dev->users))
+		dev->ops->destroy(dev);
+}
+
+static int kvm_device_release(struct inode *inode, struct file *filp)
+{
+	struct kvm_device *dev = filp->private_data;
+	struct kvm *kvm = dev->kvm;
+
+	kvm_device_put(dev);
+	kvm_put_kvm(kvm);
+	return 0;
+}
+
+static const struct file_operations kvm_device_fops = {
+	.unlocked_ioctl = kvm_device_ioctl,
+	.release = kvm_device_release,
+};
+
+struct kvm_device *kvm_device_from_filp(struct file *filp)
+{
+	if (filp->f_op != &kvm_device_fops)
+		return NULL;
+
+	return filp->private_data;
+}
+
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct kvm_device_ops *ops = NULL;
+	struct kvm_device *dev;
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+	int ret;
+
+	switch (cd->type) {
+	default:
+		return -ENODEV;
+	}
+
+	if (test)
+		return 0;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->ops = ops;
+	dev->kvm = kvm;
+	atomic_set(&dev->users, 1);
+
+	ret = ops->create(dev, cd->type);
+	if (ret < 0) {
+		kfree(dev);
+		return ret;
+	}
+
+	ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR);
+	if (ret < 0) {
+		ops->destroy(dev);
+		return ret;
+	}
+
+	kvm_get_kvm(kvm);
+	cd->fd = ret;
+	return 0;
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2272,6 +2383,24 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_create_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 1/6] kvm: add device control API
@ 2013-04-13  0:08         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v4:
 - Move some boilerplate back into generic code, as requested by Gleb.
   File descriptor management and reference counting is no longer the
   concern of the device implementation.

 - Don't hold kvm->lock during create.  The original reasons
   for doing so have vanished as for as MPIC is concerned, and
   this avoids needing to answer the question of whether to
   hold the lock during destroy as well.

   Paul, you may need to acquire the lock yourself in kvm_create_xics()
   to protect the -EEXIST check.

v3: remove some changes that were merged into this patch by accident,
and fix the error documentation for KVM_CREATE_DEVICE.
---
 Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
 Documentation/virtual/kvm/devices/README |    1 +
 include/linux/kvm_host.h                 |   35 ++++++++
 include/uapi/linux/kvm.h                 |   27 +++++++
 virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
 5 files changed, 262 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 976eb65..d52f3f9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+          be instantiated multiple times
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+          (e.g. read-only attribute, or attribute that only makes
+          sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the "devices" directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  "addr" is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 0000000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 20d77d2..8fce9bc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
 
 extern bool kvm_rebooting;
 
+struct kvm_device_ops;
+
+struct kvm_device {
+	struct kvm_device_ops *ops;
+	struct kvm *kvm;
+	atomic_t users;
+	void *private;
+};
+
+/* create, destroy, and name are mandatory */
+struct kvm_device_ops {
+	const char *name;
+	int (*create)(struct kvm_device *dev, u32 type);
+
+	/*
+	 * Destroy is responsible for freeing dev.
+	 *
+	 * Destroy may be called before or after destructors are called
+	 * on emulated I/O regions, depending on whether a reference is
+	 * held by a vcpu or other kvm component that gets destroyed
+	 * after the emulated I/O.
+	 */
+	void (*destroy)(struct kvm_device *dev);
+
+	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
+	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
+	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
+	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
+		      unsigned long arg);
+};
+
+void kvm_device_get(struct kvm_device *dev);
+void kvm_device_put(struct kvm_device *dev);
+struct kvm_device *kvm_device_from_filp(struct file *filp);
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
 static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 74d0ff3..20ce2d2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_DEVICE_CTRL 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
 
 /*
+ * Device control API, available with KVM_CAP_DEVICE_CTRL
+ */
+#define KVM_CREATE_DEVICE_TEST		1
+
+struct kvm_create_device {
+	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
+	__u32	fd;	/* out: device handle */
+	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
+};
+
+struct kvm_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* ioctl for vm fd */
+#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
+
+/* ioctls for fds returned by KVM_CREATE_DEVICE */
+#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
+#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
+#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
+
+/*
  * ioctls for vcpu fds
  */
 #define KVM_RUN                   _IO(KVMIO,   0x80)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5cc53c9..e2b18af 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,117 @@ out:
 }
 #endif
 
+static int kvm_device_ioctl_attr(struct kvm_device *dev,
+				 int (*accessor)(struct kvm_device *dev,
+						 struct kvm_device_attr *attr),
+				 unsigned long arg)
+{
+	struct kvm_device_attr attr;
+
+	if (!accessor)
+		return -EPERM;
+
+	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
+		return -EFAULT;
+
+	return accessor(dev, &attr);
+}
+
+static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
+			     unsigned long arg)
+{
+	struct kvm_device *dev = filp->private_data;
+
+	switch (ioctl) {
+	case KVM_SET_DEVICE_ATTR:
+		return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
+	case KVM_GET_DEVICE_ATTR:
+		return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
+	case KVM_HAS_DEVICE_ATTR:
+		return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
+	default:
+		if (dev->ops->ioctl)
+			return dev->ops->ioctl(dev, ioctl, arg);
+
+		return -ENOTTY;
+	}
+}
+
+void kvm_device_get(struct kvm_device *dev)
+{
+	atomic_inc(&dev->users);
+}
+
+void kvm_device_put(struct kvm_device *dev)
+{
+	if (atomic_dec_and_test(&dev->users))
+		dev->ops->destroy(dev);
+}
+
+static int kvm_device_release(struct inode *inode, struct file *filp)
+{
+	struct kvm_device *dev = filp->private_data;
+	struct kvm *kvm = dev->kvm;
+
+	kvm_device_put(dev);
+	kvm_put_kvm(kvm);
+	return 0;
+}
+
+static const struct file_operations kvm_device_fops = {
+	.unlocked_ioctl = kvm_device_ioctl,
+	.release = kvm_device_release,
+};
+
+struct kvm_device *kvm_device_from_filp(struct file *filp)
+{
+	if (filp->f_op != &kvm_device_fops)
+		return NULL;
+
+	return filp->private_data;
+}
+
+static int kvm_ioctl_create_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct kvm_device_ops *ops = NULL;
+	struct kvm_device *dev;
+	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
+	int ret;
+
+	switch (cd->type) {
+	default:
+		return -ENODEV;
+	}
+
+	if (test)
+		return 0;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->ops = ops;
+	dev->kvm = kvm;
+	atomic_set(&dev->users, 1);
+
+	ret = ops->create(dev, cd->type);
+	if (ret < 0) {
+		kfree(dev);
+		return ret;
+	}
+
+	ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR);
+	if (ret < 0) {
+		ops->destroy(dev);
+		return ret;
+	}
+
+	kvm_get_kvm(kvm);
+	cd->fd = ret;
+	return 0;
+}
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -2272,6 +2383,24 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_CREATE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_create_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &cd, sizeof(cd)))
+			goto out;
+
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r = -ENOTTY)
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
  2013-04-13  0:08       ` Scott Wood
@ 2013-04-13  0:08         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] == inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] == output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq == opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ == 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] == 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask == 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask == (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i == opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level =
+			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr =
+	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level =
+		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval =
+		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr == 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) == 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr == 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ == -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq == -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
@ 2013-04-13  0:08         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e ("Update version for
1.4.0-rc0"), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c | 1686 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 0000000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *               2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245 & MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include "hw.h"
+#include "ppc/mac.h"
+#include "pci/pci.h"
+#include "openpic.h"
+#include "sysbus.h"
+#include "pci/msi.h"
+#include "qemu/bitops.h"
+#include "ppc.h"
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+        if (debug_openpic) { \
+            printf(fmt , ## __VA_ARGS__); \
+        } \
+    } while (0)
+
+#define MAX_CPU     32
+#define MAX_SRC     256
+#define MAX_TMR     4
+#define MAX_IPI     4
+#define MAX_MSI     8
+#define MAX_IRQ     (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID         0x03	/* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT     (1 << 0)
+#define OPENPIC_FLAG_ILR          (2 << 0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START        0x0
+#define OPENPIC_GLB_REG_SIZE         0x10F0
+#define OPENPIC_TMR_REG_START        0x10F0
+#define OPENPIC_TMR_REG_SIZE         0x220
+#define OPENPIC_MSI_REG_START        0x1600
+#define OPENPIC_MSI_REG_SIZE         0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SRC_REG_START        0x10000
+#define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START        0x20000
+#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU      2
+#define RAVEN_MAX_EXT     48
+#define RAVEN_MAX_IRQ     64
+#define RAVEN_MAX_TMR      MAX_TMR
+#define RAVEN_MAX_IPI      MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
+#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
+#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
+#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+	int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+	.max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+	.max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT    16
+#define FRR_NCPU_SHIFT     8
+#define FRR_VID_SHIFT      0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC      0x00000000	/* Generic Vendor ID */
+
+#define GCR_RESET        0x80000000
+#define GCR_MODE_PASS    0x00000000
+#define GCR_MODE_MIXED   0x20000000
+#define GCR_MODE_PROXY   0x60000000
+
+#define TBCR_CI           0x80000000	/* count inhibit */
+#define TCCR_TOG          0x80000000	/* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT      31
+#define IDR_EP_MASK       (1 << IDR_EP_SHIFT)
+#define IDR_CI0_SHIFT     30
+#define IDR_CI1_SHIFT     29
+#define IDR_P1_SHIFT      1
+#define IDR_P0_SHIFT      0
+
+#define ILR_INTTGT_MASK   0x000000ff
+#define ILR_INTTGT_INT    0x00
+#define ILR_INTTGT_CINT   0x01	/* critical */
+#define ILR_INTTGT_MCP    0x02	/* machine check */
+
+/* The currently supported INTTGT values happen to be the same as QEMU's
+ * openpic output codes, but don't depend on this.  The output codes
+ * could change (unlikely, but...) or support could be added for
+ * more INTTGT values.
+ */
+static const int inttgt_output[][2] = {
+	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
+	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
+	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
+};
+
+static int inttgt_to_output(int inttgt)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][0] = inttgt) {
+			return inttgt_output[i][1];
+		}
+	}
+
+	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
+	return OPENPIC_OUTPUT_INT;
+}
+
+static int output_to_inttgt(int output)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
+		if (inttgt_output[i][1] = output) {
+			return inttgt_output[i][0];
+		}
+	}
+
+	abort();
+}
+
+#define MSIIR_OFFSET       0x140
+#define MSIIR_SRS_SHIFT    29
+#define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
+#define MSIIR_IBS_SHIFT    24
+#define MSIIR_IBS_MASK     (0x1f << MSIIR_IBS_SHIFT)
+
+static int get_current_cpu(void)
+{
+	CPUState *cpu_single_cpu;
+
+	if (!cpu_single_env) {
+		return -1;
+	}
+
+	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
+	return cpu_single_cpu->cpu_index;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx);
+
+typedef enum IRQType {
+	IRQ_TYPE_NORMAL = 0,
+	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
+	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
+} IRQType;
+
+typedef struct IRQQueue {
+	/* Round up to the nearest 64 IRQs so that the queue length
+	 * won't change when moving between 32 and 64 bit hosts.
+	 */
+	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
+	int next;
+	int priority;
+} IRQQueue;
+
+typedef struct IRQSource {
+	uint32_t ivpr;		/* IRQ vector/priority register */
+	uint32_t idr;		/* IRQ destination register */
+	uint32_t destmask;	/* bitmap of CPU destinations */
+	int last_cpu;
+	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int pending;		/* TRUE if IRQ is pending */
+	IRQType type;
+	bool level:1;		/* level-triggered */
+	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
+} IRQSource;
+
+#define IVPR_MASK_SHIFT       31
+#define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
+#define IVPR_ACTIVITY_SHIFT   30
+#define IVPR_ACTIVITY_MASK    (1 << IVPR_ACTIVITY_SHIFT)
+#define IVPR_MODE_SHIFT       29
+#define IVPR_MODE_MASK        (1 << IVPR_MODE_SHIFT)
+#define IVPR_POLARITY_SHIFT   23
+#define IVPR_POLARITY_MASK    (1 << IVPR_POLARITY_SHIFT)
+#define IVPR_SENSE_SHIFT      22
+#define IVPR_SENSE_MASK       (1 << IVPR_SENSE_SHIFT)
+
+#define IVPR_PRIORITY_MASK     (0xF << 16)
+#define IVPR_PRIORITY(_ivprr_) ((int)(((_ivprr_) & IVPR_PRIORITY_MASK) >> 16))
+#define IVPR_VECTOR(opp, _ivprr_) ((_ivprr_) & (opp)->vector_mask)
+
+/* IDR[EP/CI] are only for FSL MPIC prior to v4.0 */
+#define IDR_EP      0x80000000	/* external pin */
+#define IDR_CI      0x40000000	/* critical interrupt */
+
+typedef struct IRQDest {
+	int32_t ctpr;		/* CPU current task priority */
+	IRQQueue raised;
+	IRQQueue servicing;
+	qemu_irq *irqs;
+
+	/* Count of IRQ sources asserting on non-INT outputs */
+	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+} IRQDest;
+
+typedef struct OpenPICState {
+	SysBusDevice busdev;
+	MemoryRegion mem;
+
+	/* Behavior control */
+	FslMpicInfo *fsl;
+	uint32_t model;
+	uint32_t flags;
+	uint32_t nb_irqs;
+	uint32_t vid;
+	uint32_t vir;		/* Vendor identification register */
+	uint32_t vector_mask;
+	uint32_t tfrr_reset;
+	uint32_t ivpr_reset;
+	uint32_t idr_reset;
+	uint32_t brr1;
+	uint32_t mpic_mode_mask;
+
+	/* Sub-regions */
+	MemoryRegion sub_io_mem[6];
+
+	/* Global registers */
+	uint32_t frr;		/* Feature reporting register */
+	uint32_t gcr;		/* Global configuration register  */
+	uint32_t pir;		/* Processor initialization register */
+	uint32_t spve;		/* Spurious vector register */
+	uint32_t tfrr;		/* Timer frequency reporting register */
+	/* Source registers */
+	IRQSource src[MAX_IRQ];
+	/* Local registers per output pin */
+	IRQDest dst[MAX_CPU];
+	uint32_t nb_cpus;
+	/* Timer registers */
+	struct {
+		uint32_t tccr;	/* Global timer current count register */
+		uint32_t tbcr;	/* Global timer base count register */
+	} timers[MAX_TMR];
+	/* Shared MSI registers */
+	struct {
+		uint32_t msir;	/* Shared Message Signaled Interrupt Register */
+	} msi[MAX_MSI];
+	uint32_t max_irq;
+	uint32_t irq_ipi0;
+	uint32_t irq_tim0;
+	uint32_t irq_msi;
+} OpenPICState;
+
+static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+{
+	set_bit(n_IRQ, q->queue);
+}
+
+static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+{
+	clear_bit(n_IRQ, q->queue);
+}
+
+static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+{
+	return test_bit(n_IRQ, q->queue);
+}
+
+static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+{
+	int irq = -1;
+	int next = -1;
+	int priority = -1;
+
+	for (;;) {
+		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
+		if (irq = opp->max_irq) {
+			break;
+		}
+
+		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
+
+		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
+			next = irq;
+			priority = IVPR_PRIORITY(opp->src[irq].ivpr);
+		}
+	}
+
+	q->next = next;
+	q->priority = priority;
+}
+
+static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+{
+	/* XXX: optimize */
+	IRQ_check(opp, q);
+
+	return q->next;
+}
+
+static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+			   bool active, bool was_active)
+{
+	IRQDest *dst;
+	IRQSource *src;
+	int priority;
+
+	dst = &opp->dst[n_CPU];
+	src = &opp->src[n_IRQ];
+
+	DPRINTF("%s: IRQ %d active %d was %d\n",
+		__func__, n_IRQ, active, was_active);
+
+	if (src->output != OPENPIC_OUTPUT_INT) {
+		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+			__func__, src->output, n_IRQ, active, was_active,
+			dst->outputs_active[src->output]);
+
+		/* On Freescale MPIC, critical interrupts ignore priority,
+		 * IACK, EOI, etc.  Before MPIC v4.1 they also ignore
+		 * masking.
+		 */
+		if (active) {
+			if (!was_active
+			    && dst->outputs_active[src->output]++ = 0) {
+				DPRINTF
+				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_raise(dst->irqs[src->output]);
+			}
+		} else {
+			if (was_active
+			    && --dst->outputs_active[src->output] = 0) {
+				DPRINTF
+				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+				     __func__, src->output, n_CPU, n_IRQ);
+				qemu_irq_lower(dst->irqs[src->output]);
+			}
+		}
+
+		return;
+	}
+
+	priority = IVPR_PRIORITY(src->ivpr);
+
+	/* Even if the interrupt doesn't have enough priority,
+	 * it is still raised, in case ctpr is lowered later.
+	 */
+	if (active) {
+		IRQ_setbit(&dst->raised, n_IRQ);
+	} else {
+		IRQ_resetbit(&dst->raised, n_IRQ);
+	}
+
+	IRQ_check(opp, &dst->raised);
+
+	if (active && priority <= dst->ctpr) {
+		DPRINTF
+		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		active = 0;
+	}
+
+	if (active) {
+		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
+		    priority <= dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+		} else {
+			DPRINTF
+			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			qemu_irq_raise(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	} else {
+		IRQ_get_next(opp, &dst->servicing);
+		if (dst->raised.priority > dst->ctpr &&
+		    dst->raised.priority > dst->servicing.priority) {
+			DPRINTF
+			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->raised.next,
+			     dst->raised.priority, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			/* IRQ line stays asserted */
+		} else {
+			DPRINTF
+			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+			     __func__, n_IRQ, dst->ctpr,
+			     dst->servicing.priority, n_CPU);
+			qemu_irq_lower(opp->dst[n_CPU].
+				       irqs[OPENPIC_OUTPUT_INT]);
+		}
+	}
+}
+
+/* update pic state because registers for n_IRQ have changed value */
+static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+{
+	IRQSource *src;
+	bool active, was_active;
+	int i;
+
+	src = &opp->src[n_IRQ];
+	active = src->pending;
+
+	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
+		/* Interrupt source is disabled */
+		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		active = false;
+	}
+
+	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+
+	/*
+	 * We don't have a similar check for already-active because
+	 * ctpr may have changed and we need to withdraw the interrupt.
+	 */
+	if (!active && !was_active) {
+		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (active) {
+		src->ivpr |= IVPR_ACTIVITY_MASK;
+	} else {
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+	}
+
+	if (src->destmask = 0) {
+		/* No target */
+		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		return;
+	}
+
+	if (src->destmask = (1 << src->last_cpu)) {
+		/* Only one CPU is allowed to receive this IRQ */
+		IRQ_local_pipe(opp, src->last_cpu, n_IRQ, active, was_active);
+	} else if (!(src->ivpr & IVPR_MODE_MASK)) {
+		/* Directed delivery mode */
+		for (i = 0; i < opp->nb_cpus; i++) {
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+			}
+		}
+	} else {
+		/* Distributed delivery mode */
+		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
+			if (i = opp->nb_cpus) {
+				i = 0;
+			}
+			if (src->destmask & (1 << i)) {
+				IRQ_local_pipe(opp, i, n_IRQ, active,
+					       was_active);
+				src->last_cpu = i;
+				break;
+			}
+		}
+	}
+}
+
+static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+
+	if (n_IRQ >= MAX_IRQ) {
+		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		abort();
+	}
+
+	src = &opp->src[n_IRQ];
+	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+		n_IRQ, level, src->ivpr);
+	if (src->level) {
+		/* level-sensitive irq */
+		src->pending = level;
+		openpic_update_irq(opp, n_IRQ);
+	} else {
+		/* edge-sensitive irq */
+		if (level) {
+			src->pending = 1;
+			openpic_update_irq(opp, n_IRQ);
+		}
+
+		if (src->output != OPENPIC_OUTPUT_INT) {
+			/* Edge-triggered interrupts shouldn't be used
+			 * with non-INT delivery, but just in case,
+			 * try to make it do something sane rather than
+			 * cause an interrupt storm.  This is close to
+			 * what you'd probably see happen in real hardware.
+			 */
+			src->pending = 0;
+			openpic_update_irq(opp, n_IRQ);
+		}
+	}
+}
+
+static void openpic_reset(DeviceState * d)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	int i;
+
+	opp->gcr = GCR_RESET;
+	/* Initialise controller registers */
+	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
+	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
+	    (opp->vid << FRR_VID_SHIFT);
+
+	opp->pir = 0;
+	opp->spve = -1 & opp->vector_mask;
+	opp->tfrr = opp->tfrr_reset;
+	/* Initialise IRQ sources */
+	for (i = 0; i < opp->max_irq; i++) {
+		opp->src[i].ivpr = opp->ivpr_reset;
+		opp->src[i].idr = opp->idr_reset;
+
+		switch (opp->src[i].type) {
+		case IRQ_TYPE_NORMAL:
+			opp->src[i].level +			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			break;
+
+		case IRQ_TYPE_FSLINT:
+			opp->src[i].ivpr |= IVPR_POLARITY_MASK;
+			break;
+
+		case IRQ_TYPE_FSLSPECIAL:
+			break;
+		}
+	}
+	/* Initialise IRQ destinations */
+	for (i = 0; i < MAX_CPU; i++) {
+		opp->dst[i].ctpr = 15;
+		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		opp->dst[i].raised.next = -1;
+		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		opp->dst[i].servicing.next = -1;
+	}
+	/* Initialise timers */
+	for (i = 0; i < MAX_TMR; i++) {
+		opp->timers[i].tccr = 0;
+		opp->timers[i].tbcr = TBCR_CI;
+	}
+	/* Go out of RESET state */
+	opp->gcr = 0;
+}
+
+static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].idr;
+}
+
+static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		return output_to_inttgt(opp->src[n_IRQ].output);
+	}
+
+	return 0xffffffff;
+}
+
+static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+{
+	return opp->src[n_IRQ].ivpr;
+}
+
+static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	IRQSource *src = &opp->src[n_IRQ];
+	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
+	uint32_t crit_mask = 0;
+	uint32_t mask = normal_mask;
+	int crit_shift = IDR_EP_SHIFT - opp->nb_cpus;
+	int i;
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		crit_mask = mask << crit_shift;
+		mask |= crit_mask | IDR_EP;
+	}
+
+	src->idr = val & mask;
+	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+
+	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
+		if (src->idr & crit_mask) {
+			if (src->idr & normal_mask) {
+				DPRINTF
+				    ("%s: IRQ configured for multiple output types, using "
+				     "critical\n", __func__);
+			}
+
+			src->output = OPENPIC_OUTPUT_CINT;
+			src->nomask = true;
+			src->destmask = 0;
+
+			for (i = 0; i < opp->nb_cpus; i++) {
+				int n_ci = IDR_CI0_SHIFT - i;
+
+				if (src->idr & (1UL << n_ci)) {
+					src->destmask |= 1UL << i;
+				}
+			}
+		} else {
+			src->output = OPENPIC_OUTPUT_INT;
+			src->nomask = false;
+			src->destmask = src->idr & normal_mask;
+		}
+	} else {
+		src->destmask = src->idr;
+	}
+}
+
+static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+{
+	if (opp->flags & OPENPIC_FLAG_ILR) {
+		IRQSource *src = &opp->src[n_IRQ];
+
+		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+			src->output);
+
+		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
+	}
+}
+
+static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+				     uint32_t val)
+{
+	uint32_t mask;
+
+	/* NOTE when implementing newer FSL MPIC models: starting with v4.0,
+	 * the polarity bit is read-only on internal interrupts.
+	 */
+	mask = IVPR_MASK_MASK | IVPR_PRIORITY_MASK | IVPR_SENSE_MASK |
+	    IVPR_POLARITY_MASK | opp->vector_mask;
+
+	/* ACTIVITY bit is read-only */
+	opp->src[n_IRQ].ivpr +	    (opp->src[n_IRQ].ivpr & IVPR_ACTIVITY_MASK) | (val & mask);
+
+	/* For FSL internal interrupts, The sense bit is reserved and zero,
+	 * and the interrupt is always level-triggered.  Timers and IPIs
+	 * have no sense or polarity bits, and are edge-triggered.
+	 */
+	switch (opp->src[n_IRQ].type) {
+	case IRQ_TYPE_NORMAL:
+		opp->src[n_IRQ].level +		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		break;
+
+	case IRQ_TYPE_FSLINT:
+		opp->src[n_IRQ].ivpr &= ~IVPR_SENSE_MASK;
+		break;
+
+	case IRQ_TYPE_FSLSPECIAL:
+		opp->src[n_IRQ].ivpr &= ~(IVPR_POLARITY_MASK | IVPR_SENSE_MASK);
+		break;
+	}
+
+	openpic_update_irq(opp, n_IRQ);
+	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+		opp->src[n_IRQ].ivpr);
+}
+
+static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+{
+	bool mpic_proxy = false;
+
+	if (val & GCR_RESET) {
+		openpic_reset(&opp->busdev.qdev);
+		return;
+	}
+
+	opp->gcr &= ~opp->mpic_mode_mask;
+	opp->gcr |= val & opp->mpic_mode_mask;
+
+	/* Set external proxy mode */
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+		mpic_proxy = true;
+	}
+
+	ppce500_set_mpic_proxy(mpic_proxy);
+}
+
+static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+	switch (addr) {
+	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		break;
+	case 0x1000:		/* FRR */
+		break;
+	case 0x1020:		/* GCR */
+		openpic_gcr_write(opp, val);
+		break;
+	case 0x1080:		/* VIR */
+		break;
+	case 0x1090:		/* PIR */
+		for (idx = 0; idx < opp->nb_cpus; idx++) {
+			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Raise OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			} else if (!(val & (1 << idx))
+				   && (opp->pir & (1 << idx))) {
+				DPRINTF
+				    ("Lower OpenPIC RESET output for CPU %d\n",
+				     idx);
+				dst = &opp->dst[idx];
+				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
+			}
+		}
+		opp->pir = val;
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		opp->spve = val & opp->vector_mask;
+		break;
+	default:
+		break;
+	}
+}
+
+static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+	if (addr & 0xF) {
+		return retval;
+	}
+	switch (addr) {
+	case 0x1000:		/* FRR */
+		retval = opp->frr;
+		break;
+	case 0x1020:		/* GCR */
+		retval = opp->gcr;
+		break;
+	case 0x1080:		/* VIR */
+		retval = opp->vir;
+		break;
+	case 0x1090:		/* PIR */
+		retval = 0x00000000;
+		break;
+	case 0x00:		/* Block Revision Register1 (BRR1) */
+		retval = opp->brr1;
+		break;
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:
+	case 0x80:
+	case 0x90:
+	case 0xA0:
+	case 0xB0:
+		retval +		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		break;
+	case 0x10A0:		/* IPI_IVPR */
+	case 0x10B0:
+	case 0x10C0:
+	case 0x10D0:
+		{
+			int idx;
+			idx = (addr - 0x10A0) >> 4;
+			retval = read_IRQreg_ivpr(opp, opp->irq_ipi0 + idx);
+		}
+		break;
+	case 0x10E0:		/* SPVE */
+		retval = opp->spve;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	addr += 0x10f0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	if (addr = 0x10f0) {
+		/* TFRR */
+		opp->tfrr = val;
+		return;
+	}
+
+	idx = (addr >> 6) & 0x3;
+	addr = addr & 0x30;
+
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		break;
+	case 0x10:		/* TBCR */
+		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
+		    (val & TBCR_CI) = 0 &&
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+			opp->timers[idx].tccr &= ~TCCR_TOG;
+		}
+		opp->timers[idx].tbcr = val;
+		break;
+	case 0x20:		/* TVPR */
+		write_IRQreg_ivpr(opp, opp->irq_tim0 + idx, val);
+		break;
+	case 0x30:		/* TDR */
+		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval = -1;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		goto out;
+	}
+	idx = (addr >> 6) & 0x3;
+	if (addr = 0x0) {
+		/* TFRR */
+		retval = opp->tfrr;
+		goto out;
+	}
+	switch (addr & 0x30) {
+	case 0x00:		/* TCCR */
+		retval = opp->timers[idx].tccr;
+		break;
+	case 0x10:		/* TBCR */
+		retval = opp->timers[idx].tbcr;
+		break;
+	case 0x20:		/* TIPV */
+		retval = read_IRQreg_ivpr(opp, opp->irq_tim0 + idx);
+		break;
+	case 0x30:		/* TIDE (TIDR) */
+		retval = read_IRQreg_idr(opp, opp->irq_tim0 + idx);
+		break;
+	}
+
+out:
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	OpenPICState *opp = opaque;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+		__func__, addr, val);
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		write_IRQreg_ivpr(opp, idx, val);
+		break;
+	case 0x10:
+		write_IRQreg_idr(opp, idx, val);
+		break;
+	case 0x18:
+		write_IRQreg_ilr(opp, idx, val);
+		break;
+	}
+}
+
+static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+{
+	OpenPICState *opp = opaque;
+	uint32_t retval;
+	int idx;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	retval = 0xFFFFFFFF;
+
+	addr = addr & 0xffff;
+	idx = addr >> 5;
+
+	switch (addr & 0x1f) {
+	case 0x00:
+		retval = read_IRQreg_ivpr(opp, idx);
+		break;
+	case 0x10:
+		retval = read_IRQreg_idr(opp, idx);
+		break;
+	case 0x18:
+		retval = read_IRQreg_ilr(opp, idx);
+		break;
+	}
+
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	return retval;
+}
+
+static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned size)
+{
+	OpenPICState *opp = opaque;
+	int idx = opp->irq_msi;
+	int srs, ibs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+	if (addr & 0xF) {
+		return;
+	}
+
+	switch (addr) {
+	case MSIIR_OFFSET:
+		srs = val >> MSIIR_SRS_SHIFT;
+		idx += srs;
+		ibs = (val & MSIIR_IBS_MASK) >> MSIIR_IBS_SHIFT;
+		opp->msi[srs].msir |= 1 << ibs;
+		openpic_set_irq(opp, idx, 1);
+		break;
+	default:
+		/* most registers are read-only, thus ignored */
+		break;
+	}
+}
+
+static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+{
+	OpenPICState *opp = opaque;
+	uint64_t r = 0;
+	int i, srs;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF) {
+		return -1;
+	}
+
+	srs = addr >> 4;
+
+	switch (addr) {
+	case 0x00:
+	case 0x10:
+	case 0x20:
+	case 0x30:
+	case 0x40:
+	case 0x50:
+	case 0x60:
+	case 0x70:		/* MSIRs */
+		r = opp->msi[srs].msir;
+		/* Clear on read */
+		opp->msi[srs].msir = 0;
+		openpic_set_irq(opp, opp->irq_msi + srs, 0);
+		break;
+	case 0x120:		/* MSISR */
+		for (i = 0; i < MAX_MSI; i++) {
+			r |= (opp->msi[i].msir ? 1 : 0) << i;
+		}
+		break;
+	}
+
+	return r;
+}
+
+static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+{
+	uint64_t r = 0;
+
+	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+
+	/* TODO: EISR/EIMR */
+
+	return r;
+}
+
+static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+				  unsigned size)
+{
+	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+		__func__, addr, val);
+
+	/* TODO: EISR/EIMR */
+}
+
+static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+				       uint32_t val, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQSource *src;
+	IRQDest *dst;
+	int s_IRQ, n_IRQ;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+		addr, val);
+
+	if (idx < 0) {
+		return;
+	}
+
+	if (addr & 0xF) {
+		return;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x40:		/* IPIDR */
+	case 0x50:
+	case 0x60:
+	case 0x70:
+		idx = (addr - 0x40) >> 4;
+		/* we use IDE as mask which CPUs to deliver the IPI to still. */
+		opp->src[opp->irq_ipi0 + idx].destmask |= val;
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 1);
+		openpic_set_irq(opp, opp->irq_ipi0 + idx, 0);
+		break;
+	case 0x80:		/* CTPR */
+		dst->ctpr = val & 0x0000000F;
+
+		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+			__func__, idx, dst->ctpr, dst->raised.priority,
+			dst->servicing.priority);
+
+		if (dst->raised.priority <= dst->ctpr) {
+			DPRINTF
+			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+			     __func__, idx);
+			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+		} else if (dst->raised.priority > dst->servicing.priority) {
+			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+				__func__, idx, dst->raised.next);
+			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+		}
+
+		break;
+	case 0x90:		/* WHOAMI */
+		/* Read-only register */
+		break;
+	case 0xA0:		/* IACK */
+		/* Read-only register */
+		break;
+	case 0xB0:		/* EOI */
+		DPRINTF("EOI\n");
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+
+		if (s_IRQ < 0) {
+			DPRINTF("%s: EOI with no interrupt in service\n",
+				__func__);
+			break;
+		}
+
+		IRQ_resetbit(&dst->servicing, s_IRQ);
+		/* Set up next servicing IRQ */
+		s_IRQ = IRQ_get_next(opp, &dst->servicing);
+		/* Check queued interrupts. */
+		n_IRQ = IRQ_get_next(opp, &dst->raised);
+		src = &opp->src[n_IRQ];
+		if (n_IRQ != -1 &&
+		    (s_IRQ = -1 ||
+		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
+			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+				idx, n_IRQ);
+			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+			      unsigned len)
+{
+	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+}
+
+static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+{
+	IRQSource *src;
+	int retval, irq;
+
+	DPRINTF("Lower OpenPIC INT output\n");
+	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+
+	irq = IRQ_get_next(opp, &dst->raised);
+	DPRINTF("IACK: irq=%d\n", irq);
+
+	if (irq = -1) {
+		/* No more interrupt pending */
+		return opp->spve;
+	}
+
+	src = &opp->src[irq];
+	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
+	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
+		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+			__func__, irq, dst->ctpr, src->ivpr);
+		openpic_update_irq(opp, irq);
+		retval = opp->spve;
+	} else {
+		/* IRQ enter servicing state */
+		IRQ_setbit(&dst->servicing, irq);
+		retval = IVPR_VECTOR(opp, src->ivpr);
+	}
+
+	if (!src->level) {
+		/* edge-sensitive IRQ */
+		src->ivpr &= ~IVPR_ACTIVITY_MASK;
+		src->pending = 0;
+		IRQ_resetbit(&dst->raised, irq);
+	}
+
+	if ((irq >= opp->irq_ipi0) && (irq < (opp->irq_ipi0 + MAX_IPI))) {
+		src->destmask &= ~(1 << cpu);
+		if (src->destmask && !src->level) {
+			/* trigger on CPUs that didn't know about it yet */
+			openpic_set_irq(opp, irq, 1);
+			openpic_set_irq(opp, irq, 0);
+			/* if all CPUs knew about it, set active bit again */
+			src->ivpr |= IVPR_ACTIVITY_MASK;
+		}
+	}
+
+	return retval;
+}
+
+static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+{
+	OpenPICState *opp = opaque;
+	IRQDest *dst;
+	uint32_t retval;
+
+	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	retval = 0xFFFFFFFF;
+
+	if (idx < 0) {
+		return retval;
+	}
+
+	if (addr & 0xF) {
+		return retval;
+	}
+	dst = &opp->dst[idx];
+	addr &= 0xFF0;
+	switch (addr) {
+	case 0x80:		/* CTPR */
+		retval = dst->ctpr;
+		break;
+	case 0x90:		/* WHOAMI */
+		retval = idx;
+		break;
+	case 0xA0:		/* IACK */
+		retval = openpic_iack(opp, dst, idx);
+		break;
+	case 0xB0:		/* EOI */
+		retval = 0;
+		break;
+	default:
+		break;
+	}
+	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+
+	return retval;
+}
+
+static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+{
+	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+}
+
+static const MemoryRegionOps openpic_glb_ops_le = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_glb_ops_be = {
+	.write = openpic_gbl_write,
+	.read = openpic_gbl_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_le = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_tmr_ops_be = {
+	.write = openpic_tmr_write,
+	.read = openpic_tmr_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_le = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_cpu_ops_be = {
+	.write = openpic_cpu_write,
+	.read = openpic_cpu_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_le = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_LITTLE_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_src_ops_be = {
+	.write = openpic_src_write,
+	.read = openpic_src_read,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_msi_ops_be = {
+	.read = openpic_msi_read,
+	.write = openpic_msi_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static const MemoryRegionOps openpic_summary_ops_be = {
+	.read = openpic_summary_read,
+	.write = openpic_summary_write,
+	.endianness = DEVICE_BIG_ENDIAN,
+	.impl = {
+		 .min_access_size = 4,
+		 .max_access_size = 4,
+		 },
+};
+
+static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		/* Always put the lower half of a 64-bit long first, in case we
+		 * restore on a 32-bit host.  The least significant bits correspond
+		 * to lower IRQ numbers in the bitmap.
+		 */
+		qemu_put_be32(f, (uint32_t) q->queue[i]);
+#if LONG_MAX > 0x7FFFFFFF
+		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
+#endif
+	}
+
+	qemu_put_sbe32s(f, &q->next);
+	qemu_put_sbe32s(f, &q->priority);
+}
+
+static void openpic_save(QEMUFile * f, void *opaque)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	qemu_put_be32s(f, &opp->gcr);
+	qemu_put_be32s(f, &opp->vir);
+	qemu_put_be32s(f, &opp->pir);
+	qemu_put_be32s(f, &opp->spve);
+	qemu_put_be32s(f, &opp->tfrr);
+
+	qemu_put_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_put_be32s(f, &opp->timers[i].tccr);
+		qemu_put_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		qemu_put_be32s(f, &opp->src[i].ivpr);
+		qemu_put_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_put_sbe32s(f, &opp->src[i].pending);
+	}
+}
+
+static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
+		unsigned long val;
+
+		val = qemu_get_be32(f);
+#if LONG_MAX > 0x7FFFFFFF
+		val <<= 32;
+		val |= qemu_get_be32(f);
+#endif
+
+		q->queue[i] = val;
+	}
+
+	qemu_get_sbe32s(f, &q->next);
+	qemu_get_sbe32s(f, &q->priority);
+}
+
+static int openpic_load(QEMUFile * f, void *opaque, int version_id)
+{
+	OpenPICState *opp = (OpenPICState *) opaque;
+	unsigned int i;
+
+	if (version_id != 1) {
+		return -EINVAL;
+	}
+
+	qemu_get_be32s(f, &opp->gcr);
+	qemu_get_be32s(f, &opp->vir);
+	qemu_get_be32s(f, &opp->pir);
+	qemu_get_be32s(f, &opp->spve);
+	qemu_get_be32s(f, &opp->tfrr);
+
+	qemu_get_be32s(f, &opp->nb_cpus);
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
+		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
+		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
+		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
+				sizeof(opp->dst[i].outputs_active));
+	}
+
+	for (i = 0; i < MAX_TMR; i++) {
+		qemu_get_be32s(f, &opp->timers[i].tccr);
+		qemu_get_be32s(f, &opp->timers[i].tbcr);
+	}
+
+	for (i = 0; i < opp->max_irq; i++) {
+		uint32_t val;
+
+		val = qemu_get_be32(f);
+		write_IRQreg_idr(opp, i, val);
+		val = qemu_get_be32(f);
+		write_IRQreg_ivpr(opp, i, val);
+
+		qemu_get_be32s(f, &opp->src[i].ivpr);
+		qemu_get_be32s(f, &opp->src[i].idr);
+		qemu_get_be32s(f, &opp->src[i].destmask);
+		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
+		qemu_get_sbe32s(f, &opp->src[i].pending);
+	}
+
+	return 0;
+}
+
+typedef struct MemReg {
+	const char *name;
+	MemoryRegionOps const *ops;
+	hwaddr start_addr;
+	ram_addr_t size;
+} MemReg;
+
+static void fsl_common_init(OpenPICState * opp)
+{
+	int i;
+	int virq = MAX_SRC;
+
+	opp->vid = VID_REVISION_1_2;
+	opp->vir = VIR_GENERIC;
+	opp->vector_mask = 0xFFFF;
+	opp->tfrr_reset = 0;
+	opp->ivpr_reset = IVPR_MASK_MASK;
+	opp->idr_reset = 1 << 0;
+	opp->max_irq = MAX_IRQ;
+
+	opp->irq_ipi0 = virq;
+	virq += MAX_IPI;
+	opp->irq_tim0 = virq;
+	virq += MAX_TMR;
+
+	assert(virq <= MAX_IRQ);
+
+	opp->irq_msi = 224;
+
+	msi_supported = true;
+	for (i = 0; i < opp->fsl->max_ext; i++) {
+		opp->src[i].level = false;
+	}
+
+	/* Internal interrupts, including message and MSI */
+	for (i = 16; i < MAX_SRC; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLINT;
+		opp->src[i].level = true;
+	}
+
+	/* timers and IPIs */
+	for (i = MAX_SRC; i < virq; i++) {
+		opp->src[i].type = IRQ_TYPE_FSLSPECIAL;
+		opp->src[i].level = false;
+	}
+}
+
+static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+{
+	while (list->name) {
+		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+
+		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
+				      list->name, list->size);
+
+		memory_region_add_subregion(&opp->mem, list->start_addr,
+					    &opp->sub_io_mem[*count]);
+
+		(*count)++;
+		list++;
+	}
+}
+
+static int openpic_init(SysBusDevice * dev)
+{
+	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	int i, j;
+	int list_count = 0;
+	static const MemReg list_le[] = {
+		{"glb", &openpic_glb_ops_le,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_le,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_le,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_le,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_be[] = {
+		{"glb", &openpic_glb_ops_be,
+		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
+		{"tmr", &openpic_tmr_ops_be,
+		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
+		{"src", &openpic_src_ops_be,
+		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
+		{"cpu", &openpic_cpu_ops_be,
+		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
+		{NULL}
+	};
+	static const MemReg list_fsl[] = {
+		{"msi", &openpic_msi_ops_be,
+		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
+		{"summary", &openpic_summary_ops_be,
+		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
+		{NULL}
+	};
+
+	memory_region_init(&opp->mem, "openpic", 0x40000);
+
+	switch (opp->model) {
+	case OPENPIC_MODEL_FSL_MPIC_20:
+	default:
+		opp->fsl = &fsl_mpic_20;
+		opp->brr1 = 0x00400200;
+		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
+		opp->nb_irqs = 80;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_FSL_MPIC_42:
+		opp->fsl = &fsl_mpic_42;
+		opp->brr1 = 0x00400402;
+		opp->flags |= OPENPIC_FLAG_ILR;
+		opp->nb_irqs = 196;
+		opp->mpic_mode_mask = GCR_MODE_PROXY;
+
+		fsl_common_init(opp);
+		map_list(opp, list_be, &list_count);
+		map_list(opp, list_fsl, &list_count);
+
+		break;
+
+	case OPENPIC_MODEL_RAVEN:
+		opp->nb_irqs = RAVEN_MAX_EXT;
+		opp->vid = VID_REVISION_1_3;
+		opp->vir = VIR_GENERIC;
+		opp->vector_mask = 0xFF;
+		opp->tfrr_reset = 4160000;
+		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
+		opp->idr_reset = 0;
+		opp->max_irq = RAVEN_MAX_IRQ;
+		opp->irq_ipi0 = RAVEN_IPI_IRQ;
+		opp->irq_tim0 = RAVEN_TMR_IRQ;
+		opp->brr1 = -1;
+		opp->mpic_mode_mask = GCR_MODE_MIXED;
+
+		/* Only UP supported today */
+		if (opp->nb_cpus != 1) {
+			return -EINVAL;
+		}
+
+		map_list(opp, list_le, &list_count);
+		break;
+	}
+
+	for (i = 0; i < opp->nb_cpus; i++) {
+		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
+		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
+			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
+		}
+	}
+
+	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
+			openpic_save, openpic_load, opp);
+
+	sysbus_init_mmio(dev, &opp->mem);
+	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
+
+	return 0;
+}
+
+static Property openpic_properties[] = {
+	DEFINE_PROP_UINT32("model", OpenPICState, model,
+			   OPENPIC_MODEL_FSL_MPIC_20),
+	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
+	DEFINE_PROP_END_OF_LIST(),
+};
+
+static void openpic_class_init(ObjectClass * klass, void *data)
+{
+	DeviceClass *dc = DEVICE_CLASS(klass);
+	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+
+	k->init = openpic_init;
+	dc->props = openpic_properties;
+	dc->reset = openpic_reset;
+}
+
+static const TypeInfo openpic_info = {
+	.name = "openpic",
+	.parent = TYPE_SYS_BUS_DEVICE,
+	.instance_size = sizeof(OpenPICState),
+	.class_init = openpic_class_init,
+};
+
+static void openpic_register_types(void)
+{
+	type_register_static(&openpic_info);
+}
+
+type_init(openpic_register_types)
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 3/6] kvm/ppc/mpic: remove some obviously unneeded code
  2013-04-13  0:08       ` Scott Wood
@ 2013-04-13  0:08         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] == inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] == output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 3/6] kvm/ppc/mpic: remove some obviously unneeded code
@ 2013-04-13  0:08         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  344 -----------------------------------------------
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245 & MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include "hw.h"
-#include "ppc/mac.h"
-#include "pci/pci.h"
-#include "openpic.h"
-#include "sysbus.h"
-#include "pci/msi.h"
-#include "qemu/bitops.h"
-#include "ppc.h"
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-        if (debug_openpic) { \
-            printf(fmt , ## __VA_ARGS__); \
-        } \
-    } while (0)
 
 #define MAX_CPU     32
 #define MAX_SRC     256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START        0x20000
 #define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU      2
-#define RAVEN_MAX_EXT     48
-#define RAVEN_MAX_IRQ     64
-#define RAVEN_MAX_TMR      MAX_TMR
-#define RAVEN_MAX_IPI      MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ     (RAVEN_MAX_EXT)	/* Internal functional IRQ */
-#define RAVEN_ERR_IRQ    (RAVEN_MAX_EXT + 1)	/* Error IRQ */
-#define RAVEN_TMR_IRQ    (RAVEN_MAX_EXT + 2)	/* First timer IRQ */
-#define RAVEN_IPI_IRQ    (RAVEN_TMR_IRQ + RAVEN_MAX_TMR)	/* First IPI IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ    (RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
 	int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-	{ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-	{ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-	{ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][0] = inttgt) {
-			return inttgt_output[i][1];
-		}
-	}
-
-	fprintf(stderr, "%s: unsupported inttgt %d\n", __func__, inttgt);
-	return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(inttgt_output); i++) {
-		if (inttgt_output[i][1] = output) {
-			return inttgt_output[i][0];
-		}
-	}
-
-	abort();
-}
-
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
 #define MSIIR_SRS_MASK     (0x7 << MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-	.write = openpic_gbl_write,
-	.read = openpic_gbl_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-	.write = openpic_tmr_write,
-	.read = openpic_tmr_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-	.write = openpic_cpu_write,
-	.read = openpic_cpu_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
-};
-
-static const MemoryRegionOps openpic_src_ops_le = {
-	.write = openpic_src_write,
-	.read = openpic_src_read,
-	.endianness = DEVICE_LITTLE_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
 static const MemoryRegionOps openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-	.endianness = DEVICE_BIG_ENDIAN,
-	.impl = {
-		 .min_access_size = 4,
-		 .max_access_size = 4,
-		 },
 };
 
-static void openpic_save_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		/* Always put the lower half of a 64-bit long first, in case we
-		 * restore on a 32-bit host.  The least significant bits correspond
-		 * to lower IRQ numbers in the bitmap.
-		 */
-		qemu_put_be32(f, (uint32_t) q->queue[i]);
-#if LONG_MAX > 0x7FFFFFFF
-		qemu_put_be32(f, (uint32_t) (q->queue[i] >> 32));
-#endif
-	}
-
-	qemu_put_sbe32s(f, &q->next);
-	qemu_put_sbe32s(f, &q->priority);
-}
-
-static void openpic_save(QEMUFile * f, void *opaque)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	qemu_put_be32s(f, &opp->gcr);
-	qemu_put_be32s(f, &opp->vir);
-	qemu_put_be32s(f, &opp->pir);
-	qemu_put_be32s(f, &opp->spve);
-	qemu_put_be32s(f, &opp->tfrr);
-
-	qemu_put_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_put_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_save_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_save_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_put_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_put_be32s(f, &opp->timers[i].tccr);
-		qemu_put_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		qemu_put_be32s(f, &opp->src[i].ivpr);
-		qemu_put_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_put_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_put_sbe32s(f, &opp->src[i].pending);
-	}
-}
-
-static void openpic_load_IRQ_queue(QEMUFile * f, IRQQueue * q)
-{
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(q->queue); i++) {
-		unsigned long val;
-
-		val = qemu_get_be32(f);
-#if LONG_MAX > 0x7FFFFFFF
-		val <<= 32;
-		val |= qemu_get_be32(f);
-#endif
-
-		q->queue[i] = val;
-	}
-
-	qemu_get_sbe32s(f, &q->next);
-	qemu_get_sbe32s(f, &q->priority);
-}
-
-static int openpic_load(QEMUFile * f, void *opaque, int version_id)
-{
-	OpenPICState *opp = (OpenPICState *) opaque;
-	unsigned int i;
-
-	if (version_id != 1) {
-		return -EINVAL;
-	}
-
-	qemu_get_be32s(f, &opp->gcr);
-	qemu_get_be32s(f, &opp->vir);
-	qemu_get_be32s(f, &opp->pir);
-	qemu_get_be32s(f, &opp->spve);
-	qemu_get_be32s(f, &opp->tfrr);
-
-	qemu_get_be32s(f, &opp->nb_cpus);
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		qemu_get_sbe32s(f, &opp->dst[i].ctpr);
-		openpic_load_IRQ_queue(f, &opp->dst[i].raised);
-		openpic_load_IRQ_queue(f, &opp->dst[i].servicing);
-		qemu_get_buffer(f, (uint8_t *) & opp->dst[i].outputs_active,
-				sizeof(opp->dst[i].outputs_active));
-	}
-
-	for (i = 0; i < MAX_TMR; i++) {
-		qemu_get_be32s(f, &opp->timers[i].tccr);
-		qemu_get_be32s(f, &opp->timers[i].tbcr);
-	}
-
-	for (i = 0; i < opp->max_irq; i++) {
-		uint32_t val;
-
-		val = qemu_get_be32(f);
-		write_IRQreg_idr(opp, i, val);
-		val = qemu_get_be32(f);
-		write_IRQreg_ivpr(opp, i, val);
-
-		qemu_get_be32s(f, &opp->src[i].ivpr);
-		qemu_get_be32s(f, &opp->src[i].idr);
-		qemu_get_be32s(f, &opp->src[i].destmask);
-		qemu_get_sbe32s(f, &opp->src[i].last_cpu);
-		qemu_get_sbe32s(f, &opp->src[i].pending);
-	}
-
-	return 0;
-}
-
 typedef struct MemReg {
 	const char *name;
 	MemoryRegionOps const *ops;
@@ -1614,73 +1336,7 @@ static int openpic_init(SysBusDevice * dev)
 		map_list(opp, list_fsl, &list_count);
 
 		break;
-
-	case OPENPIC_MODEL_RAVEN:
-		opp->nb_irqs = RAVEN_MAX_EXT;
-		opp->vid = VID_REVISION_1_3;
-		opp->vir = VIR_GENERIC;
-		opp->vector_mask = 0xFF;
-		opp->tfrr_reset = 4160000;
-		opp->ivpr_reset = IVPR_MASK_MASK | IVPR_MODE_MASK;
-		opp->idr_reset = 0;
-		opp->max_irq = RAVEN_MAX_IRQ;
-		opp->irq_ipi0 = RAVEN_IPI_IRQ;
-		opp->irq_tim0 = RAVEN_TMR_IRQ;
-		opp->brr1 = -1;
-		opp->mpic_mode_mask = GCR_MODE_MIXED;
-
-		/* Only UP supported today */
-		if (opp->nb_cpus != 1) {
-			return -EINVAL;
-		}
-
-		map_list(opp, list_le, &list_count);
-		break;
-	}
-
-	for (i = 0; i < opp->nb_cpus; i++) {
-		opp->dst[i].irqs = g_new(qemu_irq, OPENPIC_OUTPUT_NB);
-		for (j = 0; j < OPENPIC_OUTPUT_NB; j++) {
-			sysbus_init_irq(dev, &opp->dst[i].irqs[j]);
-		}
 	}
 
-	register_savevm(&opp->busdev.qdev, "openpic", 0, 2,
-			openpic_save, openpic_load, opp);
-
-	sysbus_init_mmio(dev, &opp->mem);
-	qdev_init_gpio_in(&dev->qdev, openpic_set_irq, opp->max_irq);
-
 	return 0;
 }
-
-static Property openpic_properties[] = {
-	DEFINE_PROP_UINT32("model", OpenPICState, model,
-			   OPENPIC_MODEL_FSL_MPIC_20),
-	DEFINE_PROP_UINT32("nb_cpus", OpenPICState, nb_cpus, 1),
-	DEFINE_PROP_END_OF_LIST(),
-};
-
-static void openpic_class_init(ObjectClass * klass, void *data)
-{
-	DeviceClass *dc = DEVICE_CLASS(klass);
-	SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
-
-	k->init = openpic_init;
-	dc->props = openpic_properties;
-	dc->reset = openpic_reset;
-}
-
-static const TypeInfo openpic_info = {
-	.name = "openpic",
-	.parent = TYPE_SYS_BUS_DEVICE,
-	.instance_size = sizeof(OpenPICState),
-	.class_init = openpic_class_init,
-};
-
-static void openpic_register_types(void)
-{
-	type_register_static(&openpic_info);
-}
-
-type_init(openpic_register_types)
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 4/6] kvm/ppc/mpic: adapt to kernel style and environment
  2013-04-13  0:08       ` Scott Wood
@ 2013-04-13  0:08         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq == opp->max_irq) {
+		if (irq == opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ == 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ == 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] == 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] == 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask == 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i == opp->nb_cpus) {
+			if (i == opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level =
-			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level =
-		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr == 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) == 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr == 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ == -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq == -1) {
+	if (irq == -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 4/6] kvm/ppc/mpic: adapt to kernel style and environment
@ 2013-04-13  0:08         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/kvm/mpic.c |  445 ++++++++++++++++++++++-------------------------
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE         0x220
 #define OPENPIC_MSI_REG_START        0x1600
 #define OPENPIC_MSI_REG_SIZE         0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE    0x800
+#define OPENPIC_SUMMARY_REG_START    0x3800
+#define OPENPIC_SUMMARY_REG_SIZE     0x800
 #define OPENPIC_SRC_REG_START        0x10000
 #define OPENPIC_SRC_REG_SIZE         (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START        0x20000
-#define OPENPIC_CPU_REG_SIZE         0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE         (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
 	int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
 	.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
 	.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
 	CPUState *cpu_single_cpu;
 
-	if (!cpu_single_env) {
+	if (!cpu_single_env)
 		return -1;
-	}
 
 	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
 	return cpu_single_cpu->cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
 	IRQ_TYPE_FSLINT,	/* FSL internal interrupt -- level only */
 	IRQ_TYPE_FSLSPECIAL,	/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
 	/* Round up to the nearest 64 IRQs so that the queue length
 	 * won't change when moving between 32 and 64 bit hosts.
 	 */
 	unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) & ~63)];
 	int next;
 	int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
 	uint32_t ivpr;		/* IRQ vector/priority register */
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
 	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
 	int pending;		/* TRUE if IRQ is pending */
-	IRQType type;
+	enum irq_type type;
 	bool level:1;		/* level-triggered */
-	bool nomask:1;		/* critical interrupts ignore mask on some FSL MPICs */
-} IRQSource;
+	bool nomask:1;	/* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT       31
 #define IVPR_MASK_MASK        (1 << IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP      0x80000000	/* external pin */
 #define IDR_CI      0x40000000	/* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
 	int32_t ctpr;		/* CPU current task priority */
-	IRQQueue raised;
-	IRQQueue servicing;
+	struct irq_queue raised;
+	struct irq_queue servicing;
 	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
 	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-	SysBusDevice busdev;
-	MemoryRegion mem;
+};
 
+struct openpic {
 	/* Behavior control */
-	FslMpicInfo *fsl;
+	struct fsl_mpic_info *fsl;
 	uint32_t model;
 	uint32_t flags;
 	uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
 	uint32_t brr1;
 	uint32_t mpic_mode_mask;
 
-	/* Sub-regions */
-	MemoryRegion sub_io_mem[6];
-
 	/* Global registers */
 	uint32_t frr;		/* Feature reporting register */
 	uint32_t gcr;		/* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
 	uint32_t spve;		/* Spurious vector register */
 	uint32_t tfrr;		/* Timer frequency reporting register */
 	/* Source registers */
-	IRQSource src[MAX_IRQ];
+	struct irq_source src[MAX_IRQ];
 	/* Local registers per output pin */
-	IRQDest dst[MAX_CPU];
+	struct irq_dest dst[MAX_CPU];
 	uint32_t nb_cpus;
 	/* Timer registers */
 	struct {
@@ -213,24 +206,24 @@ typedef struct OpenPICState {
 	uint32_t irq_ipi0;
 	uint32_t irq_tim0;
 	uint32_t irq_msi;
-} OpenPICState;
+};
 
-static inline void IRQ_setbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
 }
 
-static inline void IRQ_resetbit(IRQQueue * q, int n_IRQ)
+static inline void IRQ_resetbit(struct irq_queue *q, int n_IRQ)
 {
 	clear_bit(n_IRQ, q->queue);
 }
 
-static inline int IRQ_testbit(IRQQueue * q, int n_IRQ)
+static inline int IRQ_testbit(struct irq_queue *q, int n_IRQ)
 {
 	return test_bit(n_IRQ, q->queue);
 }
 
-static void IRQ_check(OpenPICState * opp, IRQQueue * q)
+static void IRQ_check(struct openpic *opp, struct irq_queue *q)
 {
 	int irq = -1;
 	int next = -1;
@@ -238,11 +231,10 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 
 	for (;;) {
 		irq = find_next_bit(q->queue, opp->max_irq, irq + 1);
-		if (irq = opp->max_irq) {
+		if (irq = opp->max_irq)
 			break;
-		}
 
-		DPRINTF("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
+		pr_debug("IRQ_check: irq %d set ivpr_pr=%d pr=%d\n",
 			irq, IVPR_PRIORITY(opp->src[irq].ivpr), priority);
 
 		if (IVPR_PRIORITY(opp->src[irq].ivpr) > priority) {
@@ -255,7 +247,7 @@ static void IRQ_check(OpenPICState * opp, IRQQueue * q)
 	q->priority = priority;
 }
 
-static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
+static int IRQ_get_next(struct openpic *opp, struct irq_queue *q)
 {
 	/* XXX: optimize */
 	IRQ_check(opp, q);
@@ -263,21 +255,21 @@ static int IRQ_get_next(OpenPICState * opp, IRQQueue * q)
 	return q->next;
 }
 
-static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
+static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			   bool active, bool was_active)
 {
-	IRQDest *dst;
-	IRQSource *src;
+	struct irq_dest *dst;
+	struct irq_source *src;
 	int priority;
 
 	dst = &opp->dst[n_CPU];
 	src = &opp->src[n_IRQ];
 
-	DPRINTF("%s: IRQ %d active %d was %d\n",
+	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
 	if (src->output != OPENPIC_OUTPUT_INT) {
-		DPRINTF("%s: output %d irq %d active %d was %d count %d\n",
+		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
 
@@ -286,19 +278,17 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		 * masking.
 		 */
 		if (active) {
-			if (!was_active
-			    && dst->outputs_active[src->output]++ = 0) {
-				DPRINTF
-				    ("%s: Raise OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (!was_active &&
+			    dst->outputs_active[src->output]++ = 0) {
+				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_raise(dst->irqs[src->output]);
 			}
 		} else {
-			if (was_active
-			    && --dst->outputs_active[src->output] = 0) {
-				DPRINTF
-				    ("%s: Lower OpenPIC output %d cpu %d irq %d\n",
-				     __func__, src->output, n_CPU, n_IRQ);
+			if (was_active &&
+			    --dst->outputs_active[src->output] = 0) {
+				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
+					__func__, src->output, n_CPU, n_IRQ);
 				qemu_irq_lower(dst->irqs[src->output]);
 			}
 		}
@@ -311,31 +301,27 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 	/* Even if the interrupt doesn't have enough priority,
 	 * it is still raised, in case ctpr is lowered later.
 	 */
-	if (active) {
+	if (active)
 		IRQ_setbit(&dst->raised, n_IRQ);
-	} else {
+	else
 		IRQ_resetbit(&dst->raised, n_IRQ);
-	}
 
 	IRQ_check(opp, &dst->raised);
 
 	if (active && priority <= dst->ctpr) {
-		DPRINTF
-		    ("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
-		     __func__, n_IRQ, priority, dst->ctpr, n_CPU);
+		pr_debug("%s: IRQ %d priority %d too low for ctpr %d on CPU %d\n",
+			__func__, n_IRQ, priority, dst->ctpr, n_CPU);
 		active = 0;
 	}
 
 	if (active) {
 		if (IRQ_get_next(opp, &dst->servicing) >= 0 &&
 		    priority <= dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
-			     __func__, n_IRQ, dst->servicing.next, n_CPU);
+			pr_debug("%s: IRQ %d is hidden by servicing IRQ %d on CPU %d\n",
+				__func__, n_IRQ, dst->servicing.next, n_CPU);
 		} else {
-			DPRINTF
-			    ("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
-			     __func__, n_CPU, n_IRQ, dst->raised.next);
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
+				__func__, n_CPU, n_IRQ, dst->raised.next);
 			qemu_irq_raise(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -343,17 +329,15 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 		IRQ_get_next(opp, &dst->servicing);
 		if (dst->raised.priority > dst->ctpr &&
 		    dst->raised.priority > dst->servicing.priority) {
-			DPRINTF
-			    ("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->raised.next,
-			     dst->raised.priority, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, IRQ %d prio %d above %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->raised.next,
+				dst->raised.priority, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			/* IRQ line stays asserted */
 		} else {
-			DPRINTF
-			    ("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
-			     __func__, n_IRQ, dst->ctpr,
-			     dst->servicing.priority, n_CPU);
+			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
+				__func__, n_IRQ, dst->ctpr,
+				dst->servicing.priority, n_CPU);
 			qemu_irq_lower(opp->dst[n_CPU].
 				       irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -361,9 +345,9 @@ static void IRQ_local_pipe(OpenPICState * opp, int n_CPU, int n_IRQ,
 }
 
 /* update pic state because registers for n_IRQ have changed value */
-static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
+static void openpic_update_irq(struct openpic *opp, int n_IRQ)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	bool active, was_active;
 	int i;
 
@@ -372,30 +356,29 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 	if ((src->ivpr & IVPR_MASK_MASK) && !src->nomask) {
 		/* Interrupt source is disabled */
-		DPRINTF("%s: IRQ %d is disabled\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is disabled\n", __func__, n_IRQ);
 		active = false;
 	}
 
-	was_active = ! !(src->ivpr & IVPR_ACTIVITY_MASK);
+	was_active = !!(src->ivpr & IVPR_ACTIVITY_MASK);
 
 	/*
 	 * We don't have a similar check for already-active because
 	 * ctpr may have changed and we need to withdraw the interrupt.
 	 */
 	if (!active && !was_active) {
-		DPRINTF("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d is already inactive\n", __func__, n_IRQ);
 		return;
 	}
 
-	if (active) {
+	if (active)
 		src->ivpr |= IVPR_ACTIVITY_MASK;
-	} else {
+	else
 		src->ivpr &= ~IVPR_ACTIVITY_MASK;
-	}
 
 	if (src->destmask = 0) {
 		/* No target */
-		DPRINTF("%s: IRQ %d has no target\n", __func__, n_IRQ);
+		pr_debug("%s: IRQ %d has no target\n", __func__, n_IRQ);
 		return;
 	}
 
@@ -413,9 +396,9 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 	} else {
 		/* Distributed delivery mode */
 		for (i = src->last_cpu + 1; i != src->last_cpu; i++) {
-			if (i = opp->nb_cpus) {
+			if (i = opp->nb_cpus)
 				i = 0;
-			}
+
 			if (src->destmask & (1 << i)) {
 				IRQ_local_pipe(opp, i, n_IRQ, active,
 					       was_active);
@@ -428,16 +411,16 @@ static void openpic_update_irq(OpenPICState * opp, int n_IRQ)
 
 static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		fprintf(stderr, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
 		abort();
 	}
 
 	src = &opp->src[n_IRQ];
-	DPRINTF("openpic: set irq %d = %d ivpr=0x%08x\n",
+	pr_debug("openpic: set irq %d = %d ivpr=0x%08x\n",
 		n_IRQ, level, src->ivpr);
 	if (src->level) {
 		/* level-sensitive irq */
@@ -463,9 +446,9 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState * d)
+static void openpic_reset(DeviceState *d)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
@@ -485,7 +468,7 @@ static void openpic_reset(DeviceState * d)
 		switch (opp->src[i].type) {
 		case IRQ_TYPE_NORMAL:
 			opp->src[i].level -			    ! !(opp->ivpr_reset & IVPR_SENSE_MASK);
+			    !!(opp->ivpr_reset & IVPR_SENSE_MASK);
 			break;
 
 		case IRQ_TYPE_FSLINT:
@@ -499,9 +482,9 @@ static void openpic_reset(DeviceState * d)
 	/* Initialise IRQ destinations */
 	for (i = 0; i < MAX_CPU; i++) {
 		opp->dst[i].ctpr = 15;
-		memset(&opp->dst[i].raised, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].raised, 0, sizeof(struct irq_queue));
 		opp->dst[i].raised.next = -1;
-		memset(&opp->dst[i].servicing, 0, sizeof(IRQQueue));
+		memset(&opp->dst[i].servicing, 0, sizeof(struct irq_queue));
 		opp->dst[i].servicing.next = -1;
 	}
 	/* Initialise timers */
@@ -513,28 +496,28 @@ static void openpic_reset(DeviceState * d)
 	opp->gcr = 0;
 }
 
-static inline uint32_t read_IRQreg_idr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].idr;
 }
 
-static inline uint32_t read_IRQreg_ilr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
-	if (opp->flags & OPENPIC_FLAG_ILR) {
+	if (opp->flags & OPENPIC_FLAG_ILR)
 		return output_to_inttgt(opp->src[n_IRQ].output);
-	}
 
 	return 0xffffffff;
 }
 
-static inline uint32_t read_IRQreg_ivpr(OpenPICState * opp, int n_IRQ)
+static inline uint32_t read_IRQreg_ivpr(struct openpic *opp, int n_IRQ)
 {
 	return opp->src[n_IRQ].ivpr;
 }
 
-static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
-	IRQSource *src = &opp->src[n_IRQ];
+	struct irq_source *src = &opp->src[n_IRQ];
 	uint32_t normal_mask = (1UL << opp->nb_cpus) - 1;
 	uint32_t crit_mask = 0;
 	uint32_t mask = normal_mask;
@@ -547,14 +530,13 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 
 	src->idr = val & mask;
-	DPRINTF("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
+	pr_debug("Set IDR %d to 0x%08x\n", n_IRQ, src->idr);
 
 	if (opp->flags & OPENPIC_FLAG_IDR_CRIT) {
 		if (src->idr & crit_mask) {
 			if (src->idr & normal_mask) {
-				DPRINTF
-				    ("%s: IRQ configured for multiple output types, using "
-				     "critical\n", __func__);
+				pr_debug("%s: IRQ configured for multiple output types, using critical\n",
+					__func__);
 			}
 
 			src->output = OPENPIC_OUTPUT_CINT;
@@ -564,9 +546,8 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 			for (i = 0; i < opp->nb_cpus; i++) {
 				int n_ci = IDR_CI0_SHIFT - i;
 
-				if (src->idr & (1UL << n_ci)) {
+				if (src->idr & (1UL << n_ci))
 					src->destmask |= 1UL << i;
-				}
 			}
 		} else {
 			src->output = OPENPIC_OUTPUT_INT;
@@ -578,20 +559,21 @@ static inline void write_IRQreg_idr(OpenPICState * opp, int n_IRQ, uint32_t val)
 	}
 }
 
-static inline void write_IRQreg_ilr(OpenPICState * opp, int n_IRQ, uint32_t val)
+static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
+				    uint32_t val)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR) {
-		IRQSource *src = &opp->src[n_IRQ];
+		struct irq_source *src = &opp->src[n_IRQ];
 
 		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
-		DPRINTF("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
+		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
 		/* TODO: on MPIC v4.0 only, set nomask for non-INT */
 	}
 }
 
-static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
+static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 				     uint32_t val)
 {
 	uint32_t mask;
@@ -613,7 +595,7 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	switch (opp->src[n_IRQ].type) {
 	case IRQ_TYPE_NORMAL:
 		opp->src[n_IRQ].level -		    ! !(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
+		    !!(opp->src[n_IRQ].ivpr & IVPR_SENSE_MASK);
 		break;
 
 	case IRQ_TYPE_FSLINT:
@@ -626,11 +608,11 @@ static inline void write_IRQreg_ivpr(OpenPICState * opp, int n_IRQ,
 	}
 
 	openpic_update_irq(opp, n_IRQ);
-	DPRINTF("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
+	pr_debug("Set IVPR %d to 0x%08x -> 0x%08x\n", n_IRQ, val,
 		opp->src[n_IRQ].ivpr);
 }
 
-static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
+static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
 	bool mpic_proxy = false;
 
@@ -643,27 +625,26 @@ static void openpic_gcr_write(OpenPICState * opp, uint64_t val)
 	opp->gcr |= val & opp->mpic_mode_mask;
 
 	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY) {
+	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
 		mpic_proxy = true;
-	}
 
 	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	switch (addr) {
-	case 0x00:		/* Block Revision Register1 (BRR1) is Readonly */
+	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
 		break;
 	case 0x40:
 	case 0x50:
@@ -685,16 +666,14 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x1090:		/* PIR */
 		for (idx = 0; idx < opp->nb_cpus; idx++) {
 			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Raise OpenPIC RESET output for CPU %d\n",
-				     idx);
+				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx))
-				   && (opp->pir & (1 << idx))) {
-				DPRINTF
-				    ("Lower OpenPIC RESET output for CPU %d\n",
-				     idx);
+			} else if (!(val & (1 << idx)) &&
+				   (opp->pir & (1 << idx))) {
+				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
+					idx);
 				dst = &opp->dst[idx];
 				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
 			}
@@ -704,13 +683,12 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
 	case 0x10C0:
-	case 0x10D0:
-		{
-			int idx;
-			idx = (addr - 0x10A0) >> 4;
-			write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
-		}
+	case 0x10D0: {
+		int idx;
+		idx = (addr - 0x10A0) >> 4;
+		write_IRQreg_ivpr(opp, opp->irq_ipi0 + idx, val);
 		break;
+	}
 	case 0x10E0:		/* SPVE */
 		opp->spve = val & opp->vector_mask;
 		break;
@@ -719,16 +697,16 @@ static void openpic_gbl_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
@@ -772,24 +750,23 @@ static uint64_t openpic_gbl_read(void *opaque, hwaddr addr, unsigned len)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	if (addr = 0x10f0) {
 		/* TFRR */
@@ -806,9 +783,9 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	case 0x10:		/* TBCR */
 		if ((opp->timers[idx].tccr & TCCR_TOG) != 0 &&
 		    (val & TBCR_CI) = 0 &&
-		    (opp->timers[idx].tbcr & TBCR_CI) != 0) {
+		    (opp->timers[idx].tbcr & TBCR_CI) != 0)
 			opp->timers[idx].tccr &= ~TCCR_TOG;
-		}
+
 		opp->timers[idx].tbcr = val;
 		break;
 	case 0x20:		/* TVPR */
@@ -820,16 +797,16 @@ static void openpic_tmr_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		goto out;
-	}
+
 	idx = (addr >> 6) & 0x3;
 	if (addr = 0x0) {
 		/* TFRR */
@@ -852,18 +829,18 @@ static uint64_t openpic_tmr_read(void *opaque, hwaddr addr, unsigned len)
 	}
 
 out:
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
 		__func__, addr, val);
 
 	addr = addr & 0xffff;
@@ -884,11 +861,11 @@ static void openpic_src_write(void *opaque, hwaddr addr, uint64_t val,
 
 static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -906,22 +883,21 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 		break;
 	}
 
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 	return retval;
 }
 
-static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -937,16 +913,15 @@ static void openpic_msi_write(void *opaque, hwaddr addr, uint64_t val,
 	}
 }
 
-static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 {
-	OpenPICState *opp = opaque;
+	struct openpic *opp = opaque;
 	uint64_t r = 0;
 	int i, srs;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
-	if (addr & 0xF) {
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	if (addr & 0xF)
 		return -1;
-	}
 
 	srs = addr >> 4;
 
@@ -965,53 +940,51 @@ static uint64_t openpic_msi_read(void *opaque, hwaddr addr, unsigned size)
 		openpic_set_irq(opp, opp->irq_msi + srs, 0);
 		break;
 	case 0x120:		/* MSISR */
-		for (i = 0; i < MAX_MSI; i++) {
+		for (i = 0; i < MAX_MSI; i++)
 			r |= (opp->msi[i].msir ? 1 : 0) << i;
-		}
 		break;
 	}
 
 	return r;
 }
 
-static uint64_t openpic_summary_read(void *opaque, hwaddr addr, unsigned size)
+static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
 {
 	uint64_t r = 0;
 
-	DPRINTF("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
 	return r;
 }
 
-static void openpic_summary_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
 				  unsigned size)
 {
-	DPRINTF("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
+	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
 		__func__, addr, val);
 
 	/* TODO: EISR/EIMR */
 }
 
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 				       uint32_t val, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQSource *src;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_source *src;
+	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
 		addr, val);
 
-	if (idx < 0) {
+	if (idx < 0)
 		return;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1028,17 +1001,16 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	case 0x80:		/* CTPR */
 		dst->ctpr = val & 0x0000000F;
 
-		DPRINTF("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
+		pr_debug("%s: set CPU %d ctpr to %d, raised %d servicing %d\n",
 			__func__, idx, dst->ctpr, dst->raised.priority,
 			dst->servicing.priority);
 
 		if (dst->raised.priority <= dst->ctpr) {
-			DPRINTF
-			    ("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
-			     __func__, idx);
+			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
+				__func__, idx);
 			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 		} else if (dst->raised.priority > dst->servicing.priority) {
-			DPRINTF("%s: Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
 			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1051,11 +1023,11 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		/* Read-only register */
 		break;
 	case 0xB0:		/* EOI */
-		DPRINTF("EOI\n");
+		pr_debug("EOI\n");
 		s_IRQ = IRQ_get_next(opp, &dst->servicing);
 
 		if (s_IRQ < 0) {
-			DPRINTF("%s: EOI with no interrupt in service\n",
+			pr_debug("%s: EOI with no interrupt in service\n",
 				__func__);
 			break;
 		}
@@ -1069,7 +1041,7 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 		if (n_IRQ != -1 &&
 		    (s_IRQ = -1 ||
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
-			DPRINTF("Raise OpenPIC INT output cpu %d irq %d\n",
+			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
 			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
 		}
@@ -1079,32 +1051,32 @@ static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
 	}
 }
 
-static void openpic_cpu_write(void *opaque, hwaddr addr, uint64_t val,
+static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
 			      unsigned len)
 {
 	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
 }
 
-static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
+static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
+			     int cpu)
 {
-	IRQSource *src;
+	struct irq_source *src;
 	int retval, irq;
 
-	DPRINTF("Lower OpenPIC INT output\n");
+	pr_debug("Lower OpenPIC INT output\n");
 	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
 
 	irq = IRQ_get_next(opp, &dst->raised);
-	DPRINTF("IACK: irq=%d\n", irq);
+	pr_debug("IACK: irq=%d\n", irq);
 
-	if (irq = -1) {
+	if (irq = -1)
 		/* No more interrupt pending */
 		return opp->spve;
-	}
 
 	src = &opp->src[irq];
 	if (!(src->ivpr & IVPR_ACTIVITY_MASK) ||
 	    !(IVPR_PRIORITY(src->ivpr) > dst->ctpr)) {
-		fprintf(stderr, "%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
+		pr_err("%s: bad raised IRQ %d ctpr %d ivpr 0x%08x\n",
 			__func__, irq, dst->ctpr, src->ivpr);
 		openpic_update_irq(opp, irq);
 		retval = opp->spve;
@@ -1135,22 +1107,21 @@ static uint32_t openpic_iack(OpenPICState * opp, IRQDest * dst, int cpu)
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 {
-	OpenPICState *opp = opaque;
-	IRQDest *dst;
+	struct openpic *opp = opaque;
+	struct irq_dest *dst;
 	uint32_t retval;
 
-	DPRINTF("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
-	if (idx < 0) {
+	if (idx < 0)
 		return retval;
-	}
 
-	if (addr & 0xF) {
+	if (addr & 0xF)
 		return retval;
-	}
+
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
 	switch (addr) {
@@ -1169,54 +1140,54 @@ static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx)
 	default:
 		break;
 	}
-	DPRINTF("%s: => 0x%08x\n", __func__, retval);
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
 	return retval;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len)
+static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
 {
 	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_be = {
+static const struct kvm_io_device_ops openpic_glb_ops_be = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
 };
 
-static const MemoryRegionOps openpic_tmr_ops_be = {
+static const struct kvm_io_device_ops openpic_tmr_ops_be = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
 };
 
-static const MemoryRegionOps openpic_cpu_ops_be = {
+static const struct kvm_io_device_ops openpic_cpu_ops_be = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
 };
 
-static const MemoryRegionOps openpic_src_ops_be = {
+static const struct kvm_io_device_ops openpic_src_ops_be = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
 };
 
-static const MemoryRegionOps openpic_msi_ops_be = {
+static const struct kvm_io_device_ops openpic_msi_ops_be = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
 };
 
-static const MemoryRegionOps openpic_summary_ops_be = {
+static const struct kvm_io_device_ops openpic_summary_ops_be = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
 };
 
-typedef struct MemReg {
+struct mem_reg {
 	const char *name;
-	MemoryRegionOps const *ops;
-	hwaddr start_addr;
-	ram_addr_t size;
-} MemReg;
+	const struct kvm_io_device_ops *ops;
+	gpa_t start_addr;
+	int size;
+};
 
-static void fsl_common_init(OpenPICState * opp)
+static void fsl_common_init(struct openpic *opp)
 {
 	int i;
 	int virq = MAX_SRC;
@@ -1239,9 +1210,8 @@ static void fsl_common_init(OpenPICState * opp)
 	opp->irq_msi = 224;
 
 	msi_supported = true;
-	for (i = 0; i < opp->fsl->max_ext; i++) {
+	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
-	}
 
 	/* Internal interrupts, including message and MSI */
 	for (i = 16; i < MAX_SRC; i++) {
@@ -1256,7 +1226,8 @@ static void fsl_common_init(OpenPICState * opp)
 	}
 }
 
-static void map_list(OpenPICState * opp, const MemReg * list, int *count)
+static void map_list(struct openpic *opp, const struct mem_reg *list,
+		     int *count)
 {
 	while (list->name) {
 		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
@@ -1272,12 +1243,12 @@ static void map_list(OpenPICState * opp, const MemReg * list, int *count)
 	}
 }
 
-static int openpic_init(SysBusDevice * dev)
+static int openpic_init(SysBusDevice *dev)
 {
-	OpenPICState *opp = FROM_SYSBUS(typeof(*opp), dev);
+	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
 	int i, j;
 	int list_count = 0;
-	static const MemReg list_le[] = {
+	static const struct mem_reg list_le[] = {
 		{"glb", &openpic_glb_ops_le,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_le,
@@ -1288,7 +1259,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_be[] = {
+	static const struct mem_reg list_be[] = {
 		{"glb", &openpic_glb_ops_be,
 		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
 		{"tmr", &openpic_tmr_ops_be,
@@ -1299,7 +1270,7 @@ static int openpic_init(SysBusDevice * dev)
 		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
 		{NULL}
 	};
-	static const MemReg list_fsl[] = {
+	static const struct mem_reg list_fsl[] = {
 		{"msi", &openpic_msi_ops_be,
 		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
 		{"summary", &openpic_summary_ops_be,
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
  2013-04-13  0:08       ` Scott Wood
@ 2013-04-13  0:08         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd and KVM_IRQ_LINE support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v4:
- update for changes in device control api
- use put/get_user where possible
- add missing locks (one which was actually required, and another to
  avoid confusion),
- fix return value handling with openpic_cpu_*_internal()
- add kconfig help text
- move some vcpu-related code into the KVM_CAP_IRQ_MPIC patch
- return proper error values rather than "1" on MMIO errors.
- simplify kvm_mpic_read
---
 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h        |    8 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    7 +
 arch/powerpc/include/uapi/asm/kvm.h        |    7 +
 arch/powerpc/kvm/Kconfig                   |    9 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |    8 +-
 arch/powerpc/kvm/mpic.c                    |  757 +++++++++++++++++++++-------
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |    3 +
 virt/kvm/kvm_main.c                        |    6 +
 12 files changed, 658 insertions(+), 200 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..ce98e32
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+=========================
+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest.
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..7e7aef9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -522,7 +527,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -589,5 +594,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..3b63b97 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+struct openpic;
+void kvmppc_mpic_put(struct openpic *opp);
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
@@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index c2ff99c..36be2fe 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -426,4 +426,11 @@ struct kvm_get_htab_header {
 /* Debugging: Special instruction for software breakpoint */
 #define KVM_REG_PPC_DEBUG_INST	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x8b)
 
+/* Device control API: PPC-specific devices */
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..938a729 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,15 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+	help
+	  Enable support for emulating MPIC devices inside the
+          host kernel, rather than relying on userspace to emulate.
+          Currently, support is limited to certain versions of
+          Freescale's MPIC implementation.
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index a49a68a..cff53d4 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -346,7 +346,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority == BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -427,8 +427,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear == true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr == true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr == true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..8ed4072 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,19 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/anon_inodes.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -36,6 +49,7 @@
 #define OPENPIC_FLAG_ILR          (2 << 0)
 
 /* OpenPIC address map */
+#define OPENPIC_REG_SIZE             0x40000
 #define OPENPIC_GLB_REG_START        0x0
 #define OPENPIC_GLB_REG_SIZE         0x10F0
 #define OPENPIC_TMR_REG_START        0x10F0
@@ -89,6 +103,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +113,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +142,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +169,27 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
 };
 
 struct openpic {
+	struct kvm *kvm;
+	struct kvm_device *dev;
+	struct kvm_io_device mmio;
+	struct list_head mmio_regions;
+	atomic_t users;
+	bool mmio_mapped;
+
+	gpa_t reg_base;
+	spinlock_t lock;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +230,47 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +331,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +345,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ == 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] == 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +385,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +400,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +476,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +494,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +507,13 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +563,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +598,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +609,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +624,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +673,23 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
-	bool mpic_proxy = false;
-
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
-	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) == GCR_MODE_PROXY)
-		mpic_proxy = true;
-
-	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_gbl_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
-	struct irq_dest *dst;
-	int idx;
+	int err = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -654,7 +702,8 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		err = openpic_cpu_write_internal(opp, addr, val,
+						 get_current_cpu());
 		break;
 	case 0x1000:		/* FRR */
 		break;
@@ -664,21 +713,11 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x1080:		/* VIR */
 		break;
 	case 0x1090:		/* PIR */
-		for (idx = 0; idx < opp->nb_cpus; idx++) {
-			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx)) &&
-				   (opp->pir & (1 << idx))) {
-				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			}
-		}
-		opp->pir = val;
+		/*
+		 * This register is used to reset a CPU core --
+		 * let userspace handle it.
+		 */
+		err = -ENXIO;
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -695,21 +734,25 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	return err;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint32_t retval;
+	u32 retval;
+	int err = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +774,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval =
-		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		err = openpic_cpu_read_internal(opp, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,28 +793,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
-	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
+	*ptr = retval;
+	return err;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr == 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
@@ -795,15 +838,17 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +858,7 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -830,18 +876,16 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
@@ -857,15 +901,17 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -884,20 +930,19 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 	}
 
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,17 +956,19 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint64_t r = 0;
+	uint32_t r = 0;
 	int i, srs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return -ENXIO;
 
 	srs = addr >> 4;
 
@@ -945,45 +992,47 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+	*ptr = r;
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	uint64_t r = 0;
+	uint32_t r = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	*ptr = r;
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(void *opaque, gpa_t addr, u32 val)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1057,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1092,22 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(void *opaque, gpa_t addr, u32 val)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_write_internal(opp, addr, val,
+					 (addr & 0x1f000) >> 12);
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1117,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1160,21 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,49 +1196,67 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_read_internal(opp, addr, ptr,
+					 (addr & 0x1f000) >> 12);
 }
 
-static const struct kvm_io_device_ops openpic_glb_ops_be = {
+struct mem_reg {
+	struct list_head list;
+	int (*read)(void *opaque, gpa_t addr, u32 *ptr);
+	int (*write)(void *opaque, gpa_t addr, u32 val);
+	gpa_t start_addr;
+	int size;
+};
+
+static struct mem_reg openpic_gbl_mmio = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
+	.start_addr = OPENPIC_GLB_REG_START,
+	.size = OPENPIC_GLB_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_tmr_ops_be = {
+static struct mem_reg openpic_tmr_mmio = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
+	.start_addr = OPENPIC_TMR_REG_START,
+	.size = OPENPIC_TMR_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_cpu_ops_be = {
+static struct mem_reg openpic_cpu_mmio = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
+	.start_addr = OPENPIC_CPU_REG_START,
+	.size = OPENPIC_CPU_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_src_ops_be = {
+static struct mem_reg openpic_src_mmio = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
+	.start_addr = OPENPIC_SRC_REG_START,
+	.size = OPENPIC_SRC_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_msi_ops_be = {
+static struct mem_reg openpic_msi_mmio = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
+	.start_addr = OPENPIC_MSI_REG_START,
+	.size = OPENPIC_MSI_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_summary_ops_be = {
+static struct mem_reg openpic_summary_mmio = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-};
-
-struct mem_reg {
-	const char *name;
-	const struct kvm_io_device_ops *ops;
-	gpa_t start_addr;
-	int size;
+	.start_addr = OPENPIC_SUMMARY_REG_START,
+	.size = OPENPIC_SUMMARY_REG_SIZE,
 };
 
 static void fsl_common_init(struct openpic *opp)
@@ -1192,6 +1264,9 @@ static void fsl_common_init(struct openpic *opp)
 	int i;
 	int virq = MAX_SRC;
 
+	list_add(&openpic_msi_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_summary_mmio.list, &opp->mmio_regions);
+
 	opp->vid = VID_REVISION_1_2;
 	opp->vir = VIR_GENERIC;
 	opp->vector_mask = 0xFFFF;
@@ -1205,11 +1280,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,63 +1300,352 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static int kvm_mpic_read_internal(struct openpic *opp, gpa_t addr, u32 *ptr)
 {
-	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+	struct list_head *node;
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-		(*count)++;
-		list++;
+		return mr->read(opp, addr - mr->start_addr, ptr);
 	}
+
+	return -ENXIO;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int kvm_mpic_write_internal(struct openpic *opp, gpa_t addr, u32 val)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_be[] = {
-		{"glb", &openpic_glb_ops_be,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_be,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_be,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_be,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_fsl[] = {
-		{"msi", &openpic_msi_ops_be,
-		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
-		{"summary", &openpic_summary_ops_be,
-		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
-		{NULL}
-	};
+	struct list_head *node;
+
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
+
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+		return mr->write(opp, addr - mr->start_addr, val);
+	}
+
+	return -ENXIO;
+}
+
+static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
+			 int len, void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+	union {
+		u32 val;
+		u8 bytes[4];
+	} u;
+
+	if (addr & (len - 1)) {
+		pr_debug("%s: bad alignment %llx/%d\n",
+			 __func__, addr, len);
+		return -EINVAL;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
+	spin_unlock_irq(&opp->lock);
+
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice to
+	 * people dumping registers a byte at a time -- it works in real
+	 * hardware (reads only, not writes).
+	 */
+	if (len == 4) {
+		*(u32 *)ptr = u.val;
+		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
+			 __func__, addr, ret, u.val);
+	} else if (len == 1) {
+		*(u8 *)ptr = u.bytes[addr & 3];
+		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
+			 __func__, addr, ret, u.bytes[addr & 3]);
+	} else {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
+			  int len, const void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EOPNOTSUPP;
+	}
+	if (addr & 3) {
+		pr_debug("%s: bad alignment %llx/%d\n", __func__, addr, len);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
+				      *(const u32 *)ptr);
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: addr %llx ret %d val %x\n",
+		 __func__, addr, ret, *(const u32 *)ptr);
+
+	return ret;
+}
+
+static void kvm_mpic_dtor(struct kvm_io_device *this)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+
+	opp->mmio_mapped = false;
+}
+
+static const struct kvm_io_device_ops mpic_mmio_ops = {
+	.read = kvm_mpic_read,
+	.write = kvm_mpic_write,
+	.destructor = kvm_mpic_dtor,
+};
+
+static void map_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = true;
+
+	kvm_iodevice_init(&opp->mmio, &mpic_mmio_ops);
+
+	kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+				opp->reg_base, OPENPIC_REG_SIZE,
+				&opp->mmio);
+}
+
+static void unmap_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = false;
+
+	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+}
+
+static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 base;
+
+	if (copy_from_user(&base, (u64 __user *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base == opp->reg_base)
+		return 0;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	unmap_mmio(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base == 0)
+		goto out;
+
+	map_mmio(opp);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+out:
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	spin_lock_irq(&opp->lock);
+
+	if (type == ATTR_SET)
+		ret = kvm_mpic_write_internal(opp, addr, *val);
+	else
+		ret = kvm_mpic_read_internal(opp, addr, val);
+
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	struct openpic *opp = dev->private;
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(opp, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (get_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (get_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	struct openpic *opp = dev->private;
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			mutex_lock(&opp->kvm->slots_lock);
+			attr64 = opp->reg_base;
+			mutex_unlock(&opp->kvm->slots_lock);
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (put_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		attr32 = opp->src[attr->attr].pending;
+		spin_unlock_irq(&opp->lock);
+
+		if (put_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			break;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static void mpic_destroy(struct kvm_device *dev)
+{
+	struct openpic *opp = dev->private;
+
+	if (opp->mmio_mapped) {
+		/*
+		 * Normally we get unmapped by kvm_io_bus_destroy(),
+		 * which happens before the VCPUs release their references.
+		 *
+		 * Thus, we should only get here if no VCPUs took a reference
+		 * to us in the first place.
+		 */
+		WARN_ON(opp->nb_cpus != 0);
+		unmap_mmio(opp);
+	}
+
+	kfree(opp);
+}
+
+static int mpic_create(struct kvm_device *dev, u32 type)
+{
+	struct openpic *opp;
+	int ret;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return -ENOMEM;
+
+	dev->private = opp;
+	opp->kvm = dev->kvm;
+	opp->dev = dev;
+	opp->model = type;
+	spin_lock_init(&opp->lock);
+
+	INIT_LIST_HEAD(&opp->mmio_regions);
+	list_add(&openpic_gbl_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_tmr_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_src_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_cpu_mmio.list, &opp->mmio_regions);
 
 	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
-	default:
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1653,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1664,27 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
+	openpic_reset(opp);
 	return 0;
+
+err:
+	kfree(opp);
+	return ret;
 }
+
+struct kvm_device_ops kvm_mpic_ops = {
+	.name = "kvm-mpic",
+	.create = mpic_create,
+	.destroy = mpic_destroy,
+	.set_attr = mpic_set_attr,
+	.get_attr = mpic_get_attr,
+	.has_attr = mpic_has_attr,
+};
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..c9a2972 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -769,7 +770,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -915,6 +919,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -933,7 +938,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -945,7 +949,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -955,7 +958,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -972,7 +974,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -985,7 +986,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8fce9bc..abc2f26 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1098,6 +1098,8 @@ void kvm_device_get(struct kvm_device *dev);
 void kvm_device_put(struct kvm_device *dev);
 struct kvm_device *kvm_device_from_filp(struct file *filp);
 
+extern struct kvm_device_ops kvm_mpic_ops;
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
 static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20ce2d2..76963ec 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -927,6 +927,9 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e2b18af..9f64438 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2237,6 +2237,12 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	int ret;
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
+		ops = &kvm_mpic_ops;
+		break;
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
@ 2013-04-13  0:08         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd and KVM_IRQ_LINE support

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v4:
- update for changes in device control api
- use put/get_user where possible
- add missing locks (one which was actually required, and another to
  avoid confusion),
- fix return value handling with openpic_cpu_*_internal()
- add kconfig help text
- move some vcpu-related code into the KVM_CAP_IRQ_MPIC patch
- return proper error values rather than "1" on MMIO errors.
- simplify kvm_mpic_read
---
 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h        |    8 +-
 arch/powerpc/include/asm/kvm_ppc.h         |    7 +
 arch/powerpc/include/uapi/asm/kvm.h        |    7 +
 arch/powerpc/kvm/Kconfig                   |    9 +
 arch/powerpc/kvm/Makefile                  |    2 +
 arch/powerpc/kvm/booke.c                   |    8 +-
 arch/powerpc/kvm/mpic.c                    |  757 +++++++++++++++++++++-------
 arch/powerpc/kvm/powerpc.c                 |   12 +-
 include/linux/kvm_host.h                   |    2 +
 include/uapi/linux/kvm.h                   |    3 +
 virt/kvm/kvm_main.c                        |    6 +
 12 files changed, 658 insertions(+), 200 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 0000000..ce98e32
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+============+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20     Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42     Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+    KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+      Base address of the 256 KiB MPIC register space.  Must be
+      naturally aligned.  A value of zero disables the mapping.
+      Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+    Access an MPIC register, as if the access were made from the guest.
+    "attr" is the byte offset into the MPIC register space.  Accesses
+    must be 4-byte aligned.
+
+    MSIs may be signaled by using this attribute group to write
+    to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+    IRQ input line for each standard openpic source.  0 is inactive and 1
+    is active, regardless of interrupt sense.
+
+    For edge-triggered interrupts:  Writing 1 is considered an activating
+    edge, and writing 0 is ignored.  Reading returns 1 if a previously
+    signaled edge has not been acknowledged, and 0 otherwise.
+
+    "attr" is the IRQ number.  IRQ numbers for standard sources are the
+    byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..7e7aef9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC	4
 #define KVMPPC_BOOKE_MAX_DAC	2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE		0 /* EPR not supported */
+#define KVMPPC_EPR_USER		1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL	2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
 	u32 dbcr0;
 	u32 dbcr1;
@@ -522,7 +527,7 @@ struct kvm_vcpu_arch {
 	u8 sane;
 	u8 cpu_type;
 	u8 hcall_needed;
-	u8 epr_enabled;
+	u8 epr_flags; /* KVMPPC_EPR_xxx */
 	u8 epr_needed;
 
 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -589,5 +594,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR	0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..3b63b97 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+struct openpic;
+void kvmppc_mpic_put(struct openpic *opp);
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
@@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index c2ff99c..36be2fe 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -426,4 +426,11 @@ struct kvm_get_htab_header {
 /* Debugging: Special instruction for software breakpoint */
 #define KVM_REG_PPC_DEBUG_INST	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x8b)
 
+/* Device control API: PPC-specific devices */
+#define KVM_DEV_MPIC_GRP_MISC		1
+#define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
+
+#define KVM_DEV_MPIC_GRP_REGISTER	2	/* 32-bit */
+#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE	3	/* 32-bit */
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..938a729 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,15 @@ config KVM_E500MC
 
 	  If unsure, say N.
 
+config KVM_MPIC
+	bool "KVM in-kernel MPIC emulation"
+	depends on KVM
+	help
+	  Enable support for emulating MPIC devices inside the
+          host kernel, rather than relying on userspace to emulate.
+          Currently, support is limited to certain versions of
+          Freescale's MPIC implementation.
+
 source drivers/vhost/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index b772ede..4a2277a 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -103,6 +103,8 @@ kvm-book3s_32-objs := \
 	book3s_32_mmu.o
 kvm-objs-$(CONFIG_KVM_BOOK3S_32) := $(kvm-book3s_32-objs)
 
+kvm-objs-$(CONFIG_KVM_MPIC) += mpic.o
+
 kvm-objs := $(kvm-objs-m) $(kvm-objs-y)
 
 obj-$(CONFIG_KVM_440) += kvm.o
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index a49a68a..cff53d4 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -346,7 +346,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		keep_irq = true;
 	}
 
-	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_enabled)
+	if ((priority = BOOKE_IRQPRIO_EXTERNAL) && vcpu->arch.epr_flags)
 		update_epr = true;
 
 	switch (priority) {
@@ -427,8 +427,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 			set_guest_esr(vcpu, vcpu->arch.queued_esr);
 		if (update_dear = true)
 			set_guest_dear(vcpu, vcpu->arch.queued_dear);
-		if (update_epr = true)
-			kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		if (update_epr = true) {
+			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
+				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+		}
 
 		new_msr &= msr_mask;
 #if defined(CONFIG_64BIT)
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 1df67ae..8ed4072 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -23,6 +23,19 @@
  * THE SOFTWARE.
  */
 
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/anon_inodes.h>
+#include <asm/uaccess.h>
+#include <asm/mpic.h>
+#include <asm/kvm_para.h>
+#include <asm/kvm_host.h>
+#include <asm/kvm_ppc.h>
+#include "iodev.h"
+
 #define MAX_CPU     32
 #define MAX_SRC     256
 #define MAX_TMR     4
@@ -36,6 +49,7 @@
 #define OPENPIC_FLAG_ILR          (2 << 0)
 
 /* OpenPIC address map */
+#define OPENPIC_REG_SIZE             0x40000
 #define OPENPIC_GLB_REG_START        0x0
 #define OPENPIC_GLB_REG_SIZE         0x10F0
 #define OPENPIC_TMR_REG_START        0x10F0
@@ -89,6 +103,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 #define ILR_INTTGT_INT    0x00
 #define ILR_INTTGT_CINT   0x01	/* critical */
 #define ILR_INTTGT_MCP    0x02	/* machine check */
+#define NUM_OUTPUTS       3
 
 #define MSIIR_OFFSET       0x140
 #define MSIIR_SRS_SHIFT    29
@@ -98,18 +113,14 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 
 static int get_current_cpu(void)
 {
-	CPUState *cpu_single_cpu;
-
-	if (!cpu_single_env)
-		return -1;
-
-	cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
-	return cpu_single_cpu->cpu_index;
+	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
+	return vcpu ? vcpu->vcpu_id : -1;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx);
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx);
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx);
 
 enum irq_type {
 	IRQ_TYPE_NORMAL = 0,
@@ -131,7 +142,7 @@ struct irq_source {
 	uint32_t idr;		/* IRQ destination register */
 	uint32_t destmask;	/* bitmap of CPU destinations */
 	int last_cpu;
-	int output;		/* IRQ level, e.g. OPENPIC_OUTPUT_INT */
+	int output;		/* IRQ level, e.g. ILR_INTTGT_INT */
 	int pending;		/* TRUE if IRQ is pending */
 	enum irq_type type;
 	bool level:1;		/* level-triggered */
@@ -158,16 +169,27 @@ struct irq_source {
 #define IDR_CI      0x40000000	/* critical interrupt */
 
 struct irq_dest {
+	struct kvm_vcpu *vcpu;
+
 	int32_t ctpr;		/* CPU current task priority */
 	struct irq_queue raised;
 	struct irq_queue servicing;
-	qemu_irq *irqs;
 
 	/* Count of IRQ sources asserting on non-INT outputs */
-	uint32_t outputs_active[OPENPIC_OUTPUT_NB];
+	uint32_t outputs_active[NUM_OUTPUTS];
 };
 
 struct openpic {
+	struct kvm *kvm;
+	struct kvm_device *dev;
+	struct kvm_io_device mmio;
+	struct list_head mmio_regions;
+	atomic_t users;
+	bool mmio_mapped;
+
+	gpa_t reg_base;
+	spinlock_t lock;
+
 	/* Behavior control */
 	struct fsl_mpic_info *fsl;
 	uint32_t model;
@@ -208,6 +230,47 @@ struct openpic {
 	uint32_t irq_msi;
 };
 
+
+static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	struct kvm_interrupt irq = {
+		.irq = KVM_INTERRUPT_SET_LEVEL,
+	};
+
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvm_vcpu_ioctl_interrupt(dst->vcpu, &irq);
+}
+
+static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
+			   int output)
+{
+	if (!dst->vcpu) {
+		pr_debug("%s: destination cpu %d does not exist\n",
+			 __func__, dst - &opp->dst[0]);
+		return;
+	}
+
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+		output);
+
+	if (output != ILR_INTTGT_INT)	/* TODO */
+		return;
+
+	kvmppc_core_dequeue_external(dst->vcpu);
+}
+
 static inline void IRQ_setbit(struct irq_queue *q, int n_IRQ)
 {
 	set_bit(n_IRQ, q->queue);
@@ -268,7 +331,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 	pr_debug("%s: IRQ %d active %d was %d\n",
 		__func__, n_IRQ, active, was_active);
 
-	if (src->output != OPENPIC_OUTPUT_INT) {
+	if (src->output != ILR_INTTGT_INT) {
 		pr_debug("%s: output %d irq %d active %d was %d count %d\n",
 			__func__, src->output, n_IRQ, active, was_active,
 			dst->outputs_active[src->output]);
@@ -282,14 +345,14 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			    dst->outputs_active[src->output]++ = 0) {
 				pr_debug("%s: Raise OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_raise(dst->irqs[src->output]);
+				mpic_irq_raise(opp, dst, src->output);
 			}
 		} else {
 			if (was_active &&
 			    --dst->outputs_active[src->output] = 0) {
 				pr_debug("%s: Lower OpenPIC output %d cpu %d irq %d\n",
 					__func__, src->output, n_CPU, n_IRQ);
-				qemu_irq_lower(dst->irqs[src->output]);
+				mpic_irq_lower(opp, dst, src->output);
 			}
 		}
 
@@ -322,8 +385,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 		} else {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d/%d\n",
 				__func__, n_CPU, n_IRQ, dst->raised.next);
-			qemu_irq_raise(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 	} else {
 		IRQ_get_next(opp, &dst->servicing);
@@ -338,8 +400,7 @@ static void IRQ_local_pipe(struct openpic *opp, int n_CPU, int n_IRQ,
 			pr_debug("%s: IRQ %d inactive, current prio %d/%d, CPU %d\n",
 				__func__, n_IRQ, dst->ctpr,
 				dst->servicing.priority, n_CPU);
-			qemu_irq_lower(opp->dst[n_CPU].
-				       irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		}
 	}
 }
@@ -415,8 +476,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	struct irq_source *src;
 
 	if (n_IRQ >= MAX_IRQ) {
-		pr_err("%s: IRQ %d out of range\n", __func__, n_IRQ);
-		abort();
+		WARN_ONCE(1, "%s: IRQ %d out of range\n", __func__, n_IRQ);
+		return;
 	}
 
 	src = &opp->src[n_IRQ];
@@ -433,7 +494,7 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 			openpic_update_irq(opp, n_IRQ);
 		}
 
-		if (src->output != OPENPIC_OUTPUT_INT) {
+		if (src->output != ILR_INTTGT_INT) {
 			/* Edge-triggered interrupts shouldn't be used
 			 * with non-INT delivery, but just in case,
 			 * try to make it do something sane rather than
@@ -446,15 +507,13 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
 	}
 }
 
-static void openpic_reset(DeviceState *d)
+static void openpic_reset(struct openpic *opp)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), SYS_BUS_DEVICE(d));
 	int i;
 
 	opp->gcr = GCR_RESET;
 	/* Initialise controller registers */
 	opp->frr = ((opp->nb_irqs - 1) << FRR_NIRQ_SHIFT) |
-	    ((opp->nb_cpus - 1) << FRR_NCPU_SHIFT) |
 	    (opp->vid << FRR_VID_SHIFT);
 
 	opp->pir = 0;
@@ -504,7 +563,7 @@ static inline uint32_t read_IRQreg_idr(struct openpic *opp, int n_IRQ)
 static inline uint32_t read_IRQreg_ilr(struct openpic *opp, int n_IRQ)
 {
 	if (opp->flags & OPENPIC_FLAG_ILR)
-		return output_to_inttgt(opp->src[n_IRQ].output);
+		return opp->src[n_IRQ].output;
 
 	return 0xffffffff;
 }
@@ -539,7 +598,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					__func__);
 			}
 
-			src->output = OPENPIC_OUTPUT_CINT;
+			src->output = ILR_INTTGT_CINT;
 			src->nomask = true;
 			src->destmask = 0;
 
@@ -550,7 +609,7 @@ static inline void write_IRQreg_idr(struct openpic *opp, int n_IRQ,
 					src->destmask |= 1UL << i;
 			}
 		} else {
-			src->output = OPENPIC_OUTPUT_INT;
+			src->output = ILR_INTTGT_INT;
 			src->nomask = false;
 			src->destmask = src->idr & normal_mask;
 		}
@@ -565,7 +624,7 @@ static inline void write_IRQreg_ilr(struct openpic *opp, int n_IRQ,
 	if (opp->flags & OPENPIC_FLAG_ILR) {
 		struct irq_source *src = &opp->src[n_IRQ];
 
-		src->output = inttgt_to_output(val & ILR_INTTGT_MASK);
+		src->output = val & ILR_INTTGT_MASK;
 		pr_debug("Set ILR %d to 0x%08x, output %d\n", n_IRQ, src->idr,
 			src->output);
 
@@ -614,34 +673,23 @@ static inline void write_IRQreg_ivpr(struct openpic *opp, int n_IRQ,
 
 static void openpic_gcr_write(struct openpic *opp, uint64_t val)
 {
-	bool mpic_proxy = false;
-
 	if (val & GCR_RESET) {
-		openpic_reset(&opp->busdev.qdev);
+		openpic_reset(opp);
 		return;
 	}
 
 	opp->gcr &= ~opp->mpic_mode_mask;
 	opp->gcr |= val & opp->mpic_mode_mask;
-
-	/* Set external proxy mode */
-	if ((val & opp->mpic_mode_mask) = GCR_MODE_PROXY)
-		mpic_proxy = true;
-
-	ppce500_set_mpic_proxy(mpic_proxy);
 }
 
-static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_gbl_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
-	struct irq_dest *dst;
-	int idx;
+	int err = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case 0x00:	/* Block Revision Register1 (BRR1) is Readonly */
@@ -654,7 +702,8 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		openpic_cpu_write_internal(opp, addr, val, get_current_cpu());
+		err = openpic_cpu_write_internal(opp, addr, val,
+						 get_current_cpu());
 		break;
 	case 0x1000:		/* FRR */
 		break;
@@ -664,21 +713,11 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	case 0x1080:		/* VIR */
 		break;
 	case 0x1090:		/* PIR */
-		for (idx = 0; idx < opp->nb_cpus; idx++) {
-			if ((val & (1 << idx)) && !(opp->pir & (1 << idx))) {
-				pr_debug("Raise OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			} else if (!(val & (1 << idx)) &&
-				   (opp->pir & (1 << idx))) {
-				pr_debug("Lower OpenPIC RESET output for CPU %d\n",
-					idx);
-				dst = &opp->dst[idx];
-				qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_RESET]);
-			}
-		}
-		opp->pir = val;
+		/*
+		 * This register is used to reset a CPU core --
+		 * let userspace handle it.
+		 */
+		err = -ENXIO;
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -695,21 +734,25 @@ static void openpic_gbl_write(void *opaque, gpa_t addr, uint64_t val,
 	default:
 		break;
 	}
+
+	return err;
 }
 
-static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_gbl_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint32_t retval;
+	u32 retval;
+	int err = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	switch (addr) {
 	case 0x1000:		/* FRR */
 		retval = opp->frr;
+		retval |= (opp->nb_cpus - 1) << FRR_NCPU_SHIFT;
 		break;
 	case 0x1020:		/* GCR */
 		retval = opp->gcr;
@@ -731,8 +774,8 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	case 0x90:
 	case 0xA0:
 	case 0xB0:
-		retval -		    openpic_cpu_read_internal(opp, addr, get_current_cpu());
+		err = openpic_cpu_read_internal(opp, addr,
+			&retval, get_current_cpu());
 		break;
 	case 0x10A0:		/* IPI_IVPR */
 	case 0x10B0:
@@ -750,28 +793,28 @@ static uint64_t openpic_gbl_read(void *opaque, gpa_t addr, unsigned len)
 	default:
 		break;
 	}
-	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	pr_debug("%s: => 0x%08x\n", __func__, retval);
+	*ptr = retval;
+	return err;
 }
 
-static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_tmr_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
 	addr += 0x10f0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	if (addr = 0x10f0) {
 		/* TFRR */
 		opp->tfrr = val;
-		return;
+		return 0;
 	}
 
 	idx = (addr >> 6) & 0x3;
@@ -795,15 +838,17 @@ static void openpic_tmr_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_idr(opp, opp->irq_tim0 + idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_tmr_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval = -1;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
 		goto out;
 
@@ -813,6 +858,7 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 		retval = opp->tfrr;
 		goto out;
 	}
+
 	switch (addr & 0x30) {
 	case 0x00:		/* TCCR */
 		retval = opp->timers[idx].tccr;
@@ -830,18 +876,16 @@ static uint64_t openpic_tmr_read(void *opaque, gpa_t addr, unsigned len)
 
 out:
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_src_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= %08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= %08x\n", __func__, addr, val);
 
 	addr = addr & 0xffff;
 	idx = addr >> 5;
@@ -857,15 +901,17 @@ static void openpic_src_write(void *opaque, gpa_t addr, uint64_t val,
 		write_IRQreg_ilr(opp, idx, val);
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
+static int openpic_src_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
 	uint32_t retval;
 	int idx;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	retval = 0xFFFFFFFF;
 
 	addr = addr & 0xffff;
@@ -884,20 +930,19 @@ static uint64_t openpic_src_read(void *opaque, uint64_t addr, unsigned len)
 	}
 
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
-	return retval;
+	*ptr = retval;
+	return 0;
 }
 
-static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned size)
+static int openpic_msi_write(void *opaque, gpa_t addr, u32 val)
 {
 	struct openpic *opp = opaque;
 	int idx = opp->irq_msi;
 	int srs, ibs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	switch (addr) {
 	case MSIIR_OFFSET:
@@ -911,17 +956,19 @@ static void openpic_msi_write(void *opaque, gpa_t addr, uint64_t val,
 		/* most registers are read-only, thus ignored */
 		break;
 	}
+
+	return 0;
 }
 
-static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_msi_read(void *opaque, gpa_t addr, u32 *ptr)
 {
 	struct openpic *opp = opaque;
-	uint64_t r = 0;
+	uint32_t r = 0;
 	int i, srs;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 	if (addr & 0xF)
-		return -1;
+		return -ENXIO;
 
 	srs = addr >> 4;
 
@@ -945,45 +992,47 @@ static uint64_t openpic_msi_read(void *opaque, gpa_t addr, unsigned size)
 		break;
 	}
 
-	return r;
+	pr_debug("%s: => 0x%08x\n", __func__, r);
+	*ptr = r;
+	return 0;
 }
 
-static uint64_t openpic_summary_read(void *opaque, gpa_t addr, unsigned size)
+static int openpic_summary_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	uint64_t r = 0;
+	uint32_t r = 0;
 
-	pr_debug("%s: addr %#" HWADDR_PRIx "\n", __func__, addr);
+	pr_debug("%s: addr %#llx\n", __func__, addr);
 
 	/* TODO: EISR/EIMR */
 
-	return r;
+	*ptr = r;
+	return 0;
 }
 
-static void openpic_summary_write(void *opaque, gpa_t addr, uint64_t val,
-				  unsigned size)
+static int openpic_summary_write(void *opaque, gpa_t addr, u32 val)
 {
-	pr_debug("%s: addr %#" HWADDR_PRIx " <= 0x%08" PRIx64 "\n",
-		__func__, addr, val);
+	pr_debug("%s: addr %#llx <= 0x%08x\n", __func__, addr, val);
 
 	/* TODO: EISR/EIMR */
+	return 0;
 }
 
-static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
-				       uint32_t val, int idx)
+static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
+				      u32 val, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_source *src;
 	struct irq_dest *dst;
 	int s_IRQ, n_IRQ;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx " <= 0x%08x\n", __func__, idx,
+	pr_debug("%s: cpu %d addr %#llx <= 0x%08x\n", __func__, idx,
 		addr, val);
 
 	if (idx < 0)
-		return;
+		return 0;
 
 	if (addr & 0xF)
-		return;
+		return 0;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1008,11 +1057,11 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		if (dst->raised.priority <= dst->ctpr) {
 			pr_debug("%s: Lower OpenPIC INT output cpu %d due to ctpr\n",
 				__func__, idx);
-			qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 		} else if (dst->raised.priority > dst->servicing.priority) {
 			pr_debug("%s: Raise OpenPIC INT output cpu %d irq %d\n",
 				__func__, idx, dst->raised.next);
-			qemu_irq_raise(dst->irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 
 		break;
@@ -1043,18 +1092,22 @@ static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
 		     IVPR_PRIORITY(src->ivpr) > dst->servicing.priority)) {
 			pr_debug("Raise OpenPIC INT output cpu %d irq %d\n",
 				idx, n_IRQ);
-			qemu_irq_raise(opp->dst[idx].irqs[OPENPIC_OUTPUT_INT]);
+			mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
 		}
 		break;
 	default:
 		break;
 	}
+
+	return 0;
 }
 
-static void openpic_cpu_write(void *opaque, gpa_t addr, uint64_t val,
-			      unsigned len)
+static int openpic_cpu_write(void *opaque, gpa_t addr, u32 val)
 {
-	openpic_cpu_write_internal(opaque, addr, val, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_write_internal(opp, addr, val,
+					 (addr & 0x1f000) >> 12);
 }
 
 static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
@@ -1064,7 +1117,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	int retval, irq;
 
 	pr_debug("Lower OpenPIC INT output\n");
-	qemu_irq_lower(dst->irqs[OPENPIC_OUTPUT_INT]);
+	mpic_irq_lower(opp, dst, ILR_INTTGT_INT);
 
 	irq = IRQ_get_next(opp, &dst->raised);
 	pr_debug("IACK: irq=%d\n", irq);
@@ -1107,20 +1160,21 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
+static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
+				     u32 *ptr, int idx)
 {
 	struct openpic *opp = opaque;
 	struct irq_dest *dst;
 	uint32_t retval;
 
-	pr_debug("%s: cpu %d addr %#" HWADDR_PRIx "\n", __func__, idx, addr);
+	pr_debug("%s: cpu %d addr %#llx\n", __func__, idx, addr);
 	retval = 0xFFFFFFFF;
 
 	if (idx < 0)
-		return retval;
+		goto out;
 
 	if (addr & 0xF)
-		return retval;
+		goto out;
 
 	dst = &opp->dst[idx];
 	addr &= 0xFF0;
@@ -1142,49 +1196,67 @@ static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx)
 	}
 	pr_debug("%s: => 0x%08x\n", __func__, retval);
 
-	return retval;
+out:
+	*ptr = retval;
+	return 0;
 }
 
-static uint64_t openpic_cpu_read(void *opaque, gpa_t addr, unsigned len)
+static int openpic_cpu_read(void *opaque, gpa_t addr, u32 *ptr)
 {
-	return openpic_cpu_read_internal(opaque, addr, (addr & 0x1f000) >> 12);
+	struct openpic *opp = opaque;
+
+	return openpic_cpu_read_internal(opp, addr, ptr,
+					 (addr & 0x1f000) >> 12);
 }
 
-static const struct kvm_io_device_ops openpic_glb_ops_be = {
+struct mem_reg {
+	struct list_head list;
+	int (*read)(void *opaque, gpa_t addr, u32 *ptr);
+	int (*write)(void *opaque, gpa_t addr, u32 val);
+	gpa_t start_addr;
+	int size;
+};
+
+static struct mem_reg openpic_gbl_mmio = {
 	.write = openpic_gbl_write,
 	.read = openpic_gbl_read,
+	.start_addr = OPENPIC_GLB_REG_START,
+	.size = OPENPIC_GLB_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_tmr_ops_be = {
+static struct mem_reg openpic_tmr_mmio = {
 	.write = openpic_tmr_write,
 	.read = openpic_tmr_read,
+	.start_addr = OPENPIC_TMR_REG_START,
+	.size = OPENPIC_TMR_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_cpu_ops_be = {
+static struct mem_reg openpic_cpu_mmio = {
 	.write = openpic_cpu_write,
 	.read = openpic_cpu_read,
+	.start_addr = OPENPIC_CPU_REG_START,
+	.size = OPENPIC_CPU_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_src_ops_be = {
+static struct mem_reg openpic_src_mmio = {
 	.write = openpic_src_write,
 	.read = openpic_src_read,
+	.start_addr = OPENPIC_SRC_REG_START,
+	.size = OPENPIC_SRC_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_msi_ops_be = {
+static struct mem_reg openpic_msi_mmio = {
 	.read = openpic_msi_read,
 	.write = openpic_msi_write,
+	.start_addr = OPENPIC_MSI_REG_START,
+	.size = OPENPIC_MSI_REG_SIZE,
 };
 
-static const struct kvm_io_device_ops openpic_summary_ops_be = {
+static struct mem_reg openpic_summary_mmio = {
 	.read = openpic_summary_read,
 	.write = openpic_summary_write,
-};
-
-struct mem_reg {
-	const char *name;
-	const struct kvm_io_device_ops *ops;
-	gpa_t start_addr;
-	int size;
+	.start_addr = OPENPIC_SUMMARY_REG_START,
+	.size = OPENPIC_SUMMARY_REG_SIZE,
 };
 
 static void fsl_common_init(struct openpic *opp)
@@ -1192,6 +1264,9 @@ static void fsl_common_init(struct openpic *opp)
 	int i;
 	int virq = MAX_SRC;
 
+	list_add(&openpic_msi_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_summary_mmio.list, &opp->mmio_regions);
+
 	opp->vid = VID_REVISION_1_2;
 	opp->vir = VIR_GENERIC;
 	opp->vector_mask = 0xFFFF;
@@ -1205,11 +1280,10 @@ static void fsl_common_init(struct openpic *opp)
 	opp->irq_tim0 = virq;
 	virq += MAX_TMR;
 
-	assert(virq <= MAX_IRQ);
+	BUG_ON(virq > MAX_IRQ);
 
 	opp->irq_msi = 224;
 
-	msi_supported = true;
 	for (i = 0; i < opp->fsl->max_ext; i++)
 		opp->src[i].level = false;
 
@@ -1226,63 +1300,352 @@ static void fsl_common_init(struct openpic *opp)
 	}
 }
 
-static void map_list(struct openpic *opp, const struct mem_reg *list,
-		     int *count)
+static int kvm_mpic_read_internal(struct openpic *opp, gpa_t addr, u32 *ptr)
 {
-	while (list->name) {
-		assert(*count < ARRAY_SIZE(opp->sub_io_mem));
+	struct list_head *node;
 
-		memory_region_init_io(&opp->sub_io_mem[*count], list->ops, opp,
-				      list->name, list->size);
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
 
-		memory_region_add_subregion(&opp->mem, list->start_addr,
-					    &opp->sub_io_mem[*count]);
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-		(*count)++;
-		list++;
+		return mr->read(opp, addr - mr->start_addr, ptr);
 	}
+
+	return -ENXIO;
 }
 
-static int openpic_init(SysBusDevice *dev)
+static int kvm_mpic_write_internal(struct openpic *opp, gpa_t addr, u32 val)
 {
-	struct openpic *opp = FROM_SYSBUS(typeof(*opp), dev);
-	int i, j;
-	int list_count = 0;
-	static const struct mem_reg list_le[] = {
-		{"glb", &openpic_glb_ops_le,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_le,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_le,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_le,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_be[] = {
-		{"glb", &openpic_glb_ops_be,
-		 OPENPIC_GLB_REG_START, OPENPIC_GLB_REG_SIZE},
-		{"tmr", &openpic_tmr_ops_be,
-		 OPENPIC_TMR_REG_START, OPENPIC_TMR_REG_SIZE},
-		{"src", &openpic_src_ops_be,
-		 OPENPIC_SRC_REG_START, OPENPIC_SRC_REG_SIZE},
-		{"cpu", &openpic_cpu_ops_be,
-		 OPENPIC_CPU_REG_START, OPENPIC_CPU_REG_SIZE},
-		{NULL}
-	};
-	static const struct mem_reg list_fsl[] = {
-		{"msi", &openpic_msi_ops_be,
-		 OPENPIC_MSI_REG_START, OPENPIC_MSI_REG_SIZE},
-		{"summary", &openpic_summary_ops_be,
-		 OPENPIC_SUMMARY_REG_START, OPENPIC_SUMMARY_REG_SIZE},
-		{NULL}
-	};
+	struct list_head *node;
+
+	list_for_each(node, &opp->mmio_regions) {
+		struct mem_reg *mr = list_entry(node, struct mem_reg, list);
+
+		if (mr->start_addr > addr || addr >= mr->start_addr + mr->size)
+			continue;
 
-	memory_region_init(&opp->mem, "openpic", 0x40000);
+		return mr->write(opp, addr - mr->start_addr, val);
+	}
+
+	return -ENXIO;
+}
+
+static int kvm_mpic_read(struct kvm_io_device *this, gpa_t addr,
+			 int len, void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+	union {
+		u32 val;
+		u8 bytes[4];
+	} u;
+
+	if (addr & (len - 1)) {
+		pr_debug("%s: bad alignment %llx/%d\n",
+			 __func__, addr, len);
+		return -EINVAL;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
+	spin_unlock_irq(&opp->lock);
+
+	/*
+	 * Technically only 32-bit accesses are allowed, but be nice to
+	 * people dumping registers a byte at a time -- it works in real
+	 * hardware (reads only, not writes).
+	 */
+	if (len = 4) {
+		*(u32 *)ptr = u.val;
+		pr_debug("%s: addr %llx ret %d len 4 val %x\n",
+			 __func__, addr, ret, u.val);
+	} else if (len = 1) {
+		*(u8 *)ptr = u.bytes[addr & 3];
+		pr_debug("%s: addr %llx ret %d len 1 val %x\n",
+			 __func__, addr, ret, u.bytes[addr & 3]);
+	} else {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EINVAL;
+	}
+
+	return ret;
+}
+
+static int kvm_mpic_write(struct kvm_io_device *this, gpa_t addr,
+			  int len, const void *ptr)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+	int ret;
+
+	if (len != 4) {
+		pr_debug("%s: bad length %d\n", __func__, len);
+		return -EOPNOTSUPP;
+	}
+	if (addr & 3) {
+		pr_debug("%s: bad alignment %llx/%d\n", __func__, addr, len);
+		return -EOPNOTSUPP;
+	}
+
+	spin_lock_irq(&opp->lock);
+	ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
+				      *(const u32 *)ptr);
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: addr %llx ret %d val %x\n",
+		 __func__, addr, ret, *(const u32 *)ptr);
+
+	return ret;
+}
+
+static void kvm_mpic_dtor(struct kvm_io_device *this)
+{
+	struct openpic *opp = container_of(this, struct openpic, mmio);
+
+	opp->mmio_mapped = false;
+}
+
+static const struct kvm_io_device_ops mpic_mmio_ops = {
+	.read = kvm_mpic_read,
+	.write = kvm_mpic_write,
+	.destructor = kvm_mpic_dtor,
+};
+
+static void map_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = true;
+
+	kvm_iodevice_init(&opp->mmio, &mpic_mmio_ops);
+
+	kvm_io_bus_register_dev(opp->kvm, KVM_MMIO_BUS,
+				opp->reg_base, OPENPIC_REG_SIZE,
+				&opp->mmio);
+}
+
+static void unmap_mmio(struct openpic *opp)
+{
+	BUG_ON(opp->mmio_mapped);
+	opp->mmio_mapped = false;
+
+	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+}
+
+static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
+{
+	u64 base;
+
+	if (copy_from_user(&base, (u64 __user *)(long)attr->addr, sizeof(u64)))
+		return -EFAULT;
+
+	if (base & 0x3ffff) {
+		pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx not aligned\n",
+			 __func__, base);
+		return -EINVAL;
+	}
+
+	if (base = opp->reg_base)
+		return 0;
+
+	mutex_lock(&opp->kvm->slots_lock);
+
+	unmap_mmio(opp);
+	opp->reg_base = base;
+
+	pr_debug("kvm mpic %s: KVM_DEV_MPIC_BASE_ADDR %08llx\n",
+		 __func__, base);
+
+	if (base = 0)
+		goto out;
+
+	map_mmio(opp);
+
+	mutex_unlock(&opp->kvm->slots_lock);
+out:
+	return 0;
+}
+
+#define ATTR_SET		0
+#define ATTR_GET		1
+
+static int access_reg(struct openpic *opp, gpa_t addr, u32 *val, int type)
+{
+	int ret;
+
+	if (addr & 3)
+		return -ENXIO;
+
+	spin_lock_irq(&opp->lock);
+
+	if (type = ATTR_SET)
+		ret = kvm_mpic_write_internal(opp, addr, *val);
+	else
+		ret = kvm_mpic_read_internal(opp, addr, val);
+
+	spin_unlock_irq(&opp->lock);
+
+	pr_debug("%s: type %d addr %llx val %x\n", __func__, type, addr, *val);
+
+	return ret;
+}
+
+static int mpic_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	struct openpic *opp = dev->private;
+	u32 attr32;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return set_base_addr(opp, attr);
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		if (get_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		return access_reg(opp, attr->attr, &attr32, ATTR_SET);
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		if (get_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		if (attr32 != 0 && attr32 != 1)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		openpic_set_irq(opp, attr->attr, attr32);
+		spin_unlock_irq(&opp->lock);
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	struct openpic *opp = dev->private;
+	u64 attr64;
+	u32 attr32;
+	int ret;
+
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			mutex_lock(&opp->kvm->slots_lock);
+			attr64 = opp->reg_base;
+			mutex_unlock(&opp->kvm->slots_lock);
+
+			if (copy_to_user((u64 __user *)(long)attr->addr,
+					 &attr64, sizeof(u64)))
+				return -EFAULT;
+
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		ret = access_reg(opp, attr->attr, &attr32, ATTR_GET);
+		if (ret)
+			return ret;
+
+		if (put_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			return -EINVAL;
+
+		spin_lock_irq(&opp->lock);
+		attr32 = opp->src[attr->attr].pending;
+		spin_unlock_irq(&opp->lock);
+
+		if (put_user(attr32, (u32 __user *)(long)attr->addr))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static int mpic_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_MPIC_GRP_MISC:
+		switch (attr->attr) {
+		case KVM_DEV_MPIC_BASE_ADDR:
+			return 0;
+		}
+
+		break;
+
+	case KVM_DEV_MPIC_GRP_REGISTER:
+		return 0;
+
+	case KVM_DEV_MPIC_GRP_IRQ_ACTIVE:
+		if (attr->attr > MAX_SRC)
+			break;
+
+		return 0;
+	}
+
+	return -ENXIO;
+}
+
+static void mpic_destroy(struct kvm_device *dev)
+{
+	struct openpic *opp = dev->private;
+
+	if (opp->mmio_mapped) {
+		/*
+		 * Normally we get unmapped by kvm_io_bus_destroy(),
+		 * which happens before the VCPUs release their references.
+		 *
+		 * Thus, we should only get here if no VCPUs took a reference
+		 * to us in the first place.
+		 */
+		WARN_ON(opp->nb_cpus != 0);
+		unmap_mmio(opp);
+	}
+
+	kfree(opp);
+}
+
+static int mpic_create(struct kvm_device *dev, u32 type)
+{
+	struct openpic *opp;
+	int ret;
+
+	opp = kzalloc(sizeof(struct openpic), GFP_KERNEL);
+	if (!opp)
+		return -ENOMEM;
+
+	dev->private = opp;
+	opp->kvm = dev->kvm;
+	opp->dev = dev;
+	opp->model = type;
+	spin_lock_init(&opp->lock);
+
+	INIT_LIST_HEAD(&opp->mmio_regions);
+	list_add(&openpic_gbl_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_tmr_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_src_mmio.list, &opp->mmio_regions);
+	list_add(&openpic_cpu_mmio.list, &opp->mmio_regions);
 
 	switch (opp->model) {
-	case OPENPIC_MODEL_FSL_MPIC_20:
-	default:
+	case KVM_DEV_TYPE_FSL_MPIC_20:
 		opp->fsl = &fsl_mpic_20;
 		opp->brr1 = 0x00400200;
 		opp->flags |= OPENPIC_FLAG_IDR_CRIT;
@@ -1290,12 +1653,10 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_MIXED;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
 
-	case OPENPIC_MODEL_FSL_MPIC_42:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
 		opp->fsl = &fsl_mpic_42;
 		opp->brr1 = 0x00400402;
 		opp->flags |= OPENPIC_FLAG_ILR;
@@ -1303,11 +1664,27 @@ static int openpic_init(SysBusDevice *dev)
 		opp->mpic_mode_mask = GCR_MODE_PROXY;
 
 		fsl_common_init(opp);
-		map_list(opp, list_be, &list_count);
-		map_list(opp, list_fsl, &list_count);
 
 		break;
+
+	default:
+		ret = -ENODEV;
+		goto err;
 	}
 
+	openpic_reset(opp);
 	return 0;
+
+err:
+	kfree(opp);
+	return ret;
 }
+
+struct kvm_device_ops kvm_mpic_ops = {
+	.name = "kvm-mpic",
+	.create = mpic_create,
+	.destroy = mpic_destroy,
+	.set_attr = mpic_set_attr,
+	.get_attr = mpic_get_attr,
+	.has_attr = mpic_has_attr,
+};
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 16b4595..c9a2972 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -317,6 +317,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_ONE_REG:
 	case KVM_CAP_IOEVENTFD:
+	case KVM_CAP_DEVICE_CTRL:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -769,7 +770,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	case KVM_CAP_PPC_EPR:
 		r = 0;
-		vcpu->arch.epr_enabled = cap->args[0];
+		if (cap->args[0])
+			vcpu->arch.epr_flags |= KVMPPC_EPR_USER;
+		else
+			vcpu->arch.epr_flags &= ~KVMPPC_EPR_USER;
 		break;
 #ifdef CONFIG_BOOKE
 	case KVM_CAP_PPC_BOOKE_WATCHDOG:
@@ -915,6 +919,7 @@ static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
 long kvm_arch_vm_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
 {
+	struct kvm *kvm __maybe_unused = filp->private_data;
 	void __user *argp = (void __user *)arg;
 	long r;
 
@@ -933,7 +938,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_CREATE_SPAPR_TCE: {
 		struct kvm_create_spapr_tce create_tce;
-		struct kvm *kvm = filp->private_data;
 
 		r = -EFAULT;
 		if (copy_from_user(&create_tce, argp, sizeof(create_tce)))
@@ -945,7 +949,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	case KVM_ALLOCATE_RMA: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_allocate_rma rma;
 
 		r = kvm_vm_ioctl_allocate_rma(kvm, &rma);
@@ -955,7 +958,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_ALLOCATE_HTAB: {
-		struct kvm *kvm = filp->private_data;
 		u32 htab_order;
 
 		r = -EFAULT;
@@ -972,7 +974,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 	}
 
 	case KVM_PPC_GET_HTAB_FD: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_get_htab_fd ghf;
 
 		r = -EFAULT;
@@ -985,7 +986,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	case KVM_PPC_GET_SMMU_INFO: {
-		struct kvm *kvm = filp->private_data;
 		struct kvm_ppc_smmu_info info;
 
 		memset(&info, 0, sizeof(info));
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8fce9bc..abc2f26 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1098,6 +1098,8 @@ void kvm_device_get(struct kvm_device *dev);
 void kvm_device_put(struct kvm_device *dev);
 struct kvm_device *kvm_device_from_filp(struct file *filp);
 
+extern struct kvm_device_ops kvm_mpic_ops;
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
 static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20ce2d2..76963ec 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -927,6 +927,9 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
+#define KVM_DEV_TYPE_FSL_MPIC_20	1
+#define KVM_DEV_TYPE_FSL_MPIC_42	2
+
 /* ioctl for vm fd */
 #define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e2b18af..9f64438 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2237,6 +2237,12 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	int ret;
 
 	switch (cd->type) {
+#ifdef CONFIG_KVM_MPIC
+	case KVM_DEV_TYPE_FSL_MPIC_20:
+	case KVM_DEV_TYPE_FSL_MPIC_42:
+		ops = &kvm_mpic_ops;
+		break;
+#endif
 	default:
 		return -ENODEV;
 	}
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-13  0:08       ` Scott Wood
@ 2013-04-13  0:08         ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v4:
- add vcpu->arch.irq_cpu_id, instead of incorrectly using vcpu_id
  now that the numberspaces are separate
- pass the vcpu to mpic code when disconnecting from the vcpu
- minor bugfixes on error paths
---
 Documentation/virtual/kvm/api.txt   |    8 ++++
 arch/powerpc/include/asm/kvm_host.h |    9 ++++
 arch/powerpc/include/asm/kvm_ppc.h  |    4 +-
 arch/powerpc/kvm/booke.c            |    4 ++
 arch/powerpc/kvm/mpic.c             |   82 ++++++++++++++++++++++++++++++++---
 arch/powerpc/kvm/powerpc.c          |   30 +++++++++++++
 include/uapi/linux/kvm.h            |    1 +
 7 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d52f3f9..4c326ae 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] == 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+            args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7e7aef9..36368c9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQ_DEFAULT	0
+#define KVMPPC_IRQ_MPIC		1
+
+struct openpic;
+
 struct kvm_vcpu_arch {
 	ulong host_stack;
 	u32 host_pid;
@@ -554,6 +559,10 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+	int irq_type;		/* one of KVM_IRQ_* */
+	int irq_cpu_id;
+	struct openpic *mpic;	/* KVM_IRQ_MPIC */
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 3b63b97..b25d475 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -248,7 +248,6 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *);
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
 struct openpic;
-void kvmppc_mpic_put(struct openpic *opp);
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
@@ -276,6 +275,9 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu,
+			     u32 cpu);
+void kvmppc_mpic_disconnect_vcpu(struct openpic *opp, struct kvm_vcpu *vcpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cff53d4..0097912 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,6 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		if (update_epr == true) {
 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
+				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
+				kvmppc_mpic_set_epr(vcpu);
+			}
 		}
 
 		new_msr &= msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 8ed4072..a231e0c 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -114,7 +114,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 static int get_current_cpu(void)
 {
 	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
-	return vcpu ? vcpu->vcpu_id : -1;
+	return vcpu ? vcpu->arch.irq_cpu_id : -1;
 }
 
 static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
@@ -244,7 +244,7 @@ static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
 		return;
 	}
 
-	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->arch.irq_cpu_id,
 		output);
 
 	if (output != ILR_INTTGT_INT)	/* TODO */
@@ -262,7 +262,7 @@ static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
 		return;
 	}
 
-	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->arch.irq_cpu_id,
 		output);
 
 	if (output != ILR_INTTGT_INT)	/* TODO */
@@ -1160,6 +1160,20 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
+{
+	struct openpic *opp = vcpu->arch.mpic;
+	int cpu = vcpu->arch.irq_cpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
 static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
 				     u32 *ptr, int idx)
 {
@@ -1426,10 +1440,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-	BUG_ON(opp->mmio_mapped);
-	opp->mmio_mapped = false;
-
-	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	if (opp->mmio_mapped) {
+		opp->mmio_mapped = false;
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	}
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1688,3 +1702,57 @@ struct kvm_device_ops kvm_mpic_ops = {
 	.get_attr = mpic_get_attr,
 	.has_attr = mpic_has_attr,
 };
+
+int kvmppc_mpic_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu,
+			     u32 cpu)
+{
+	struct openpic *opp = dev->private;
+	int ret = 0;
+
+	if (dev->ops != &kvm_mpic_ops)
+		return -EPERM;
+	if (opp->kvm != vcpu->kvm)
+		return -EPERM;
+	if (cpu < 0 || cpu >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	if (opp->dst[cpu].vcpu) {
+		ret = -EEXIST;
+		goto out;
+	}
+	if (vcpu->arch.irq_type) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	opp->dst[cpu].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
+
+	vcpu->arch.mpic = opp;
+	vcpu->arch.irq_cpu_id = cpu;
+	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
+
+	/* This might need to be changed if GCR gets extended */
+	if (opp->mpic_mode_mask == GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+	kvm_device_get(dev);
+out:
+	spin_unlock_irq(&opp->lock);
+	return ret;
+}
+
+/*
+ * This should only happen immediately before the mpic is destroyed,
+ * so we shouldn't need to worry about anything still trying to
+ * access the vcpu pointer.
+ */
+void kvmppc_mpic_disconnect_vcpu(struct openpic *opp, struct kvm_vcpu *vcpu)
+{
+	BUG_ON(!opp->dst[vcpu->arch.irq_cpu_id].vcpu);
+
+	opp->dst[vcpu->arch.irq_cpu_id].vcpu = NULL;
+	kvm_device_put(opp->dev);
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index c9a2972..1d3888b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/cputable.h>
 #include <asm/uaccess.h>
 #include <asm/kvm_ppc.h>
@@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
 	case KVM_CAP_SW_TLB:
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	tasklet_kill(&vcpu->arch.tasklet);
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
+
+	switch (vcpu->arch.irq_type) {
+	case KVMPPC_IRQ_MPIC:
+		kvmppc_mpic_disconnect_vcpu(vcpu->arch.mpic, vcpu);
+		break;
+	}
+
 	kvmppc_core_vcpu_free(vcpu);
 }
 
@@ -794,6 +805,25 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC: {
+		struct file *filp;
+		struct kvm_device *dev;
+
+		r = -EBADF;
+		filp = fget(cap->args[0]);
+		if (!filp)
+			break;
+
+		r = -EPERM;
+		dev = kvm_device_from_filp(filp);
+		if (dev)
+			r = kvmppc_mpic_connect_vcpu(dev, vcpu, cap->args[1]);
+
+		fput(filp);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 76963ec..5b3248a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 #define KVM_CAP_DEVICE_CTRL 89
+#define KVM_CAP_IRQ_MPIC 90
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-13  0:08         ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-13  0:08 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm, paulus, Scott Wood

Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v4:
- add vcpu->arch.irq_cpu_id, instead of incorrectly using vcpu_id
  now that the numberspaces are separate
- pass the vcpu to mpic code when disconnecting from the vcpu
- minor bugfixes on error paths
---
 Documentation/virtual/kvm/api.txt   |    8 ++++
 arch/powerpc/include/asm/kvm_host.h |    9 ++++
 arch/powerpc/include/asm/kvm_ppc.h  |    4 +-
 arch/powerpc/kvm/booke.c            |    4 ++
 arch/powerpc/kvm/mpic.c             |   82 ++++++++++++++++++++++++++++++++---
 arch/powerpc/kvm/powerpc.c          |   30 +++++++++++++
 include/uapi/linux/kvm.h            |    1 +
 7 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index d52f3f9..4c326ae 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] = 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+            args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7e7aef9..36368c9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
 	u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQ_DEFAULT	0
+#define KVMPPC_IRQ_MPIC		1
+
+struct openpic;
+
 struct kvm_vcpu_arch {
 	ulong host_stack;
 	u32 host_pid;
@@ -554,6 +559,10 @@ struct kvm_vcpu_arch {
 	unsigned long magic_page_pa; /* phys addr to map the magic page to */
 	unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+	int irq_type;		/* one of KVM_IRQ_* */
+	int irq_cpu_id;
+	struct openpic *mpic;	/* KVM_IRQ_MPIC */
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 3b63b97..b25d475 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -248,7 +248,6 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *);
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
 struct openpic;
-void kvmppc_mpic_put(struct openpic *opp);
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
@@ -276,6 +275,9 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu,
+			     u32 cpu);
+void kvmppc_mpic_disconnect_vcpu(struct openpic *opp, struct kvm_vcpu *vcpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 			      struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cff53d4..0097912 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,6 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 		if (update_epr = true) {
 			if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
 				kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
+			else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
+				BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
+				kvmppc_mpic_set_epr(vcpu);
+			}
 		}
 
 		new_msr &= msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 8ed4072..a231e0c 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -114,7 +114,7 @@ static struct fsl_mpic_info fsl_mpic_42 = {
 static int get_current_cpu(void)
 {
 	struct kvm_vcpu *vcpu = current->thread.kvm_vcpu;
-	return vcpu ? vcpu->vcpu_id : -1;
+	return vcpu ? vcpu->arch.irq_cpu_id : -1;
 }
 
 static int openpic_cpu_write_internal(void *opaque, gpa_t addr,
@@ -244,7 +244,7 @@ static void mpic_irq_raise(struct openpic *opp, struct irq_dest *dst,
 		return;
 	}
 
-	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->arch.irq_cpu_id,
 		output);
 
 	if (output != ILR_INTTGT_INT)	/* TODO */
@@ -262,7 +262,7 @@ static void mpic_irq_lower(struct openpic *opp, struct irq_dest *dst,
 		return;
 	}
 
-	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->vcpu_id,
+	pr_debug("%s: cpu %d output %d\n", __func__, dst->vcpu->arch.irq_cpu_id,
 		output);
 
 	if (output != ILR_INTTGT_INT)	/* TODO */
@@ -1160,6 +1160,20 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst,
 	return retval;
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
+{
+	struct openpic *opp = vcpu->arch.mpic;
+	int cpu = vcpu->arch.irq_cpu_id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&opp->lock, flags);
+
+	if ((opp->gcr & opp->mpic_mode_mask) = GCR_MODE_PROXY)
+		kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));
+
+	spin_unlock_irqrestore(&opp->lock, flags);
+}
+
 static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
 				     u32 *ptr, int idx)
 {
@@ -1426,10 +1440,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-	BUG_ON(opp->mmio_mapped);
-	opp->mmio_mapped = false;
-
-	kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	if (opp->mmio_mapped) {
+		opp->mmio_mapped = false;
+		kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
+	}
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1688,3 +1702,57 @@ struct kvm_device_ops kvm_mpic_ops = {
 	.get_attr = mpic_get_attr,
 	.has_attr = mpic_has_attr,
 };
+
+int kvmppc_mpic_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu,
+			     u32 cpu)
+{
+	struct openpic *opp = dev->private;
+	int ret = 0;
+
+	if (dev->ops != &kvm_mpic_ops)
+		return -EPERM;
+	if (opp->kvm != vcpu->kvm)
+		return -EPERM;
+	if (cpu < 0 || cpu >= MAX_CPU)
+		return -EPERM;
+
+	spin_lock_irq(&opp->lock);
+
+	if (opp->dst[cpu].vcpu) {
+		ret = -EEXIST;
+		goto out;
+	}
+	if (vcpu->arch.irq_type) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	opp->dst[cpu].vcpu = vcpu;
+	opp->nb_cpus = max(opp->nb_cpus, cpu + 1);
+
+	vcpu->arch.mpic = opp;
+	vcpu->arch.irq_cpu_id = cpu;
+	vcpu->arch.irq_type = KVMPPC_IRQ_MPIC;
+
+	/* This might need to be changed if GCR gets extended */
+	if (opp->mpic_mode_mask = GCR_MODE_PROXY)
+		vcpu->arch.epr_flags |= KVMPPC_EPR_KERNEL;
+
+	kvm_device_get(dev);
+out:
+	spin_unlock_irq(&opp->lock);
+	return ret;
+}
+
+/*
+ * This should only happen immediately before the mpic is destroyed,
+ * so we shouldn't need to worry about anything still trying to
+ * access the vcpu pointer.
+ */
+void kvmppc_mpic_disconnect_vcpu(struct openpic *opp, struct kvm_vcpu *vcpu)
+{
+	BUG_ON(!opp->dst[vcpu->arch.irq_cpu_id].vcpu);
+
+	opp->dst[vcpu->arch.irq_cpu_id].vcpu = NULL;
+	kvm_device_put(opp->dev);
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index c9a2972..1d3888b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/file.h>
 #include <asm/cputable.h>
 #include <asm/uaccess.h>
 #include <asm/kvm_ppc.h>
@@ -327,6 +328,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
 	case KVM_CAP_SW_TLB:
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	tasklet_kill(&vcpu->arch.tasklet);
 
 	kvmppc_remove_vcpu_debugfs(vcpu);
+
+	switch (vcpu->arch.irq_type) {
+	case KVMPPC_IRQ_MPIC:
+		kvmppc_mpic_disconnect_vcpu(vcpu->arch.mpic, vcpu);
+		break;
+	}
+
 	kvmppc_core_vcpu_free(vcpu);
 }
 
@@ -794,6 +805,25 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif
+#ifdef CONFIG_KVM_MPIC
+	case KVM_CAP_IRQ_MPIC: {
+		struct file *filp;
+		struct kvm_device *dev;
+
+		r = -EBADF;
+		filp = fget(cap->args[0]);
+		if (!filp)
+			break;
+
+		r = -EPERM;
+		dev = kvm_device_from_filp(filp);
+		if (dev)
+			r = kvmppc_mpic_connect_vcpu(dev, vcpu, cap->args[1]);
+
+		fput(filp);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 76963ec..5b3248a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -669,6 +669,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 #define KVM_CAP_DEVICE_CTRL 89
+#define KVM_CAP_IRQ_MPIC 90
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4



^ permalink raw reply related	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-13  0:08         ` Scott Wood
@ 2013-04-15  5:23           ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-15  5:23 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Fri, Apr 12, 2013 at 07:08:47PM -0500, Scott Wood wrote:
> Enabling this capability connects the vcpu to the designated in-kernel
> MPIC.  Using explicit connections between vcpus and irqchips allows
> for flexibility, but the main benefit at the moment is that it
> simplifies the code -- KVM doesn't need vm-global state to remember
> which MPIC object is associated with this vm, and it doesn't need to
> care about ordering between irqchip creation and vcpu creation.

This fails to link with CONFIG_KVM_MPIC=n with the following error:

arch/powerpc/kvm/built-in.o: In function `.kvm_arch_vcpu_free':
(.text+0x94f0): undefined reference to `.kvmppc_mpic_disconnect_vcpu'

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-15  5:23           ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-15  5:23 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Fri, Apr 12, 2013 at 07:08:47PM -0500, Scott Wood wrote:
> Enabling this capability connects the vcpu to the designated in-kernel
> MPIC.  Using explicit connections between vcpus and irqchips allows
> for flexibility, but the main benefit at the moment is that it
> simplifies the code -- KVM doesn't need vm-global state to remember
> which MPIC object is associated with this vm, and it doesn't need to
> care about ordering between irqchip creation and vcpu creation.

This fails to link with CONFIG_KVM_MPIC=n with the following error:

arch/powerpc/kvm/built-in.o: In function `.kvm_arch_vcpu_free':
(.text+0x94f0): undefined reference to `.kvmppc_mpic_disconnect_vcpu'

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-15  5:23           ` Paul Mackerras
@ 2013-04-15 17:52             ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-15 17:52 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/15/2013 12:23:28 AM, Paul Mackerras wrote:
> On Fri, Apr 12, 2013 at 07:08:47PM -0500, Scott Wood wrote:
> > Enabling this capability connects the vcpu to the designated  
> in-kernel
> > MPIC.  Using explicit connections between vcpus and irqchips allows
> > for flexibility, but the main benefit at the moment is that it
> > simplifies the code -- KVM doesn't need vm-global state to remember
> > which MPIC object is associated with this vm, and it doesn't need to
> > care about ordering between irqchip creation and vcpu creation.
> 
> This fails to link with CONFIG_KVM_MPIC=n with the following error:
> 
> arch/powerpc/kvm/built-in.o: In function `.kvm_arch_vcpu_free':
> (.text+0x94f0): undefined reference to `.kvmppc_mpic_disconnect_vcpu'

Sigh, forgot the ifdef again. :-(

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-15 17:52             ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-15 17:52 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm

On 04/15/2013 12:23:28 AM, Paul Mackerras wrote:
> On Fri, Apr 12, 2013 at 07:08:47PM -0500, Scott Wood wrote:
> > Enabling this capability connects the vcpu to the designated  
> in-kernel
> > MPIC.  Using explicit connections between vcpus and irqchips allows
> > for flexibility, but the main benefit at the moment is that it
> > simplifies the code -- KVM doesn't need vm-global state to remember
> > which MPIC object is associated with this vm, and it doesn't need to
> > care about ordering between irqchip creation and vcpu creation.
> 
> This fails to link with CONFIG_KVM_MPIC=n with the following error:
> 
> arch/powerpc/kvm/built-in.o: In function `.kvm_arch_vcpu_free':
> (.text+0x94f0): undefined reference to `.kvmppc_mpic_disconnect_vcpu'

Sigh, forgot the ifdef again. :-(

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
  2013-04-15 17:52             ` Scott Wood
@ 2013-04-16  3:59               ` Paul Mackerras
  -1 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-16  3:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Mon, Apr 15, 2013 at 12:52:27PM -0500, Scott Wood wrote:
> On 04/15/2013 12:23:28 AM, Paul Mackerras wrote:
> >On Fri, Apr 12, 2013 at 07:08:47PM -0500, Scott Wood wrote:
> >> Enabling this capability connects the vcpu to the designated
> >in-kernel
> >> MPIC.  Using explicit connections between vcpus and irqchips allows
> >> for flexibility, but the main benefit at the moment is that it
> >> simplifies the code -- KVM doesn't need vm-global state to remember
> >> which MPIC object is associated with this vm, and it doesn't need to
> >> care about ordering between irqchip creation and vcpu creation.
> >
> >This fails to link with CONFIG_KVM_MPIC=n with the following error:
> >
> >arch/powerpc/kvm/built-in.o: In function `.kvm_arch_vcpu_free':
> >(.text+0x94f0): undefined reference to `.kvmppc_mpic_disconnect_vcpu'
> 
> Sigh, forgot the ifdef again. :-(

Or you could define an empty static inline function in the
CONFIG_KVM_MPIC=n case, which looks a little nicer at the point of
use.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
@ 2013-04-16  3:59               ` Paul Mackerras
  0 siblings, 0 replies; 261+ messages in thread
From: Paul Mackerras @ 2013-04-16  3:59 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm

On Mon, Apr 15, 2013 at 12:52:27PM -0500, Scott Wood wrote:
> On 04/15/2013 12:23:28 AM, Paul Mackerras wrote:
> >On Fri, Apr 12, 2013 at 07:08:47PM -0500, Scott Wood wrote:
> >> Enabling this capability connects the vcpu to the designated
> >in-kernel
> >> MPIC.  Using explicit connections between vcpus and irqchips allows
> >> for flexibility, but the main benefit at the moment is that it
> >> simplifies the code -- KVM doesn't need vm-global state to remember
> >> which MPIC object is associated with this vm, and it doesn't need to
> >> care about ordering between irqchip creation and vcpu creation.
> >
> >This fails to link with CONFIG_KVM_MPIC=n with the following error:
> >
> >arch/powerpc/kvm/built-in.o: In function `.kvm_arch_vcpu_free':
> >(.text+0x94f0): undefined reference to `.kvmppc_mpic_disconnect_vcpu'
> 
> Sigh, forgot the ifdef again. :-(

Or you could define an empty static inline function in the
CONFIG_KVM_MPIC=n case, which looks a little nicer at the point of
use.

Paul.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-13  0:08         ` Scott Wood
@ 2013-04-25  9:43           ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25  9:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v4:
>  - Move some boilerplate back into generic code, as requested by Gleb.
>    File descriptor management and reference counting is no longer the
>    concern of the device implementation.
> 
>  - Don't hold kvm->lock during create.  The original reasons
>    for doing so have vanished as for as MPIC is concerned, and
>    this avoids needing to answer the question of whether to
>    hold the lock during destroy as well.
> 
>    Paul, you may need to acquire the lock yourself in kvm_create_xics()
>    to protect the -EEXIST check.
> 
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> ---
>  Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/linux/kvm_host.h                 |   35 ++++++++
>  include/uapi/linux/kvm.h                 |   27 +++++++
>  virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
>  5 files changed, 262 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
Should we add __u32 padding here to make struct size multiple of u64?

> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>  
>  4.77 KVM_ARM_VCPU_INIT
>  
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 20d77d2..8fce9bc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
>  
>  extern bool kvm_rebooting;
>  
> +struct kvm_device_ops;
> +
> +struct kvm_device {
> +	struct kvm_device_ops *ops;
> +	struct kvm *kvm;
> +	atomic_t users;
> +	void *private;
> +};
> +
> +/* create, destroy, and name are mandatory */
> +struct kvm_device_ops {
> +	const char *name;
> +	int (*create)(struct kvm_device *dev, u32 type);
> +
> +	/*
> +	 * Destroy is responsible for freeing dev.
> +	 *
> +	 * Destroy may be called before or after destructors are called
> +	 * on emulated I/O regions, depending on whether a reference is
> +	 * held by a vcpu or other kvm component that gets destroyed
> +	 * after the emulated I/O.
> +	 */
> +	void (*destroy)(struct kvm_device *dev);
> +
> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> +		      unsigned long arg);
> +};
> +
> +void kvm_device_get(struct kvm_device *dev);
> +void kvm_device_put(struct kvm_device *dev);
> +struct kvm_device *kvm_device_from_filp(struct file *filp);
> +
>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>  
>  static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_EPR 86
>  #define KVM_CAP_ARM_PSCI 87
>  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>  
>  /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
from ioctl definition block.

> +
> +/* ioctl for vm fd */
> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> +
> +/* ioctls for fds returned by KVM_CREATE_DEVICE */
> +#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> +#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> +
> +/*
>   * ioctls for vcpu fds
>   */
>  #define KVM_RUN                   _IO(KVMIO,   0x80)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5cc53c9..e2b18af 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2158,6 +2158,117 @@ out:
>  }
>  #endif
>  
> +static int kvm_device_ioctl_attr(struct kvm_device *dev,
> +				 int (*accessor)(struct kvm_device *dev,
> +						 struct kvm_device_attr *attr),
> +				 unsigned long arg)
> +{
> +	struct kvm_device_attr attr;
> +
> +	if (!accessor)
> +		return -EPERM;
> +
> +	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
> +		return -EFAULT;
> +
> +	return accessor(dev, &attr);
> +}
> +
> +static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
> +			     unsigned long arg)
> +{
> +	struct kvm_device *dev = filp->private_data;
> +
> +	switch (ioctl) {
> +	case KVM_SET_DEVICE_ATTR:
> +		return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
> +	case KVM_GET_DEVICE_ATTR:
> +		return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
> +	case KVM_HAS_DEVICE_ATTR:
> +		return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
> +	default:
> +		if (dev->ops->ioctl)
> +			return dev->ops->ioctl(dev, ioctl, arg);
> +
> +		return -ENOTTY;
> +	}
> +}
> +
> +void kvm_device_get(struct kvm_device *dev)
> +{
> +	atomic_inc(&dev->users);
> +}
> +
> +void kvm_device_put(struct kvm_device *dev)
> +{
> +	if (atomic_dec_and_test(&dev->users))
> +		dev->ops->destroy(dev);
> +}
> +
> +static int kvm_device_release(struct inode *inode, struct file *filp)
> +{
> +	struct kvm_device *dev = filp->private_data;
> +	struct kvm *kvm = dev->kvm;
> +
> +	kvm_device_put(dev);
> +	kvm_put_kvm(kvm);
We may put kvm only if users goes to zero, otherwise kvm can be
freed while something holds a reference to a device. Why not make
kvm_device_put() do it?

> +	return 0;
> +}
> +
> +static const struct file_operations kvm_device_fops = {
> +	.unlocked_ioctl = kvm_device_ioctl,
> +	.release = kvm_device_release,
> +};
> +
> +struct kvm_device *kvm_device_from_filp(struct file *filp)
> +{
> +	if (filp->f_op != &kvm_device_fops)
> +		return NULL;
> +
> +	return filp->private_data;
> +}
> +
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +				   struct kvm_create_device *cd)
> +{
> +	struct kvm_device_ops *ops = NULL;
> +	struct kvm_device *dev;
> +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +	int ret;
> +
> +	switch (cd->type) {
> +	default:
> +		return -ENODEV;
> +	}
> +
> +	if (test)
> +		return 0;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->ops = ops;
> +	dev->kvm = kvm;
> +	atomic_set(&dev->users, 1);
> +
> +	ret = ops->create(dev, cd->type);
> +	if (ret < 0) {
> +		kfree(dev);
> +		return ret;
> +	}
> +
> +	ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR);
> +	if (ret < 0) {
> +		ops->destroy(dev);
> +		return ret;
> +	}
> +
> +	kvm_get_kvm(kvm);
> +	cd->fd = ret;
> +	return 0;
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>  			   unsigned int ioctl, unsigned long arg)
>  {
> @@ -2272,6 +2383,24 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  #endif
> +	case KVM_CREATE_DEVICE: {
> +		struct kvm_create_device cd;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&cd, argp, sizeof(cd)))
> +			goto out;
> +
> +		r = kvm_ioctl_create_device(kvm, &cd);
> +		if (r)
> +			goto out;
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &cd, sizeof(cd)))
> +			goto out;
> +
> +		r = 0;
> +		break;
> +	}
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  		if (r == -ENOTTY)
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25  9:43           ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25  9:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v4:
>  - Move some boilerplate back into generic code, as requested by Gleb.
>    File descriptor management and reference counting is no longer the
>    concern of the device implementation.
> 
>  - Don't hold kvm->lock during create.  The original reasons
>    for doing so have vanished as for as MPIC is concerned, and
>    this avoids needing to answer the question of whether to
>    hold the lock during destroy as well.
> 
>    Paul, you may need to acquire the lock yourself in kvm_create_xics()
>    to protect the -EEXIST check.
> 
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> ---
>  Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
>  Documentation/virtual/kvm/devices/README |    1 +
>  include/linux/kvm_host.h                 |   35 ++++++++
>  include/uapi/linux/kvm.h                 |   27 +++++++
>  virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
>  5 files changed, 262 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +          be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
Should we add __u32 padding here to make struct size multiple of u64?

> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +          (e.g. read-only attribute, or attribute that only makes
> +          sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +4.81 KVM_HAS_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +
> +Tests whether a device supports a particular attribute.  A successful
> +return indicates the attribute is implemented.  It does not necessarily
> +indicate that the attribute can be read or written in the device's
> +current state.  "addr" is ignored.
>  
>  4.77 KVM_ARM_VCPU_INIT
>  
> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> new file mode 100644
> index 0000000..34a6983
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/README
> @@ -0,0 +1 @@
> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 20d77d2..8fce9bc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
>  
>  extern bool kvm_rebooting;
>  
> +struct kvm_device_ops;
> +
> +struct kvm_device {
> +	struct kvm_device_ops *ops;
> +	struct kvm *kvm;
> +	atomic_t users;
> +	void *private;
> +};
> +
> +/* create, destroy, and name are mandatory */
> +struct kvm_device_ops {
> +	const char *name;
> +	int (*create)(struct kvm_device *dev, u32 type);
> +
> +	/*
> +	 * Destroy is responsible for freeing dev.
> +	 *
> +	 * Destroy may be called before or after destructors are called
> +	 * on emulated I/O regions, depending on whether a reference is
> +	 * held by a vcpu or other kvm component that gets destroyed
> +	 * after the emulated I/O.
> +	 */
> +	void (*destroy)(struct kvm_device *dev);
> +
> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> +		      unsigned long arg);
> +};
> +
> +void kvm_device_get(struct kvm_device *dev);
> +void kvm_device_put(struct kvm_device *dev);
> +struct kvm_device *kvm_device_from_filp(struct file *filp);
> +
>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>  
>  static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 74d0ff3..20ce2d2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_EPR 86
>  #define KVM_CAP_ARM_PSCI 87
>  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> +#define KVM_CAP_DEVICE_CTRL 89
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>  
>  /*
> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> + */
> +#define KVM_CREATE_DEVICE_TEST		1
> +
> +struct kvm_create_device {
> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> +	__u32	fd;	/* out: device handle */
> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +struct kvm_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
from ioctl definition block.

> +
> +/* ioctl for vm fd */
> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> +
> +/* ioctls for fds returned by KVM_CREATE_DEVICE */
> +#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> +#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> +#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> +
> +/*
>   * ioctls for vcpu fds
>   */
>  #define KVM_RUN                   _IO(KVMIO,   0x80)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 5cc53c9..e2b18af 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2158,6 +2158,117 @@ out:
>  }
>  #endif
>  
> +static int kvm_device_ioctl_attr(struct kvm_device *dev,
> +				 int (*accessor)(struct kvm_device *dev,
> +						 struct kvm_device_attr *attr),
> +				 unsigned long arg)
> +{
> +	struct kvm_device_attr attr;
> +
> +	if (!accessor)
> +		return -EPERM;
> +
> +	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
> +		return -EFAULT;
> +
> +	return accessor(dev, &attr);
> +}
> +
> +static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
> +			     unsigned long arg)
> +{
> +	struct kvm_device *dev = filp->private_data;
> +
> +	switch (ioctl) {
> +	case KVM_SET_DEVICE_ATTR:
> +		return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
> +	case KVM_GET_DEVICE_ATTR:
> +		return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
> +	case KVM_HAS_DEVICE_ATTR:
> +		return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
> +	default:
> +		if (dev->ops->ioctl)
> +			return dev->ops->ioctl(dev, ioctl, arg);
> +
> +		return -ENOTTY;
> +	}
> +}
> +
> +void kvm_device_get(struct kvm_device *dev)
> +{
> +	atomic_inc(&dev->users);
> +}
> +
> +void kvm_device_put(struct kvm_device *dev)
> +{
> +	if (atomic_dec_and_test(&dev->users))
> +		dev->ops->destroy(dev);
> +}
> +
> +static int kvm_device_release(struct inode *inode, struct file *filp)
> +{
> +	struct kvm_device *dev = filp->private_data;
> +	struct kvm *kvm = dev->kvm;
> +
> +	kvm_device_put(dev);
> +	kvm_put_kvm(kvm);
We may put kvm only if users goes to zero, otherwise kvm can be
freed while something holds a reference to a device. Why not make
kvm_device_put() do it?

> +	return 0;
> +}
> +
> +static const struct file_operations kvm_device_fops = {
> +	.unlocked_ioctl = kvm_device_ioctl,
> +	.release = kvm_device_release,
> +};
> +
> +struct kvm_device *kvm_device_from_filp(struct file *filp)
> +{
> +	if (filp->f_op != &kvm_device_fops)
> +		return NULL;
> +
> +	return filp->private_data;
> +}
> +
> +static int kvm_ioctl_create_device(struct kvm *kvm,
> +				   struct kvm_create_device *cd)
> +{
> +	struct kvm_device_ops *ops = NULL;
> +	struct kvm_device *dev;
> +	bool test = cd->flags & KVM_CREATE_DEVICE_TEST;
> +	int ret;
> +
> +	switch (cd->type) {
> +	default:
> +		return -ENODEV;
> +	}
> +
> +	if (test)
> +		return 0;
> +
> +	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	dev->ops = ops;
> +	dev->kvm = kvm;
> +	atomic_set(&dev->users, 1);
> +
> +	ret = ops->create(dev, cd->type);
> +	if (ret < 0) {
> +		kfree(dev);
> +		return ret;
> +	}
> +
> +	ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR);
> +	if (ret < 0) {
> +		ops->destroy(dev);
> +		return ret;
> +	}
> +
> +	kvm_get_kvm(kvm);
> +	cd->fd = ret;
> +	return 0;
> +}
> +
>  static long kvm_vm_ioctl(struct file *filp,
>  			   unsigned int ioctl, unsigned long arg)
>  {
> @@ -2272,6 +2383,24 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  #endif
> +	case KVM_CREATE_DEVICE: {
> +		struct kvm_create_device cd;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&cd, argp, sizeof(cd)))
> +			goto out;
> +
> +		r = kvm_ioctl_create_device(kvm, &cd);
> +		if (r)
> +			goto out;
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &cd, sizeof(cd)))
> +			goto out;
> +
> +		r = 0;
> +		break;
> +	}
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  		if (r = -ENOTTY)
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25  9:43           ` Gleb Natapov
@ 2013-04-25 10:47             ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-25 10:47 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, paulus


On 25.04.2013, at 11:43, Gleb Natapov wrote:

> On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
>> Currently, devices that are emulated inside KVM are configured in a
>> hardcoded manner based on an assumption that any given architecture
>> only has one way to do it.  If there's any need to access device state,
>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> cumbersome and depletes a limited numberspace.
>> 
>> This API provides a mechanism to instantiate a device of a certain
>> type, returning an ID that can be used to set/get attributes of the
>> device.  Attributes may include configuration parameters (e.g.
>> register base address), device state, operational commands, etc.  It
>> is similar to the ONE_REG API, except that it acts on devices rather
>> than vcpus.
>> 
>> Both device types and individual attributes can be tested without having
>> to create the device or get/set the attribute, without the need for
>> separately managing enumerated capabilities.
>> 
>> Signed-off-by: Scott Wood <scottwood@freescale.com>
>> ---
>> v4:
>> - Move some boilerplate back into generic code, as requested by Gleb.
>>   File descriptor management and reference counting is no longer the
>>   concern of the device implementation.
>> 
>> - Don't hold kvm->lock during create.  The original reasons
>>   for doing so have vanished as for as MPIC is concerned, and
>>   this avoids needing to answer the question of whether to
>>   hold the lock during destroy as well.
>> 
>>   Paul, you may need to acquire the lock yourself in kvm_create_xics()
>>   to protect the -EEXIST check.
>> 
>> v3: remove some changes that were merged into this patch by accident,
>> and fix the error documentation for KVM_CREATE_DEVICE.
>> ---
>> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
>> Documentation/virtual/kvm/devices/README |    1 +
>> include/linux/kvm_host.h                 |   35 ++++++++
>> include/uapi/linux/kvm.h                 |   27 +++++++
>> virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
>> 5 files changed, 262 insertions(+)
>> create mode 100644 Documentation/virtual/kvm/devices/README
>> 
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 976eb65..d52f3f9 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>> written, then `n_invalid' invalid entries, invalidating any previously
>> valid entries found.
>> 
>> +4.79 KVM_CREATE_DEVICE
>> +
>> +Capability: KVM_CAP_DEVICE_CTRL
>> +Type: vm ioctl
>> +Parameters: struct kvm_create_device (in/out)
>> +Returns: 0 on success, -1 on error
>> +Errors:
>> +  ENODEV: The device type is unknown or unsupported
>> +  EEXIST: Device already created, and this type of device may not
>> +          be instantiated multiple times
>> +
>> +  Other error conditions may be defined by individual device types or
>> +  have their standard meanings.
>> +
>> +Creates an emulated device in the kernel.  The file descriptor returned
>> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
>> +
>> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
>> +device type is supported (not necessarily whether it can be created
>> +in the current vm).
>> +
>> +Individual devices should not define flags.  Attributes should be used
>> +for specifying any behavior that is not implied by the device type
>> +number.
>> +
>> +struct kvm_create_device {
>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>> +	__u32	fd;	/* out: device handle */
>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>> +};
> Should we add __u32 padding here to make struct size multiple of u64?

Do you know of any arch that pads structs to u64 boundaries? x86_64 doesn't and ppc64 doesn't either.

> 
>> +
>> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
>> +
>> +Capability: KVM_CAP_DEVICE_CTRL
>> +Type: device ioctl
>> +Parameters: struct kvm_device_attr
>> +Returns: 0 on success, -1 on error
>> +Errors:
>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>> +  EPERM:  The attribute cannot (currently) be accessed this way
>> +          (e.g. read-only attribute, or attribute that only makes
>> +          sense when the device is in a different state)
>> +
>> +  Other error conditions may be defined by individual device types.
>> +
>> +Gets/sets a specified piece of device configuration and/or state.  The
>> +semantics are device-specific.  See individual device documentation in
>> +the "devices" directory.  As with ONE_REG, the size of the data
>> +transferred is defined by the particular attribute.
>> +
>> +struct kvm_device_attr {
>> +	__u32	flags;		/* no flags currently defined */
>> +	__u32	group;		/* device-defined */
>> +	__u64	attr;		/* group-defined */
>> +	__u64	addr;		/* userspace address of attr data */
>> +};
>> +
>> +4.81 KVM_HAS_DEVICE_ATTR
>> +
>> +Capability: KVM_CAP_DEVICE_CTRL
>> +Type: device ioctl
>> +Parameters: struct kvm_device_attr
>> +Returns: 0 on success, -1 on error
>> +Errors:
>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>> +
>> +Tests whether a device supports a particular attribute.  A successful
>> +return indicates the attribute is implemented.  It does not necessarily
>> +indicate that the attribute can be read or written in the device's
>> +current state.  "addr" is ignored.
>> 
>> 4.77 KVM_ARM_VCPU_INIT
>> 
>> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
>> new file mode 100644
>> index 0000000..34a6983
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/devices/README
>> @@ -0,0 +1 @@
>> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 20d77d2..8fce9bc 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
>> 
>> extern bool kvm_rebooting;
>> 
>> +struct kvm_device_ops;
>> +
>> +struct kvm_device {
>> +	struct kvm_device_ops *ops;
>> +	struct kvm *kvm;
>> +	atomic_t users;
>> +	void *private;
>> +};
>> +
>> +/* create, destroy, and name are mandatory */
>> +struct kvm_device_ops {
>> +	const char *name;
>> +	int (*create)(struct kvm_device *dev, u32 type);
>> +
>> +	/*
>> +	 * Destroy is responsible for freeing dev.
>> +	 *
>> +	 * Destroy may be called before or after destructors are called
>> +	 * on emulated I/O regions, depending on whether a reference is
>> +	 * held by a vcpu or other kvm component that gets destroyed
>> +	 * after the emulated I/O.
>> +	 */
>> +	void (*destroy)(struct kvm_device *dev);
>> +
>> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
>> +		      unsigned long arg);
>> +};
>> +
>> +void kvm_device_get(struct kvm_device *dev);
>> +void kvm_device_put(struct kvm_device *dev);
>> +struct kvm_device *kvm_device_from_filp(struct file *filp);
>> +
>> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>> 
>> static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 74d0ff3..20ce2d2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>> #define KVM_CAP_PPC_EPR 86
>> #define KVM_CAP_ARM_PSCI 87
>> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
>> +#define KVM_CAP_DEVICE_CTRL 89
>> 
>> #ifdef KVM_CAP_IRQ_ROUTING
>> 
>> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>> 
>> /*
>> + * Device control API, available with KVM_CAP_DEVICE_CTRL
>> + */
>> +#define KVM_CREATE_DEVICE_TEST		1
>> +
>> +struct kvm_create_device {
>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>> +	__u32	fd;	/* out: device handle */
>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>> +};
>> +
>> +struct kvm_device_attr {
>> +	__u32	flags;		/* no flags currently defined */
>> +	__u32	group;		/* device-defined */
>> +	__u64	attr;		/* group-defined */
>> +	__u64	addr;		/* userspace address of attr data */
>> +};
> Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
> from ioctl definition block.

Let me change that in my tree...

> 
>> +
>> +/* ioctl for vm fd */
>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>> +
>> +/* ioctls for fds returned by KVM_CREATE_DEVICE */
>> +#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
>> +#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>> +#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>> +
>> +/*
>>  * ioctls for vcpu fds
>>  */
>> #define KVM_RUN                   _IO(KVMIO,   0x80)
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 5cc53c9..e2b18af 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2158,6 +2158,117 @@ out:
>> }
>> #endif
>> 
>> +static int kvm_device_ioctl_attr(struct kvm_device *dev,
>> +				 int (*accessor)(struct kvm_device *dev,
>> +						 struct kvm_device_attr *attr),
>> +				 unsigned long arg)
>> +{
>> +	struct kvm_device_attr attr;
>> +
>> +	if (!accessor)
>> +		return -EPERM;
>> +
>> +	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
>> +		return -EFAULT;
>> +
>> +	return accessor(dev, &attr);
>> +}
>> +
>> +static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
>> +			     unsigned long arg)
>> +{
>> +	struct kvm_device *dev = filp->private_data;
>> +
>> +	switch (ioctl) {
>> +	case KVM_SET_DEVICE_ATTR:
>> +		return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
>> +	case KVM_GET_DEVICE_ATTR:
>> +		return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
>> +	case KVM_HAS_DEVICE_ATTR:
>> +		return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
>> +	default:
>> +		if (dev->ops->ioctl)
>> +			return dev->ops->ioctl(dev, ioctl, arg);
>> +
>> +		return -ENOTTY;
>> +	}
>> +}
>> +
>> +void kvm_device_get(struct kvm_device *dev)
>> +{
>> +	atomic_inc(&dev->users);
>> +}
>> +
>> +void kvm_device_put(struct kvm_device *dev)
>> +{
>> +	if (atomic_dec_and_test(&dev->users))
>> +		dev->ops->destroy(dev);
>> +}
>> +
>> +static int kvm_device_release(struct inode *inode, struct file *filp)
>> +{
>> +	struct kvm_device *dev = filp->private_data;
>> +	struct kvm *kvm = dev->kvm;
>> +
>> +	kvm_device_put(dev);
>> +	kvm_put_kvm(kvm);
> We may put kvm only if users goes to zero, otherwise kvm can be
> freed while something holds a reference to a device. Why not make
> kvm_device_put() do it?

Nice catch. I'll change the patch so it does the kvm_put_kvm inside kvm_device_put's destroy branch.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 10:47             ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-25 10:47 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, paulus


On 25.04.2013, at 11:43, Gleb Natapov wrote:

> On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
>> Currently, devices that are emulated inside KVM are configured in a
>> hardcoded manner based on an assumption that any given architecture
>> only has one way to do it.  If there's any need to access device state,
>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>> cumbersome and depletes a limited numberspace.
>> 
>> This API provides a mechanism to instantiate a device of a certain
>> type, returning an ID that can be used to set/get attributes of the
>> device.  Attributes may include configuration parameters (e.g.
>> register base address), device state, operational commands, etc.  It
>> is similar to the ONE_REG API, except that it acts on devices rather
>> than vcpus.
>> 
>> Both device types and individual attributes can be tested without having
>> to create the device or get/set the attribute, without the need for
>> separately managing enumerated capabilities.
>> 
>> Signed-off-by: Scott Wood <scottwood@freescale.com>
>> ---
>> v4:
>> - Move some boilerplate back into generic code, as requested by Gleb.
>>   File descriptor management and reference counting is no longer the
>>   concern of the device implementation.
>> 
>> - Don't hold kvm->lock during create.  The original reasons
>>   for doing so have vanished as for as MPIC is concerned, and
>>   this avoids needing to answer the question of whether to
>>   hold the lock during destroy as well.
>> 
>>   Paul, you may need to acquire the lock yourself in kvm_create_xics()
>>   to protect the -EEXIST check.
>> 
>> v3: remove some changes that were merged into this patch by accident,
>> and fix the error documentation for KVM_CREATE_DEVICE.
>> ---
>> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
>> Documentation/virtual/kvm/devices/README |    1 +
>> include/linux/kvm_host.h                 |   35 ++++++++
>> include/uapi/linux/kvm.h                 |   27 +++++++
>> virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
>> 5 files changed, 262 insertions(+)
>> create mode 100644 Documentation/virtual/kvm/devices/README
>> 
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 976eb65..d52f3f9 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>> written, then `n_invalid' invalid entries, invalidating any previously
>> valid entries found.
>> 
>> +4.79 KVM_CREATE_DEVICE
>> +
>> +Capability: KVM_CAP_DEVICE_CTRL
>> +Type: vm ioctl
>> +Parameters: struct kvm_create_device (in/out)
>> +Returns: 0 on success, -1 on error
>> +Errors:
>> +  ENODEV: The device type is unknown or unsupported
>> +  EEXIST: Device already created, and this type of device may not
>> +          be instantiated multiple times
>> +
>> +  Other error conditions may be defined by individual device types or
>> +  have their standard meanings.
>> +
>> +Creates an emulated device in the kernel.  The file descriptor returned
>> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
>> +
>> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
>> +device type is supported (not necessarily whether it can be created
>> +in the current vm).
>> +
>> +Individual devices should not define flags.  Attributes should be used
>> +for specifying any behavior that is not implied by the device type
>> +number.
>> +
>> +struct kvm_create_device {
>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>> +	__u32	fd;	/* out: device handle */
>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>> +};
> Should we add __u32 padding here to make struct size multiple of u64?

Do you know of any arch that pads structs to u64 boundaries? x86_64 doesn't and ppc64 doesn't either.

> 
>> +
>> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
>> +
>> +Capability: KVM_CAP_DEVICE_CTRL
>> +Type: device ioctl
>> +Parameters: struct kvm_device_attr
>> +Returns: 0 on success, -1 on error
>> +Errors:
>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>> +  EPERM:  The attribute cannot (currently) be accessed this way
>> +          (e.g. read-only attribute, or attribute that only makes
>> +          sense when the device is in a different state)
>> +
>> +  Other error conditions may be defined by individual device types.
>> +
>> +Gets/sets a specified piece of device configuration and/or state.  The
>> +semantics are device-specific.  See individual device documentation in
>> +the "devices" directory.  As with ONE_REG, the size of the data
>> +transferred is defined by the particular attribute.
>> +
>> +struct kvm_device_attr {
>> +	__u32	flags;		/* no flags currently defined */
>> +	__u32	group;		/* device-defined */
>> +	__u64	attr;		/* group-defined */
>> +	__u64	addr;		/* userspace address of attr data */
>> +};
>> +
>> +4.81 KVM_HAS_DEVICE_ATTR
>> +
>> +Capability: KVM_CAP_DEVICE_CTRL
>> +Type: device ioctl
>> +Parameters: struct kvm_device_attr
>> +Returns: 0 on success, -1 on error
>> +Errors:
>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>> +
>> +Tests whether a device supports a particular attribute.  A successful
>> +return indicates the attribute is implemented.  It does not necessarily
>> +indicate that the attribute can be read or written in the device's
>> +current state.  "addr" is ignored.
>> 
>> 4.77 KVM_ARM_VCPU_INIT
>> 
>> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
>> new file mode 100644
>> index 0000000..34a6983
>> --- /dev/null
>> +++ b/Documentation/virtual/kvm/devices/README
>> @@ -0,0 +1 @@
>> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 20d77d2..8fce9bc 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
>> 
>> extern bool kvm_rebooting;
>> 
>> +struct kvm_device_ops;
>> +
>> +struct kvm_device {
>> +	struct kvm_device_ops *ops;
>> +	struct kvm *kvm;
>> +	atomic_t users;
>> +	void *private;
>> +};
>> +
>> +/* create, destroy, and name are mandatory */
>> +struct kvm_device_ops {
>> +	const char *name;
>> +	int (*create)(struct kvm_device *dev, u32 type);
>> +
>> +	/*
>> +	 * Destroy is responsible for freeing dev.
>> +	 *
>> +	 * Destroy may be called before or after destructors are called
>> +	 * on emulated I/O regions, depending on whether a reference is
>> +	 * held by a vcpu or other kvm component that gets destroyed
>> +	 * after the emulated I/O.
>> +	 */
>> +	void (*destroy)(struct kvm_device *dev);
>> +
>> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
>> +		      unsigned long arg);
>> +};
>> +
>> +void kvm_device_get(struct kvm_device *dev);
>> +void kvm_device_put(struct kvm_device *dev);
>> +struct kvm_device *kvm_device_from_filp(struct file *filp);
>> +
>> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>> 
>> static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 74d0ff3..20ce2d2 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>> #define KVM_CAP_PPC_EPR 86
>> #define KVM_CAP_ARM_PSCI 87
>> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
>> +#define KVM_CAP_DEVICE_CTRL 89
>> 
>> #ifdef KVM_CAP_IRQ_ROUTING
>> 
>> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>> 
>> /*
>> + * Device control API, available with KVM_CAP_DEVICE_CTRL
>> + */
>> +#define KVM_CREATE_DEVICE_TEST		1
>> +
>> +struct kvm_create_device {
>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>> +	__u32	fd;	/* out: device handle */
>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>> +};
>> +
>> +struct kvm_device_attr {
>> +	__u32	flags;		/* no flags currently defined */
>> +	__u32	group;		/* device-defined */
>> +	__u64	attr;		/* group-defined */
>> +	__u64	addr;		/* userspace address of attr data */
>> +};
> Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
> from ioctl definition block.

Let me change that in my tree...

> 
>> +
>> +/* ioctl for vm fd */
>> +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
>> +
>> +/* ioctls for fds returned by KVM_CREATE_DEVICE */
>> +#define KVM_SET_DEVICE_ATTR	  _IOW(KVMIO,  0xe1, struct kvm_device_attr)
>> +#define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>> +#define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>> +
>> +/*
>>  * ioctls for vcpu fds
>>  */
>> #define KVM_RUN                   _IO(KVMIO,   0x80)
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 5cc53c9..e2b18af 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2158,6 +2158,117 @@ out:
>> }
>> #endif
>> 
>> +static int kvm_device_ioctl_attr(struct kvm_device *dev,
>> +				 int (*accessor)(struct kvm_device *dev,
>> +						 struct kvm_device_attr *attr),
>> +				 unsigned long arg)
>> +{
>> +	struct kvm_device_attr attr;
>> +
>> +	if (!accessor)
>> +		return -EPERM;
>> +
>> +	if (copy_from_user(&attr, (void __user *)arg, sizeof(attr)))
>> +		return -EFAULT;
>> +
>> +	return accessor(dev, &attr);
>> +}
>> +
>> +static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
>> +			     unsigned long arg)
>> +{
>> +	struct kvm_device *dev = filp->private_data;
>> +
>> +	switch (ioctl) {
>> +	case KVM_SET_DEVICE_ATTR:
>> +		return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
>> +	case KVM_GET_DEVICE_ATTR:
>> +		return kvm_device_ioctl_attr(dev, dev->ops->get_attr, arg);
>> +	case KVM_HAS_DEVICE_ATTR:
>> +		return kvm_device_ioctl_attr(dev, dev->ops->has_attr, arg);
>> +	default:
>> +		if (dev->ops->ioctl)
>> +			return dev->ops->ioctl(dev, ioctl, arg);
>> +
>> +		return -ENOTTY;
>> +	}
>> +}
>> +
>> +void kvm_device_get(struct kvm_device *dev)
>> +{
>> +	atomic_inc(&dev->users);
>> +}
>> +
>> +void kvm_device_put(struct kvm_device *dev)
>> +{
>> +	if (atomic_dec_and_test(&dev->users))
>> +		dev->ops->destroy(dev);
>> +}
>> +
>> +static int kvm_device_release(struct inode *inode, struct file *filp)
>> +{
>> +	struct kvm_device *dev = filp->private_data;
>> +	struct kvm *kvm = dev->kvm;
>> +
>> +	kvm_device_put(dev);
>> +	kvm_put_kvm(kvm);
> We may put kvm only if users goes to zero, otherwise kvm can be
> freed while something holds a reference to a device. Why not make
> kvm_device_put() do it?

Nice catch. I'll change the patch so it does the kvm_put_kvm inside kvm_device_put's destroy branch.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 10:47             ` Alexander Graf
@ 2013-04-25 12:07               ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25 12:07 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 12:47:39PM +0200, Alexander Graf wrote:
> 
> On 25.04.2013, at 11:43, Gleb Natapov wrote:
> 
> > On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.
> >> 
> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >> 
> >> Both device types and individual attributes can be tested without having
> >> to create the device or get/set the attribute, without the need for
> >> separately managing enumerated capabilities.
> >> 
> >> Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> ---
> >> v4:
> >> - Move some boilerplate back into generic code, as requested by Gleb.
> >>   File descriptor management and reference counting is no longer the
> >>   concern of the device implementation.
> >> 
> >> - Don't hold kvm->lock during create.  The original reasons
> >>   for doing so have vanished as for as MPIC is concerned, and
> >>   this avoids needing to answer the question of whether to
> >>   hold the lock during destroy as well.
> >> 
> >>   Paul, you may need to acquire the lock yourself in kvm_create_xics()
> >>   to protect the -EEXIST check.
> >> 
> >> v3: remove some changes that were merged into this patch by accident,
> >> and fix the error documentation for KVM_CREATE_DEVICE.
> >> ---
> >> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
> >> Documentation/virtual/kvm/devices/README |    1 +
> >> include/linux/kvm_host.h                 |   35 ++++++++
> >> include/uapi/linux/kvm.h                 |   27 +++++++
> >> virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
> >> 5 files changed, 262 insertions(+)
> >> create mode 100644 Documentation/virtual/kvm/devices/README
> >> 
> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >> index 976eb65..d52f3f9 100644
> >> --- a/Documentation/virtual/kvm/api.txt
> >> +++ b/Documentation/virtual/kvm/api.txt
> >> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
> >> written, then `n_invalid' invalid entries, invalidating any previously
> >> valid entries found.
> >> 
> >> +4.79 KVM_CREATE_DEVICE
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >> +Type: vm ioctl
> >> +Parameters: struct kvm_create_device (in/out)
> >> +Returns: 0 on success, -1 on error
> >> +Errors:
> >> +  ENODEV: The device type is unknown or unsupported
> >> +  EEXIST: Device already created, and this type of device may not
> >> +          be instantiated multiple times
> >> +
> >> +  Other error conditions may be defined by individual device types or
> >> +  have their standard meanings.
> >> +
> >> +Creates an emulated device in the kernel.  The file descriptor returned
> >> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> >> +
> >> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> >> +device type is supported (not necessarily whether it can be created
> >> +in the current vm).
> >> +
> >> +Individual devices should not define flags.  Attributes should be used
> >> +for specifying any behavior that is not implied by the device type
> >> +number.
> >> +
> >> +struct kvm_create_device {
> >> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> >> +	__u32	fd;	/* out: device handle */
> >> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> >> +};
> > Should we add __u32 padding here to make struct size multiple of u64?
> 
> Do you know of any arch that pads structs to u64 boundaries? x86_64 doesn't and ppc64 doesn't either.
> 
Not really. I just notices that we pad some structures to that effect.

> > 
> >> +
> >> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >> +Type: device ioctl
> >> +Parameters: struct kvm_device_attr
> >> +Returns: 0 on success, -1 on error
> >> +Errors:
> >> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> >> +  EPERM:  The attribute cannot (currently) be accessed this way
> >> +          (e.g. read-only attribute, or attribute that only makes
> >> +          sense when the device is in a different state)
> >> +
> >> +  Other error conditions may be defined by individual device types.
> >> +
> >> +Gets/sets a specified piece of device configuration and/or state.  The
> >> +semantics are device-specific.  See individual device documentation in
> >> +the "devices" directory.  As with ONE_REG, the size of the data
> >> +transferred is defined by the particular attribute.
> >> +
> >> +struct kvm_device_attr {
> >> +	__u32	flags;		/* no flags currently defined */
> >> +	__u32	group;		/* device-defined */
> >> +	__u64	attr;		/* group-defined */
> >> +	__u64	addr;		/* userspace address of attr data */
> >> +};
> >> +
> >> +4.81 KVM_HAS_DEVICE_ATTR
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >> +Type: device ioctl
> >> +Parameters: struct kvm_device_attr
> >> +Returns: 0 on success, -1 on error
> >> +Errors:
> >> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> >> +
> >> +Tests whether a device supports a particular attribute.  A successful
> >> +return indicates the attribute is implemented.  It does not necessarily
> >> +indicate that the attribute can be read or written in the device's
> >> +current state.  "addr" is ignored.
> >> 
> >> 4.77 KVM_ARM_VCPU_INIT
> >> 
> >> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> >> new file mode 100644
> >> index 0000000..34a6983
> >> --- /dev/null
> >> +++ b/Documentation/virtual/kvm/devices/README
> >> @@ -0,0 +1 @@
> >> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index 20d77d2..8fce9bc 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
> >> 
> >> extern bool kvm_rebooting;
> >> 
> >> +struct kvm_device_ops;
> >> +
> >> +struct kvm_device {
> >> +	struct kvm_device_ops *ops;
> >> +	struct kvm *kvm;
> >> +	atomic_t users;
> >> +	void *private;
> >> +};
> >> +
> >> +/* create, destroy, and name are mandatory */
> >> +struct kvm_device_ops {
> >> +	const char *name;
> >> +	int (*create)(struct kvm_device *dev, u32 type);
> >> +
> >> +	/*
> >> +	 * Destroy is responsible for freeing dev.
> >> +	 *
> >> +	 * Destroy may be called before or after destructors are called
> >> +	 * on emulated I/O regions, depending on whether a reference is
> >> +	 * held by a vcpu or other kvm component that gets destroyed
> >> +	 * after the emulated I/O.
> >> +	 */
> >> +	void (*destroy)(struct kvm_device *dev);
> >> +
> >> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> >> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> >> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> >> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> >> +		      unsigned long arg);
> >> +};
> >> +
> >> +void kvm_device_get(struct kvm_device *dev);
> >> +void kvm_device_put(struct kvm_device *dev);
> >> +struct kvm_device *kvm_device_from_filp(struct file *filp);
> >> +
> >> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> >> 
> >> static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 74d0ff3..20ce2d2 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> >> #define KVM_CAP_PPC_EPR 86
> >> #define KVM_CAP_ARM_PSCI 87
> >> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> >> +#define KVM_CAP_DEVICE_CTRL 89
> >> 
> >> #ifdef KVM_CAP_IRQ_ROUTING
> >> 
> >> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
> >> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
> >> 
> >> /*
> >> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> >> + */
> >> +#define KVM_CREATE_DEVICE_TEST		1
> >> +
> >> +struct kvm_create_device {
> >> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> >> +	__u32	fd;	/* out: device handle */
> >> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> >> +};
> >> +
> >> +struct kvm_device_attr {
> >> +	__u32	flags;		/* no flags currently defined */
> >> +	__u32	group;		/* device-defined */
> >> +	__u64	attr;		/* group-defined */
> >> +	__u64	addr;		/* userspace address of attr data */
> >> +};
> > Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
> > from ioctl definition block.
> 
> Let me change that in my tree...
> 
So are you sending this via your tree and I should not apply it directly?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 12:07               ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25 12:07 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 12:47:39PM +0200, Alexander Graf wrote:
> 
> On 25.04.2013, at 11:43, Gleb Natapov wrote:
> 
> > On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.
> >> 
> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >> 
> >> Both device types and individual attributes can be tested without having
> >> to create the device or get/set the attribute, without the need for
> >> separately managing enumerated capabilities.
> >> 
> >> Signed-off-by: Scott Wood <scottwood@freescale.com>
> >> ---
> >> v4:
> >> - Move some boilerplate back into generic code, as requested by Gleb.
> >>   File descriptor management and reference counting is no longer the
> >>   concern of the device implementation.
> >> 
> >> - Don't hold kvm->lock during create.  The original reasons
> >>   for doing so have vanished as for as MPIC is concerned, and
> >>   this avoids needing to answer the question of whether to
> >>   hold the lock during destroy as well.
> >> 
> >>   Paul, you may need to acquire the lock yourself in kvm_create_xics()
> >>   to protect the -EEXIST check.
> >> 
> >> v3: remove some changes that were merged into this patch by accident,
> >> and fix the error documentation for KVM_CREATE_DEVICE.
> >> ---
> >> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
> >> Documentation/virtual/kvm/devices/README |    1 +
> >> include/linux/kvm_host.h                 |   35 ++++++++
> >> include/uapi/linux/kvm.h                 |   27 +++++++
> >> virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
> >> 5 files changed, 262 insertions(+)
> >> create mode 100644 Documentation/virtual/kvm/devices/README
> >> 
> >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> >> index 976eb65..d52f3f9 100644
> >> --- a/Documentation/virtual/kvm/api.txt
> >> +++ b/Documentation/virtual/kvm/api.txt
> >> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
> >> written, then `n_invalid' invalid entries, invalidating any previously
> >> valid entries found.
> >> 
> >> +4.79 KVM_CREATE_DEVICE
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >> +Type: vm ioctl
> >> +Parameters: struct kvm_create_device (in/out)
> >> +Returns: 0 on success, -1 on error
> >> +Errors:
> >> +  ENODEV: The device type is unknown or unsupported
> >> +  EEXIST: Device already created, and this type of device may not
> >> +          be instantiated multiple times
> >> +
> >> +  Other error conditions may be defined by individual device types or
> >> +  have their standard meanings.
> >> +
> >> +Creates an emulated device in the kernel.  The file descriptor returned
> >> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> >> +
> >> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> >> +device type is supported (not necessarily whether it can be created
> >> +in the current vm).
> >> +
> >> +Individual devices should not define flags.  Attributes should be used
> >> +for specifying any behavior that is not implied by the device type
> >> +number.
> >> +
> >> +struct kvm_create_device {
> >> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> >> +	__u32	fd;	/* out: device handle */
> >> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> >> +};
> > Should we add __u32 padding here to make struct size multiple of u64?
> 
> Do you know of any arch that pads structs to u64 boundaries? x86_64 doesn't and ppc64 doesn't either.
> 
Not really. I just notices that we pad some structures to that effect.

> > 
> >> +
> >> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >> +Type: device ioctl
> >> +Parameters: struct kvm_device_attr
> >> +Returns: 0 on success, -1 on error
> >> +Errors:
> >> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> >> +  EPERM:  The attribute cannot (currently) be accessed this way
> >> +          (e.g. read-only attribute, or attribute that only makes
> >> +          sense when the device is in a different state)
> >> +
> >> +  Other error conditions may be defined by individual device types.
> >> +
> >> +Gets/sets a specified piece of device configuration and/or state.  The
> >> +semantics are device-specific.  See individual device documentation in
> >> +the "devices" directory.  As with ONE_REG, the size of the data
> >> +transferred is defined by the particular attribute.
> >> +
> >> +struct kvm_device_attr {
> >> +	__u32	flags;		/* no flags currently defined */
> >> +	__u32	group;		/* device-defined */
> >> +	__u64	attr;		/* group-defined */
> >> +	__u64	addr;		/* userspace address of attr data */
> >> +};
> >> +
> >> +4.81 KVM_HAS_DEVICE_ATTR
> >> +
> >> +Capability: KVM_CAP_DEVICE_CTRL
> >> +Type: device ioctl
> >> +Parameters: struct kvm_device_attr
> >> +Returns: 0 on success, -1 on error
> >> +Errors:
> >> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> >> +
> >> +Tests whether a device supports a particular attribute.  A successful
> >> +return indicates the attribute is implemented.  It does not necessarily
> >> +indicate that the attribute can be read or written in the device's
> >> +current state.  "addr" is ignored.
> >> 
> >> 4.77 KVM_ARM_VCPU_INIT
> >> 
> >> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
> >> new file mode 100644
> >> index 0000000..34a6983
> >> --- /dev/null
> >> +++ b/Documentation/virtual/kvm/devices/README
> >> @@ -0,0 +1 @@
> >> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index 20d77d2..8fce9bc 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
> >> 
> >> extern bool kvm_rebooting;
> >> 
> >> +struct kvm_device_ops;
> >> +
> >> +struct kvm_device {
> >> +	struct kvm_device_ops *ops;
> >> +	struct kvm *kvm;
> >> +	atomic_t users;
> >> +	void *private;
> >> +};
> >> +
> >> +/* create, destroy, and name are mandatory */
> >> +struct kvm_device_ops {
> >> +	const char *name;
> >> +	int (*create)(struct kvm_device *dev, u32 type);
> >> +
> >> +	/*
> >> +	 * Destroy is responsible for freeing dev.
> >> +	 *
> >> +	 * Destroy may be called before or after destructors are called
> >> +	 * on emulated I/O regions, depending on whether a reference is
> >> +	 * held by a vcpu or other kvm component that gets destroyed
> >> +	 * after the emulated I/O.
> >> +	 */
> >> +	void (*destroy)(struct kvm_device *dev);
> >> +
> >> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> >> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> >> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
> >> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> >> +		      unsigned long arg);
> >> +};
> >> +
> >> +void kvm_device_get(struct kvm_device *dev);
> >> +void kvm_device_put(struct kvm_device *dev);
> >> +struct kvm_device *kvm_device_from_filp(struct file *filp);
> >> +
> >> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> >> 
> >> static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 74d0ff3..20ce2d2 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> >> #define KVM_CAP_PPC_EPR 86
> >> #define KVM_CAP_ARM_PSCI 87
> >> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
> >> +#define KVM_CAP_DEVICE_CTRL 89
> >> 
> >> #ifdef KVM_CAP_IRQ_ROUTING
> >> 
> >> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
> >> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
> >> 
> >> /*
> >> + * Device control API, available with KVM_CAP_DEVICE_CTRL
> >> + */
> >> +#define KVM_CREATE_DEVICE_TEST		1
> >> +
> >> +struct kvm_create_device {
> >> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
> >> +	__u32	fd;	/* out: device handle */
> >> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
> >> +};
> >> +
> >> +struct kvm_device_attr {
> >> +	__u32	flags;		/* no flags currently defined */
> >> +	__u32	group;		/* device-defined */
> >> +	__u64	attr;		/* group-defined */
> >> +	__u64	addr;		/* userspace address of attr data */
> >> +};
> > Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
> > from ioctl definition block.
> 
> Let me change that in my tree...
> 
So are you sending this via your tree and I should not apply it directly?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 12:07               ` Gleb Natapov
@ 2013-04-25 13:45                 ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-25 13:45 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, paulus


On 25.04.2013, at 14:07, Gleb Natapov wrote:

> On Thu, Apr 25, 2013 at 12:47:39PM +0200, Alexander Graf wrote:
>> 
>> On 25.04.2013, at 11:43, Gleb Natapov wrote:
>> 
>>> On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
>>>> Currently, devices that are emulated inside KVM are configured in a
>>>> hardcoded manner based on an assumption that any given architecture
>>>> only has one way to do it.  If there's any need to access device state,
>>>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>>>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>>>> cumbersome and depletes a limited numberspace.
>>>> 
>>>> This API provides a mechanism to instantiate a device of a certain
>>>> type, returning an ID that can be used to set/get attributes of the
>>>> device.  Attributes may include configuration parameters (e.g.
>>>> register base address), device state, operational commands, etc.  It
>>>> is similar to the ONE_REG API, except that it acts on devices rather
>>>> than vcpus.
>>>> 
>>>> Both device types and individual attributes can be tested without having
>>>> to create the device or get/set the attribute, without the need for
>>>> separately managing enumerated capabilities.
>>>> 
>>>> Signed-off-by: Scott Wood <scottwood@freescale.com>
>>>> ---
>>>> v4:
>>>> - Move some boilerplate back into generic code, as requested by Gleb.
>>>>  File descriptor management and reference counting is no longer the
>>>>  concern of the device implementation.
>>>> 
>>>> - Don't hold kvm->lock during create.  The original reasons
>>>>  for doing so have vanished as for as MPIC is concerned, and
>>>>  this avoids needing to answer the question of whether to
>>>>  hold the lock during destroy as well.
>>>> 
>>>>  Paul, you may need to acquire the lock yourself in kvm_create_xics()
>>>>  to protect the -EEXIST check.
>>>> 
>>>> v3: remove some changes that were merged into this patch by accident,
>>>> and fix the error documentation for KVM_CREATE_DEVICE.
>>>> ---
>>>> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
>>>> Documentation/virtual/kvm/devices/README |    1 +
>>>> include/linux/kvm_host.h                 |   35 ++++++++
>>>> include/uapi/linux/kvm.h                 |   27 +++++++
>>>> virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
>>>> 5 files changed, 262 insertions(+)
>>>> create mode 100644 Documentation/virtual/kvm/devices/README
>>>> 
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index 976eb65..d52f3f9 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>>>> written, then `n_invalid' invalid entries, invalidating any previously
>>>> valid entries found.
>>>> 
>>>> +4.79 KVM_CREATE_DEVICE
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_create_device (in/out)
>>>> +Returns: 0 on success, -1 on error
>>>> +Errors:
>>>> +  ENODEV: The device type is unknown or unsupported
>>>> +  EEXIST: Device already created, and this type of device may not
>>>> +          be instantiated multiple times
>>>> +
>>>> +  Other error conditions may be defined by individual device types or
>>>> +  have their standard meanings.
>>>> +
>>>> +Creates an emulated device in the kernel.  The file descriptor returned
>>>> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
>>>> +
>>>> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
>>>> +device type is supported (not necessarily whether it can be created
>>>> +in the current vm).
>>>> +
>>>> +Individual devices should not define flags.  Attributes should be used
>>>> +for specifying any behavior that is not implied by the device type
>>>> +number.
>>>> +
>>>> +struct kvm_create_device {
>>>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>>>> +	__u32	fd;	/* out: device handle */
>>>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>>>> +};
>>> Should we add __u32 padding here to make struct size multiple of u64?
>> 
>> Do you know of any arch that pads structs to u64 boundaries? x86_64 doesn't and ppc64 doesn't either.
>> 
> Not really. I just notices that we pad some structures to that effect.

I don't think we really need to :).

> 
>>> 
>>>> +
>>>> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>>> +Type: device ioctl
>>>> +Parameters: struct kvm_device_attr
>>>> +Returns: 0 on success, -1 on error
>>>> +Errors:
>>>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>>>> +  EPERM:  The attribute cannot (currently) be accessed this way
>>>> +          (e.g. read-only attribute, or attribute that only makes
>>>> +          sense when the device is in a different state)
>>>> +
>>>> +  Other error conditions may be defined by individual device types.
>>>> +
>>>> +Gets/sets a specified piece of device configuration and/or state.  The
>>>> +semantics are device-specific.  See individual device documentation in
>>>> +the "devices" directory.  As with ONE_REG, the size of the data
>>>> +transferred is defined by the particular attribute.
>>>> +
>>>> +struct kvm_device_attr {
>>>> +	__u32	flags;		/* no flags currently defined */
>>>> +	__u32	group;		/* device-defined */
>>>> +	__u64	attr;		/* group-defined */
>>>> +	__u64	addr;		/* userspace address of attr data */
>>>> +};
>>>> +
>>>> +4.81 KVM_HAS_DEVICE_ATTR
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>>> +Type: device ioctl
>>>> +Parameters: struct kvm_device_attr
>>>> +Returns: 0 on success, -1 on error
>>>> +Errors:
>>>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>>>> +
>>>> +Tests whether a device supports a particular attribute.  A successful
>>>> +return indicates the attribute is implemented.  It does not necessarily
>>>> +indicate that the attribute can be read or written in the device's
>>>> +current state.  "addr" is ignored.
>>>> 
>>>> 4.77 KVM_ARM_VCPU_INIT
>>>> 
>>>> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
>>>> new file mode 100644
>>>> index 0000000..34a6983
>>>> --- /dev/null
>>>> +++ b/Documentation/virtual/kvm/devices/README
>>>> @@ -0,0 +1 @@
>>>> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
>>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>>> index 20d77d2..8fce9bc 100644
>>>> --- a/include/linux/kvm_host.h
>>>> +++ b/include/linux/kvm_host.h
>>>> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
>>>> 
>>>> extern bool kvm_rebooting;
>>>> 
>>>> +struct kvm_device_ops;
>>>> +
>>>> +struct kvm_device {
>>>> +	struct kvm_device_ops *ops;
>>>> +	struct kvm *kvm;
>>>> +	atomic_t users;
>>>> +	void *private;
>>>> +};
>>>> +
>>>> +/* create, destroy, and name are mandatory */
>>>> +struct kvm_device_ops {
>>>> +	const char *name;
>>>> +	int (*create)(struct kvm_device *dev, u32 type);
>>>> +
>>>> +	/*
>>>> +	 * Destroy is responsible for freeing dev.
>>>> +	 *
>>>> +	 * Destroy may be called before or after destructors are called
>>>> +	 * on emulated I/O regions, depending on whether a reference is
>>>> +	 * held by a vcpu or other kvm component that gets destroyed
>>>> +	 * after the emulated I/O.
>>>> +	 */
>>>> +	void (*destroy)(struct kvm_device *dev);
>>>> +
>>>> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>>>> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>>>> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>>>> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
>>>> +		      unsigned long arg);
>>>> +};
>>>> +
>>>> +void kvm_device_get(struct kvm_device *dev);
>>>> +void kvm_device_put(struct kvm_device *dev);
>>>> +struct kvm_device *kvm_device_from_filp(struct file *filp);
>>>> +
>>>> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>>>> 
>>>> static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 74d0ff3..20ce2d2 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>>>> #define KVM_CAP_PPC_EPR 86
>>>> #define KVM_CAP_ARM_PSCI 87
>>>> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
>>>> +#define KVM_CAP_DEVICE_CTRL 89
>>>> 
>>>> #ifdef KVM_CAP_IRQ_ROUTING
>>>> 
>>>> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>>>> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>>>> 
>>>> /*
>>>> + * Device control API, available with KVM_CAP_DEVICE_CTRL
>>>> + */
>>>> +#define KVM_CREATE_DEVICE_TEST		1
>>>> +
>>>> +struct kvm_create_device {
>>>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>>>> +	__u32	fd;	/* out: device handle */
>>>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>>>> +};
>>>> +
>>>> +struct kvm_device_attr {
>>>> +	__u32	flags;		/* no flags currently defined */
>>>> +	__u32	group;		/* device-defined */
>>>> +	__u64	attr;		/* group-defined */
>>>> +	__u64	addr;		/* userspace address of attr data */
>>>> +};
>>> Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
>>> from ioctl definition block.
>> 
>> Let me change that in my tree...
>> 
> So are you sending this via your tree and I should not apply it directly?

I was hoping to have things ready very soon for you to just pull...


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 13:45                 ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-25 13:45 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, paulus


On 25.04.2013, at 14:07, Gleb Natapov wrote:

> On Thu, Apr 25, 2013 at 12:47:39PM +0200, Alexander Graf wrote:
>> 
>> On 25.04.2013, at 11:43, Gleb Natapov wrote:
>> 
>>> On Fri, Apr 12, 2013 at 07:08:42PM -0500, Scott Wood wrote:
>>>> Currently, devices that are emulated inside KVM are configured in a
>>>> hardcoded manner based on an assumption that any given architecture
>>>> only has one way to do it.  If there's any need to access device state,
>>>> it is done through inflexible one-purpose-only IOCTLs (e.g.
>>>> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
>>>> cumbersome and depletes a limited numberspace.
>>>> 
>>>> This API provides a mechanism to instantiate a device of a certain
>>>> type, returning an ID that can be used to set/get attributes of the
>>>> device.  Attributes may include configuration parameters (e.g.
>>>> register base address), device state, operational commands, etc.  It
>>>> is similar to the ONE_REG API, except that it acts on devices rather
>>>> than vcpus.
>>>> 
>>>> Both device types and individual attributes can be tested without having
>>>> to create the device or get/set the attribute, without the need for
>>>> separately managing enumerated capabilities.
>>>> 
>>>> Signed-off-by: Scott Wood <scottwood@freescale.com>
>>>> ---
>>>> v4:
>>>> - Move some boilerplate back into generic code, as requested by Gleb.
>>>>  File descriptor management and reference counting is no longer the
>>>>  concern of the device implementation.
>>>> 
>>>> - Don't hold kvm->lock during create.  The original reasons
>>>>  for doing so have vanished as for as MPIC is concerned, and
>>>>  this avoids needing to answer the question of whether to
>>>>  hold the lock during destroy as well.
>>>> 
>>>>  Paul, you may need to acquire the lock yourself in kvm_create_xics()
>>>>  to protect the -EEXIST check.
>>>> 
>>>> v3: remove some changes that were merged into this patch by accident,
>>>> and fix the error documentation for KVM_CREATE_DEVICE.
>>>> ---
>>>> Documentation/virtual/kvm/api.txt        |   70 ++++++++++++++++
>>>> Documentation/virtual/kvm/devices/README |    1 +
>>>> include/linux/kvm_host.h                 |   35 ++++++++
>>>> include/uapi/linux/kvm.h                 |   27 +++++++
>>>> virt/kvm/kvm_main.c                      |  129 ++++++++++++++++++++++++++++++
>>>> 5 files changed, 262 insertions(+)
>>>> create mode 100644 Documentation/virtual/kvm/devices/README
>>>> 
>>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>>> index 976eb65..d52f3f9 100644
>>>> --- a/Documentation/virtual/kvm/api.txt
>>>> +++ b/Documentation/virtual/kvm/api.txt
>>>> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data
>>>> written, then `n_invalid' invalid entries, invalidating any previously
>>>> valid entries found.
>>>> 
>>>> +4.79 KVM_CREATE_DEVICE
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>>> +Type: vm ioctl
>>>> +Parameters: struct kvm_create_device (in/out)
>>>> +Returns: 0 on success, -1 on error
>>>> +Errors:
>>>> +  ENODEV: The device type is unknown or unsupported
>>>> +  EEXIST: Device already created, and this type of device may not
>>>> +          be instantiated multiple times
>>>> +
>>>> +  Other error conditions may be defined by individual device types or
>>>> +  have their standard meanings.
>>>> +
>>>> +Creates an emulated device in the kernel.  The file descriptor returned
>>>> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
>>>> +
>>>> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
>>>> +device type is supported (not necessarily whether it can be created
>>>> +in the current vm).
>>>> +
>>>> +Individual devices should not define flags.  Attributes should be used
>>>> +for specifying any behavior that is not implied by the device type
>>>> +number.
>>>> +
>>>> +struct kvm_create_device {
>>>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>>>> +	__u32	fd;	/* out: device handle */
>>>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>>>> +};
>>> Should we add __u32 padding here to make struct size multiple of u64?
>> 
>> Do you know of any arch that pads structs to u64 boundaries? x86_64 doesn't and ppc64 doesn't either.
>> 
> Not really. I just notices that we pad some structures to that effect.

I don't think we really need to :).

> 
>>> 
>>>> +
>>>> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>>> +Type: device ioctl
>>>> +Parameters: struct kvm_device_attr
>>>> +Returns: 0 on success, -1 on error
>>>> +Errors:
>>>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>>>> +  EPERM:  The attribute cannot (currently) be accessed this way
>>>> +          (e.g. read-only attribute, or attribute that only makes
>>>> +          sense when the device is in a different state)
>>>> +
>>>> +  Other error conditions may be defined by individual device types.
>>>> +
>>>> +Gets/sets a specified piece of device configuration and/or state.  The
>>>> +semantics are device-specific.  See individual device documentation in
>>>> +the "devices" directory.  As with ONE_REG, the size of the data
>>>> +transferred is defined by the particular attribute.
>>>> +
>>>> +struct kvm_device_attr {
>>>> +	__u32	flags;		/* no flags currently defined */
>>>> +	__u32	group;		/* device-defined */
>>>> +	__u64	attr;		/* group-defined */
>>>> +	__u64	addr;		/* userspace address of attr data */
>>>> +};
>>>> +
>>>> +4.81 KVM_HAS_DEVICE_ATTR
>>>> +
>>>> +Capability: KVM_CAP_DEVICE_CTRL
>>>> +Type: device ioctl
>>>> +Parameters: struct kvm_device_attr
>>>> +Returns: 0 on success, -1 on error
>>>> +Errors:
>>>> +  ENXIO:  The group or attribute is unknown/unsupported for this device
>>>> +
>>>> +Tests whether a device supports a particular attribute.  A successful
>>>> +return indicates the attribute is implemented.  It does not necessarily
>>>> +indicate that the attribute can be read or written in the device's
>>>> +current state.  "addr" is ignored.
>>>> 
>>>> 4.77 KVM_ARM_VCPU_INIT
>>>> 
>>>> diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README
>>>> new file mode 100644
>>>> index 0000000..34a6983
>>>> --- /dev/null
>>>> +++ b/Documentation/virtual/kvm/devices/README
>>>> @@ -0,0 +1 @@
>>>> +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
>>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>>> index 20d77d2..8fce9bc 100644
>>>> --- a/include/linux/kvm_host.h
>>>> +++ b/include/linux/kvm_host.h
>>>> @@ -1063,6 +1063,41 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
>>>> 
>>>> extern bool kvm_rebooting;
>>>> 
>>>> +struct kvm_device_ops;
>>>> +
>>>> +struct kvm_device {
>>>> +	struct kvm_device_ops *ops;
>>>> +	struct kvm *kvm;
>>>> +	atomic_t users;
>>>> +	void *private;
>>>> +};
>>>> +
>>>> +/* create, destroy, and name are mandatory */
>>>> +struct kvm_device_ops {
>>>> +	const char *name;
>>>> +	int (*create)(struct kvm_device *dev, u32 type);
>>>> +
>>>> +	/*
>>>> +	 * Destroy is responsible for freeing dev.
>>>> +	 *
>>>> +	 * Destroy may be called before or after destructors are called
>>>> +	 * on emulated I/O regions, depending on whether a reference is
>>>> +	 * held by a vcpu or other kvm component that gets destroyed
>>>> +	 * after the emulated I/O.
>>>> +	 */
>>>> +	void (*destroy)(struct kvm_device *dev);
>>>> +
>>>> +	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>>>> +	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>>>> +	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
>>>> +	long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
>>>> +		      unsigned long arg);
>>>> +};
>>>> +
>>>> +void kvm_device_get(struct kvm_device *dev);
>>>> +void kvm_device_put(struct kvm_device *dev);
>>>> +struct kvm_device *kvm_device_from_filp(struct file *filp);
>>>> +
>>>> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
>>>> 
>>>> static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
>>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>>> index 74d0ff3..20ce2d2 100644
>>>> --- a/include/uapi/linux/kvm.h
>>>> +++ b/include/uapi/linux/kvm.h
>>>> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>>>> #define KVM_CAP_PPC_EPR 86
>>>> #define KVM_CAP_ARM_PSCI 87
>>>> #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
>>>> +#define KVM_CAP_DEVICE_CTRL 89
>>>> 
>>>> #ifdef KVM_CAP_IRQ_ROUTING
>>>> 
>>>> @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
>>>> #define KVM_ARM_SET_DEVICE_ADDR	  _IOW(KVMIO,  0xab, struct kvm_arm_device_addr)
>>>> 
>>>> /*
>>>> + * Device control API, available with KVM_CAP_DEVICE_CTRL
>>>> + */
>>>> +#define KVM_CREATE_DEVICE_TEST		1
>>>> +
>>>> +struct kvm_create_device {
>>>> +	__u32	type;	/* in: KVM_DEV_TYPE_xxx */
>>>> +	__u32	fd;	/* out: device handle */
>>>> +	__u32	flags;	/* in: KVM_CREATE_DEVICE_xxx */
>>>> +};
>>>> +
>>>> +struct kvm_device_attr {
>>>> +	__u32	flags;		/* no flags currently defined */
>>>> +	__u32	group;		/* device-defined */
>>>> +	__u64	attr;		/* group-defined */
>>>> +	__u64	addr;		/* userspace address of attr data */
>>>> +};
>>> Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
>>> from ioctl definition block.
>> 
>> Let me change that in my tree...
>> 
> So are you sending this via your tree and I should not apply it directly?

I was hoping to have things ready very soon for you to just pull...


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 13:45                 ` Alexander Graf
@ 2013-04-25 13:51                   ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25 13:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 03:45:14PM +0200, Alexander Graf wrote:
> >>> Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
> >>> from ioctl definition block.
> >> 
> >> Let me change that in my tree...
> >> 
> > So are you sending this via your tree and I should not apply it directly?
> 
> I was hoping to have things ready very soon for you to just pull...
> 
Make sense since there are PPC patches that depend on this one. 3.10 merge windows
will very likely open next week though...

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 13:51                   ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25 13:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 03:45:14PM +0200, Alexander Graf wrote:
> >>> Please move struct definitions and KVM_CREATE_DEVICE_TEST define out
> >>> from ioctl definition block.
> >> 
> >> Let me change that in my tree...
> >> 
> > So are you sending this via your tree and I should not apply it directly?
> 
> I was hoping to have things ready very soon for you to just pull...
> 
Make sense since there are PPC patches that depend on this one. 3.10 merge windows
will very likely open next week though...

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 10:47             ` Alexander Graf
@ 2013-04-25 16:51               ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-25 16:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Gleb Natapov, kvm-ppc, kvm, paulus

On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> 
> On 25.04.2013, at 11:43, Gleb Natapov wrote:
> 
> >> +void kvm_device_put(struct kvm_device *dev)
> >> +{
> >> +	if (atomic_dec_and_test(&dev->users))
> >> +		dev->ops->destroy(dev);
> >> +}
> >> +
> >> +static int kvm_device_release(struct inode *inode, struct file  
> *filp)
> >> +{
> >> +	struct kvm_device *dev = filp->private_data;
> >> +	struct kvm *kvm = dev->kvm;
> >> +
> >> +	kvm_device_put(dev);
> >> +	kvm_put_kvm(kvm);
> > We may put kvm only if users goes to zero, otherwise kvm can be
> > freed while something holds a reference to a device. Why not make
> > kvm_device_put() do it?
> 
> Nice catch. I'll change the patch so it does the kvm_put_kvm inside  
> kvm_device_put's destroy branch.

No, please don't.  The KVM reference being "put" here is associated  
with the file descriptor, not with the MPIC object.  If you make that  
change I think you'll have circular references and thus a memory leak,  
because the vcpus can hold a reference to the MPIC object.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 16:51               ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-25 16:51 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Gleb Natapov, kvm-ppc, kvm, paulus

On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> 
> On 25.04.2013, at 11:43, Gleb Natapov wrote:
> 
> >> +void kvm_device_put(struct kvm_device *dev)
> >> +{
> >> +	if (atomic_dec_and_test(&dev->users))
> >> +		dev->ops->destroy(dev);
> >> +}
> >> +
> >> +static int kvm_device_release(struct inode *inode, struct file  
> *filp)
> >> +{
> >> +	struct kvm_device *dev = filp->private_data;
> >> +	struct kvm *kvm = dev->kvm;
> >> +
> >> +	kvm_device_put(dev);
> >> +	kvm_put_kvm(kvm);
> > We may put kvm only if users goes to zero, otherwise kvm can be
> > freed while something holds a reference to a device. Why not make
> > kvm_device_put() do it?
> 
> Nice catch. I'll change the patch so it does the kvm_put_kvm inside  
> kvm_device_put's destroy branch.

No, please don't.  The KVM reference being "put" here is associated  
with the file descriptor, not with the MPIC object.  If you make that  
change I think you'll have circular references and thus a memory leak,  
because the vcpus can hold a reference to the MPIC object.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 16:51               ` Scott Wood
@ 2013-04-25 18:22                 ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25 18:22 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> >
> >On 25.04.2013, at 11:43, Gleb Natapov wrote:
> >
> >>> +void kvm_device_put(struct kvm_device *dev)
> >>> +{
> >>> +	if (atomic_dec_and_test(&dev->users))
> >>> +		dev->ops->destroy(dev);
> >>> +}
> >>> +
> >>> +static int kvm_device_release(struct inode *inode, struct file
> >*filp)
> >>> +{
> >>> +	struct kvm_device *dev = filp->private_data;
> >>> +	struct kvm *kvm = dev->kvm;
> >>> +
> >>> +	kvm_device_put(dev);
> >>> +	kvm_put_kvm(kvm);
> >> We may put kvm only if users goes to zero, otherwise kvm can be
> >> freed while something holds a reference to a device. Why not make
> >> kvm_device_put() do it?
> >
> >Nice catch. I'll change the patch so it does the kvm_put_kvm
> >inside kvm_device_put's destroy branch.
> 
> No, please don't.  The KVM reference being "put" here is associated
> with the file descriptor, not with the MPIC object.
Is it so? Device holds a pointer to kvm, so it increments kvm reference
to make sure the pointer is valid. What prevents kvm from been destroyed
while device is still in use in current code?
 

>                                                    If you make
> that change I think you'll have circular references and thus a
> memory leak, because the vcpus can hold a reference to the MPIC
> object.
> 
How circular reference can be created?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 18:22                 ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-25 18:22 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> >
> >On 25.04.2013, at 11:43, Gleb Natapov wrote:
> >
> >>> +void kvm_device_put(struct kvm_device *dev)
> >>> +{
> >>> +	if (atomic_dec_and_test(&dev->users))
> >>> +		dev->ops->destroy(dev);
> >>> +}
> >>> +
> >>> +static int kvm_device_release(struct inode *inode, struct file
> >*filp)
> >>> +{
> >>> +	struct kvm_device *dev = filp->private_data;
> >>> +	struct kvm *kvm = dev->kvm;
> >>> +
> >>> +	kvm_device_put(dev);
> >>> +	kvm_put_kvm(kvm);
> >> We may put kvm only if users goes to zero, otherwise kvm can be
> >> freed while something holds a reference to a device. Why not make
> >> kvm_device_put() do it?
> >
> >Nice catch. I'll change the patch so it does the kvm_put_kvm
> >inside kvm_device_put's destroy branch.
> 
> No, please don't.  The KVM reference being "put" here is associated
> with the file descriptor, not with the MPIC object.
Is it so? Device holds a pointer to kvm, so it increments kvm reference
to make sure the pointer is valid. What prevents kvm from been destroyed
while device is still in use in current code?
 

>                                                    If you make
> that change I think you'll have circular references and thus a
> memory leak, because the vcpus can hold a reference to the MPIC
> object.
> 
How circular reference can be created?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 18:22                 ` Gleb Natapov
@ 2013-04-25 18:59                   ` Scott Wood
  -1 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-25 18:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
> On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> > On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> > >
> > >On 25.04.2013, at 11:43, Gleb Natapov wrote:
> > >
> > >>> +void kvm_device_put(struct kvm_device *dev)
> > >>> +{
> > >>> +	if (atomic_dec_and_test(&dev->users))
> > >>> +		dev->ops->destroy(dev);
> > >>> +}
> > >>> +
> > >>> +static int kvm_device_release(struct inode *inode, struct file
> > >*filp)
> > >>> +{
> > >>> +	struct kvm_device *dev = filp->private_data;
> > >>> +	struct kvm *kvm = dev->kvm;
> > >>> +
> > >>> +	kvm_device_put(dev);
> > >>> +	kvm_put_kvm(kvm);
> > >> We may put kvm only if users goes to zero, otherwise kvm can be
> > >> freed while something holds a reference to a device. Why not make
> > >> kvm_device_put() do it?
> > >
> > >Nice catch. I'll change the patch so it does the kvm_put_kvm
> > >inside kvm_device_put's destroy branch.
> >
> > No, please don't.  The KVM reference being "put" here is associated
> > with the file descriptor, not with the MPIC object.
> Is it so? Device holds a pointer to kvm, so it increments kvm  
> reference
> to make sure the pointer is valid. What prevents kvm from been  
> destroyed
> while device is still in use in current code?

Where will that kvm pointer be used, after all the file descriptors go  
away and the vcpus stop running?  mmio_mapped guards against unmapping  
the MMIO if it's already been unmapped due to KVM destruction.  We  
don't have any timers or other delayed work.

Well, I do see one place, that Alex added -- the NULLing out of  
dev->kvm->arch.mpic, which didn't exist in my patchset.

> > that change I think you'll have circular references and thus a
> > memory leak, because the vcpus can hold a reference to the MPIC
> > object.
> >
> How circular reference can be created?

MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu is  
not destroyed until KVM is destroyed.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-25 18:59                   ` Scott Wood
  0 siblings, 0 replies; 261+ messages in thread
From: Scott Wood @ 2013-04-25 18:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
> On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> > On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> > >
> > >On 25.04.2013, at 11:43, Gleb Natapov wrote:
> > >
> > >>> +void kvm_device_put(struct kvm_device *dev)
> > >>> +{
> > >>> +	if (atomic_dec_and_test(&dev->users))
> > >>> +		dev->ops->destroy(dev);
> > >>> +}
> > >>> +
> > >>> +static int kvm_device_release(struct inode *inode, struct file
> > >*filp)
> > >>> +{
> > >>> +	struct kvm_device *dev = filp->private_data;
> > >>> +	struct kvm *kvm = dev->kvm;
> > >>> +
> > >>> +	kvm_device_put(dev);
> > >>> +	kvm_put_kvm(kvm);
> > >> We may put kvm only if users goes to zero, otherwise kvm can be
> > >> freed while something holds a reference to a device. Why not make
> > >> kvm_device_put() do it?
> > >
> > >Nice catch. I'll change the patch so it does the kvm_put_kvm
> > >inside kvm_device_put's destroy branch.
> >
> > No, please don't.  The KVM reference being "put" here is associated
> > with the file descriptor, not with the MPIC object.
> Is it so? Device holds a pointer to kvm, so it increments kvm  
> reference
> to make sure the pointer is valid. What prevents kvm from been  
> destroyed
> while device is still in use in current code?

Where will that kvm pointer be used, after all the file descriptors go  
away and the vcpus stop running?  mmio_mapped guards against unmapping  
the MMIO if it's already been unmapped due to KVM destruction.  We  
don't have any timers or other delayed work.

Well, I do see one place, that Alex added -- the NULLing out of  
dev->kvm->arch.mpic, which didn't exist in my patchset.

> > that change I think you'll have circular references and thus a
> > memory leak, because the vcpus can hold a reference to the MPIC
> > object.
> >
> How circular reference can be created?

MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu is  
not destroyed until KVM is destroyed.

-Scott

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-25 18:59                   ` Scott Wood
@ 2013-04-26  9:53                     ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-26  9:53 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 01:59:20PM -0500, Scott Wood wrote:
> On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
> >On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> >> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> >> >
> >> >On 25.04.2013, at 11:43, Gleb Natapov wrote:
> >> >
> >> >>> +void kvm_device_put(struct kvm_device *dev)
> >> >>> +{
> >> >>> +	if (atomic_dec_and_test(&dev->users))
> >> >>> +		dev->ops->destroy(dev);
> >> >>> +}
> >> >>> +
> >> >>> +static int kvm_device_release(struct inode *inode, struct file
> >> >*filp)
> >> >>> +{
> >> >>> +	struct kvm_device *dev = filp->private_data;
> >> >>> +	struct kvm *kvm = dev->kvm;
> >> >>> +
> >> >>> +	kvm_device_put(dev);
> >> >>> +	kvm_put_kvm(kvm);
> >> >> We may put kvm only if users goes to zero, otherwise kvm can be
> >> >> freed while something holds a reference to a device. Why not make
> >> >> kvm_device_put() do it?
> >> >
> >> >Nice catch. I'll change the patch so it does the kvm_put_kvm
> >> >inside kvm_device_put's destroy branch.
> >>
> >> No, please don't.  The KVM reference being "put" here is associated
> >> with the file descriptor, not with the MPIC object.
> >Is it so? Device holds a pointer to kvm, so it increments kvm
> >reference
> >to make sure the pointer is valid. What prevents kvm from been
> >destroyed
> >while device is still in use in current code?
> 
> Where will that kvm pointer be used, after all the file descriptors
> go away and the vcpus stop running?  mmio_mapped guards against
> unmapping the MMIO if it's already been unmapped due to KVM
> destruction.  We don't have any timers or other delayed work.
> 
MPIC does not, but timer device will have one.

> Well, I do see one place, that Alex added -- the NULLing out of
> dev->kvm->arch.mpic, which didn't exist in my patchset.
> 
> >> that change I think you'll have circular references and thus a
> >> memory leak, because the vcpus can hold a reference to the MPIC
> >> object.
> >>
> >How circular reference can be created?
> 
> MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu
> is not destroyed until KVM is destroyed.
> 
Yes, you are right. So we need to think about how to fix it in a
different way. What about holding all devices in kvm->devices[] array
and destroy them during kvm destruction, like we do for vcpus?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-26  9:53                     ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-26  9:53 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm, paulus

On Thu, Apr 25, 2013 at 01:59:20PM -0500, Scott Wood wrote:
> On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
> >On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> >> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> >> >
> >> >On 25.04.2013, at 11:43, Gleb Natapov wrote:
> >> >
> >> >>> +void kvm_device_put(struct kvm_device *dev)
> >> >>> +{
> >> >>> +	if (atomic_dec_and_test(&dev->users))
> >> >>> +		dev->ops->destroy(dev);
> >> >>> +}
> >> >>> +
> >> >>> +static int kvm_device_release(struct inode *inode, struct file
> >> >*filp)
> >> >>> +{
> >> >>> +	struct kvm_device *dev = filp->private_data;
> >> >>> +	struct kvm *kvm = dev->kvm;
> >> >>> +
> >> >>> +	kvm_device_put(dev);
> >> >>> +	kvm_put_kvm(kvm);
> >> >> We may put kvm only if users goes to zero, otherwise kvm can be
> >> >> freed while something holds a reference to a device. Why not make
> >> >> kvm_device_put() do it?
> >> >
> >> >Nice catch. I'll change the patch so it does the kvm_put_kvm
> >> >inside kvm_device_put's destroy branch.
> >>
> >> No, please don't.  The KVM reference being "put" here is associated
> >> with the file descriptor, not with the MPIC object.
> >Is it so? Device holds a pointer to kvm, so it increments kvm
> >reference
> >to make sure the pointer is valid. What prevents kvm from been
> >destroyed
> >while device is still in use in current code?
> 
> Where will that kvm pointer be used, after all the file descriptors
> go away and the vcpus stop running?  mmio_mapped guards against
> unmapping the MMIO if it's already been unmapped due to KVM
> destruction.  We don't have any timers or other delayed work.
> 
MPIC does not, but timer device will have one.

> Well, I do see one place, that Alex added -- the NULLing out of
> dev->kvm->arch.mpic, which didn't exist in my patchset.
> 
> >> that change I think you'll have circular references and thus a
> >> memory leak, because the vcpus can hold a reference to the MPIC
> >> object.
> >>
> >How circular reference can be created?
> 
> MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu
> is not destroyed until KVM is destroyed.
> 
Yes, you are right. So we need to think about how to fix it in a
different way. What about holding all devices in kvm->devices[] array
and destroy them during kvm destruction, like we do for vcpus?

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-26  9:53                     ` Gleb Natapov
@ 2013-04-26  9:55                       ` Alexander Graf
  -1 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-26  9:55 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, paulus


On 26.04.2013, at 11:53, Gleb Natapov wrote:

> On Thu, Apr 25, 2013 at 01:59:20PM -0500, Scott Wood wrote:
>> On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
>>> On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
>>>> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
>>>>> 
>>>>> On 25.04.2013, at 11:43, Gleb Natapov wrote:
>>>>> 
>>>>>>> +void kvm_device_put(struct kvm_device *dev)
>>>>>>> +{
>>>>>>> +	if (atomic_dec_and_test(&dev->users))
>>>>>>> +		dev->ops->destroy(dev);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int kvm_device_release(struct inode *inode, struct file
>>>>> *filp)
>>>>>>> +{
>>>>>>> +	struct kvm_device *dev = filp->private_data;
>>>>>>> +	struct kvm *kvm = dev->kvm;
>>>>>>> +
>>>>>>> +	kvm_device_put(dev);
>>>>>>> +	kvm_put_kvm(kvm);
>>>>>> We may put kvm only if users goes to zero, otherwise kvm can be
>>>>>> freed while something holds a reference to a device. Why not make
>>>>>> kvm_device_put() do it?
>>>>> 
>>>>> Nice catch. I'll change the patch so it does the kvm_put_kvm
>>>>> inside kvm_device_put's destroy branch.
>>>> 
>>>> No, please don't.  The KVM reference being "put" here is associated
>>>> with the file descriptor, not with the MPIC object.
>>> Is it so? Device holds a pointer to kvm, so it increments kvm
>>> reference
>>> to make sure the pointer is valid. What prevents kvm from been
>>> destroyed
>>> while device is still in use in current code?
>> 
>> Where will that kvm pointer be used, after all the file descriptors
>> go away and the vcpus stop running?  mmio_mapped guards against
>> unmapping the MMIO if it's already been unmapped due to KVM
>> destruction.  We don't have any timers or other delayed work.
>> 
> MPIC does not, but timer device will have one.
> 
>> Well, I do see one place, that Alex added -- the NULLing out of
>> dev->kvm->arch.mpic, which didn't exist in my patchset.
>> 
>>>> that change I think you'll have circular references and thus a
>>>> memory leak, because the vcpus can hold a reference to the MPIC
>>>> object.
>>>> 
>>> How circular reference can be created?
>> 
>> MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu
>> is not destroyed until KVM is destroyed.
>> 
> Yes, you are right. So we need to think about how to fix it in a
> different way. What about holding all devices in kvm->devices[] array
> and destroy them during kvm destruction, like we do for vcpus?

You should really look at your patches in LIFO order :). A patch doing that was already sent by Scott last night and is in v4 of my patch set.


Alex

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-26  9:55                       ` Alexander Graf
  0 siblings, 0 replies; 261+ messages in thread
From: Alexander Graf @ 2013-04-26  9:55 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Scott Wood, kvm-ppc, kvm, paulus


On 26.04.2013, at 11:53, Gleb Natapov wrote:

> On Thu, Apr 25, 2013 at 01:59:20PM -0500, Scott Wood wrote:
>> On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
>>> On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
>>>> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
>>>>> 
>>>>> On 25.04.2013, at 11:43, Gleb Natapov wrote:
>>>>> 
>>>>>>> +void kvm_device_put(struct kvm_device *dev)
>>>>>>> +{
>>>>>>> +	if (atomic_dec_and_test(&dev->users))
>>>>>>> +		dev->ops->destroy(dev);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int kvm_device_release(struct inode *inode, struct file
>>>>> *filp)
>>>>>>> +{
>>>>>>> +	struct kvm_device *dev = filp->private_data;
>>>>>>> +	struct kvm *kvm = dev->kvm;
>>>>>>> +
>>>>>>> +	kvm_device_put(dev);
>>>>>>> +	kvm_put_kvm(kvm);
>>>>>> We may put kvm only if users goes to zero, otherwise kvm can be
>>>>>> freed while something holds a reference to a device. Why not make
>>>>>> kvm_device_put() do it?
>>>>> 
>>>>> Nice catch. I'll change the patch so it does the kvm_put_kvm
>>>>> inside kvm_device_put's destroy branch.
>>>> 
>>>> No, please don't.  The KVM reference being "put" here is associated
>>>> with the file descriptor, not with the MPIC object.
>>> Is it so? Device holds a pointer to kvm, so it increments kvm
>>> reference
>>> to make sure the pointer is valid. What prevents kvm from been
>>> destroyed
>>> while device is still in use in current code?
>> 
>> Where will that kvm pointer be used, after all the file descriptors
>> go away and the vcpus stop running?  mmio_mapped guards against
>> unmapping the MMIO if it's already been unmapped due to KVM
>> destruction.  We don't have any timers or other delayed work.
>> 
> MPIC does not, but timer device will have one.
> 
>> Well, I do see one place, that Alex added -- the NULLing out of
>> dev->kvm->arch.mpic, which didn't exist in my patchset.
>> 
>>>> that change I think you'll have circular references and thus a
>>>> memory leak, because the vcpus can hold a reference to the MPIC
>>>> object.
>>>> 
>>> How circular reference can be created?
>> 
>> MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu
>> is not destroyed until KVM is destroyed.
>> 
> Yes, you are right. So we need to think about how to fix it in a
> different way. What about holding all devices in kvm->devices[] array
> and destroy them during kvm destruction, like we do for vcpus?

You should really look at your patches in LIFO order :). A patch doing that was already sent by Scott last night and is in v4 of my patch set.


Alex


^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
  2013-04-26  9:55                       ` Alexander Graf
@ 2013-04-26  9:57                         ` Gleb Natapov
  -1 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-26  9:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm, paulus

On Fri, Apr 26, 2013 at 11:55:27AM +0200, Alexander Graf wrote:
> 
> On 26.04.2013, at 11:53, Gleb Natapov wrote:
> 
> > On Thu, Apr 25, 2013 at 01:59:20PM -0500, Scott Wood wrote:
> >> On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
> >>> On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> >>>> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> >>>>> 
> >>>>> On 25.04.2013, at 11:43, Gleb Natapov wrote:
> >>>>> 
> >>>>>>> +void kvm_device_put(struct kvm_device *dev)
> >>>>>>> +{
> >>>>>>> +	if (atomic_dec_and_test(&dev->users))
> >>>>>>> +		dev->ops->destroy(dev);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static int kvm_device_release(struct inode *inode, struct file
> >>>>> *filp)
> >>>>>>> +{
> >>>>>>> +	struct kvm_device *dev = filp->private_data;
> >>>>>>> +	struct kvm *kvm = dev->kvm;
> >>>>>>> +
> >>>>>>> +	kvm_device_put(dev);
> >>>>>>> +	kvm_put_kvm(kvm);
> >>>>>> We may put kvm only if users goes to zero, otherwise kvm can be
> >>>>>> freed while something holds a reference to a device. Why not make
> >>>>>> kvm_device_put() do it?
> >>>>> 
> >>>>> Nice catch. I'll change the patch so it does the kvm_put_kvm
> >>>>> inside kvm_device_put's destroy branch.
> >>>> 
> >>>> No, please don't.  The KVM reference being "put" here is associated
> >>>> with the file descriptor, not with the MPIC object.
> >>> Is it so? Device holds a pointer to kvm, so it increments kvm
> >>> reference
> >>> to make sure the pointer is valid. What prevents kvm from been
> >>> destroyed
> >>> while device is still in use in current code?
> >> 
> >> Where will that kvm pointer be used, after all the file descriptors
> >> go away and the vcpus stop running?  mmio_mapped guards against
> >> unmapping the MMIO if it's already been unmapped due to KVM
> >> destruction.  We don't have any timers or other delayed work.
> >> 
> > MPIC does not, but timer device will have one.
> > 
> >> Well, I do see one place, that Alex added -- the NULLing out of
> >> dev->kvm->arch.mpic, which didn't exist in my patchset.
> >> 
> >>>> that change I think you'll have circular references and thus a
> >>>> memory leak, because the vcpus can hold a reference to the MPIC
> >>>> object.
> >>>> 
> >>> How circular reference can be created?
> >> 
> >> MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu
> >> is not destroyed until KVM is destroyed.
> >> 
> > Yes, you are right. So we need to think about how to fix it in a
> > different way. What about holding all devices in kvm->devices[] array
> > and destroy them during kvm destruction, like we do for vcpus?
> 
> You should really look at your patches in LIFO order :). A patch doing that was already sent by Scott last night and is in v4 of my patch set.
> 
> 
I tried! This causes starvation for some patches. I need better algorithm :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: [PATCH v4 1/6] kvm: add device control API
@ 2013-04-26  9:57                         ` Gleb Natapov
  0 siblings, 0 replies; 261+ messages in thread
From: Gleb Natapov @ 2013-04-26  9:57 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm, paulus

On Fri, Apr 26, 2013 at 11:55:27AM +0200, Alexander Graf wrote:
> 
> On 26.04.2013, at 11:53, Gleb Natapov wrote:
> 
> > On Thu, Apr 25, 2013 at 01:59:20PM -0500, Scott Wood wrote:
> >> On 04/25/2013 01:22:04 PM, Gleb Natapov wrote:
> >>> On Thu, Apr 25, 2013 at 11:51:08AM -0500, Scott Wood wrote:
> >>>> On 04/25/2013 05:47:39 AM, Alexander Graf wrote:
> >>>>> 
> >>>>> On 25.04.2013, at 11:43, Gleb Natapov wrote:
> >>>>> 
> >>>>>>> +void kvm_device_put(struct kvm_device *dev)
> >>>>>>> +{
> >>>>>>> +	if (atomic_dec_and_test(&dev->users))
> >>>>>>> +		dev->ops->destroy(dev);
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static int kvm_device_release(struct inode *inode, struct file
> >>>>> *filp)
> >>>>>>> +{
> >>>>>>> +	struct kvm_device *dev = filp->private_data;
> >>>>>>> +	struct kvm *kvm = dev->kvm;
> >>>>>>> +
> >>>>>>> +	kvm_device_put(dev);
> >>>>>>> +	kvm_put_kvm(kvm);
> >>>>>> We may put kvm only if users goes to zero, otherwise kvm can be
> >>>>>> freed while something holds a reference to a device. Why not make
> >>>>>> kvm_device_put() do it?
> >>>>> 
> >>>>> Nice catch. I'll change the patch so it does the kvm_put_kvm
> >>>>> inside kvm_device_put's destroy branch.
> >>>> 
> >>>> No, please don't.  The KVM reference being "put" here is associated
> >>>> with the file descriptor, not with the MPIC object.
> >>> Is it so? Device holds a pointer to kvm, so it increments kvm
> >>> reference
> >>> to make sure the pointer is valid. What prevents kvm from been
> >>> destroyed
> >>> while device is still in use in current code?
> >> 
> >> Where will that kvm pointer be used, after all the file descriptors
> >> go away and the vcpus stop running?  mmio_mapped guards against
> >> unmapping the MMIO if it's already been unmapped due to KVM
> >> destruction.  We don't have any timers or other delayed work.
> >> 
> > MPIC does not, but timer device will have one.
> > 
> >> Well, I do see one place, that Alex added -- the NULLing out of
> >> dev->kvm->arch.mpic, which didn't exist in my patchset.
> >> 
> >>>> that change I think you'll have circular references and thus a
> >>>> memory leak, because the vcpus can hold a reference to the MPIC
> >>>> object.
> >>>> 
> >>> How circular reference can be created?
> >> 
> >> MPIC holds reference on KVM, vcpu holds reference on MPIC, and vcpu
> >> is not destroyed until KVM is destroyed.
> >> 
> > Yes, you are right. So we need to think about how to fix it in a
> > different way. What about holding all devices in kvm->devices[] array
> > and destroy them during kvm destruction, like we do for vcpus?
> 
> You should really look at your patches in LIFO order :). A patch doing that was already sent by Scott last night and is in v4 of my patch set.
> 
> 
I tried! This causes starvation for some patches. I need better algorithm :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 261+ messages in thread

end of thread, other threads:[~2013-04-26  9:57 UTC | newest]

Thread overview: 261+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14  5:49 [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip Scott Wood
2013-02-14  5:49 ` Scott Wood
2013-02-14  5:49 ` [RFC PATCH 1/6] kvm: add device control API Scott Wood
2013-02-14  5:49   ` Scott Wood
2013-02-18 12:21   ` Gleb Natapov
2013-02-18 12:21     ` Gleb Natapov
2013-02-18 23:01     ` Scott Wood
2013-02-18 23:01       ` Scott Wood
2013-02-19  0:43       ` Christoffer Dall
2013-02-19  0:43         ` Christoffer Dall
2013-02-19 12:24       ` Gleb Natapov
2013-02-19 12:24         ` Gleb Natapov
2013-02-19 15:51         ` Christoffer Dall
2013-02-19 15:51           ` Christoffer Dall
2013-02-19 21:16         ` Scott Wood
2013-02-19 21:16           ` Scott Wood
2013-02-20 13:09           ` Gleb Natapov
2013-02-20 13:09             ` Gleb Natapov
2013-02-20 21:28             ` Marcelo Tosatti
2013-02-20 21:28               ` Marcelo Tosatti
2013-02-20 22:44               ` Marcelo Tosatti
2013-02-20 22:44                 ` Marcelo Tosatti
2013-02-20 23:53               ` Scott Wood
2013-02-20 23:53                 ` Scott Wood
2013-02-21  0:14                 ` Marcelo Tosatti
2013-02-21  0:14                   ` Marcelo Tosatti
2013-02-21  1:28                   ` Scott Wood
2013-02-21  1:28                     ` Scott Wood
2013-02-21  6:39                     ` Gleb Natapov
2013-02-21  6:39                       ` Gleb Natapov
2013-02-21 23:03                     ` Marcelo Tosatti
2013-02-21 23:03                       ` Marcelo Tosatti
2013-02-22  2:00                       ` Scott Wood
2013-02-22  2:00                         ` Scott Wood
2013-02-23 15:04                         ` Marcelo Tosatti
2013-02-23 15:04                           ` Marcelo Tosatti
2013-02-26  0:27                           ` Scott Wood
2013-02-26  0:27                             ` Scott Wood
2013-02-21  2:05             ` Scott Wood
2013-02-21  2:05               ` Scott Wood
2013-02-21  8:22               ` Gleb Natapov
2013-02-21  8:22                 ` Gleb Natapov
2013-02-22  2:17                 ` Scott Wood
2013-02-22  2:17                   ` Scott Wood
2013-02-24 15:46                   ` Gleb Natapov
2013-02-24 15:46                     ` Gleb Natapov
2013-02-25 15:23                     ` Alexander Graf
2013-02-25 15:23                       ` Alexander Graf
2013-02-26  2:38                     ` Scott Wood
2013-02-26  2:38                       ` Scott Wood
2013-02-20 21:17       ` Marcelo Tosatti
2013-02-20 21:17         ` Marcelo Tosatti
2013-02-20 23:20         ` Scott Wood
2013-02-20 23:20           ` Scott Wood
2013-02-21  0:01           ` Marcelo Tosatti
2013-02-21  0:01             ` Marcelo Tosatti
2013-02-21  0:33             ` Scott Wood
2013-02-21  0:33               ` Scott Wood
2013-02-25  1:11     ` Paul Mackerras
2013-02-25  1:11       ` Paul Mackerras
2013-02-25 13:09       ` Gleb Natapov
2013-02-25 13:09         ` Gleb Natapov
2013-02-25 15:29         ` Alexander Graf
2013-02-25 15:29           ` Alexander Graf
2013-02-19  0:44   ` Christoffer Dall
2013-02-19  0:44     ` Christoffer Dall
2013-02-19  0:53     ` Scott Wood
2013-02-19  0:53       ` Scott Wood
2013-02-19  5:50       ` Christoffer Dall
2013-02-19  5:50         ` Christoffer Dall
2013-02-19 12:45         ` Gleb Natapov
2013-02-19 12:45           ` Gleb Natapov
2013-02-19 20:16         ` Scott Wood
2013-02-19 20:16           ` Scott Wood
2013-02-20  2:16           ` Christoffer Dall
2013-02-20  2:16             ` Christoffer Dall
2013-02-24 13:12             ` Marc Zyngier
2013-02-24 13:12               ` Marc Zyngier
2013-03-06  0:59   ` Paul Mackerras
2013-03-06  0:59     ` Paul Mackerras
2013-03-06  1:20     ` Scott Wood
2013-03-06  1:20       ` Scott Wood
2013-03-06  2:48   ` Benjamin Herrenschmidt
2013-03-06  2:48     ` Benjamin Herrenschmidt
2013-03-06  3:36     ` Scott Wood
2013-03-06  3:36       ` Scott Wood
2013-03-06  4:28       ` Benjamin Herrenschmidt
2013-03-06  4:28         ` Benjamin Herrenschmidt
2013-03-06 10:18     ` Gleb Natapov
2013-03-06 10:18       ` Gleb Natapov
2013-02-14  5:49 ` [RFC PATCH 2/6] kvm/ppc: add a notifier chain for vcpu creation/destruction Scott Wood
2013-02-14  5:49   ` Scott Wood
2013-02-14  5:49 ` [RFC PATCH 3/6] kvm/ppc/mpic: import hw/openpic.c from QEMU Scott Wood
2013-02-14  5:49   ` Scott Wood
2013-02-14  5:49 ` [RFC PATCH 4/6] kvm/ppc/mpic: remove some obviously unneeded code Scott Wood
2013-02-14  5:49   ` Scott Wood
2013-02-14  5:49 ` [RFC PATCH 5/6] kvm/ppc/mpic: adapt to kernel style and environment Scott Wood
2013-02-14  5:49   ` Scott Wood
2013-02-14  5:49 ` [RFC PATCH 6/6] kvm/ppc/mpic: in-kernel MPIC emulation Scott Wood
2013-02-14  5:49   ` Scott Wood
2013-03-21  8:28   ` Alexander Graf
2013-03-21  8:28     ` Alexander Graf
2013-03-21 14:43     ` Scott Wood
2013-03-21 14:43       ` Scott Wood
2013-03-21 14:52       ` Alexander Graf
2013-03-21 14:52         ` Alexander Graf
2013-02-18 12:04 ` [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip Gleb Natapov
2013-02-18 12:04   ` Gleb Natapov
2013-02-18 23:05   ` Scott Wood
2013-02-18 23:05     ` Scott Wood
2013-04-01 22:47 ` [RFC PATCH v2 0/6] device control and in-kernel MPIC Scott Wood
2013-04-01 22:47   ` Scott Wood
2013-04-01 22:47   ` [RFC PATCH v2 1/6] kvm: add device control API Scott Wood
2013-04-01 22:47     ` Scott Wood
2013-04-02  6:59     ` tiejun.chen
2013-04-02  6:59       ` tiejun.chen
     [not found]       ` <1364923807.24520.2@snotra>
2013-04-03  1:28         ` tiejun.chen
2013-04-03  1:28           ` tiejun.chen
     [not found]           ` <1364952853.8690.3@snotra>
2013-04-03  1:42             ` tiejun.chen
2013-04-03  1:42               ` tiejun.chen
2013-04-03  1:02     ` Paul Mackerras
2013-04-03  1:02       ` Paul Mackerras
2013-04-03  1:19       ` Scott Wood
2013-04-03  1:19         ` Scott Wood
2013-04-03  2:17         ` Paul Mackerras
2013-04-03  2:17           ` Paul Mackerras
2013-04-03 13:22           ` Alexander Graf
2013-04-03 13:22             ` Alexander Graf
2013-04-03 17:37             ` Scott Wood
2013-04-03 17:37               ` Scott Wood
2013-04-03 17:39               ` Alexander Graf
2013-04-03 17:39                 ` Alexander Graf
2013-04-04  9:58                 ` Gleb Natapov
2013-04-04  9:58                   ` Gleb Natapov
2013-04-03 21:03           ` Scott Wood
2013-04-03 21:03             ` Scott Wood
2013-04-01 22:47   ` [RFC PATCH v2 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU Scott Wood
2013-04-01 22:47     ` Scott Wood
2013-04-01 22:47   ` [RFC PATCH v2 3/6] kvm/ppc/mpic: remove some obviously unneeded code Scott Wood
2013-04-01 22:47     ` Scott Wood
2013-04-01 22:47   ` [RFC PATCH v2 4/6] kvm/ppc/mpic: adapt to kernel style and environment Scott Wood
2013-04-01 22:47     ` Scott Wood
2013-04-01 22:47   ` [RFC PATCH v2 5/6] kvm/ppc/mpic: in-kernel MPIC emulation Scott Wood
2013-04-01 22:47     ` Scott Wood
2013-04-01 22:47   ` [RFC PATCH v2 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC Scott Wood
2013-04-01 22:47     ` Scott Wood
2013-04-03  1:57   ` [RFC PATCH v3 0/6] device control and in-kernel MPIC Scott Wood
2013-04-03  1:57     ` Scott Wood
2013-04-03  1:57     ` [RFC PATCH v3 1/6] kvm: add device control API Scott Wood
2013-04-03  1:57       ` Scott Wood
2013-04-03 15:13       ` Alexander Graf
2013-04-03 15:13         ` Alexander Graf
2013-04-04 10:41       ` Gleb Natapov
2013-04-04 10:41         ` Gleb Natapov
2013-04-04 23:47         ` Scott Wood
2013-04-04 23:47           ` Scott Wood
2013-04-08 10:34           ` Gleb Natapov
2013-04-08 10:34             ` Gleb Natapov
2013-04-05  1:02         ` Paul Mackerras
2013-04-05  1:02           ` Paul Mackerras
2013-04-08 10:37           ` Gleb Natapov
2013-04-08 10:37             ` Gleb Natapov
2013-04-08  5:33       ` Paul Mackerras
2013-04-08  5:33         ` Paul Mackerras
2013-04-09  0:50         ` Scott Wood
2013-04-09  0:50           ` Scott Wood
2013-04-03  1:57     ` [RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU Scott Wood
2013-04-03  1:57       ` Scott Wood
2013-04-03  1:57     ` [RFC PATCH v3 3/6] kvm/ppc/mpic: remove some obviously unneeded code Scott Wood
2013-04-03  1:57       ` Scott Wood
2013-04-03  1:57     ` [RFC PATCH v3 4/6] kvm/ppc/mpic: adapt to kernel style and environment Scott Wood
2013-04-03  1:57       ` Scott Wood
2013-04-03  1:57     ` [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation Scott Wood
2013-04-03  1:57       ` Scott Wood
2013-04-03 15:55       ` Gleb Natapov
2013-04-03 15:55         ` Gleb Natapov
2013-04-03 20:58         ` Scott Wood
2013-04-03 20:58           ` Scott Wood
2013-04-04  5:59           ` Gleb Natapov
2013-04-04  5:59             ` Gleb Natapov
2013-04-04 23:33             ` Scott Wood
2013-04-04 23:33               ` Scott Wood
2013-04-08 10:39               ` Gleb Natapov
2013-04-08 10:39                 ` Gleb Natapov
2013-04-03 16:19       ` Alexander Graf
2013-04-03 16:19         ` Alexander Graf
2013-04-03 21:38         ` Scott Wood
2013-04-03 21:38           ` Scott Wood
2013-04-03 21:58           ` Alexander Graf
2013-04-03 21:58             ` Alexander Graf
2013-04-03 22:07             ` Scott Wood
2013-04-03 22:07               ` Scott Wood
2013-04-03 22:12               ` Alexander Graf
2013-04-03 22:12                 ` Alexander Graf
2013-04-03 22:54                 ` Scott Wood
2013-04-03 22:54                   ` Scott Wood
2013-04-04  9:42                   ` Alexander Graf
2013-04-04  9:42                     ` Alexander Graf
2013-04-03 23:23               ` Scott Wood
2013-04-03 23:23                 ` Scott Wood
2013-04-03 23:23                 ` Scott Wood
2013-04-08  6:30       ` Paul Mackerras
2013-04-08  6:30         ` Paul Mackerras
2013-04-09  0:49         ` Scott Wood
2013-04-09  0:49           ` Scott Wood
2013-04-03  1:57     ` [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC Scott Wood
2013-04-03  1:57       ` Scott Wood
2013-04-04 12:54       ` Alexander Graf
2013-04-04 12:54         ` Alexander Graf
2013-04-04 18:41         ` Scott Wood
2013-04-04 18:41           ` Scott Wood
2013-04-04 22:30           ` Alexander Graf
2013-04-04 22:30             ` Alexander Graf
2013-04-04 22:35             ` Scott Wood
2013-04-04 22:35               ` Scott Wood
2013-04-05  6:09               ` Alexander Graf
2013-04-05  6:09                 ` Alexander Graf
2013-04-05 17:11                 ` Scott Wood
2013-04-05 17:11                   ` Scott Wood
2013-04-13  0:08     ` [PATCH v4 0/6] device-control and in-kernel MPIC Scott Wood
2013-04-13  0:08       ` Scott Wood
2013-04-13  0:08       ` [PATCH v4 1/6] kvm: add device control API Scott Wood
2013-04-13  0:08         ` Scott Wood
2013-04-25  9:43         ` Gleb Natapov
2013-04-25  9:43           ` Gleb Natapov
2013-04-25 10:47           ` Alexander Graf
2013-04-25 10:47             ` Alexander Graf
2013-04-25 12:07             ` Gleb Natapov
2013-04-25 12:07               ` Gleb Natapov
2013-04-25 13:45               ` Alexander Graf
2013-04-25 13:45                 ` Alexander Graf
2013-04-25 13:51                 ` Gleb Natapov
2013-04-25 13:51                   ` Gleb Natapov
2013-04-25 16:51             ` Scott Wood
2013-04-25 16:51               ` Scott Wood
2013-04-25 18:22               ` Gleb Natapov
2013-04-25 18:22                 ` Gleb Natapov
2013-04-25 18:59                 ` Scott Wood
2013-04-25 18:59                   ` Scott Wood
2013-04-26  9:53                   ` Gleb Natapov
2013-04-26  9:53                     ` Gleb Natapov
2013-04-26  9:55                     ` Alexander Graf
2013-04-26  9:55                       ` Alexander Graf
2013-04-26  9:57                       ` Gleb Natapov
2013-04-26  9:57                         ` Gleb Natapov
2013-04-13  0:08       ` [PATCH v4 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU Scott Wood
2013-04-13  0:08         ` Scott Wood
2013-04-13  0:08       ` [PATCH v4 3/6] kvm/ppc/mpic: remove some obviously unneeded code Scott Wood
2013-04-13  0:08         ` Scott Wood
2013-04-13  0:08       ` [PATCH v4 4/6] kvm/ppc/mpic: adapt to kernel style and environment Scott Wood
2013-04-13  0:08         ` Scott Wood
2013-04-13  0:08       ` [PATCH v4 5/6] kvm/ppc/mpic: in-kernel MPIC emulation Scott Wood
2013-04-13  0:08         ` Scott Wood
2013-04-13  0:08       ` [PATCH v4 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC Scott Wood
2013-04-13  0:08         ` Scott Wood
2013-04-15  5:23         ` Paul Mackerras
2013-04-15  5:23           ` Paul Mackerras
2013-04-15 17:52           ` Scott Wood
2013-04-15 17:52             ` Scott Wood
2013-04-16  3:59             ` Paul Mackerras
2013-04-16  3:59               ` Paul Mackerras

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.