All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] kvm: level irqfd and new eoifd
@ 2012-07-16 20:33 Alex Williamson
  2012-07-16 20:33 ` [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts Alex Williamson
                   ` (5 more replies)
  0 siblings, 6 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-16 20:33 UTC (permalink / raw)
  To: avi, mst; +Cc: gleb, kvm, linux-kernel, jan.kiszka

v5:
 - irqfds now have a one-to-one mapping with eoifds to prevent users
   from consuming all of kernel memory by repeatedly creating eoifds
   from a single irqfd.
 - implement a kvm_clear_irq() which does a test_and_clear_bit of
   the irq_state, only updating the pic/ioapic if changes and allowing
   the caller to know if anything was done.  I added this onto the end
   as it's essentially an optimization on the previous design.  It's
   hard to tell if there's an actual performance benefit to this.
 - dropped eoifd gsi support patch as it was only an FYI.

Thanks,

Alex

---

Alex Williamson (4):
      kvm: Convert eoifd to use kvm_clear_irq
      kvm: Create kvm_clear_irq()
      kvm: KVM_EOIFD, an eventfd for EOIs
      kvm: Extend irqfd to support level interrupts


 Documentation/virtual/kvm/api.txt |   28 +++
 arch/x86/kvm/x86.c                |    3 
 include/linux/kvm.h               |   18 ++
 include/linux/kvm_host.h          |   16 ++
 virt/kvm/eventfd.c                |  333 +++++++++++++++++++++++++++++++++++++
 virt/kvm/irq_comm.c               |   78 +++++++++
 virt/kvm/kvm_main.c               |   11 +
 7 files changed, 483 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
@ 2012-07-16 20:33 ` Alex Williamson
  2012-07-17 21:26   ` Michael S. Tsirkin
  2012-07-18 10:41   ` Michael S. Tsirkin
  2012-07-16 20:33 ` [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-16 20:33 UTC (permalink / raw)
  To: avi, mst; +Cc: gleb, kvm, linux-kernel, jan.kiszka

In order to inject a level interrupt from an external source using an
irqfd, we need to allocate a new irq_source_id.  This allows us to
assert and (later) de-assert an interrupt line independently from
users of KVM_IRQ_LINE and avoid lost interrupts.

We also add what may appear like a bit of excessive infrastructure
around an object for storing this irq_source_id.  However, notice
that we only provide a way to assert the interrupt here.  A follow-on
interface will make use of the same irq_source_id to allow de-assert.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/virtual/kvm/api.txt |    6 ++
 arch/x86/kvm/x86.c                |    1 
 include/linux/kvm.h               |    3 +
 virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 100acde..c7267d5 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
 the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
 and kvm_irqfd.gsi.
 
+The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
+triggered interrupt.  In this case a new irqchip input is allocated
+which is logically OR'd with other inputs allowing multiple sources
+to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
+is only necessary on setup, teardown is identical to that above.
+KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a01a424..80bed07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_GET_TSC_KHZ:
 	case KVM_CAP_PCI_2_3:
 	case KVM_CAP_KVMCLOCK_CTRL:
+	case KVM_CAP_IRQFD_LEVEL:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..b2e6e4f 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_IRQFD_LEVEL 81
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
 #endif
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+/* Available with KVM_CAP_IRQFD_LEVEL */
+#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
 
 struct kvm_irqfd {
 	__u32 fd;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 7d7e2aa..ecdbfea 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -36,6 +36,68 @@
 #include "iodev.h"
 
 /*
+ * An irq_source_id can be created from KVM_IRQFD for level interrupt
+ * injections and shared with other interfaces for EOI or de-assert.
+ * Create an object with reference counting to make it easy to use.
+ */
+struct _irq_source {
+	int id; /* the IRQ source ID */
+	bool level_asserted; /* Track assertion state and protect with lock */
+	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
+	struct kvm *kvm;
+	struct kref kref;
+};
+
+static void _irq_source_release(struct kref *kref)
+{
+	struct _irq_source *source;
+
+	source = container_of(kref, struct _irq_source, kref);
+
+	/* This also de-asserts */
+	kvm_free_irq_source_id(source->kvm, source->id);
+	kfree(source);
+}
+
+static void _irq_source_put(struct _irq_source *source)
+{
+	if (source)
+		kref_put(&source->kref, _irq_source_release);
+}
+
+static struct _irq_source *__attribute__ ((used)) /* white lie for now */
+_irq_source_get(struct _irq_source *source)
+{
+	if (source)
+		kref_get(&source->kref);
+
+	return source;
+}
+
+static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
+{
+	struct _irq_source *source;
+	int id;
+
+	source = kzalloc(sizeof(*source), GFP_KERNEL);
+	if (!source)
+		return ERR_PTR(-ENOMEM);
+
+	id = kvm_request_irq_source_id(kvm);
+	if (id < 0) {
+		kfree(source);
+		return ERR_PTR(id);
+	}
+
+	kref_init(&source->kref);
+	spin_lock_init(&source->lock);
+	source->kvm = kvm;
+	source->id = id;
+
+	return source;
+}
+
+/*
  * --------------------------------------------------------------------
  * irqfd: Allows an fd to be used to inject an interrupt to the guest
  *
@@ -52,6 +114,8 @@ struct _irqfd {
 	/* Used for level IRQ fast-path */
 	int gsi;
 	struct work_struct inject;
+	/* IRQ source ID for level triggered irqfds */
+	struct _irq_source *source;
 	/* Used for setup/shutdown */
 	struct eventfd_ctx *eventfd;
 	struct list_head list;
@@ -62,7 +126,7 @@ struct _irqfd {
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject_edge(struct work_struct *work)
 {
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
 	struct kvm *kvm = irqfd->kvm;
@@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
 	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
 }
 
+static void
+irqfd_inject_level(struct work_struct *work)
+{
+	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+	/*
+	 * Inject an interrupt only if not already asserted.
+	 *
+	 * We can safely ignore the kvm_set_irq return value here.  If
+	 * masked, the irr bit is still set and will eventually be serviced.
+	 * This interface does not guarantee immediate injection.  If
+	 * coalesced, an eoi will be coming where we can de-assert and
+	 * re-inject if necessary.  NB, if you need to know if an interrupt
+	 * was coalesced, this interface is not for you.
+	 */
+	spin_lock(&irqfd->source->lock);
+	if (!irqfd->source->level_asserted) {
+		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
+		irqfd->source->level_asserted = true;
+	}
+	spin_unlock(&irqfd->source->lock);
+}
+
 /*
  * Race-free decouple logic (ordering is critical)
  */
@@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
 	 * It is now safe to release the object's resources
 	 */
 	eventfd_ctx_put(irqfd->eventfd);
+
+	_irq_source_put(irqfd->source);
+
 	kfree(irqfd);
 }
 
@@ -202,6 +292,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 {
 	struct kvm_irq_routing_table *irq_rt;
 	struct _irqfd *irqfd, *tmp;
+	struct _irq_source *source = NULL;
 	struct file *file = NULL;
 	struct eventfd_ctx *eventfd = NULL;
 	int ret;
@@ -214,7 +305,19 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	irqfd->kvm = kvm;
 	irqfd->gsi = args->gsi;
 	INIT_LIST_HEAD(&irqfd->list);
-	INIT_WORK(&irqfd->inject, irqfd_inject);
+
+	if (args->flags & KVM_IRQFD_FLAG_LEVEL) {
+		source = _irq_source_alloc(kvm);
+		if (IS_ERR(source)) {
+			ret = PTR_ERR(source);
+			goto fail;
+		}
+
+		irqfd->source = source;
+		INIT_WORK(&irqfd->inject, irqfd_inject_level);
+	} else
+		INIT_WORK(&irqfd->inject, irqfd_inject_edge);
+
 	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
 
 	file = eventfd_fget(args->fd);
@@ -276,10 +379,13 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	return 0;
 
 fail:
+	if (source && !IS_ERR(source))
+		_irq_source_put(source);
+
 	if (eventfd && !IS_ERR(eventfd))
 		eventfd_ctx_put(eventfd);
 
-	if (!IS_ERR(file))
+	if (file && !IS_ERR(file))
 		fput(file);
 
 	kfree(irqfd);
@@ -340,7 +446,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
 int
 kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
 {
-	if (args->flags & ~KVM_IRQFD_FLAG_DEASSIGN)
+	if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_LEVEL))
 		return -EINVAL;
 
 	if (args->flags & KVM_IRQFD_FLAG_DEASSIGN)


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
  2012-07-16 20:33 ` [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts Alex Williamson
@ 2012-07-16 20:33 ` Alex Williamson
  2012-07-17 10:21   ` Michael S. Tsirkin
  2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-16 20:33 UTC (permalink / raw)
  To: avi, mst; +Cc: gleb, kvm, linux-kernel, jan.kiszka

This new ioctl enables an eventfd to be triggered when an EOI is
written for a specified irqchip pin.  The first user of this will
be external device assignment through VFIO, using a level irqfd
for asserting a PCI INTx interrupt and this interface for de-assert
and notification once the interrupt is serviced.

Here we make use of the reference counting of the _irq_source
object allowing us to share it with an irqfd and cleanup regardless
of the release order.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/virtual/kvm/api.txt |   22 +++
 arch/x86/kvm/x86.c                |    2 
 include/linux/kvm.h               |   15 ++
 include/linux/kvm_host.h          |   13 ++
 virt/kvm/eventfd.c                |  239 +++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c               |   11 ++
 6 files changed, 300 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c7267d5..9761f78 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1988,6 +1988,28 @@ to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
 is only necessary on setup, teardown is identical to that above.
 KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
 
+4.77 KVM_EOIFD
+
+Capability: KVM_CAP_EOIFD
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_eoifd (in)
+Returns: 0 on success, -1 on error
+
+KVM_EOIFD allows userspace to receive interrupt EOI notification
+through an eventfd.  kvm_eoifd.fd specifies the eventfd used for
+notification.  KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
+once assigned.  KVM_EOIFD also requires additional bits set in
+kvm_eoifd.flags to bind to the proper interrupt line.  The
+KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.irqfd is provided
+and is an irqfd for a level triggered interrupt (configured from
+KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL).  The EOI notification is bound
+to the same GSI and irqchip input as the irqfd.  Both kvm_eoifd.irqfd
+and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified both on assignment
+and de-assignment of KVM_EOIFD.  A level irqfd may only be bound to
+a single eoifd.  KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
+KVM_EOIFD_FLAG_LEVEL_IRQFD.
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 80bed07..cc47e31 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2149,6 +2149,8 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PCI_2_3:
 	case KVM_CAP_KVMCLOCK_CTRL:
 	case KVM_CAP_IRQFD_LEVEL:
+	case KVM_CAP_EOIFD:
+	case KVM_CAP_EOIFD_LEVEL_IRQFD:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index b2e6e4f..5ca887d 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -619,6 +619,8 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
 #define KVM_CAP_IRQFD_LEVEL 81
+#define KVM_CAP_EOIFD 82
+#define KVM_CAP_EOIFD_LEVEL_IRQFD 83
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -694,6 +696,17 @@ struct kvm_irqfd {
 	__u8  pad[20];
 };
 
+#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
+/* Available with KVM_CAP_EOIFD_LEVEL_IRQFD */
+#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
+
+struct kvm_eoifd {
+	__u32 fd;
+	__u32 flags;
+	__u32 irqfd;
+	__u8 pad[20];
+};
+
 struct kvm_clock_data {
 	__u64 clock;
 	__u32 flags;
@@ -834,6 +847,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
 /* Available with KVM_CAP_PPC_ALLOC_HTAB */
 #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
+/* Available with KVM_CAP_EOIFD */
+#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
 
 /*
  * ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae3b426..a7661c0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -285,6 +285,10 @@ struct kvm {
 		struct list_head  items;
 	} irqfds;
 	struct list_head ioeventfds;
+	struct {
+		struct mutex lock;
+		struct list_head items;
+	} eoifds;
 #endif
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
@@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
 void kvm_irqfd_release(struct kvm *kvm);
 void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
 int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
+int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
+void kvm_eoifd_release(struct kvm *kvm);
 
 #else
 
@@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 	return -ENOSYS;
 }
 
+static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	return -ENOSYS;
+}
+
+static inline void kvm_eoifd_release(struct kvm *kvm) {}
+
 #endif /* CONFIG_HAVE_KVM_EVENTFD */
 
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index ecdbfea..1f9412a 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -65,8 +65,7 @@ static void _irq_source_put(struct _irq_source *source)
 		kref_put(&source->kref, _irq_source_release);
 }
 
-static struct _irq_source *__attribute__ ((used)) /* white lie for now */
-_irq_source_get(struct _irq_source *source)
+static struct _irq_source *_irq_source_get(struct _irq_source *source)
 {
 	if (source)
 		kref_get(&source->kref);
@@ -123,6 +122,39 @@ struct _irqfd {
 	struct work_struct shutdown;
 };
 
+static struct _irqfd *_irqfd_fdget_lock(struct kvm *kvm, int fd)
+{
+	struct eventfd_ctx *eventfd;
+	struct _irqfd *tmp, *irqfd = NULL;
+
+	eventfd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(eventfd))
+		return (struct _irqfd *)eventfd;
+
+	spin_lock_irq(&kvm->irqfds.lock);
+
+	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
+		if (tmp->eventfd == eventfd) {
+			irqfd = tmp;
+			break;
+		}
+	}
+
+	if (!irqfd) {
+		spin_unlock_irq(&kvm->irqfds.lock);
+		eventfd_ctx_put(eventfd);
+		return ERR_PTR(-ENODEV);
+	}
+
+	return irqfd;
+}
+
+static void _irqfd_put_unlock(struct _irqfd *irqfd)
+{
+	eventfd_ctx_put(irqfd->eventfd);
+	spin_unlock_irq(&irqfd->kvm->irqfds.lock);
+}
+
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
@@ -398,6 +430,8 @@ kvm_eventfd_init(struct kvm *kvm)
 	spin_lock_init(&kvm->irqfds.lock);
 	INIT_LIST_HEAD(&kvm->irqfds.items);
 	INIT_LIST_HEAD(&kvm->ioeventfds);
+	mutex_init(&kvm->eoifds.lock);
+	INIT_LIST_HEAD(&kvm->eoifds.items);
 }
 
 /*
@@ -764,3 +798,204 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 
 	return kvm_assign_ioeventfd(kvm, args);
 }
+
+/*
+ * --------------------------------------------------------------------
+ *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
+ *
+ *  userspace can register with an eventfd for receiving
+ *  notification when an EOI occurs.
+ * --------------------------------------------------------------------
+ */
+
+struct _eoifd {
+	/* eventfd triggered on EOI */
+	struct eventfd_ctx *eventfd;
+	/* irq source ID de-asserted on EOI */
+	struct _irq_source *source;
+	struct kvm *kvm;
+	struct kvm_irq_ack_notifier notifier;
+	/* reference to irqfd eventfd for de-assign matching */
+	struct eventfd_ctx *level_irqfd;
+	struct list_head list;
+};
+
+static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
+{
+	struct _eoifd *eoifd;
+
+	eoifd = container_of(notifier, struct _eoifd, notifier);
+
+	/*
+	 * Ack notifier is per GSI, which may be shared with others.
+	 * Only de-assert and send EOI if our source ID is asserted.
+	 * User needs to re-assert if device still requires service.
+	 */
+	spin_lock(&eoifd->source->lock);
+	if (eoifd->source->level_asserted) {
+		kvm_set_irq(eoifd->kvm,
+			    eoifd->source->id, eoifd->notifier.gsi, 0);
+		eoifd->source->level_asserted = false;
+		eventfd_signal(eoifd->eventfd, 1);
+	}
+	spin_unlock(&eoifd->source->lock);
+}
+
+static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	struct eventfd_ctx *level_irqfd = NULL, *eventfd = NULL;
+	struct _eoifd *eoifd = NULL, *tmp;
+	struct _irq_source *source = NULL;
+	unsigned gsi;
+	int ret;
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd)) {
+		ret = PTR_ERR(eventfd);
+		goto fail;
+	}
+
+	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
+	if (!eoifd) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
+		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
+		if (IS_ERR(irqfd)) {
+			ret = PTR_ERR(irqfd);
+			goto fail;
+		}
+
+		gsi = irqfd->gsi;
+		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
+		source = _irq_source_get(irqfd->source);
+		_irqfd_put_unlock(irqfd);
+		if (!source) {
+			ret = -EINVAL;
+			goto fail;
+		}
+	} else {
+		ret = -EINVAL;
+		goto fail;
+	}
+
+	INIT_LIST_HEAD(&eoifd->list);
+	eoifd->kvm = kvm;
+	eoifd->eventfd = eventfd;
+	eoifd->source = source;
+	eoifd->level_irqfd = level_irqfd;
+	eoifd->notifier.gsi = gsi;
+	eoifd->notifier.irq_acked = eoifd_event;
+
+	mutex_lock(&kvm->eoifds.lock);
+
+	/*
+	 * Enforce a one-to-one relationship between irqfd and eoifd so
+	 * that this interface can't be used to consume all kernel memory.
+	 * NB. single eventfd can still be used by multiple eoifds.
+	 */
+	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
+		if (tmp->level_irqfd == eoifd->level_irqfd) {
+			mutex_unlock(&kvm->eoifds.lock);
+			ret = -EBUSY;
+			goto fail;
+		}
+	}
+
+	list_add_tail(&eoifd->list, &kvm->eoifds.items);
+	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
+
+	mutex_unlock(&kvm->eoifds.lock);
+
+	return 0;
+
+fail:
+	if (eventfd && !IS_ERR(eventfd))
+		eventfd_ctx_put(eventfd);
+	kfree(eoifd);
+	if (level_irqfd)
+		eventfd_ctx_put(level_irqfd);
+	_irq_source_put(source);
+	return ret;
+}
+
+static void eoifd_destroy(struct kvm *kvm, struct _eoifd *eoifd)
+{
+	list_del(&eoifd->list);
+	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
+	_irq_source_put(eoifd->source);
+	eventfd_ctx_put(eoifd->eventfd);
+	eventfd_ctx_put(eoifd->level_irqfd);
+	kfree(eoifd);
+}
+
+void kvm_eoifd_release(struct kvm *kvm)
+{
+	struct _eoifd *tmp, *eoifd;
+
+	mutex_lock(&kvm->eoifds.lock);
+
+	list_for_each_entry_safe(eoifd, tmp, &kvm->eoifds.items, list)
+		eoifd_destroy(kvm, eoifd);
+
+	mutex_unlock(&kvm->eoifds.lock);
+}
+
+static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	struct eventfd_ctx *eventfd = NULL, *level_irqfd = NULL;
+	struct _eoifd *eoifd;
+	int ret = -ENOENT;
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd)) {
+		ret = PTR_ERR(eventfd);
+		goto fail;
+	}
+
+	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
+		level_irqfd = eventfd_ctx_fdget(args->irqfd);
+		if (IS_ERR(level_irqfd)) {
+			ret = PTR_ERR(level_irqfd);
+			goto fail;
+		}
+	} else {
+		ret = -EINVAL;
+		goto fail;
+	}
+
+	mutex_lock(&kvm->eoifds.lock);
+
+	list_for_each_entry(eoifd, &kvm->eoifds.items, list) {
+		if (eoifd->eventfd == eventfd &&
+		    eoifd->level_irqfd == level_irqfd) {
+			eoifd_destroy(kvm, eoifd);
+			ret = 0;
+			break;
+		}
+	}
+
+	mutex_unlock(&kvm->eoifds.lock);
+
+fail:
+	if (eventfd && !IS_ERR(eventfd))
+		eventfd_ctx_put(eventfd);
+	if (level_irqfd && !IS_ERR(level_irqfd))
+		eventfd_ctx_put(level_irqfd);
+
+	return ret;
+}
+
+int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
+			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
+		return -EINVAL;
+
+	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
+		return kvm_deassign_eoifd(kvm, args);
+
+	return kvm_assign_eoifd(kvm, args);
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b4ad14cc..5b41df1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 
 	kvm_irqfd_release(kvm);
 
+	kvm_eoifd_release(kvm);
+
 	kvm_put_kvm(kvm);
 	return 0;
 }
@@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_EOIFD: {
+		struct kvm_eoifd data;
+
+		r = -EFAULT;
+		if (copy_from_user(&data, argp, sizeof data))
+			goto out;
+		r = kvm_eoifd(kvm, &data);
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
  2012-07-16 20:33 ` [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts Alex Williamson
  2012-07-16 20:33 ` [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
@ 2012-07-16 20:34 ` Alex Williamson
  2012-07-17  0:51   ` Michael S. Tsirkin
                     ` (3 more replies)
  2012-07-16 20:34 ` [PATCH v5 4/4] kvm: Convert eoifd to use kvm_clear_irq Alex Williamson
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-16 20:34 UTC (permalink / raw)
  To: avi, mst; +Cc: gleb, kvm, linux-kernel, jan.kiszka

This is an alternative to kvm_set_irq(,,,0) which returns the previous
assertion state of the interrupt and does nothing if it isn't changed.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 include/linux/kvm_host.h |    3 ++
 virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a7661c0..6c168f1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
 	u32 type;
 	int (*set)(struct kvm_kernel_irq_routing_entry *e,
 		   struct kvm *kvm, int irq_source_id, int level);
+	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
+		     struct kvm *kvm, int irq_source_id);
 	union {
 		struct {
 			unsigned irqchip;
@@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
 				   unsigned long *deliver_bitmask);
 #endif
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
+int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
 		int irq_source_id, int level);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 5afb431..76e8f22 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
 	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
 }
 
+static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
+					    int irq_source_id)
+{
+	return !!test_and_clear_bit(irq_source_id, irq_state);
+}
+
+static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
+			     struct kvm *kvm, int irq_source_id)
+{
+#ifdef CONFIG_X86
+	struct kvm_pic *pic = pic_irqchip(kvm);
+	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
+					     irq_source_id);
+	if (level)
+		kvm_pic_set_irq(pic, e->irqchip.pin,
+				!!pic->irq_states[e->irqchip.pin]);
+	return level;
+#else
+	return -1;
+#endif
+}
+
+static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
+				struct kvm *kvm, int irq_source_id)
+{
+	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
+	int level;
+
+	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
+					 irq_source_id);
+	if (level)
+		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
+				   !!ioapic->irq_states[e->irqchip.pin]);
+	return level;
+}
+
 inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 {
 #ifdef CONFIG_IA64
@@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
 	return ret;
 }
 
+/*
+ * Return value:
+ *  < 0   Error
+ *  = 0   Interrupt was not set, did nothing
+ *  > 0   Interrupt was pending, cleared
+ */
+int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
+{
+	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
+	int ret = -EINVAL, i = 0;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
+
+	/* Not possible to detect if the guest uses the PIC or the
+	 * IOAPIC.  So clear the bit in both. The guest will ignore
+	 * writes to the unused one.
+	 */
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (irq < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
+			irq_set[i++] = *e;
+	rcu_read_unlock();
+
+	while (i--) {
+		int r = -EINVAL;
+
+		if (irq_set[i].clear)
+			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
+
+		if (r < 0)
+			continue;
+
+		ret = r + ((ret < 0) ? 0 : ret);
+	}
+
+	return ret;
+}
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
 	struct kvm_irq_ack_notifier *kian;
@@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
 		switch (ue->u.irqchip.irqchip) {
 		case KVM_IRQCHIP_PIC_MASTER:
 			e->set = kvm_set_pic_irq;
+			e->clear = kvm_clear_pic_irq;
 			max_pin = 16;
 			break;
 		case KVM_IRQCHIP_PIC_SLAVE:
 			e->set = kvm_set_pic_irq;
+			e->clear = kvm_clear_pic_irq;
 			max_pin = 16;
 			delta = 8;
 			break;
 		case KVM_IRQCHIP_IOAPIC:
 			max_pin = KVM_IOAPIC_NUM_PINS;
 			e->set = kvm_set_ioapic_irq;
+			e->clear = kvm_clear_ioapic_irq;
 			break;
 		default:
 			goto out;


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 4/4] kvm: Convert eoifd to use kvm_clear_irq
  2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
                   ` (2 preceding siblings ...)
  2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
@ 2012-07-16 20:34 ` Alex Williamson
  2012-07-18 10:43 ` [PATCH v5 0/4] kvm: level irqfd and new eoifd Michael S. Tsirkin
  2012-07-19 16:59 ` Michael S. Tsirkin
  5 siblings, 0 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-16 20:34 UTC (permalink / raw)
  To: avi, mst; +Cc: gleb, kvm, linux-kernel, jan.kiszka

We can drop any kind of serialization on the injection side as we
expect spurious injections to be both rare and safe.  On the EOI
side, this continues to filter out both the pic/ioapic work and
the eventfd signaling if our source ID has not set the interrupt.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 virt/kvm/eventfd.c |   24 ++++--------------------
 1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 1f9412a..164b4c0 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -42,8 +42,6 @@
  */
 struct _irq_source {
 	int id; /* the IRQ source ID */
-	bool level_asserted; /* Track assertion state and protect with lock */
-	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
 	struct kvm *kvm;
 	struct kref kref;
 };
@@ -89,7 +87,6 @@ static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
 	}
 
 	kref_init(&source->kref);
-	spin_lock_init(&source->lock);
 	source->kvm = kvm;
 	source->id = id;
 
@@ -173,8 +170,6 @@ irqfd_inject_level(struct work_struct *work)
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
 
 	/*
-	 * Inject an interrupt only if not already asserted.
-	 *
 	 * We can safely ignore the kvm_set_irq return value here.  If
 	 * masked, the irr bit is still set and will eventually be serviced.
 	 * This interface does not guarantee immediate injection.  If
@@ -182,12 +177,7 @@ irqfd_inject_level(struct work_struct *work)
 	 * re-inject if necessary.  NB, if you need to know if an interrupt
 	 * was coalesced, this interface is not for you.
 	 */
-	spin_lock(&irqfd->source->lock);
-	if (!irqfd->source->level_asserted) {
-		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
-		irqfd->source->level_asserted = true;
-	}
-	spin_unlock(&irqfd->source->lock);
+	kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
 }
 
 /*
@@ -828,17 +818,11 @@ static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
 
 	/*
 	 * Ack notifier is per GSI, which may be shared with others.
-	 * Only de-assert and send EOI if our source ID is asserted.
-	 * User needs to re-assert if device still requires service.
+	 * Only send EOI if pending from our source ID.  User needs to
+	 * re-assert if device still requires service.
 	 */
-	spin_lock(&eoifd->source->lock);
-	if (eoifd->source->level_asserted) {
-		kvm_set_irq(eoifd->kvm,
-			    eoifd->source->id, eoifd->notifier.gsi, 0);
-		eoifd->source->level_asserted = false;
+	if (kvm_clear_irq(eoifd->kvm, eoifd->source->id, notifier->gsi) > 0)
 		eventfd_signal(eoifd->eventfd, 1);
-	}
-	spin_unlock(&eoifd->source->lock);
 }
 
 static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
@ 2012-07-17  0:51   ` Michael S. Tsirkin
  2012-07-17  2:42     ` Alex Williamson
  2012-07-17  0:55   ` Michael S. Tsirkin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17  0:51 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> This is an alternative to kvm_set_irq(,,,0) which returns the previous
> assertion state of the interrupt and does nothing if it isn't changed.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  include/linux/kvm_host.h |    3 ++
>  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a7661c0..6c168f1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
>  	u32 type;
>  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
>  		   struct kvm *kvm, int irq_source_id, int level);
> +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> +		     struct kvm *kvm, int irq_source_id);
>  	union {
>  		struct {
>  			unsigned irqchip;
> @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
>  				   unsigned long *deliver_bitmask);
>  #endif
>  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
>  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
>  		int irq_source_id, int level);
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 5afb431..76e8f22 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
>  }
>  
> +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> +					    int irq_source_id)
> +{
> +	return !!test_and_clear_bit(irq_source_id, irq_state);
> +}
> +
> +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> +			     struct kvm *kvm, int irq_source_id)
> +{
> +#ifdef CONFIG_X86
> +	struct kvm_pic *pic = pic_irqchip(kvm);
> +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> +					     irq_source_id);
> +	if (level)
> +		kvm_pic_set_irq(pic, e->irqchip.pin,
> +				!!pic->irq_states[e->irqchip.pin]);
> +	return level;
> +#else
> +	return -1;
> +#endif

What does this ifdef exclude exactly?

> +}
> +
> +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> +				struct kvm *kvm, int irq_source_id)
> +{
> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +	int level;
> +
> +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> +					 irq_source_id);
> +	if (level)
> +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> +				   !!ioapic->irq_states[e->irqchip.pin]);
> +	return level;
> +}
> +
>  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  {
>  #ifdef CONFIG_IA64
> @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>  	return ret;
>  }
>  
> +/*
> + * Return value:
> + *  < 0   Error
> + *  = 0   Interrupt was not set, did nothing
> + *  > 0   Interrupt was pending, cleared
> + */
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> +{
> +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> +	int ret = -EINVAL, i = 0;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
> +
> +	/* Not possible to detect if the guest uses the PIC or the
> +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> +	 * writes to the unused one.
> +	 */
> +	rcu_read_lock();
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	if (irq < irq_rt->nr_rt_entries)
> +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> +			irq_set[i++] = *e;
> +	rcu_read_unlock();
> +
> +	while (i--) {
> +		int r = -EINVAL;
> +
> +		if (irq_set[i].clear)
> +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> +
> +		if (r < 0)
> +			continue;
> +
> +		ret = r + ((ret < 0) ? 0 : ret);
> +	}
> +
> +	return ret;
> +}
> +
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
>  	struct kvm_irq_ack_notifier *kian;
> @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
>  		switch (ue->u.irqchip.irqchip) {
>  		case KVM_IRQCHIP_PIC_MASTER:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			break;
>  		case KVM_IRQCHIP_PIC_SLAVE:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			delta = 8;
>  			break;
>  		case KVM_IRQCHIP_IOAPIC:
>  			max_pin = KVM_IOAPIC_NUM_PINS;
>  			e->set = kvm_set_ioapic_irq;
> +			e->clear = kvm_clear_ioapic_irq;
>  			break;
>  		default:
>  			goto out;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
  2012-07-17  0:51   ` Michael S. Tsirkin
@ 2012-07-17  0:55   ` Michael S. Tsirkin
  2012-07-17 10:14   ` Michael S. Tsirkin
  2012-07-17 10:18   ` Michael S. Tsirkin
  3 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17  0:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> This is an alternative to kvm_set_irq(,,,0) which returns the previous
> assertion state of the interrupt and does nothing if it isn't changed.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  include/linux/kvm_host.h |    3 ++
>  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a7661c0..6c168f1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
>  	u32 type;
>  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
>  		   struct kvm *kvm, int irq_source_id, int level);
> +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> +		     struct kvm *kvm, int irq_source_id);
>  	union {
>  		struct {
>  			unsigned irqchip;
> @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
>  				   unsigned long *deliver_bitmask);
>  #endif
>  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
>  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
>  		int irq_source_id, int level);
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 5afb431..76e8f22 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
>  }
>  
> +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> +					    int irq_source_id)
> +{
> +	return !!test_and_clear_bit(irq_source_id, irq_state);
> +}
> +
> +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> +			     struct kvm *kvm, int irq_source_id)
> +{
> +#ifdef CONFIG_X86
> +	struct kvm_pic *pic = pic_irqchip(kvm);
> +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> +					     irq_source_id);
> +	if (level)
> +		kvm_pic_set_irq(pic, e->irqchip.pin,
> +				!!pic->irq_states[e->irqchip.pin]);

This is a bit tricky: add a comment explaining the logic?

> +	return level;
> +#else
> +	return -1;
> +#endif
> +}
> +
> +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> +				struct kvm *kvm, int irq_source_id)
> +{
> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +	int level;
> +
> +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> +					 irq_source_id);
> +	if (level)
> +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> +				   !!ioapic->irq_states[e->irqchip.pin]);

This is a bit tricky: add a comment explaining the logic?

> +	return level;
> +}
> +
>  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  {
>  #ifdef CONFIG_IA64
> @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>  	return ret;
>  }
>  
> +/*
> + * Return value:
> + *  < 0   Error
> + *  = 0   Interrupt was not set, did nothing
> + *  > 0   Interrupt was pending, cleared
> + */
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> +{
> +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> +	int ret = -EINVAL, i = 0;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
> +
> +	/* Not possible to detect if the guest uses the PIC or the
> +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> +	 * writes to the unused one.
> +	 */
> +	rcu_read_lock();
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	if (irq < irq_rt->nr_rt_entries)
> +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> +			irq_set[i++] = *e;
> +	rcu_read_unlock();
> +
> +	while (i--) {
> +		int r = -EINVAL;
> +
> +		if (irq_set[i].clear)
> +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> +
> +		if (r < 0)
> +			continue;
> +
> +		ret = r + ((ret < 0) ? 0 : ret);
> +	}
> +
> +	return ret;
> +}
> +
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
>  	struct kvm_irq_ack_notifier *kian;
> @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
>  		switch (ue->u.irqchip.irqchip) {
>  		case KVM_IRQCHIP_PIC_MASTER:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			break;
>  		case KVM_IRQCHIP_PIC_SLAVE:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			delta = 8;
>  			break;
>  		case KVM_IRQCHIP_IOAPIC:
>  			max_pin = KVM_IOAPIC_NUM_PINS;
>  			e->set = kvm_set_ioapic_irq;
> +			e->clear = kvm_clear_ioapic_irq;
>  			break;
>  		default:
>  			goto out;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17  0:51   ` Michael S. Tsirkin
@ 2012-07-17  2:42     ` Alex Williamson
  0 siblings, 0 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-17  2:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 03:51 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > assertion state of the interrupt and does nothing if it isn't changed.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> >  include/linux/kvm_host.h |    3 ++
> >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 81 insertions(+)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a7661c0..6c168f1 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> >  	u32 type;
> >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> >  		   struct kvm *kvm, int irq_source_id, int level);
> > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > +		     struct kvm *kvm, int irq_source_id);
> >  	union {
> >  		struct {
> >  			unsigned irqchip;
> > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> >  				   unsigned long *deliver_bitmask);
> >  #endif
> >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> >  		int irq_source_id, int level);
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > index 5afb431..76e8f22 100644
> > --- a/virt/kvm/irq_comm.c
> > +++ b/virt/kvm/irq_comm.c
> > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> >  }
> >  
> > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > +					    int irq_source_id)
> > +{
> > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > +}
> > +
> > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > +			     struct kvm *kvm, int irq_source_id)
> > +{
> > +#ifdef CONFIG_X86
> > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > +					     irq_source_id);
> > +	if (level)
> > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > +				!!pic->irq_states[e->irqchip.pin]);
> > +	return level;
> > +#else
> > +	return -1;
> > +#endif
> 
> What does this ifdef exclude exactly?

No pic on ia64.  Not that it works, but I figured the consistency with
kvm_set_pic_irq would make more sense whenever someone goes through and
cleans out the code.  Thanks,

Alex

> > +}
> > +
> > +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > +				struct kvm *kvm, int irq_source_id)
> > +{
> > +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> > +	int level;
> > +
> > +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> > +					 irq_source_id);
> > +	if (level)
> > +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> > +				   !!ioapic->irq_states[e->irqchip.pin]);
> > +	return level;
> > +}
> > +
> >  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
> >  {
> >  #ifdef CONFIG_IA64
> > @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * Return value:
> > + *  < 0   Error
> > + *  = 0   Interrupt was not set, did nothing
> > + *  > 0   Interrupt was pending, cleared
> > + */
> > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> > +{
> > +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> > +	int ret = -EINVAL, i = 0;
> > +	struct kvm_irq_routing_table *irq_rt;
> > +	struct hlist_node *n;
> > +
> > +	/* Not possible to detect if the guest uses the PIC or the
> > +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> > +	 * writes to the unused one.
> > +	 */
> > +	rcu_read_lock();
> > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > +	if (irq < irq_rt->nr_rt_entries)
> > +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> > +			irq_set[i++] = *e;
> > +	rcu_read_unlock();
> > +
> > +	while (i--) {
> > +		int r = -EINVAL;
> > +
> > +		if (irq_set[i].clear)
> > +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> > +
> > +		if (r < 0)
> > +			continue;
> > +
> > +		ret = r + ((ret < 0) ? 0 : ret);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> >  {
> >  	struct kvm_irq_ack_notifier *kian;
> > @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> >  		switch (ue->u.irqchip.irqchip) {
> >  		case KVM_IRQCHIP_PIC_MASTER:
> >  			e->set = kvm_set_pic_irq;
> > +			e->clear = kvm_clear_pic_irq;
> >  			max_pin = 16;
> >  			break;
> >  		case KVM_IRQCHIP_PIC_SLAVE:
> >  			e->set = kvm_set_pic_irq;
> > +			e->clear = kvm_clear_pic_irq;
> >  			max_pin = 16;
> >  			delta = 8;
> >  			break;
> >  		case KVM_IRQCHIP_IOAPIC:
> >  			max_pin = KVM_IOAPIC_NUM_PINS;
> >  			e->set = kvm_set_ioapic_irq;
> > +			e->clear = kvm_clear_ioapic_irq;
> >  			break;
> >  		default:
> >  			goto out;




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
  2012-07-17  0:51   ` Michael S. Tsirkin
  2012-07-17  0:55   ` Michael S. Tsirkin
@ 2012-07-17 10:14   ` Michael S. Tsirkin
  2012-07-17 13:56     ` Alex Williamson
  2012-07-17 10:18   ` Michael S. Tsirkin
  3 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 10:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> This is an alternative to kvm_set_irq(,,,0) which returns the previous
> assertion state of the interrupt and does nothing if it isn't changed.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  include/linux/kvm_host.h |    3 ++
>  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a7661c0..6c168f1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
>  	u32 type;
>  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
>  		   struct kvm *kvm, int irq_source_id, int level);
> +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> +		     struct kvm *kvm, int irq_source_id);
>  	union {
>  		struct {
>  			unsigned irqchip;
> @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
>  				   unsigned long *deliver_bitmask);
>  #endif
>  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
>  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
>  		int irq_source_id, int level);
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 5afb431..76e8f22 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
>  }
>  
> +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> +					    int irq_source_id)
> +{
> +	return !!test_and_clear_bit(irq_source_id, irq_state);
> +}
> +
> +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> +			     struct kvm *kvm, int irq_source_id)
> +{
> +#ifdef CONFIG_X86
> +	struct kvm_pic *pic = pic_irqchip(kvm);
> +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> +					     irq_source_id);
> +	if (level)
> +		kvm_pic_set_irq(pic, e->irqchip.pin,
> +				!!pic->irq_states[e->irqchip.pin]);
> +	return level;

I think I begin to understand: if (level) checks it was previously set,
and then we clear if needed? I think it's worthwhile to rename
level to orig_level and rewrite as:

	if (orig_level && !pic->irq_states[e->irqchip.pin])
		kvm_pic_set_irq(pic, e->irqchip.pin, 0);

This both makes the logic clear without need for comments and
saves some cycles on pic in case nothing actually changed.

> +#else
> +	return -1;
> +#endif
> +}
> +
> +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> +				struct kvm *kvm, int irq_source_id)
> +{
> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +	int level;
> +
> +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> +					 irq_source_id);
> +	if (level)
> +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> +				   !!ioapic->irq_states[e->irqchip.pin]);
> +	return level;
> +}
> +
>  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  {
>  #ifdef CONFIG_IA64
> @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>  	return ret;
>  }
>  
> +/*
> + * Return value:
> + *  < 0   Error
> + *  = 0   Interrupt was not set, did nothing
> + *  > 0   Interrupt was pending, cleared
> + */
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> +{
> +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> +	int ret = -EINVAL, i = 0;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
> +
> +	/* Not possible to detect if the guest uses the PIC or the
> +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> +	 * writes to the unused one.
> +	 */
> +	rcu_read_lock();
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	if (irq < irq_rt->nr_rt_entries)
> +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> +			irq_set[i++] = *e;
> +	rcu_read_unlock();
> +
> +	while (i--) {
> +		int r = -EINVAL;
> +
> +		if (irq_set[i].clear)
> +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> +
> +		if (r < 0)
> +			continue;
> +
> +		ret = r + ((ret < 0) ? 0 : ret);
> +	}
> +
> +	return ret;
> +}
> +
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
>  	struct kvm_irq_ack_notifier *kian;
> @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
>  		switch (ue->u.irqchip.irqchip) {
>  		case KVM_IRQCHIP_PIC_MASTER:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			break;
>  		case KVM_IRQCHIP_PIC_SLAVE:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			delta = 8;
>  			break;
>  		case KVM_IRQCHIP_IOAPIC:
>  			max_pin = KVM_IOAPIC_NUM_PINS;
>  			e->set = kvm_set_ioapic_irq;
> +			e->clear = kvm_clear_ioapic_irq;
>  			break;
>  		default:
>  			goto out;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
                     ` (2 preceding siblings ...)
  2012-07-17 10:14   ` Michael S. Tsirkin
@ 2012-07-17 10:18   ` Michael S. Tsirkin
  3 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 10:18 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka, mtosatti

On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> This is an alternative to kvm_set_irq(,,,0) which returns the previous
> assertion state of the interrupt and does nothing if it isn't changed.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  include/linux/kvm_host.h |    3 ++
>  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a7661c0..6c168f1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
>  	u32 type;
>  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
>  		   struct kvm *kvm, int irq_source_id, int level);
> +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> +		     struct kvm *kvm, int irq_source_id);
>  	union {
>  		struct {
>  			unsigned irqchip;
> @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
>  				   unsigned long *deliver_bitmask);
>  #endif
>  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
>  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
>  		int irq_source_id, int level);
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 5afb431..76e8f22 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
>  }
>  
> +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> +					    int irq_source_id)
> +{
> +	return !!test_and_clear_bit(irq_source_id, irq_state);
> +}
> +
> +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> +			     struct kvm *kvm, int irq_source_id)
> +{
> +#ifdef CONFIG_X86
> +	struct kvm_pic *pic = pic_irqchip(kvm);
> +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> +					     irq_source_id);
> +	if (level)
> +		kvm_pic_set_irq(pic, e->irqchip.pin,
> +				!!pic->irq_states[e->irqchip.pin]);
> +	return level;
> +#else
> +	return -1;
> +#endif
> +}
> +
> +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> +				struct kvm *kvm, int irq_source_id)
> +{
> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +	int level;
> +
> +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> +					 irq_source_id);
> +	if (level)
> +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> +				   !!ioapic->irq_states[e->irqchip.pin]);
> +	return level;
> +}
> +
>  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  {
>  #ifdef CONFIG_IA64
> @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>  	return ret;
>  }
>  
> +/*
> + * Return value:
> + *  < 0   Error
> + *  = 0   Interrupt was not set, did nothing
> + *  > 0   Interrupt was pending, cleared
> + */
> +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> +{
> +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> +	int ret = -EINVAL, i = 0;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
> +
> +	/* Not possible to detect if the guest uses the PIC or the
> +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> +	 * writes to the unused one.
> +	 */
> +	rcu_read_lock();
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	if (irq < irq_rt->nr_rt_entries)
> +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> +			irq_set[i++] = *e;
> +	rcu_read_unlock();
> +
> +	while (i--) {
> +		int r = -EINVAL;
> +
> +		if (irq_set[i].clear)

I would normally suggest if (likely()) here but recently Marcelo
started pushing back against these tags. Maybe add in a separate patch
so it's easier to ignore ...

> +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> +
> +		if (r < 0)
> +			continue;
> +
> +		ret = r + ((ret < 0) ? 0 : ret);
> +	}
> +
> +	return ret;
> +}
> +
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
>  	struct kvm_irq_ack_notifier *kian;
> @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
>  		switch (ue->u.irqchip.irqchip) {
>  		case KVM_IRQCHIP_PIC_MASTER:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			break;
>  		case KVM_IRQCHIP_PIC_SLAVE:
>  			e->set = kvm_set_pic_irq;
> +			e->clear = kvm_clear_pic_irq;
>  			max_pin = 16;
>  			delta = 8;
>  			break;
>  		case KVM_IRQCHIP_IOAPIC:
>  			max_pin = KVM_IOAPIC_NUM_PINS;
>  			e->set = kvm_set_ioapic_irq;
> +			e->clear = kvm_clear_ioapic_irq;
>  			break;
>  		default:
>  			goto out;

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-16 20:33 ` [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
@ 2012-07-17 10:21   ` Michael S. Tsirkin
  2012-07-17 13:59     ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 10:21 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> +		if (IS_ERR(irqfd)) {
> +			ret = PTR_ERR(irqfd);
> +			goto fail;
> +		}
> +
> +		gsi = irqfd->gsi;
> +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> +		source = _irq_source_get(irqfd->source);
> +		_irqfd_put_unlock(irqfd);
> +		if (!source) {
> +			ret = -EINVAL;
> +			goto fail;
> +		}
> +	} else {
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +
> +	INIT_LIST_HEAD(&eoifd->list);
> +	eoifd->kvm = kvm;
> +	eoifd->eventfd = eventfd;
> +	eoifd->source = source;
> +	eoifd->level_irqfd = level_irqfd;
> +	eoifd->notifier.gsi = gsi;
> +	eoifd->notifier.irq_acked = eoifd_event;

OK so this means eoifd keeps a reference to the irqfd.
And since this is the case, can't we drop the reference counting
around source ids now? Everything is referenced through irqfd.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 10:14   ` Michael S. Tsirkin
@ 2012-07-17 13:56     ` Alex Williamson
  2012-07-17 14:08       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 13:56 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > assertion state of the interrupt and does nothing if it isn't changed.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> >  include/linux/kvm_host.h |    3 ++
> >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 81 insertions(+)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a7661c0..6c168f1 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> >  	u32 type;
> >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> >  		   struct kvm *kvm, int irq_source_id, int level);
> > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > +		     struct kvm *kvm, int irq_source_id);
> >  	union {
> >  		struct {
> >  			unsigned irqchip;
> > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> >  				   unsigned long *deliver_bitmask);
> >  #endif
> >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> >  		int irq_source_id, int level);
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > index 5afb431..76e8f22 100644
> > --- a/virt/kvm/irq_comm.c
> > +++ b/virt/kvm/irq_comm.c
> > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> >  }
> >  
> > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > +					    int irq_source_id)
> > +{
> > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > +}
> > +
> > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > +			     struct kvm *kvm, int irq_source_id)
> > +{
> > +#ifdef CONFIG_X86
> > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > +					     irq_source_id);
> > +	if (level)
> > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > +				!!pic->irq_states[e->irqchip.pin]);
> > +	return level;
> 
> I think I begin to understand: if (level) checks it was previously set,
> and then we clear if needed?

It's actually very simple, if we change anything in irq_states, then
update via the chip specific set_irq function.

>  I think it's worthwhile to rename
> level to orig_level and rewrite as:
> 
> 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> 
> This both makes the logic clear without need for comments and
> saves some cycles on pic in case nothing actually changed.

That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
will clear the bit and call kvm_pic_set_irq with the new irq_states
value, whether it's 0 or 1.  The optimization I make is to only call
kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
step further to "changed and is now 0".  I don't know if that's correct
behavior.

> > +#else
> > +	return -1;
> > +#endif
> > +}
> > +
> > +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > +				struct kvm *kvm, int irq_source_id)
> > +{
> > +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> > +	int level;
> > +
> > +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> > +					 irq_source_id);
> > +	if (level)
> > +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> > +				   !!ioapic->irq_states[e->irqchip.pin]);
> > +	return level;
> > +}
> > +
> >  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
> >  {
> >  #ifdef CONFIG_IA64
> > @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * Return value:
> > + *  < 0   Error
> > + *  = 0   Interrupt was not set, did nothing
> > + *  > 0   Interrupt was pending, cleared
> > + */
> > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> > +{
> > +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> > +	int ret = -EINVAL, i = 0;
> > +	struct kvm_irq_routing_table *irq_rt;
> > +	struct hlist_node *n;
> > +
> > +	/* Not possible to detect if the guest uses the PIC or the
> > +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> > +	 * writes to the unused one.
> > +	 */
> > +	rcu_read_lock();
> > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > +	if (irq < irq_rt->nr_rt_entries)
> > +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> > +			irq_set[i++] = *e;
> > +	rcu_read_unlock();
> > +
> > +	while (i--) {
> > +		int r = -EINVAL;
> > +
> > +		if (irq_set[i].clear)
> > +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> > +
> > +		if (r < 0)
> > +			continue;
> > +
> > +		ret = r + ((ret < 0) ? 0 : ret);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> >  {
> >  	struct kvm_irq_ack_notifier *kian;
> > @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> >  		switch (ue->u.irqchip.irqchip) {
> >  		case KVM_IRQCHIP_PIC_MASTER:
> >  			e->set = kvm_set_pic_irq;
> > +			e->clear = kvm_clear_pic_irq;
> >  			max_pin = 16;
> >  			break;
> >  		case KVM_IRQCHIP_PIC_SLAVE:
> >  			e->set = kvm_set_pic_irq;
> > +			e->clear = kvm_clear_pic_irq;
> >  			max_pin = 16;
> >  			delta = 8;
> >  			break;
> >  		case KVM_IRQCHIP_IOAPIC:
> >  			max_pin = KVM_IOAPIC_NUM_PINS;
> >  			e->set = kvm_set_ioapic_irq;
> > +			e->clear = kvm_clear_ioapic_irq;
> >  			break;
> >  		default:
> >  			goto out;




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 10:21   ` Michael S. Tsirkin
@ 2012-07-17 13:59     ` Alex Williamson
  2012-07-17 14:10       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 13:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > +		if (IS_ERR(irqfd)) {
> > +			ret = PTR_ERR(irqfd);
> > +			goto fail;
> > +		}
> > +
> > +		gsi = irqfd->gsi;
> > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > +		source = _irq_source_get(irqfd->source);
> > +		_irqfd_put_unlock(irqfd);
> > +		if (!source) {
> > +			ret = -EINVAL;
> > +			goto fail;
> > +		}
> > +	} else {
> > +		ret = -EINVAL;
> > +		goto fail;
> > +	}
> > +
> > +	INIT_LIST_HEAD(&eoifd->list);
> > +	eoifd->kvm = kvm;
> > +	eoifd->eventfd = eventfd;
> > +	eoifd->source = source;
> > +	eoifd->level_irqfd = level_irqfd;
> > +	eoifd->notifier.gsi = gsi;
> > +	eoifd->notifier.irq_acked = eoifd_event;
> 
> OK so this means eoifd keeps a reference to the irqfd.
> And since this is the case, can't we drop the reference counting
> around source ids now? Everything is referenced through irqfd.

Holding a reference and using it as a reference count are not the same
thing.  What if another module holds a reference to this eventfd?  How
do we do anything on release?


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 13:56     ` Alex Williamson
@ 2012-07-17 14:08       ` Michael S. Tsirkin
  2012-07-17 14:21         ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 14:08 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > assertion state of the interrupt and does nothing if it isn't changed.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > > 
> > >  include/linux/kvm_host.h |    3 ++
> > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 81 insertions(+)
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a7661c0..6c168f1 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > >  	u32 type;
> > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > +		     struct kvm *kvm, int irq_source_id);
> > >  	union {
> > >  		struct {
> > >  			unsigned irqchip;
> > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > >  				   unsigned long *deliver_bitmask);
> > >  #endif
> > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > >  		int irq_source_id, int level);
> > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > index 5afb431..76e8f22 100644
> > > --- a/virt/kvm/irq_comm.c
> > > +++ b/virt/kvm/irq_comm.c
> > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > >  }
> > >  
> > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > +					    int irq_source_id)
> > > +{
> > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > +}
> > > +
> > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > +			     struct kvm *kvm, int irq_source_id)
> > > +{
> > > +#ifdef CONFIG_X86
> > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > +					     irq_source_id);
> > > +	if (level)
> > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > +				!!pic->irq_states[e->irqchip.pin]);
> > > +	return level;
> > 
> > I think I begin to understand: if (level) checks it was previously set,
> > and then we clear if needed?
> 
> It's actually very simple, if we change anything in irq_states, then
> update via the chip specific set_irq function.
> 
> >  I think it's worthwhile to rename
> > level to orig_level and rewrite as:
> > 
> > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > 
> > This both makes the logic clear without need for comments and
> > saves some cycles on pic in case nothing actually changed.
> 
> That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> will clear the bit and call kvm_pic_set_irq with the new irq_states
> value, whether it's 0 or 1.  The optimization I make is to only call
> kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> step further to "changed and is now 0".  I don't know if that's correct
> behavior.

If not then I don't understand. You clear a bit
in a word. You never change it to 1, do you?

But this brings another question:

static inline int kvm_irq_line_state(unsigned long *irq_state,
                                     int irq_source_id, int level)
{
        /* Logical OR for level trig interrupt */
        if (level)
                set_bit(irq_source_id, irq_state);
        else
                clear_bit(irq_source_id, irq_state);


^^^^^^^^^^^
above uses locked instructions

        return !!(*irq_state);


above doesn't

}


why the insonsistency?

> > > +#else
> > > +	return -1;
> > > +#endif
> > > +}
> > > +
> > > +static int kvm_clear_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > +				struct kvm *kvm, int irq_source_id)
> > > +{
> > > +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> > > +	int level;
> > > +
> > > +	level = kvm_clear_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> > > +					 irq_source_id);
> > > +	if (level)
> > > +		kvm_ioapic_set_irq(ioapic, e->irqchip.pin,
> > > +				   !!ioapic->irq_states[e->irqchip.pin]);
> > > +	return level;
> > > +}
> > > +
> > >  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
> > >  {
> > >  #ifdef CONFIG_IA64
> > > @@ -190,6 +226,45 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> > >  	return ret;
> > >  }
> > >  
> > > +/*
> > > + * Return value:
> > > + *  < 0   Error
> > > + *  = 0   Interrupt was not set, did nothing
> > > + *  > 0   Interrupt was pending, cleared
> > > + */
> > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq)
> > > +{
> > > +	struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> > > +	int ret = -EINVAL, i = 0;
> > > +	struct kvm_irq_routing_table *irq_rt;
> > > +	struct hlist_node *n;
> > > +
> > > +	/* Not possible to detect if the guest uses the PIC or the
> > > +	 * IOAPIC.  So clear the bit in both. The guest will ignore
> > > +	 * writes to the unused one.
> > > +	 */
> > > +	rcu_read_lock();
> > > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > > +	if (irq < irq_rt->nr_rt_entries)
> > > +		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> > > +			irq_set[i++] = *e;
> > > +	rcu_read_unlock();
> > > +
> > > +	while (i--) {
> > > +		int r = -EINVAL;
> > > +
> > > +		if (irq_set[i].clear)
> > > +			r = irq_set[i].clear(&irq_set[i], kvm, irq_source_id);
> > > +
> > > +		if (r < 0)
> > > +			continue;
> > > +
> > > +		ret = r + ((ret < 0) ? 0 : ret);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> > >  {
> > >  	struct kvm_irq_ack_notifier *kian;
> > > @@ -344,16 +419,19 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> > >  		switch (ue->u.irqchip.irqchip) {
> > >  		case KVM_IRQCHIP_PIC_MASTER:
> > >  			e->set = kvm_set_pic_irq;
> > > +			e->clear = kvm_clear_pic_irq;
> > >  			max_pin = 16;
> > >  			break;
> > >  		case KVM_IRQCHIP_PIC_SLAVE:
> > >  			e->set = kvm_set_pic_irq;
> > > +			e->clear = kvm_clear_pic_irq;
> > >  			max_pin = 16;
> > >  			delta = 8;
> > >  			break;
> > >  		case KVM_IRQCHIP_IOAPIC:
> > >  			max_pin = KVM_IOAPIC_NUM_PINS;
> > >  			e->set = kvm_set_ioapic_irq;
> > > +			e->clear = kvm_clear_ioapic_irq;
> > >  			break;
> > >  		default:
> > >  			goto out;
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 13:59     ` Alex Williamson
@ 2012-07-17 14:10       ` Michael S. Tsirkin
  2012-07-17 14:29         ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 14:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > +		if (IS_ERR(irqfd)) {
> > > +			ret = PTR_ERR(irqfd);
> > > +			goto fail;
> > > +		}
> > > +
> > > +		gsi = irqfd->gsi;
> > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > +		source = _irq_source_get(irqfd->source);
> > > +		_irqfd_put_unlock(irqfd);
> > > +		if (!source) {
> > > +			ret = -EINVAL;
> > > +			goto fail;
> > > +		}
> > > +	} else {
> > > +		ret = -EINVAL;
> > > +		goto fail;
> > > +	}
> > > +
> > > +	INIT_LIST_HEAD(&eoifd->list);
> > > +	eoifd->kvm = kvm;
> > > +	eoifd->eventfd = eventfd;
> > > +	eoifd->source = source;
> > > +	eoifd->level_irqfd = level_irqfd;
> > > +	eoifd->notifier.gsi = gsi;
> > > +	eoifd->notifier.irq_acked = eoifd_event;
> > 
> > OK so this means eoifd keeps a reference to the irqfd.
> > And since this is the case, can't we drop the reference counting
> > around source ids now? Everything is referenced through irqfd.
> 
> Holding a reference and using it as a reference count are not the same
> thing.  What if another module holds a reference to this eventfd?  How
> do we do anything on release?

We don't as there is no release, and using kref on source id does not
help: it just never gets invoked.

-- 
MST


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 14:08       ` Michael S. Tsirkin
@ 2012-07-17 14:21         ` Alex Williamson
  2012-07-17 14:53           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > > 
> > > >  include/linux/kvm_host.h |    3 ++
> > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 81 insertions(+)
> > > > 
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index a7661c0..6c168f1 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > >  	u32 type;
> > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > +		     struct kvm *kvm, int irq_source_id);
> > > >  	union {
> > > >  		struct {
> > > >  			unsigned irqchip;
> > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > >  				   unsigned long *deliver_bitmask);
> > > >  #endif
> > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > >  		int irq_source_id, int level);
> > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > index 5afb431..76e8f22 100644
> > > > --- a/virt/kvm/irq_comm.c
> > > > +++ b/virt/kvm/irq_comm.c
> > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > >  }
> > > >  
> > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > +					    int irq_source_id)
> > > > +{
> > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > +}
> > > > +
> > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > +			     struct kvm *kvm, int irq_source_id)
> > > > +{
> > > > +#ifdef CONFIG_X86
> > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > +					     irq_source_id);
> > > > +	if (level)
> > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > +	return level;
> > > 
> > > I think I begin to understand: if (level) checks it was previously set,
> > > and then we clear if needed?
> > 
> > It's actually very simple, if we change anything in irq_states, then
> > update via the chip specific set_irq function.
> > 
> > >  I think it's worthwhile to rename
> > > level to orig_level and rewrite as:
> > > 
> > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > 
> > > This both makes the logic clear without need for comments and
> > > saves some cycles on pic in case nothing actually changed.
> > 
> > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > value, whether it's 0 or 1.  The optimization I make is to only call
> > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > step further to "changed and is now 0".  I don't know if that's correct
> > behavior.
> 
> If not then I don't understand. You clear a bit
> in a word. You never change it to 1, do you?

Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
source IDs are still asserting the interrupt.  Your proposal assumes
that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
and I don't know if that's correct.

> 
> But this brings another question:
> 
> static inline int kvm_irq_line_state(unsigned long *irq_state,
>                                      int irq_source_id, int level)
> {
>         /* Logical OR for level trig interrupt */
>         if (level)
>                 set_bit(irq_source_id, irq_state);
>         else
>                 clear_bit(irq_source_id, irq_state);
> 
> 
> ^^^^^^^^^^^
> above uses locked instructions
> 
>         return !!(*irq_state);
> 
> 
> above doesn't
> 
> }
> 
> 
> why the insonsistency?

Note that set/clear_bit are not locked instructions, but atomic
instructions and it could be argued that reading the value is also
atomic.  At least that was my guess when I stumbled across the same
yesterday.  IMHO, we're going off into the weeds again with these last
two patches.  It may be a valid optimization, but it really has no
bearing on the meat of the series (and afaict, no significant
performance difference either).




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 14:10       ` Michael S. Tsirkin
@ 2012-07-17 14:29         ` Alex Williamson
  2012-07-17 14:42           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 14:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > +		if (IS_ERR(irqfd)) {
> > > > +			ret = PTR_ERR(irqfd);
> > > > +			goto fail;
> > > > +		}
> > > > +
> > > > +		gsi = irqfd->gsi;
> > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > +		source = _irq_source_get(irqfd->source);
> > > > +		_irqfd_put_unlock(irqfd);
> > > > +		if (!source) {
> > > > +			ret = -EINVAL;
> > > > +			goto fail;
> > > > +		}
> > > > +	} else {
> > > > +		ret = -EINVAL;
> > > > +		goto fail;
> > > > +	}
> > > > +
> > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > +	eoifd->kvm = kvm;
> > > > +	eoifd->eventfd = eventfd;
> > > > +	eoifd->source = source;
> > > > +	eoifd->level_irqfd = level_irqfd;
> > > > +	eoifd->notifier.gsi = gsi;
> > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > 
> > > OK so this means eoifd keeps a reference to the irqfd.
> > > And since this is the case, can't we drop the reference counting
> > > around source ids now? Everything is referenced through irqfd.
> > 
> > Holding a reference and using it as a reference count are not the same
> > thing.  What if another module holds a reference to this eventfd?  How
> > do we do anything on release?
> 
> We don't as there is no release, and using kref on source id does not
> help: it just never gets invoked.

Please work out how you think it should work and let me know, I don't
see it.  We have an irq source id that needs to be allocated by irqfd
and returned when it's unused.  It becomes unused when neither irqfd nor
eoifd are making use of it.  irqfd and eoifd may be closed in any order.
Use of the source id is what we're reference counting, which is why it's
in struct _irq_source.  How can I use an eventfd reference for the same?
I don't know when it's unused.  I don't know who else holds a reference
to it...  Doesn't make sense to me.  Feels like attempting to squat on
someone else's object.




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 14:29         ` Alex Williamson
@ 2012-07-17 14:42           ` Michael S. Tsirkin
  2012-07-17 14:57             ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 14:42 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > +		if (IS_ERR(irqfd)) {
> > > > > +			ret = PTR_ERR(irqfd);
> > > > > +			goto fail;
> > > > > +		}
> > > > > +
> > > > > +		gsi = irqfd->gsi;
> > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > +		source = _irq_source_get(irqfd->source);
> > > > > +		_irqfd_put_unlock(irqfd);
> > > > > +		if (!source) {
> > > > > +			ret = -EINVAL;
> > > > > +			goto fail;
> > > > > +		}
> > > > > +	} else {
> > > > > +		ret = -EINVAL;
> > > > > +		goto fail;
> > > > > +	}
> > > > > +
> > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > +	eoifd->kvm = kvm;
> > > > > +	eoifd->eventfd = eventfd;
> > > > > +	eoifd->source = source;
> > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > +	eoifd->notifier.gsi = gsi;
> > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > 
> > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > And since this is the case, can't we drop the reference counting
> > > > around source ids now? Everything is referenced through irqfd.
> > > 
> > > Holding a reference and using it as a reference count are not the same
> > > thing.  What if another module holds a reference to this eventfd?  How
> > > do we do anything on release?
> > 
> > We don't as there is no release, and using kref on source id does not
> > help: it just never gets invoked.
> 
> Please work out how you think it should work and let me know, I don't
> see it.  We have an irq source id that needs to be allocated by irqfd
> and returned when it's unused.  It becomes unused when neither irqfd nor
> eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> Use of the source id is what we're reference counting, which is why it's
> in struct _irq_source.  How can I use an eventfd reference for the same?
> I don't know when it's unused.  I don't know who else holds a reference
> to it...  Doesn't make sense to me.  Feels like attempting to squat on
> someone else's object.
> 
> 

eoifd should prevent irqfd from being released.  It already keeps
a reference to it so it prevents irqfd from going away by userspace
closing the fd.  But it can still be released with deassign.
An easy solution is to fail deassign of irqfd if there is
eoifd bound to it.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 14:21         ` Alex Williamson
@ 2012-07-17 14:53           ` Michael S. Tsirkin
  2012-07-17 15:20             ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 14:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > 
> > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > ---
> > > > > 
> > > > >  include/linux/kvm_host.h |    3 ++
> > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  2 files changed, 81 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > index a7661c0..6c168f1 100644
> > > > > --- a/include/linux/kvm_host.h
> > > > > +++ b/include/linux/kvm_host.h
> > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > >  	u32 type;
> > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > >  	union {
> > > > >  		struct {
> > > > >  			unsigned irqchip;
> > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > >  				   unsigned long *deliver_bitmask);
> > > > >  #endif
> > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > >  		int irq_source_id, int level);
> > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > index 5afb431..76e8f22 100644
> > > > > --- a/virt/kvm/irq_comm.c
> > > > > +++ b/virt/kvm/irq_comm.c
> > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > >  }
> > > > >  
> > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > +					    int irq_source_id)
> > > > > +{
> > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > +}
> > > > > +
> > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > +{
> > > > > +#ifdef CONFIG_X86
> > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > +					     irq_source_id);
> > > > > +	if (level)
> > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > +	return level;
> > > > 
> > > > I think I begin to understand: if (level) checks it was previously set,
> > > > and then we clear if needed?
> > > 
> > > It's actually very simple, if we change anything in irq_states, then
> > > update via the chip specific set_irq function.
> > > 
> > > >  I think it's worthwhile to rename
> > > > level to orig_level and rewrite as:
> > > > 
> > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > 
> > > > This both makes the logic clear without need for comments and
> > > > saves some cycles on pic in case nothing actually changed.
> > > 
> > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > step further to "changed and is now 0".  I don't know if that's correct
> > > behavior.
> > 
> > If not then I don't understand. You clear a bit
> > in a word. You never change it to 1, do you?
> 
> Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> source IDs are still asserting the interrupt.  Your proposal assumes
> that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> and I don't know if that's correct.

Well you are asked to clear some id and level was 1. So we know
interrupt was asserted. Either we clear it or we don't. No?

> > 
> > But this brings another question:
> > 
> > static inline int kvm_irq_line_state(unsigned long *irq_state,
> >                                      int irq_source_id, int level)
> > {
> >         /* Logical OR for level trig interrupt */
> >         if (level)
> >                 set_bit(irq_source_id, irq_state);
> >         else
> >                 clear_bit(irq_source_id, irq_state);
> > 
> > 
> > ^^^^^^^^^^^
> > above uses locked instructions
> > 
> >         return !!(*irq_state);
> > 
> > 
> > above doesn't
> > 
> > }
> > 
> > 
> > why the insonsistency?
> 
> Note that set/clear_bit are not locked instructions,

On x86 they are:
static __always_inline void
set_bit(unsigned int nr, volatile unsigned long *addr)
{
        if (IS_IMMEDIATE(nr)) {
                asm volatile(LOCK_PREFIX "orb %1,%0"
                        : CONST_MASK_ADDR(nr, addr)
                        : "iq" ((u8)CONST_MASK(nr))
                        : "memory");
        } else {
                asm volatile(LOCK_PREFIX "bts %1,%0"
                        : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
        }
}

> but atomic
> instructions and it could be argued that reading the value is also
> atomic.  At least that was my guess when I stumbled across the same
> yesterday.  IMHO, we're going off into the weeds again with these last
> two patches.  It may be a valid optimization, but it really has no
> bearing on the meat of the series (and afaict, no significant
> performance difference either).

For me it's not a performance thing. IMO code is cleaner without this locking:
we add a lock but only use it in some cases, so the rules become really
complex.  And current code looks buggy if yes we need to fix it somehow.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 14:42           ` Michael S. Tsirkin
@ 2012-07-17 14:57             ` Alex Williamson
  2012-07-17 15:13               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 14:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > +			goto fail;
> > > > > > +		}
> > > > > > +
> > > > > > +		gsi = irqfd->gsi;
> > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > +		if (!source) {
> > > > > > +			ret = -EINVAL;
> > > > > > +			goto fail;
> > > > > > +		}
> > > > > > +	} else {
> > > > > > +		ret = -EINVAL;
> > > > > > +		goto fail;
> > > > > > +	}
> > > > > > +
> > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > +	eoifd->kvm = kvm;
> > > > > > +	eoifd->eventfd = eventfd;
> > > > > > +	eoifd->source = source;
> > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > 
> > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > And since this is the case, can't we drop the reference counting
> > > > > around source ids now? Everything is referenced through irqfd.
> > > > 
> > > > Holding a reference and using it as a reference count are not the same
> > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > do we do anything on release?
> > > 
> > > We don't as there is no release, and using kref on source id does not
> > > help: it just never gets invoked.
> > 
> > Please work out how you think it should work and let me know, I don't
> > see it.  We have an irq source id that needs to be allocated by irqfd
> > and returned when it's unused.  It becomes unused when neither irqfd nor
> > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > Use of the source id is what we're reference counting, which is why it's
> > in struct _irq_source.  How can I use an eventfd reference for the same?
> > I don't know when it's unused.  I don't know who else holds a reference
> > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > someone else's object.
> > 
> > 
> 
> eoifd should prevent irqfd from being released.

Why?  Note that this is actually quite difficult too.  We can't fail a
release, nobody checks close(3p) return.  Blocking a release is likely
to cause all sorts of problems, so what you mean is that irqfd should
linger around until there are no references to it... but that's exactly
what struct _irq_source is for, is to hold the bits that we care about
references to and automatically release it when there are none.

>   It already keeps
> a reference to it so it prevents irqfd from going away by userspace
> closing the fd.

Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
prevents the fd from going away, not the irqfd.

>   But it can still be released with deassign.
> An easy solution is to fail deassign of irqfd if there is
> eoifd bound to it.

I don't know why we would impose such a bizarre usage model when
reference counting on struct _irq_source seems to handle this nicely
already.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 14:57             ` Alex Williamson
@ 2012-07-17 15:13               ` Michael S. Tsirkin
  2012-07-17 15:41                 ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 15:13 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > +			goto fail;
> > > > > > > +		}
> > > > > > > +
> > > > > > > +		gsi = irqfd->gsi;
> > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > +		if (!source) {
> > > > > > > +			ret = -EINVAL;
> > > > > > > +			goto fail;
> > > > > > > +		}
> > > > > > > +	} else {
> > > > > > > +		ret = -EINVAL;
> > > > > > > +		goto fail;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > +	eoifd->kvm = kvm;
> > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > +	eoifd->source = source;
> > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > 
> > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > And since this is the case, can't we drop the reference counting
> > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > 
> > > > > Holding a reference and using it as a reference count are not the same
> > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > do we do anything on release?
> > > > 
> > > > We don't as there is no release, and using kref on source id does not
> > > > help: it just never gets invoked.
> > > 
> > > Please work out how you think it should work and let me know, I don't
> > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > Use of the source id is what we're reference counting, which is why it's
> > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > I don't know when it's unused.  I don't know who else holds a reference
> > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > someone else's object.
> > > 
> > > 
> > 
> > eoifd should prevent irqfd from being released.
> 
> Why?  Note that this is actually quite difficult too.  We can't fail a
> release, nobody checks close(3p) return.  Blocking a release is likely
> to cause all sorts of problems, so what you mean is that irqfd should
> linger around until there are no references to it... but that's exactly
> what struct _irq_source is for, is to hold the bits that we care about
> references to and automatically release it when there are none.

No no. You *already* prevent it. You take a reference to the eventfd
context.

> >   It already keeps
> > a reference to it so it prevents irqfd from going away by userspace
> > closing the fd.
> 
> Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> prevents the fd from going away, not the irqfd.

When the fd is no going away an ioctl is the only other way for
it to go away.

> >   But it can still be released with deassign.
> > An easy solution is to fail deassign of irqfd if there is
> > eoifd bound to it.
> 
> I don't know why we would impose such a bizarre usage model when
> reference counting on struct _irq_source seems to handle this nicely
> already.

Well eventfd gets an irqfd. What does it mean if said irqfd gets
deassigned, and potentially assigned an unrelated interrupt?
I think what I would expect is for it to handle the new interrupt.
This is hard to implement so let us fail this.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 14:53           ` Michael S. Tsirkin
@ 2012-07-17 15:20             ` Alex Williamson
  2012-07-17 15:36               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 15:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > 
> > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  2 files changed, 81 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > index a7661c0..6c168f1 100644
> > > > > > --- a/include/linux/kvm_host.h
> > > > > > +++ b/include/linux/kvm_host.h
> > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > >  	u32 type;
> > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > >  	union {
> > > > > >  		struct {
> > > > > >  			unsigned irqchip;
> > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > >  				   unsigned long *deliver_bitmask);
> > > > > >  #endif
> > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > >  		int irq_source_id, int level);
> > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > index 5afb431..76e8f22 100644
> > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > >  }
> > > > > >  
> > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > +					    int irq_source_id)
> > > > > > +{
> > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > +}
> > > > > > +
> > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > +{
> > > > > > +#ifdef CONFIG_X86
> > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > +					     irq_source_id);
> > > > > > +	if (level)
> > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > +	return level;
> > > > > 
> > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > and then we clear if needed?
> > > > 
> > > > It's actually very simple, if we change anything in irq_states, then
> > > > update via the chip specific set_irq function.
> > > > 
> > > > >  I think it's worthwhile to rename
> > > > > level to orig_level and rewrite as:
> > > > > 
> > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > 
> > > > > This both makes the logic clear without need for comments and
> > > > > saves some cycles on pic in case nothing actually changed.
> > > > 
> > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > behavior.
> > > 
> > > If not then I don't understand. You clear a bit
> > > in a word. You never change it to 1, do you?
> > 
> > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > source IDs are still asserting the interrupt.  Your proposal assumes
> > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > and I don't know if that's correct.
> 
> Well you are asked to clear some id and level was 1. So we know
> interrupt was asserted. Either we clear it or we don't. No?
> 
> > > 
> > > But this brings another question:
> > > 
> > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > >                                      int irq_source_id, int level)
> > > {
> > >         /* Logical OR for level trig interrupt */
> > >         if (level)
> > >                 set_bit(irq_source_id, irq_state);
> > >         else
> > >                 clear_bit(irq_source_id, irq_state);
> > > 
> > > 
> > > ^^^^^^^^^^^
> > > above uses locked instructions
> > > 
> > >         return !!(*irq_state);
> > > 
> > > 
> > > above doesn't
> > > 
> > > }
> > > 
> > > 
> > > why the insonsistency?
> > 
> > Note that set/clear_bit are not locked instructions,
> 
> On x86 they are:
> static __always_inline void
> set_bit(unsigned int nr, volatile unsigned long *addr)
> {
>         if (IS_IMMEDIATE(nr)) {
>                 asm volatile(LOCK_PREFIX "orb %1,%0"
>                         : CONST_MASK_ADDR(nr, addr)
>                         : "iq" ((u8)CONST_MASK(nr))
>                         : "memory");
>         } else {
>                 asm volatile(LOCK_PREFIX "bts %1,%0"
>                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
>         }
> }
> 
> > but atomic
> > instructions and it could be argued that reading the value is also
> > atomic.  At least that was my guess when I stumbled across the same
> > yesterday.  IMHO, we're going off into the weeds again with these last
> > two patches.  It may be a valid optimization, but it really has no
> > bearing on the meat of the series (and afaict, no significant
> > performance difference either).
> 
> For me it's not a performance thing. IMO code is cleaner without this locking:
> we add a lock but only use it in some cases, so the rules become really
> complex.

Seriously?

        spin_lock(&irqfd->source->lock);
        if (!irqfd->source->level_asserted) {
                kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
                irqfd->source->level_asserted = true;
        }
        spin_unlock(&irqfd->source->lock);

...

        spin_lock(&eoifd->source->lock);
        if (eoifd->source->level_asserted) {
                kvm_set_irq(eoifd->kvm,
                            eoifd->source->id, eoifd->notifier.gsi, 0);
                eoifd->source->level_asserted = false;
                eventfd_signal(eoifd->eventfd, 1);
        }
        spin_unlock(&eoifd->source->lock);


Locking doesn't get much more straightforward than that

>   And current code looks buggy if yes we need to fix it somehow.


Which to me seems to indicate this should be handled as a separate
effort.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 15:20             ` Alex Williamson
@ 2012-07-17 15:36               ` Michael S. Tsirkin
  2012-07-17 15:51                 ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 15:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > 
> > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > ---
> > > > > > > 
> > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > index a7661c0..6c168f1 100644
> > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > >  	u32 type;
> > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > >  	union {
> > > > > > >  		struct {
> > > > > > >  			unsigned irqchip;
> > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > >  #endif
> > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > >  		int irq_source_id, int level);
> > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > index 5afb431..76e8f22 100644
> > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > >  }
> > > > > > >  
> > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > +					    int irq_source_id)
> > > > > > > +{
> > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > +{
> > > > > > > +#ifdef CONFIG_X86
> > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > +					     irq_source_id);
> > > > > > > +	if (level)
> > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > +	return level;
> > > > > > 
> > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > and then we clear if needed?
> > > > > 
> > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > update via the chip specific set_irq function.
> > > > > 
> > > > > >  I think it's worthwhile to rename
> > > > > > level to orig_level and rewrite as:
> > > > > > 
> > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > 
> > > > > > This both makes the logic clear without need for comments and
> > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > 
> > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > behavior.
> > > > 
> > > > If not then I don't understand. You clear a bit
> > > > in a word. You never change it to 1, do you?
> > > 
> > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > and I don't know if that's correct.
> > 
> > Well you are asked to clear some id and level was 1. So we know
> > interrupt was asserted. Either we clear it or we don't. No?
> > 
> > > > 
> > > > But this brings another question:
> > > > 
> > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > >                                      int irq_source_id, int level)
> > > > {
> > > >         /* Logical OR for level trig interrupt */
> > > >         if (level)
> > > >                 set_bit(irq_source_id, irq_state);
> > > >         else
> > > >                 clear_bit(irq_source_id, irq_state);
> > > > 
> > > > 
> > > > ^^^^^^^^^^^
> > > > above uses locked instructions
> > > > 
> > > >         return !!(*irq_state);
> > > > 
> > > > 
> > > > above doesn't
> > > > 
> > > > }
> > > > 
> > > > 
> > > > why the insonsistency?
> > > 
> > > Note that set/clear_bit are not locked instructions,
> > 
> > On x86 they are:
> > static __always_inline void
> > set_bit(unsigned int nr, volatile unsigned long *addr)
> > {
> >         if (IS_IMMEDIATE(nr)) {
> >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> >                         : CONST_MASK_ADDR(nr, addr)
> >                         : "iq" ((u8)CONST_MASK(nr))
> >                         : "memory");
> >         } else {
> >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> >         }
> > }
> > 
> > > but atomic
> > > instructions and it could be argued that reading the value is also
> > > atomic.  At least that was my guess when I stumbled across the same
> > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > two patches.  It may be a valid optimization, but it really has no
> > > bearing on the meat of the series (and afaict, no significant
> > > performance difference either).
> > 
> > For me it's not a performance thing. IMO code is cleaner without this locking:
> > we add a lock but only use it in some cases, so the rules become really
> > complex.
> 
> Seriously?
> 
>         spin_lock(&irqfd->source->lock);
>         if (!irqfd->source->level_asserted) {
>                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
>                 irqfd->source->level_asserted = true;
>         }
>         spin_unlock(&irqfd->source->lock);
> 
> ...
> 
>         spin_lock(&eoifd->source->lock);
>         if (eoifd->source->level_asserted) {
>                 kvm_set_irq(eoifd->kvm,
>                             eoifd->source->id, eoifd->notifier.gsi, 0);
>                 eoifd->source->level_asserted = false;
>                 eventfd_signal(eoifd->eventfd, 1);
>         }
>         spin_unlock(&eoifd->source->lock);
> 
> 
> Locking doesn't get much more straightforward than that

Don't look at it in isolation. You are now calling kvm_set_irq
from under a spinlock. You are saying it is always safe but
this seems far from obvious. kvm_set_irq used to be
unsafe from an atomic context.

> >   And current code looks buggy if yes we need to fix it somehow.
> 
> 
> Which to me seems to indicate this should be handled as a separate
> effort.

A separate patchset, sure. But likely a prerequisite: we still need to
look at all the code. Let's not copy bugs, need to fix them.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 15:13               ` Michael S. Tsirkin
@ 2012-07-17 15:41                 ` Alex Williamson
  2012-07-17 15:53                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 15:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > +			goto fail;
> > > > > > > > +		}
> > > > > > > > +
> > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > +		if (!source) {
> > > > > > > > +			ret = -EINVAL;
> > > > > > > > +			goto fail;
> > > > > > > > +		}
> > > > > > > > +	} else {
> > > > > > > > +		ret = -EINVAL;
> > > > > > > > +		goto fail;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > +	eoifd->source = source;
> > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > 
> > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > 
> > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > do we do anything on release?
> > > > > 
> > > > > We don't as there is no release, and using kref on source id does not
> > > > > help: it just never gets invoked.
> > > > 
> > > > Please work out how you think it should work and let me know, I don't
> > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > Use of the source id is what we're reference counting, which is why it's
> > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > someone else's object.
> > > > 
> > > > 
> > > 
> > > eoifd should prevent irqfd from being released.
> > 
> > Why?  Note that this is actually quite difficult too.  We can't fail a
> > release, nobody checks close(3p) return.  Blocking a release is likely
> > to cause all sorts of problems, so what you mean is that irqfd should
> > linger around until there are no references to it... but that's exactly
> > what struct _irq_source is for, is to hold the bits that we care about
> > references to and automatically release it when there are none.
> 
> No no. You *already* prevent it. You take a reference to the eventfd
> context.

Right, which keeps the fd from going away, not the struct _irqfd.

> > >   It already keeps
> > > a reference to it so it prevents irqfd from going away by userspace
> > > closing the fd.
> > 
> > Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> > prevents the fd from going away, not the irqfd.
> 
> When the fd is no going away an ioctl is the only other way for
> it to go away.

It doesn't do any good to fail the ioctl if close(fd) allows it.

> > >   But it can still be released with deassign.
> > > An easy solution is to fail deassign of irqfd if there is
> > > eoifd bound to it.
> > 
> > I don't know why we would impose such a bizarre usage model when
> > reference counting on struct _irq_source seems to handle this nicely
> > already.
> 
> Well eventfd gets an irqfd. What does it mean if said irqfd gets
> deassigned, and potentially assigned an unrelated interrupt?
> I think what I would expect is for it to handle the new interrupt.
> This is hard to implement so let us fail this.

Ah, so an actual problem, let's solve this.  Why wouldn't we just search
the list of eoifds and see if this level_irqfd is already used?  If we
find it and it's compatible, we can get a reference to the _irq_source
and "re-attach" the irqfd.  If it's not compatible, fail the KVM_IRQFD.
If the KVM_IRQFD is for an edge irqfd, I think we let it go.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 15:36               ` Michael S. Tsirkin
@ 2012-07-17 15:51                 ` Alex Williamson
  2012-07-17 15:57                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 15:51 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > 
> > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > ---
> > > > > > > > 
> > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > >  	u32 type;
> > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > >  	union {
> > > > > > > >  		struct {
> > > > > > > >  			unsigned irqchip;
> > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > >  #endif
> > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > >  		int irq_source_id, int level);
> > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > +					    int irq_source_id)
> > > > > > > > +{
> > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > +{
> > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > +					     irq_source_id);
> > > > > > > > +	if (level)
> > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > +	return level;
> > > > > > > 
> > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > and then we clear if needed?
> > > > > > 
> > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > update via the chip specific set_irq function.
> > > > > > 
> > > > > > >  I think it's worthwhile to rename
> > > > > > > level to orig_level and rewrite as:
> > > > > > > 
> > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > 
> > > > > > > This both makes the logic clear without need for comments and
> > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > 
> > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > behavior.
> > > > > 
> > > > > If not then I don't understand. You clear a bit
> > > > > in a word. You never change it to 1, do you?
> > > > 
> > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > and I don't know if that's correct.
> > > 
> > > Well you are asked to clear some id and level was 1. So we know
> > > interrupt was asserted. Either we clear it or we don't. No?
> > > 
> > > > > 
> > > > > But this brings another question:
> > > > > 
> > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > >                                      int irq_source_id, int level)
> > > > > {
> > > > >         /* Logical OR for level trig interrupt */
> > > > >         if (level)
> > > > >                 set_bit(irq_source_id, irq_state);
> > > > >         else
> > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > 
> > > > > 
> > > > > ^^^^^^^^^^^
> > > > > above uses locked instructions
> > > > > 
> > > > >         return !!(*irq_state);
> > > > > 
> > > > > 
> > > > > above doesn't
> > > > > 
> > > > > }
> > > > > 
> > > > > 
> > > > > why the insonsistency?
> > > > 
> > > > Note that set/clear_bit are not locked instructions,
> > > 
> > > On x86 they are:
> > > static __always_inline void
> > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > {
> > >         if (IS_IMMEDIATE(nr)) {
> > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > >                         : CONST_MASK_ADDR(nr, addr)
> > >                         : "iq" ((u8)CONST_MASK(nr))
> > >                         : "memory");
> > >         } else {
> > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > >         }
> > > }
> > > 
> > > > but atomic
> > > > instructions and it could be argued that reading the value is also
> > > > atomic.  At least that was my guess when I stumbled across the same
> > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > two patches.  It may be a valid optimization, but it really has no
> > > > bearing on the meat of the series (and afaict, no significant
> > > > performance difference either).
> > > 
> > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > we add a lock but only use it in some cases, so the rules become really
> > > complex.
> > 
> > Seriously?
> > 
> >         spin_lock(&irqfd->source->lock);
> >         if (!irqfd->source->level_asserted) {
> >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> >                 irqfd->source->level_asserted = true;
> >         }
> >         spin_unlock(&irqfd->source->lock);
> > 
> > ...
> > 
> >         spin_lock(&eoifd->source->lock);
> >         if (eoifd->source->level_asserted) {
> >                 kvm_set_irq(eoifd->kvm,
> >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> >                 eoifd->source->level_asserted = false;
> >                 eventfd_signal(eoifd->eventfd, 1);
> >         }
> >         spin_unlock(&eoifd->source->lock);
> > 
> > 
> > Locking doesn't get much more straightforward than that
> 
> Don't look at it in isolation. You are now calling kvm_set_irq
> from under a spinlock. You are saying it is always safe but
> this seems far from obvious. kvm_set_irq used to be
> unsafe from an atomic context.

Device assignment has been calling kvm_set_irq from atomic context for
quite a long time.

> > >   And current code looks buggy if yes we need to fix it somehow.
> > 
> > 
> > Which to me seems to indicate this should be handled as a separate
> > effort.
> 
> A separate patchset, sure. But likely a prerequisite: we still need to
> look at all the code. Let's not copy bugs, need to fix them.

This looks tangential to me unless you can come up with an actual reason
the above spinlock usage is incorrect or insufficient.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 15:41                 ` Alex Williamson
@ 2012-07-17 15:53                   ` Michael S. Tsirkin
  2012-07-17 16:06                     ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 15:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > +			goto fail;
> > > > > > > > > +		}
> > > > > > > > > +
> > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > +		if (!source) {
> > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > +			goto fail;
> > > > > > > > > +		}
> > > > > > > > > +	} else {
> > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > +		goto fail;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > +	eoifd->source = source;
> > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > 
> > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > 
> > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > do we do anything on release?
> > > > > > 
> > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > help: it just never gets invoked.
> > > > > 
> > > > > Please work out how you think it should work and let me know, I don't
> > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > someone else's object.
> > > > > 
> > > > > 
> > > > 
> > > > eoifd should prevent irqfd from being released.
> > > 
> > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > to cause all sorts of problems, so what you mean is that irqfd should
> > > linger around until there are no references to it... but that's exactly
> > > what struct _irq_source is for, is to hold the bits that we care about
> > > references to and automatically release it when there are none.
> > 
> > No no. You *already* prevent it. You take a reference to the eventfd
> > context.
> 
> Right, which keeps the fd from going away, not the struct _irqfd.

_irqfd too.

> > > >   It already keeps
> > > > a reference to it so it prevents irqfd from going away by userspace
> > > > closing the fd.
> > > 
> > > Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> > > prevents the fd from going away, not the irqfd.
> > 
> > When the fd is no going away an ioctl is the only other way for
> > it to go away.
> 
> It doesn't do any good to fail the ioctl if close(fd) allows it.

allows what? It does nothing.

> > > >   But it can still be released with deassign.
> > > > An easy solution is to fail deassign of irqfd if there is
> > > > eoifd bound to it.
> > > 
> > > I don't know why we would impose such a bizarre usage model when
> > > reference counting on struct _irq_source seems to handle this nicely
> > > already.
> > 
> > Well eventfd gets an irqfd. What does it mean if said irqfd gets
> > deassigned, and potentially assigned an unrelated interrupt?
> > I think what I would expect is for it to handle the new interrupt.
> > This is hard to implement so let us fail this.
> 
> Ah, so an actual problem, let's solve this.  Why wouldn't we just search
> the list of eoifds and see if this level_irqfd is already used?  If we
> find it and it's compatible, we can get a reference to the _irq_source
> and "re-attach" the irqfd.  If it's not compatible, fail the KVM_IRQFD.
> If the KVM_IRQFD is for an edge irqfd, I think we let it go.

This is just confusing. Userspace has no idea that you are reusing fds
behind the scenes. assign is not the problem, deassign is.
So fail *that*.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 15:51                 ` Alex Williamson
@ 2012-07-17 15:57                   ` Michael S. Tsirkin
  2012-07-17 16:01                     ` Gleb Natapov
  2012-07-17 16:08                     ` Alex Williamson
  0 siblings, 2 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 15:57 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > ---
> > > > > > > > > 
> > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > >  	u32 type;
> > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > >  	union {
> > > > > > > > >  		struct {
> > > > > > > > >  			unsigned irqchip;
> > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > >  #endif
> > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > +					    int irq_source_id)
> > > > > > > > > +{
> > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > +{
> > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > +					     irq_source_id);
> > > > > > > > > +	if (level)
> > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > +	return level;
> > > > > > > > 
> > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > and then we clear if needed?
> > > > > > > 
> > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > update via the chip specific set_irq function.
> > > > > > > 
> > > > > > > >  I think it's worthwhile to rename
> > > > > > > > level to orig_level and rewrite as:
> > > > > > > > 
> > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > 
> > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > 
> > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > behavior.
> > > > > > 
> > > > > > If not then I don't understand. You clear a bit
> > > > > > in a word. You never change it to 1, do you?
> > > > > 
> > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > and I don't know if that's correct.
> > > > 
> > > > Well you are asked to clear some id and level was 1. So we know
> > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > 
> > > > > > 
> > > > > > But this brings another question:
> > > > > > 
> > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > >                                      int irq_source_id, int level)
> > > > > > {
> > > > > >         /* Logical OR for level trig interrupt */
> > > > > >         if (level)
> > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > >         else
> > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > 
> > > > > > 
> > > > > > ^^^^^^^^^^^
> > > > > > above uses locked instructions
> > > > > > 
> > > > > >         return !!(*irq_state);
> > > > > > 
> > > > > > 
> > > > > > above doesn't
> > > > > > 
> > > > > > }
> > > > > > 
> > > > > > 
> > > > > > why the insonsistency?
> > > > > 
> > > > > Note that set/clear_bit are not locked instructions,
> > > > 
> > > > On x86 they are:
> > > > static __always_inline void
> > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > {
> > > >         if (IS_IMMEDIATE(nr)) {
> > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > >                         : CONST_MASK_ADDR(nr, addr)
> > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > >                         : "memory");
> > > >         } else {
> > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > >         }
> > > > }
> > > > 
> > > > > but atomic
> > > > > instructions and it could be argued that reading the value is also
> > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > bearing on the meat of the series (and afaict, no significant
> > > > > performance difference either).
> > > > 
> > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > we add a lock but only use it in some cases, so the rules become really
> > > > complex.
> > > 
> > > Seriously?
> > > 
> > >         spin_lock(&irqfd->source->lock);
> > >         if (!irqfd->source->level_asserted) {
> > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > >                 irqfd->source->level_asserted = true;
> > >         }
> > >         spin_unlock(&irqfd->source->lock);
> > > 
> > > ...
> > > 
> > >         spin_lock(&eoifd->source->lock);
> > >         if (eoifd->source->level_asserted) {
> > >                 kvm_set_irq(eoifd->kvm,
> > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > >                 eoifd->source->level_asserted = false;
> > >                 eventfd_signal(eoifd->eventfd, 1);
> > >         }
> > >         spin_unlock(&eoifd->source->lock);
> > > 
> > > 
> > > Locking doesn't get much more straightforward than that
> > 
> > Don't look at it in isolation. You are now calling kvm_set_irq
> > from under a spinlock. You are saying it is always safe but
> > this seems far from obvious. kvm_set_irq used to be
> > unsafe from an atomic context.
> 
> Device assignment has been calling kvm_set_irq from atomic context for
> quite a long time.

Only for MSI. That's an exception (and it's also a messy one).

> > > >   And current code looks buggy if yes we need to fix it somehow.
> > > 
> > > 
> > > Which to me seems to indicate this should be handled as a separate
> > > effort.
> > 
> > A separate patchset, sure. But likely a prerequisite: we still need to
> > look at all the code. Let's not copy bugs, need to fix them.
> 
> This looks tangential to me unless you can come up with an actual reason
> the above spinlock usage is incorrect or insufficient.

You copy the same pattern that seems racy. So you double the
amount of code that woul need to be fixed.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 15:57                   ` Michael S. Tsirkin
@ 2012-07-17 16:01                     ` Gleb Natapov
  2012-07-17 16:08                     ` Alex Williamson
  1 sibling, 0 replies; 96+ messages in thread
From: Gleb Natapov @ 2012-07-17 16:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 06:57:01PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > > 
> > > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > > >  	u32 type;
> > > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > > >  	union {
> > > > > > > > > >  		struct {
> > > > > > > > > >  			unsigned irqchip;
> > > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > > >  #endif
> > > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > > >  }
> > > > > > > > > >  
> > > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > > +					    int irq_source_id)
> > > > > > > > > > +{
> > > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > > +{
> > > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > > +					     irq_source_id);
> > > > > > > > > > +	if (level)
> > > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > > +	return level;
> > > > > > > > > 
> > > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > > and then we clear if needed?
> > > > > > > > 
> > > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > > update via the chip specific set_irq function.
> > > > > > > > 
> > > > > > > > >  I think it's worthwhile to rename
> > > > > > > > > level to orig_level and rewrite as:
> > > > > > > > > 
> > > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > > 
> > > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > > 
> > > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > > behavior.
> > > > > > > 
> > > > > > > If not then I don't understand. You clear a bit
> > > > > > > in a word. You never change it to 1, do you?
> > > > > > 
> > > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > > and I don't know if that's correct.
> > > > > 
> > > > > Well you are asked to clear some id and level was 1. So we know
> > > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > > 
> > > > > > > 
> > > > > > > But this brings another question:
> > > > > > > 
> > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > >                                      int irq_source_id, int level)
> > > > > > > {
> > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > >         if (level)
> > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > >         else
> > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > 
> > > > > > > 
> > > > > > > ^^^^^^^^^^^
> > > > > > > above uses locked instructions
> > > > > > > 
> > > > > > >         return !!(*irq_state);
> > > > > > > 
> > > > > > > 
> > > > > > > above doesn't
> > > > > > > 
> > > > > > > }
> > > > > > > 
> > > > > > > 
> > > > > > > why the insonsistency?
> > > > > > 
> > > > > > Note that set/clear_bit are not locked instructions,
> > > > > 
> > > > > On x86 they are:
> > > > > static __always_inline void
> > > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > > {
> > > > >         if (IS_IMMEDIATE(nr)) {
> > > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > > >                         : CONST_MASK_ADDR(nr, addr)
> > > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > > >                         : "memory");
> > > > >         } else {
> > > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > > >         }
> > > > > }
> > > > > 
> > > > > > but atomic
> > > > > > instructions and it could be argued that reading the value is also
> > > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > performance difference either).
> > > > > 
> > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > complex.
> > > > 
> > > > Seriously?
> > > > 
> > > >         spin_lock(&irqfd->source->lock);
> > > >         if (!irqfd->source->level_asserted) {
> > > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > >                 irqfd->source->level_asserted = true;
> > > >         }
> > > >         spin_unlock(&irqfd->source->lock);
> > > > 
> > > > ...
> > > > 
> > > >         spin_lock(&eoifd->source->lock);
> > > >         if (eoifd->source->level_asserted) {
> > > >                 kvm_set_irq(eoifd->kvm,
> > > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > > >                 eoifd->source->level_asserted = false;
> > > >                 eventfd_signal(eoifd->eventfd, 1);
> > > >         }
> > > >         spin_unlock(&eoifd->source->lock);
> > > > 
> > > > 
> > > > Locking doesn't get much more straightforward than that
> > > 
> > > Don't look at it in isolation. You are now calling kvm_set_irq
> > > from under a spinlock. You are saying it is always safe but
> > > this seems far from obvious. kvm_set_irq used to be
> > > unsafe from an atomic context.
> > 
> > Device assignment has been calling kvm_set_irq from atomic context for
> > quite a long time.
> 
> Only for MSI. That's an exception (and it's also a messy one).
> 
ioapic/pic used to use mutexes for locking. But this is not longer the
case. See 46a47b1e for instance. I wasn't able to find the reason for
the commit.

> > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > 
> > > > 
> > > > Which to me seems to indicate this should be handled as a separate
> > > > effort.
> > > 
> > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > look at all the code. Let's not copy bugs, need to fix them.
> > 
> > This looks tangential to me unless you can come up with an actual reason
> > the above spinlock usage is incorrect or insufficient.
> 
> You copy the same pattern that seems racy. So you double the
> amount of code that woul need to be fixed.
> 
> -- 
> MST

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 15:53                   ` Michael S. Tsirkin
@ 2012-07-17 16:06                     ` Alex Williamson
  2012-07-17 16:19                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 16:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > +			goto fail;
> > > > > > > > > > +		}
> > > > > > > > > > +
> > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > +		if (!source) {
> > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > +			goto fail;
> > > > > > > > > > +		}
> > > > > > > > > > +	} else {
> > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > +		goto fail;
> > > > > > > > > > +	}
> > > > > > > > > > +
> > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > 
> > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > 
> > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > do we do anything on release?
> > > > > > > 
> > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > help: it just never gets invoked.
> > > > > > 
> > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > someone else's object.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > eoifd should prevent irqfd from being released.
> > > > 
> > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > linger around until there are no references to it... but that's exactly
> > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > references to and automatically release it when there are none.
> > > 
> > > No no. You *already* prevent it. You take a reference to the eventfd
> > > context.
> > 
> > Right, which keeps the fd from going away, not the struct _irqfd.
> 
> _irqfd too.


How so?


> > > > >   It already keeps
> > > > > a reference to it so it prevents irqfd from going away by userspace
> > > > > closing the fd.
> > > > 
> > > > Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> > > > prevents the fd from going away, not the irqfd.
> > > 
> > > When the fd is no going away an ioctl is the only other way for
> > > it to go away.
> > 
> > It doesn't do any good to fail the ioctl if close(fd) allows it.
> 
> allows what? It does nothing.
> 
> > > > >   But it can still be released with deassign.
> > > > > An easy solution is to fail deassign of irqfd if there is
> > > > > eoifd bound to it.
> > > > 
> > > > I don't know why we would impose such a bizarre usage model when
> > > > reference counting on struct _irq_source seems to handle this nicely
> > > > already.
> > > 
> > > Well eventfd gets an irqfd. What does it mean if said irqfd gets
> > > deassigned, and potentially assigned an unrelated interrupt?
> > > I think what I would expect is for it to handle the new interrupt.
> > > This is hard to implement so let us fail this.
> > 
> > Ah, so an actual problem, let's solve this.  Why wouldn't we just search
> > the list of eoifds and see if this level_irqfd is already used?  If we
> > find it and it's compatible, we can get a reference to the _irq_source
> > and "re-attach" the irqfd.  If it's not compatible, fail the KVM_IRQFD.
> > If the KVM_IRQFD is for an edge irqfd, I think we let it go.
> 
> This is just confusing. Userspace has no idea that you are reusing fds
> behind the scenes. assign is not the problem, deassign is.
> So fail *that*.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 15:57                   ` Michael S. Tsirkin
  2012-07-17 16:01                     ` Gleb Natapov
@ 2012-07-17 16:08                     ` Alex Williamson
  2012-07-17 16:14                       ` Michael S. Tsirkin
  2012-07-17 16:36                       ` Michael S. Tsirkin
  1 sibling, 2 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 18:57 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > > 
> > > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > > >  	u32 type;
> > > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > > >  	union {
> > > > > > > > > >  		struct {
> > > > > > > > > >  			unsigned irqchip;
> > > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > > >  #endif
> > > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > > >  }
> > > > > > > > > >  
> > > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > > +					    int irq_source_id)
> > > > > > > > > > +{
> > > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > > +{
> > > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > > +					     irq_source_id);
> > > > > > > > > > +	if (level)
> > > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > > +	return level;
> > > > > > > > > 
> > > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > > and then we clear if needed?
> > > > > > > > 
> > > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > > update via the chip specific set_irq function.
> > > > > > > > 
> > > > > > > > >  I think it's worthwhile to rename
> > > > > > > > > level to orig_level and rewrite as:
> > > > > > > > > 
> > > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > > 
> > > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > > 
> > > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > > behavior.
> > > > > > > 
> > > > > > > If not then I don't understand. You clear a bit
> > > > > > > in a word. You never change it to 1, do you?
> > > > > > 
> > > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > > and I don't know if that's correct.
> > > > > 
> > > > > Well you are asked to clear some id and level was 1. So we know
> > > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > > 
> > > > > > > 
> > > > > > > But this brings another question:
> > > > > > > 
> > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > >                                      int irq_source_id, int level)
> > > > > > > {
> > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > >         if (level)
> > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > >         else
> > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > 
> > > > > > > 
> > > > > > > ^^^^^^^^^^^
> > > > > > > above uses locked instructions
> > > > > > > 
> > > > > > >         return !!(*irq_state);
> > > > > > > 
> > > > > > > 
> > > > > > > above doesn't
> > > > > > > 
> > > > > > > }
> > > > > > > 
> > > > > > > 
> > > > > > > why the insonsistency?
> > > > > > 
> > > > > > Note that set/clear_bit are not locked instructions,
> > > > > 
> > > > > On x86 they are:
> > > > > static __always_inline void
> > > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > > {
> > > > >         if (IS_IMMEDIATE(nr)) {
> > > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > > >                         : CONST_MASK_ADDR(nr, addr)
> > > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > > >                         : "memory");
> > > > >         } else {
> > > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > > >         }
> > > > > }
> > > > > 
> > > > > > but atomic
> > > > > > instructions and it could be argued that reading the value is also
> > > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > performance difference either).
> > > > > 
> > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > complex.
> > > > 
> > > > Seriously?
> > > > 
> > > >         spin_lock(&irqfd->source->lock);
> > > >         if (!irqfd->source->level_asserted) {
> > > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > >                 irqfd->source->level_asserted = true;
> > > >         }
> > > >         spin_unlock(&irqfd->source->lock);
> > > > 
> > > > ...
> > > > 
> > > >         spin_lock(&eoifd->source->lock);
> > > >         if (eoifd->source->level_asserted) {
> > > >                 kvm_set_irq(eoifd->kvm,
> > > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > > >                 eoifd->source->level_asserted = false;
> > > >                 eventfd_signal(eoifd->eventfd, 1);
> > > >         }
> > > >         spin_unlock(&eoifd->source->lock);
> > > > 
> > > > 
> > > > Locking doesn't get much more straightforward than that
> > > 
> > > Don't look at it in isolation. You are now calling kvm_set_irq
> > > from under a spinlock. You are saying it is always safe but
> > > this seems far from obvious. kvm_set_irq used to be
> > > unsafe from an atomic context.
> > 
> > Device assignment has been calling kvm_set_irq from atomic context for
> > quite a long time.
> 
> Only for MSI. That's an exception (and it's also a messy one).

Nope, I see past code that used it for INTx as well.

> > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > 
> > > > 
> > > > Which to me seems to indicate this should be handled as a separate
> > > > effort.
> > > 
> > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > look at all the code. Let's not copy bugs, need to fix them.
> > 
> > This looks tangential to me unless you can come up with an actual reason
> > the above spinlock usage is incorrect or insufficient.
> 
> You copy the same pattern that seems racy. So you double the
> amount of code that woul need to be fixed.


_Seems_ racy, or _is_ racy?  Please identify the race.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:08                     ` Alex Williamson
@ 2012-07-17 16:14                       ` Michael S. Tsirkin
  2012-07-17 16:17                         ` Alex Williamson
  2012-07-18  6:27                         ` Gleb Natapov
  2012-07-17 16:36                       ` Michael S. Tsirkin
  1 sibling, 2 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 16:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 10:08:21AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 18:57 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > > 
> > > > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > > > >  	u32 type;
> > > > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > > > >  	union {
> > > > > > > > > > >  		struct {
> > > > > > > > > > >  			unsigned irqchip;
> > > > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > > > >  #endif
> > > > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > > > +					    int irq_source_id)
> > > > > > > > > > > +{
> > > > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > > > +{
> > > > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > > > +					     irq_source_id);
> > > > > > > > > > > +	if (level)
> > > > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > > > +	return level;
> > > > > > > > > > 
> > > > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > > > and then we clear if needed?
> > > > > > > > > 
> > > > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > > > update via the chip specific set_irq function.
> > > > > > > > > 
> > > > > > > > > >  I think it's worthwhile to rename
> > > > > > > > > > level to orig_level and rewrite as:
> > > > > > > > > > 
> > > > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > > > 
> > > > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > > > 
> > > > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > > > behavior.
> > > > > > > > 
> > > > > > > > If not then I don't understand. You clear a bit
> > > > > > > > in a word. You never change it to 1, do you?
> > > > > > > 
> > > > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > > > and I don't know if that's correct.
> > > > > > 
> > > > > > Well you are asked to clear some id and level was 1. So we know
> > > > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > > > 
> > > > > > > > 
> > > > > > > > But this brings another question:
> > > > > > > > 
> > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > {
> > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > >         if (level)
> > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > >         else
> > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ^^^^^^^^^^^
> > > > > > > > above uses locked instructions
> > > > > > > > 
> > > > > > > >         return !!(*irq_state);
> > > > > > > > 
> > > > > > > > 
> > > > > > > > above doesn't
> > > > > > > > 
> > > > > > > > }
> > > > > > > > 
> > > > > > > > 
> > > > > > > > why the insonsistency?
> > > > > > > 
> > > > > > > Note that set/clear_bit are not locked instructions,
> > > > > > 
> > > > > > On x86 they are:
> > > > > > static __always_inline void
> > > > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > > > {
> > > > > >         if (IS_IMMEDIATE(nr)) {
> > > > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > > > >                         : CONST_MASK_ADDR(nr, addr)
> > > > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > > > >                         : "memory");
> > > > > >         } else {
> > > > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > > > >         }
> > > > > > }
> > > > > > 
> > > > > > > but atomic
> > > > > > > instructions and it could be argued that reading the value is also
> > > > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > performance difference either).
> > > > > > 
> > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > complex.
> > > > > 
> > > > > Seriously?
> > > > > 
> > > > >         spin_lock(&irqfd->source->lock);
> > > > >         if (!irqfd->source->level_asserted) {
> > > > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > >                 irqfd->source->level_asserted = true;
> > > > >         }
> > > > >         spin_unlock(&irqfd->source->lock);
> > > > > 
> > > > > ...
> > > > > 
> > > > >         spin_lock(&eoifd->source->lock);
> > > > >         if (eoifd->source->level_asserted) {
> > > > >                 kvm_set_irq(eoifd->kvm,
> > > > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > > > >                 eoifd->source->level_asserted = false;
> > > > >                 eventfd_signal(eoifd->eventfd, 1);
> > > > >         }
> > > > >         spin_unlock(&eoifd->source->lock);
> > > > > 
> > > > > 
> > > > > Locking doesn't get much more straightforward than that
> > > > 
> > > > Don't look at it in isolation. You are now calling kvm_set_irq
> > > > from under a spinlock. You are saying it is always safe but
> > > > this seems far from obvious. kvm_set_irq used to be
> > > > unsafe from an atomic context.
> > > 
> > > Device assignment has been calling kvm_set_irq from atomic context for
> > > quite a long time.
> > 
> > Only for MSI. That's an exception (and it's also a messy one).
> 
> Nope, I see past code that used it for INTx as well.
> 
> > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > 
> > > > > 
> > > > > Which to me seems to indicate this should be handled as a separate
> > > > > effort.
> > > > 
> > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > look at all the code. Let's not copy bugs, need to fix them.
> > > 
> > > This looks tangential to me unless you can come up with an actual reason
> > > the above spinlock usage is incorrect or insufficient.
> > 
> > You copy the same pattern that seems racy. So you double the
> > amount of code that woul need to be fixed.
> 
> 
> _Seems_ racy, or _is_ racy?  Please identify the race.

Look at this:

static inline int kvm_irq_line_state(unsigned long *irq_state,
                                     int irq_source_id, int level)
{
        /* Logical OR for level trig interrupt */
        if (level)
                set_bit(irq_source_id, irq_state);
        else
                clear_bit(irq_source_id, irq_state);

        return !!(*irq_state);
}


Now:
If other CPU changes some other bit after the atomic change,
it looks like !!(*irq_state) might return a stale value.

CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
If CPU 0 sees a stale value now it will return 0 here
and interrupt will get cleared.


Maybe this is not a problem. But in that case IMO it needs
a comment explaining why and why it's not a problem in
your code.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:14                       ` Michael S. Tsirkin
@ 2012-07-17 16:17                         ` Alex Williamson
  2012-07-17 16:21                           ` Michael S. Tsirkin
  2012-07-18  6:27                         ` Gleb Natapov
  1 sibling, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 16:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 19:14 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 10:08:21AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 18:57 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > 
> > > > > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > > > > >  	u32 type;
> > > > > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > > > > >  	union {
> > > > > > > > > > > >  		struct {
> > > > > > > > > > > >  			unsigned irqchip;
> > > > > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > > > > >  #endif
> > > > > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > > > > +					    int irq_source_id)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > > > > +					     irq_source_id);
> > > > > > > > > > > > +	if (level)
> > > > > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > > > > +	return level;
> > > > > > > > > > > 
> > > > > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > > > > and then we clear if needed?
> > > > > > > > > > 
> > > > > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > > > > update via the chip specific set_irq function.
> > > > > > > > > > 
> > > > > > > > > > >  I think it's worthwhile to rename
> > > > > > > > > > > level to orig_level and rewrite as:
> > > > > > > > > > > 
> > > > > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > > > > 
> > > > > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > > > > 
> > > > > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > > > > behavior.
> > > > > > > > > 
> > > > > > > > > If not then I don't understand. You clear a bit
> > > > > > > > > in a word. You never change it to 1, do you?
> > > > > > > > 
> > > > > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > > > > and I don't know if that's correct.
> > > > > > > 
> > > > > > > Well you are asked to clear some id and level was 1. So we know
> > > > > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > But this brings another question:
> > > > > > > > > 
> > > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > > {
> > > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > > >         if (level)
> > > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > > >         else
> > > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > ^^^^^^^^^^^
> > > > > > > > > above uses locked instructions
> > > > > > > > > 
> > > > > > > > >         return !!(*irq_state);
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > above doesn't
> > > > > > > > > 
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > why the insonsistency?
> > > > > > > > 
> > > > > > > > Note that set/clear_bit are not locked instructions,
> > > > > > > 
> > > > > > > On x86 they are:
> > > > > > > static __always_inline void
> > > > > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > > > > {
> > > > > > >         if (IS_IMMEDIATE(nr)) {
> > > > > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > > > > >                         : CONST_MASK_ADDR(nr, addr)
> > > > > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > > > > >                         : "memory");
> > > > > > >         } else {
> > > > > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > > > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > > > > >         }
> > > > > > > }
> > > > > > > 
> > > > > > > > but atomic
> > > > > > > > instructions and it could be argued that reading the value is also
> > > > > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > performance difference either).
> > > > > > > 
> > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > complex.
> > > > > > 
> > > > > > Seriously?
> > > > > > 
> > > > > >         spin_lock(&irqfd->source->lock);
> > > > > >         if (!irqfd->source->level_asserted) {
> > > > > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > >                 irqfd->source->level_asserted = true;
> > > > > >         }
> > > > > >         spin_unlock(&irqfd->source->lock);
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > >         spin_lock(&eoifd->source->lock);
> > > > > >         if (eoifd->source->level_asserted) {
> > > > > >                 kvm_set_irq(eoifd->kvm,
> > > > > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > > > > >                 eoifd->source->level_asserted = false;
> > > > > >                 eventfd_signal(eoifd->eventfd, 1);
> > > > > >         }
> > > > > >         spin_unlock(&eoifd->source->lock);
> > > > > > 
> > > > > > 
> > > > > > Locking doesn't get much more straightforward than that
> > > > > 
> > > > > Don't look at it in isolation. You are now calling kvm_set_irq
> > > > > from under a spinlock. You are saying it is always safe but
> > > > > this seems far from obvious. kvm_set_irq used to be
> > > > > unsafe from an atomic context.
> > > > 
> > > > Device assignment has been calling kvm_set_irq from atomic context for
> > > > quite a long time.
> > > 
> > > Only for MSI. That's an exception (and it's also a messy one).
> > 
> > Nope, I see past code that used it for INTx as well.
> > 
> > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > 
> > > > > > 
> > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > effort.
> > > > > 
> > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > 
> > > > This looks tangential to me unless you can come up with an actual reason
> > > > the above spinlock usage is incorrect or insufficient.
> > > 
> > > You copy the same pattern that seems racy. So you double the
> > > amount of code that woul need to be fixed.
> > 
> > 
> > _Seems_ racy, or _is_ racy?  Please identify the race.
> 
> Look at this:
> 
> static inline int kvm_irq_line_state(unsigned long *irq_state,
>                                      int irq_source_id, int level)
> {
>         /* Logical OR for level trig interrupt */
>         if (level)
>                 set_bit(irq_source_id, irq_state);
>         else
>                 clear_bit(irq_source_id, irq_state);
> 
>         return !!(*irq_state);
> }
> 
> 
> Now:
> If other CPU changes some other bit after the atomic change,
> it looks like !!(*irq_state) might return a stale value.
> 
> CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> If CPU 0 sees a stale value now it will return 0 here
> and interrupt will get cleared.
> 
> 
> Maybe this is not a problem. But in that case IMO it needs
> a comment explaining why and why it's not a problem in
> your code.

So you want to close the door on anything that uses kvm_set_irq until
this gets fixed... that's insane.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 16:06                     ` Alex Williamson
@ 2012-07-17 16:19                       ` Michael S. Tsirkin
  2012-07-17 16:52                         ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 16:19 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > +			goto fail;
> > > > > > > > > > > +		}
> > > > > > > > > > > +
> > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > +			goto fail;
> > > > > > > > > > > +		}
> > > > > > > > > > > +	} else {
> > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > +		goto fail;
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > 
> > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > 
> > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > do we do anything on release?
> > > > > > > > 
> > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > help: it just never gets invoked.
> > > > > > > 
> > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > someone else's object.
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > eoifd should prevent irqfd from being released.
> > > > > 
> > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > linger around until there are no references to it... but that's exactly
> > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > references to and automatically release it when there are none.
> > > > 
> > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > context.
> > > 
> > > Right, which keeps the fd from going away, not the struct _irqfd.
> > 
> > _irqfd too.
> 
> 
> How so?

Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
If you get a ctx reference this does not happen.

> > > > > >   It already keeps
> > > > > > a reference to it so it prevents irqfd from going away by userspace
> > > > > > closing the fd.
> > > > > 
> > > > > Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> > > > > prevents the fd from going away, not the irqfd.
> > > > 
> > > > When the fd is no going away an ioctl is the only other way for
> > > > it to go away.
> > > 
> > > It doesn't do any good to fail the ioctl if close(fd) allows it.
> > 
> > allows what? It does nothing.
> > 
> > > > > >   But it can still be released with deassign.
> > > > > > An easy solution is to fail deassign of irqfd if there is
> > > > > > eoifd bound to it.
> > > > > 
> > > > > I don't know why we would impose such a bizarre usage model when
> > > > > reference counting on struct _irq_source seems to handle this nicely
> > > > > already.
> > > > 
> > > > Well eventfd gets an irqfd. What does it mean if said irqfd gets
> > > > deassigned, and potentially assigned an unrelated interrupt?
> > > > I think what I would expect is for it to handle the new interrupt.
> > > > This is hard to implement so let us fail this.
> > > 
> > > Ah, so an actual problem, let's solve this.  Why wouldn't we just search
> > > the list of eoifds and see if this level_irqfd is already used?  If we
> > > find it and it's compatible, we can get a reference to the _irq_source
> > > and "re-attach" the irqfd.  If it's not compatible, fail the KVM_IRQFD.
> > > If the KVM_IRQFD is for an edge irqfd, I think we let it go.
> > 
> > This is just confusing. Userspace has no idea that you are reusing fds
> > behind the scenes. assign is not the problem, deassign is.
> > So fail *that*.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:17                         ` Alex Williamson
@ 2012-07-17 16:21                           ` Michael S. Tsirkin
  2012-07-17 16:45                             ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 16:21 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > 
> > > > > > > 
> > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > effort.
> > > > > > 
> > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > 
> > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > the above spinlock usage is incorrect or insufficient.
> > > > 
> > > > You copy the same pattern that seems racy. So you double the
> > > > amount of code that woul need to be fixed.
> > > 
> > > 
> > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > 
> > Look at this:
> > 
> > static inline int kvm_irq_line_state(unsigned long *irq_state,
> >                                      int irq_source_id, int level)
> > {
> >         /* Logical OR for level trig interrupt */
> >         if (level)
> >                 set_bit(irq_source_id, irq_state);
> >         else
> >                 clear_bit(irq_source_id, irq_state);
> > 
> >         return !!(*irq_state);
> > }
> > 
> > 
> > Now:
> > If other CPU changes some other bit after the atomic change,
> > it looks like !!(*irq_state) might return a stale value.
> > 
> > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > If CPU 0 sees a stale value now it will return 0 here
> > and interrupt will get cleared.
> > 
> > 
> > Maybe this is not a problem. But in that case IMO it needs
> > a comment explaining why and why it's not a problem in
> > your code.
> 
> So you want to close the door on anything that uses kvm_set_irq until
> this gets fixed... that's insane.

What does kvm_set_irq use have to do with it?  You posted this patch:

+static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
+                            struct kvm *kvm, int irq_source_id)
+{
+#ifdef CONFIG_X86
+       struct kvm_pic *pic = pic_irqchip(kvm);
+       int level =
kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
+                                            irq_source_id);
+       if (level)
+               kvm_pic_set_irq(pic, e->irqchip.pin,
+                               !!pic->irq_states[e->irqchip.pin]);
+       return level;
+#else
+       return -1;
+#endif
+}
+

it seems racy in the same way.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:08                     ` Alex Williamson
  2012-07-17 16:14                       ` Michael S. Tsirkin
@ 2012-07-17 16:36                       ` Michael S. Tsirkin
  2012-07-17 17:09                         ` Gleb Natapov
  1 sibling, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 16:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 10:08:21AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 18:57 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > > 
> > > > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > > > >  	u32 type;
> > > > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > > > >  	union {
> > > > > > > > > > >  		struct {
> > > > > > > > > > >  			unsigned irqchip;
> > > > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > > > >  #endif
> > > > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > > > +					    int irq_source_id)
> > > > > > > > > > > +{
> > > > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > > > +{
> > > > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > > > +					     irq_source_id);
> > > > > > > > > > > +	if (level)
> > > > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > > > +	return level;
> > > > > > > > > > 
> > > > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > > > and then we clear if needed?
> > > > > > > > > 
> > > > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > > > update via the chip specific set_irq function.
> > > > > > > > > 
> > > > > > > > > >  I think it's worthwhile to rename
> > > > > > > > > > level to orig_level and rewrite as:
> > > > > > > > > > 
> > > > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > > > 
> > > > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > > > 
> > > > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > > > behavior.
> > > > > > > > 
> > > > > > > > If not then I don't understand. You clear a bit
> > > > > > > > in a word. You never change it to 1, do you?
> > > > > > > 
> > > > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > > > and I don't know if that's correct.
> > > > > > 
> > > > > > Well you are asked to clear some id and level was 1. So we know
> > > > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > > > 
> > > > > > > > 
> > > > > > > > But this brings another question:
> > > > > > > > 
> > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > {
> > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > >         if (level)
> > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > >         else
> > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ^^^^^^^^^^^
> > > > > > > > above uses locked instructions
> > > > > > > > 
> > > > > > > >         return !!(*irq_state);
> > > > > > > > 
> > > > > > > > 
> > > > > > > > above doesn't
> > > > > > > > 
> > > > > > > > }
> > > > > > > > 
> > > > > > > > 
> > > > > > > > why the insonsistency?
> > > > > > > 
> > > > > > > Note that set/clear_bit are not locked instructions,
> > > > > > 
> > > > > > On x86 they are:
> > > > > > static __always_inline void
> > > > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > > > {
> > > > > >         if (IS_IMMEDIATE(nr)) {
> > > > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > > > >                         : CONST_MASK_ADDR(nr, addr)
> > > > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > > > >                         : "memory");
> > > > > >         } else {
> > > > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > > > >         }
> > > > > > }
> > > > > > 
> > > > > > > but atomic
> > > > > > > instructions and it could be argued that reading the value is also
> > > > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > performance difference either).
> > > > > > 
> > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > complex.
> > > > > 
> > > > > Seriously?
> > > > > 
> > > > >         spin_lock(&irqfd->source->lock);
> > > > >         if (!irqfd->source->level_asserted) {
> > > > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > >                 irqfd->source->level_asserted = true;
> > > > >         }
> > > > >         spin_unlock(&irqfd->source->lock);
> > > > > 
> > > > > ...
> > > > > 
> > > > >         spin_lock(&eoifd->source->lock);
> > > > >         if (eoifd->source->level_asserted) {
> > > > >                 kvm_set_irq(eoifd->kvm,
> > > > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > > > >                 eoifd->source->level_asserted = false;
> > > > >                 eventfd_signal(eoifd->eventfd, 1);
> > > > >         }
> > > > >         spin_unlock(&eoifd->source->lock);
> > > > > 
> > > > > 
> > > > > Locking doesn't get much more straightforward than that
> > > > 
> > > > Don't look at it in isolation. You are now calling kvm_set_irq
> > > > from under a spinlock. You are saying it is always safe but
> > > > this seems far from obvious. kvm_set_irq used to be
> > > > unsafe from an atomic context.
> > > 
> > > Device assignment has been calling kvm_set_irq from atomic context for
> > > quite a long time.
> > 
> > Only for MSI. That's an exception (and it's also a messy one).
> 
> Nope, I see past code that used it for INTx as well.

While this looks like it will not crash, this scans all vcpus under a
spinlock. A problem for big VMs.
Again, yes we have such uses now but we are looking for ways
to fix them and not be adding more.


> > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > 
> > > > > 
> > > > > Which to me seems to indicate this should be handled as a separate
> > > > > effort.
> > > > 
> > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > look at all the code. Let's not copy bugs, need to fix them.
> > > 
> > > This looks tangential to me unless you can come up with an actual reason
> > > the above spinlock usage is incorrect or insufficient.
> > 
> > You copy the same pattern that seems racy. So you double the
> > amount of code that woul need to be fixed.
> 
> 
> _Seems_ racy, or _is_ racy?  Please identify the race.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:21                           ` Michael S. Tsirkin
@ 2012-07-17 16:45                             ` Alex Williamson
  2012-07-17 18:55                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 16:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > effort.
> > > > > > > 
> > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > 
> > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > 
> > > > > You copy the same pattern that seems racy. So you double the
> > > > > amount of code that woul need to be fixed.
> > > > 
> > > > 
> > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > 
> > > Look at this:
> > > 
> > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > >                                      int irq_source_id, int level)
> > > {
> > >         /* Logical OR for level trig interrupt */
> > >         if (level)
> > >                 set_bit(irq_source_id, irq_state);
> > >         else
> > >                 clear_bit(irq_source_id, irq_state);
> > > 
> > >         return !!(*irq_state);
> > > }
> > > 
> > > 
> > > Now:
> > > If other CPU changes some other bit after the atomic change,
> > > it looks like !!(*irq_state) might return a stale value.
> > > 
> > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > If CPU 0 sees a stale value now it will return 0 here
> > > and interrupt will get cleared.
> > > 
> > > 
> > > Maybe this is not a problem. But in that case IMO it needs
> > > a comment explaining why and why it's not a problem in
> > > your code.
> > 
> > So you want to close the door on anything that uses kvm_set_irq until
> > this gets fixed... that's insane.
> 
> What does kvm_set_irq use have to do with it?  You posted this patch:
> 
> +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> +                            struct kvm *kvm, int irq_source_id)
> +{
> +#ifdef CONFIG_X86
> +       struct kvm_pic *pic = pic_irqchip(kvm);
> +       int level =
> kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> +                                            irq_source_id);
> +       if (level)
> +               kvm_pic_set_irq(pic, e->irqchip.pin,
> +                               !!pic->irq_states[e->irqchip.pin]);
> +       return level;
> +#else
> +       return -1;
> +#endif
> +}
> +
> 
> it seems racy in the same way.

Now you're just misrepresenting how we got here, which was:

> > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > performance difference either).
> > > > > 
> > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > complex.

So I'm happy to drop the last 2 patches, which were done at your request
anyway, but you've failed to show how the locking in patches 1&2 is
messy, inconsistent, or complex and now you're asking to block all
progress.  Those patches are just users of kvm_set_irq.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 16:19                       ` Michael S. Tsirkin
@ 2012-07-17 16:52                         ` Alex Williamson
  2012-07-17 18:58                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 16:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > +		}
> > > > > > > > > > > > +
> > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > +		}
> > > > > > > > > > > > +	} else {
> > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > +	}
> > > > > > > > > > > > +
> > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > 
> > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > 
> > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > do we do anything on release?
> > > > > > > > > 
> > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > help: it just never gets invoked.
> > > > > > > > 
> > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > someone else's object.
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > eoifd should prevent irqfd from being released.
> > > > > > 
> > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > linger around until there are no references to it... but that's exactly
> > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > references to and automatically release it when there are none.
> > > > > 
> > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > context.
> > > > 
> > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > 
> > > _irqfd too.
> > 
> > 
> > How so?
> 
> Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> If you get a ctx reference this does not happen.

I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
eventfd_release (file_operations.release), not from ctx reference
release.

> > > > > > >   It already keeps
> > > > > > > a reference to it so it prevents irqfd from going away by userspace
> > > > > > > closing the fd.
> > > > > > 
> > > > > > Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> > > > > > prevents the fd from going away, not the irqfd.
> > > > > 
> > > > > When the fd is no going away an ioctl is the only other way for
> > > > > it to go away.
> > > > 
> > > > It doesn't do any good to fail the ioctl if close(fd) allows it.
> > > 
> > > allows what? It does nothing.
> > > 
> > > > > > >   But it can still be released with deassign.
> > > > > > > An easy solution is to fail deassign of irqfd if there is
> > > > > > > eoifd bound to it.
> > > > > > 
> > > > > > I don't know why we would impose such a bizarre usage model when
> > > > > > reference counting on struct _irq_source seems to handle this nicely
> > > > > > already.
> > > > > 
> > > > > Well eventfd gets an irqfd. What does it mean if said irqfd gets
> > > > > deassigned, and potentially assigned an unrelated interrupt?
> > > > > I think what I would expect is for it to handle the new interrupt.
> > > > > This is hard to implement so let us fail this.
> > > > 
> > > > Ah, so an actual problem, let's solve this.  Why wouldn't we just search
> > > > the list of eoifds and see if this level_irqfd is already used?  If we
> > > > find it and it's compatible, we can get a reference to the _irq_source
> > > > and "re-attach" the irqfd.  If it's not compatible, fail the KVM_IRQFD.
> > > > If the KVM_IRQFD is for an edge irqfd, I think we let it go.
> > > 
> > > This is just confusing. Userspace has no idea that you are reusing fds
> > > behind the scenes. assign is not the problem, deassign is.
> > > So fail *that*.




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:36                       ` Michael S. Tsirkin
@ 2012-07-17 17:09                         ` Gleb Natapov
  0 siblings, 0 replies; 96+ messages in thread
From: Gleb Natapov @ 2012-07-17 17:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 07:36:49PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 10:08:21AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 18:57 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 09:51:41AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 18:36 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 09:20:11AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 17:53 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 08:21:51AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 17:08 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jul 17, 2012 at 07:56:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 2012-07-17 at 13:14 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Mon, Jul 16, 2012 at 02:34:03PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > This is an alternative to kvm_set_irq(,,,0) which returns the previous
> > > > > > > > > > > > assertion state of the interrupt and does nothing if it isn't changed.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > 
> > > > > > > > > > > >  include/linux/kvm_host.h |    3 ++
> > > > > > > > > > > >  virt/kvm/irq_comm.c      |   78 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > > >  2 files changed, 81 insertions(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > > > > > > > index a7661c0..6c168f1 100644
> > > > > > > > > > > > --- a/include/linux/kvm_host.h
> > > > > > > > > > > > +++ b/include/linux/kvm_host.h
> > > > > > > > > > > > @@ -219,6 +219,8 @@ struct kvm_kernel_irq_routing_entry {
> > > > > > > > > > > >  	u32 type;
> > > > > > > > > > > >  	int (*set)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > >  		   struct kvm *kvm, int irq_source_id, int level);
> > > > > > > > > > > > +	int (*clear)(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > > +		     struct kvm *kvm, int irq_source_id);
> > > > > > > > > > > >  	union {
> > > > > > > > > > > >  		struct {
> > > > > > > > > > > >  			unsigned irqchip;
> > > > > > > > > > > > @@ -629,6 +631,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> > > > > > > > > > > >  				   unsigned long *deliver_bitmask);
> > > > > > > > > > > >  #endif
> > > > > > > > > > > >  int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> > > > > > > > > > > > +int kvm_clear_irq(struct kvm *kvm, int irq_source_id, u32 irq);
> > > > > > > > > > > >  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
> > > > > > > > > > > >  		int irq_source_id, int level);
> > > > > > > > > > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> > > > > > > > > > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > > > > > > > > > index 5afb431..76e8f22 100644
> > > > > > > > > > > > --- a/virt/kvm/irq_comm.c
> > > > > > > > > > > > +++ b/virt/kvm/irq_comm.c
> > > > > > > > > > > > @@ -68,6 +68,42 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > >  	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +static inline int kvm_clear_irq_line_state(unsigned long *irq_state,
> > > > > > > > > > > > +					    int irq_source_id)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +	return !!test_and_clear_bit(irq_source_id, irq_state);
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > > > > > +			     struct kvm *kvm, int irq_source_id)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > > > > > +	struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > > > > > +	int level = kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > > > > > +					     irq_source_id);
> > > > > > > > > > > > +	if (level)
> > > > > > > > > > > > +		kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > > > > > +				!!pic->irq_states[e->irqchip.pin]);
> > > > > > > > > > > > +	return level;
> > > > > > > > > > > 
> > > > > > > > > > > I think I begin to understand: if (level) checks it was previously set,
> > > > > > > > > > > and then we clear if needed?
> > > > > > > > > > 
> > > > > > > > > > It's actually very simple, if we change anything in irq_states, then
> > > > > > > > > > update via the chip specific set_irq function.
> > > > > > > > > > 
> > > > > > > > > > >  I think it's worthwhile to rename
> > > > > > > > > > > level to orig_level and rewrite as:
> > > > > > > > > > > 
> > > > > > > > > > > 	if (orig_level && !pic->irq_states[e->irqchip.pin])
> > > > > > > > > > > 		kvm_pic_set_irq(pic, e->irqchip.pin, 0);
> > > > > > > > > > > 
> > > > > > > > > > > This both makes the logic clear without need for comments and
> > > > > > > > > > > saves some cycles on pic in case nothing actually changed.
> > > > > > > > > > 
> > > > > > > > > > That may work, but it's not actually the same thing.  kvm_set_irq(,,,0)
> > > > > > > > > > will clear the bit and call kvm_pic_set_irq with the new irq_states
> > > > > > > > > > value, whether it's 0 or 1.  The optimization I make is to only call
> > > > > > > > > > kvm_pic_set_irq if we've "changed" irq_states.  You're taking that one
> > > > > > > > > > step further to "changed and is now 0".  I don't know if that's correct
> > > > > > > > > > behavior.
> > > > > > > > > 
> > > > > > > > > If not then I don't understand. You clear a bit
> > > > > > > > > in a word. You never change it to 1, do you?
> > > > > > > > 
> > > > > > > > Correct, but kvm_set_irq(,,,0) may call kvm_pic_set_irq(,,1) if other
> > > > > > > > source IDs are still asserting the interrupt.  Your proposal assumes
> > > > > > > > that unless irq_states is also 0 we don't need to call kvm_pic_set_irq,
> > > > > > > > and I don't know if that's correct.
> > > > > > > 
> > > > > > > Well you are asked to clear some id and level was 1. So we know
> > > > > > > interrupt was asserted. Either we clear it or we don't. No?
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > But this brings another question:
> > > > > > > > > 
> > > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > > {
> > > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > > >         if (level)
> > > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > > >         else
> > > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > ^^^^^^^^^^^
> > > > > > > > > above uses locked instructions
> > > > > > > > > 
> > > > > > > > >         return !!(*irq_state);
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > above doesn't
> > > > > > > > > 
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > why the insonsistency?
> > > > > > > > 
> > > > > > > > Note that set/clear_bit are not locked instructions,
> > > > > > > 
> > > > > > > On x86 they are:
> > > > > > > static __always_inline void
> > > > > > > set_bit(unsigned int nr, volatile unsigned long *addr)
> > > > > > > {
> > > > > > >         if (IS_IMMEDIATE(nr)) {
> > > > > > >                 asm volatile(LOCK_PREFIX "orb %1,%0"
> > > > > > >                         : CONST_MASK_ADDR(nr, addr)
> > > > > > >                         : "iq" ((u8)CONST_MASK(nr))
> > > > > > >                         : "memory");
> > > > > > >         } else {
> > > > > > >                 asm volatile(LOCK_PREFIX "bts %1,%0"
> > > > > > >                         : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
> > > > > > >         }
> > > > > > > }
> > > > > > > 
> > > > > > > > but atomic
> > > > > > > > instructions and it could be argued that reading the value is also
> > > > > > > > atomic.  At least that was my guess when I stumbled across the same
> > > > > > > > yesterday.  IMHO, we're going off into the weeds again with these last
> > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > performance difference either).
> > > > > > > 
> > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > complex.
> > > > > > 
> > > > > > Seriously?
> > > > > > 
> > > > > >         spin_lock(&irqfd->source->lock);
> > > > > >         if (!irqfd->source->level_asserted) {
> > > > > >                 kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > >                 irqfd->source->level_asserted = true;
> > > > > >         }
> > > > > >         spin_unlock(&irqfd->source->lock);
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > >         spin_lock(&eoifd->source->lock);
> > > > > >         if (eoifd->source->level_asserted) {
> > > > > >                 kvm_set_irq(eoifd->kvm,
> > > > > >                             eoifd->source->id, eoifd->notifier.gsi, 0);
> > > > > >                 eoifd->source->level_asserted = false;
> > > > > >                 eventfd_signal(eoifd->eventfd, 1);
> > > > > >         }
> > > > > >         spin_unlock(&eoifd->source->lock);
> > > > > > 
> > > > > > 
> > > > > > Locking doesn't get much more straightforward than that
> > > > > 
> > > > > Don't look at it in isolation. You are now calling kvm_set_irq
> > > > > from under a spinlock. You are saying it is always safe but
> > > > > this seems far from obvious. kvm_set_irq used to be
> > > > > unsafe from an atomic context.
> > > > 
> > > > Device assignment has been calling kvm_set_irq from atomic context for
> > > > quite a long time.
> > > 
> > > Only for MSI. That's an exception (and it's also a messy one).
> > 
> > Nope, I see past code that used it for INTx as well.
> 
> While this looks like it will not crash, this scans all vcpus under a
> spinlock. A problem for big VMs.
> Again, yes we have such uses now but we are looking for ways
> to fix them and not be adding more.
> 
> 
Same as with MSI.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:45                             ` Alex Williamson
@ 2012-07-17 18:55                               ` Michael S. Tsirkin
  2012-07-17 19:51                                 ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 18:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > effort.
> > > > > > > > 
> > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > 
> > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > 
> > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > amount of code that woul need to be fixed.
> > > > > 
> > > > > 
> > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > 
> > > > Look at this:
> > > > 
> > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > >                                      int irq_source_id, int level)
> > > > {
> > > >         /* Logical OR for level trig interrupt */
> > > >         if (level)
> > > >                 set_bit(irq_source_id, irq_state);
> > > >         else
> > > >                 clear_bit(irq_source_id, irq_state);
> > > > 
> > > >         return !!(*irq_state);
> > > > }
> > > > 
> > > > 
> > > > Now:
> > > > If other CPU changes some other bit after the atomic change,
> > > > it looks like !!(*irq_state) might return a stale value.
> > > > 
> > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > If CPU 0 sees a stale value now it will return 0 here
> > > > and interrupt will get cleared.
> > > > 
> > > > 
> > > > Maybe this is not a problem. But in that case IMO it needs
> > > > a comment explaining why and why it's not a problem in
> > > > your code.
> > > 
> > > So you want to close the door on anything that uses kvm_set_irq until
> > > this gets fixed... that's insane.
> > 
> > What does kvm_set_irq use have to do with it?  You posted this patch:
> > 
> > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > +                            struct kvm *kvm, int irq_source_id)
> > +{
> > +#ifdef CONFIG_X86
> > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > +       int level =
> > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > +                                            irq_source_id);
> > +       if (level)
> > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > +                               !!pic->irq_states[e->irqchip.pin]);
> > +       return level;
> > +#else
> > +       return -1;
> > +#endif
> > +}
> > +
> > 
> > it seems racy in the same way.
> 
> Now you're just misrepresenting how we got here, which was:
> 
> > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > performance difference either).
> > > > > > 
> > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > complex.
> 
> So I'm happy to drop the last 2 patches, which were done at your request
> anyway, but you've failed to show how the locking in patches 1&2 is
> messy, inconsistent, or complex and now you're asking to block all
> progress.

I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
not progress. Or maybe there is no bug. Let's see why and add a comment.

>  Those patches are just users of kvm_set_irq.


Well these add calls to kvm_set_irq which scans all vcpus under
spinlock. In the past Avi thought this is not a good idea too.
Maybe things changed.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 16:52                         ` Alex Williamson
@ 2012-07-17 18:58                           ` Michael S. Tsirkin
  2012-07-17 20:03                             ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 18:58 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > +		}
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > +		}
> > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > +	}
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > 
> > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > 
> > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > do we do anything on release?
> > > > > > > > > > 
> > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > 
> > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > someone else's object.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > 
> > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > references to and automatically release it when there are none.
> > > > > > 
> > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > context.
> > > > > 
> > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > 
> > > > _irqfd too.
> > > 
> > > 
> > > How so?
> > 
> > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > If you get a ctx reference this does not happen.
> 
> I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> eventfd_release (file_operations.release), not from ctx reference
> release.

True. I was wrong. so close has the same bug as deassign. To fix,
how about eoifd will hold a reference to the irqfd instead of the
eventfd context?

> > > > > > > >   It already keeps
> > > > > > > > a reference to it so it prevents irqfd from going away by userspace
> > > > > > > > closing the fd.
> > > > > > > 
> > > > > > > Wrong, eoifd holds a reference to the eventfd for the irqfd, so it
> > > > > > > prevents the fd from going away, not the irqfd.
> > > > > > 
> > > > > > When the fd is no going away an ioctl is the only other way for
> > > > > > it to go away.
> > > > > 
> > > > > It doesn't do any good to fail the ioctl if close(fd) allows it.
> > > > 
> > > > allows what? It does nothing.
> > > > 
> > > > > > > >   But it can still be released with deassign.
> > > > > > > > An easy solution is to fail deassign of irqfd if there is
> > > > > > > > eoifd bound to it.
> > > > > > > 
> > > > > > > I don't know why we would impose such a bizarre usage model when
> > > > > > > reference counting on struct _irq_source seems to handle this nicely
> > > > > > > already.
> > > > > > 
> > > > > > Well eventfd gets an irqfd. What does it mean if said irqfd gets
> > > > > > deassigned, and potentially assigned an unrelated interrupt?
> > > > > > I think what I would expect is for it to handle the new interrupt.
> > > > > > This is hard to implement so let us fail this.
> > > > > 
> > > > > Ah, so an actual problem, let's solve this.  Why wouldn't we just search
> > > > > the list of eoifds and see if this level_irqfd is already used?  If we
> > > > > find it and it's compatible, we can get a reference to the _irq_source
> > > > > and "re-attach" the irqfd.  If it's not compatible, fail the KVM_IRQFD.
> > > > > If the KVM_IRQFD is for an edge irqfd, I think we let it go.
> > > > 
> > > > This is just confusing. Userspace has no idea that you are reusing fds
> > > > behind the scenes. assign is not the problem, deassign is.
> > > > So fail *that*.
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 18:55                               ` Michael S. Tsirkin
@ 2012-07-17 19:51                                 ` Alex Williamson
  2012-07-17 21:05                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 19:51 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 21:55 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > > effort.
> > > > > > > > > 
> > > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > > 
> > > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > > 
> > > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > > amount of code that woul need to be fixed.
> > > > > > 
> > > > > > 
> > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > 
> > > > > Look at this:
> > > > > 
> > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > >                                      int irq_source_id, int level)
> > > > > {
> > > > >         /* Logical OR for level trig interrupt */
> > > > >         if (level)
> > > > >                 set_bit(irq_source_id, irq_state);
> > > > >         else
> > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > 
> > > > >         return !!(*irq_state);
> > > > > }
> > > > > 
> > > > > 
> > > > > Now:
> > > > > If other CPU changes some other bit after the atomic change,
> > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > 
> > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > and interrupt will get cleared.
> > > > > 
> > > > > 
> > > > > Maybe this is not a problem. But in that case IMO it needs
> > > > > a comment explaining why and why it's not a problem in
> > > > > your code.
> > > > 
> > > > So you want to close the door on anything that uses kvm_set_irq until
> > > > this gets fixed... that's insane.
> > > 
> > > What does kvm_set_irq use have to do with it?  You posted this patch:
> > > 
> > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > +                            struct kvm *kvm, int irq_source_id)
> > > +{
> > > +#ifdef CONFIG_X86
> > > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > > +       int level =
> > > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > +                                            irq_source_id);
> > > +       if (level)
> > > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > > +                               !!pic->irq_states[e->irqchip.pin]);
> > > +       return level;
> > > +#else
> > > +       return -1;
> > > +#endif
> > > +}
> > > +
> > > 
> > > it seems racy in the same way.
> > 
> > Now you're just misrepresenting how we got here, which was:
> > 
> > > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > performance difference either).
> > > > > > > 
> > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > complex.
> > 
> > So I'm happy to drop the last 2 patches, which were done at your request
> > anyway, but you've failed to show how the locking in patches 1&2 is
> > messy, inconsistent, or complex and now you're asking to block all
> > progress.
> 
> I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
> not progress. Or maybe there is no bug. Let's see why and add a comment.
> 
> >  Those patches are just users of kvm_set_irq.
> 
> 
> Well these add calls to kvm_set_irq which scans all vcpus under
> spinlock. In the past Avi thought this is not a good idea too.
> Maybe things changed.

We can drop the spinlock if we don't care about spurious EOIs, which is
only a theoretical scalability problem anyway.  We're talking about
level interrupts here, how scalable do we need to be?




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 18:58                           ` Michael S. Tsirkin
@ 2012-07-17 20:03                             ` Alex Williamson
  2012-07-17 21:23                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 20:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, 2012-07-17 at 21:58 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > > 
> > > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > > 
> > > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > > do we do anything on release?
> > > > > > > > > > > 
> > > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > > 
> > > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > > someone else's object.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > > 
> > > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > > references to and automatically release it when there are none.
> > > > > > > 
> > > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > > context.
> > > > > > 
> > > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > > 
> > > > > _irqfd too.
> > > > 
> > > > 
> > > > How so?
> > > 
> > > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > > If you get a ctx reference this does not happen.
> > 
> > I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> > eventfd_release (file_operations.release), not from ctx reference
> > release.
> 
> True. I was wrong. so close has the same bug as deassign. To fix,
> how about eoifd will hold a reference to the irqfd instead of the
> eventfd context?

What does it mean to hold a reference to the irqfd?  What state of
functionality is an irqfd that has been closed/de-assigned but is still
attached to an eoifd?  It can't continue to fire interrupts into the
guest.  I don't think close or de-assign have a bug, assign has a bug
that it can allow re-assignment using an in-use eventfd.  I think I'd
rather fix that.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 19:51                                 ` Alex Williamson
@ 2012-07-17 21:05                                   ` Michael S. Tsirkin
  2012-07-17 22:01                                     ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 21:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 01:51:27PM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 21:55 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > > > effort.
> > > > > > > > > > 
> > > > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > > > 
> > > > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > > > 
> > > > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > > > amount of code that woul need to be fixed.
> > > > > > > 
> > > > > > > 
> > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > 
> > > > > > Look at this:
> > > > > > 
> > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > >                                      int irq_source_id, int level)
> > > > > > {
> > > > > >         /* Logical OR for level trig interrupt */
> > > > > >         if (level)
> > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > >         else
> > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > 
> > > > > >         return !!(*irq_state);
> > > > > > }
> > > > > > 
> > > > > > 
> > > > > > Now:
> > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > 
> > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > and interrupt will get cleared.
> > > > > > 
> > > > > > 
> > > > > > Maybe this is not a problem. But in that case IMO it needs
> > > > > > a comment explaining why and why it's not a problem in
> > > > > > your code.
> > > > > 
> > > > > So you want to close the door on anything that uses kvm_set_irq until
> > > > > this gets fixed... that's insane.
> > > > 
> > > > What does kvm_set_irq use have to do with it?  You posted this patch:
> > > > 
> > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > +                            struct kvm *kvm, int irq_source_id)
> > > > +{
> > > > +#ifdef CONFIG_X86
> > > > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > > > +       int level =
> > > > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > +                                            irq_source_id);
> > > > +       if (level)
> > > > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > +                               !!pic->irq_states[e->irqchip.pin]);
> > > > +       return level;
> > > > +#else
> > > > +       return -1;
> > > > +#endif
> > > > +}
> > > > +
> > > > 
> > > > it seems racy in the same way.
> > > 
> > > Now you're just misrepresenting how we got here, which was:
> > > 
> > > > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > > performance difference either).
> > > > > > > > 
> > > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > > complex.
> > > 
> > > So I'm happy to drop the last 2 patches, which were done at your request
> > > anyway, but you've failed to show how the locking in patches 1&2 is
> > > messy, inconsistent, or complex and now you're asking to block all
> > > progress.
> > 
> > I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
> > not progress. Or maybe there is no bug. Let's see why and add a comment.
> > 
> > >  Those patches are just users of kvm_set_irq.
> > 
> > 
> > Well these add calls to kvm_set_irq which scans all vcpus under
> > spinlock. In the past Avi thought this is not a good idea too.
> > Maybe things changed.
> 
> We can drop the spinlock if we don't care about spurious EOIs, which is
> only a theoretical scalability problem anyway.

Not theoretical at all IMO. We see the problem with virtio with old
guests today.

> We're talking about
> level interrupts here, how scalable do we need to be?
> 

The reason we are moving them into kernel at all is for speed, no?

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 20:03                             ` Alex Williamson
@ 2012-07-17 21:23                               ` Michael S. Tsirkin
  2012-07-17 22:09                                 ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 21:23 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 02:03:05PM -0600, Alex Williamson wrote:
> On Tue, 2012-07-17 at 21:58 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > > > do we do anything on release?
> > > > > > > > > > > > 
> > > > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > > > 
> > > > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > > > someone else's object.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > > > 
> > > > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > > > references to and automatically release it when there are none.
> > > > > > > > 
> > > > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > > > context.
> > > > > > > 
> > > > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > > > 
> > > > > > _irqfd too.
> > > > > 
> > > > > 
> > > > > How so?
> > > > 
> > > > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > > > If you get a ctx reference this does not happen.
> > > 
> > > I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> > > eventfd_release (file_operations.release), not from ctx reference
> > > release.
> > 
> > True. I was wrong. so close has the same bug as deassign. To fix,
> > how about eoifd will hold a reference to the irqfd instead of the
> > eventfd context?
> 
> What does it mean to hold a reference to the irqfd?

I meant file *reference: eventfd_fget. But there are other options see
below.

> What state of functionality is an irqfd that has been
> closed/de-assigned but is still attached to an eoifd?  It can't
> continue to fire interrupts into the guest.
>
> I don't think close or de-assign have a bug, assign has a bug that it
> can allow re-assignment using an in-use eventfd.  I think I'd rather
> fix that.

Let me show you that the bug is in deassign:
	assign irqfd for fd=1
	assign for eoifd fd=2, irqfd=1
	deassign irqfd 1

At this point eoifd has no meaning and there is also no way to deassign
it, so the bug already triggered.

I can see two ways out:
1. easy way - fail deassign
2. elegant way - shut down eoifd on irqfd deassign too

I'm fine with both approaches.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-16 20:33 ` [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts Alex Williamson
@ 2012-07-17 21:26   ` Michael S. Tsirkin
  2012-07-17 21:57     ` Alex Williamson
  2012-07-18 10:41   ` Michael S. Tsirkin
  1 sibling, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 21:26 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> @@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
>  	 * It is now safe to release the object's resources
>  	 */
>  	eventfd_ctx_put(irqfd->eventfd);
> +
> +	_irq_source_put(irqfd->source);
> +
>  	kfree(irqfd);
>  }
>  

Hang on, this is a bug I think. This is done asynchronously.  So this
means that I can assign MAX number of irqfds, then close one, and now
assign will fail because deassign did not get freed.

Maybe we can solve this by flushing wq before assign?
Looks a bit fragile but may be enough - need to document well.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-17 21:26   ` Michael S. Tsirkin
@ 2012-07-17 21:57     ` Alex Williamson
  2012-07-17 22:00       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 21:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 00:26 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > @@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
> >  	 * It is now safe to release the object's resources
> >  	 */
> >  	eventfd_ctx_put(irqfd->eventfd);
> > +
> > +	_irq_source_put(irqfd->source);
> > +
> >  	kfree(irqfd);
> >  }
> >  
> 
> Hang on, this is a bug I think. This is done asynchronously.  So this
> means that I can assign MAX number of irqfds, then close one, and now
                         ^^^^^^^^^^^^^^^^^^^^ What is this?
Do you mean max irq source ids?

> assign will fail because deassign did not get freed.
> 
> Maybe we can solve this by flushing wq before assign?
> Looks a bit fragile but may be enough - need to document well.
> 




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-17 21:57     ` Alex Williamson
@ 2012-07-17 22:00       ` Michael S. Tsirkin
  2012-07-17 22:16         ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 22:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 03:57:41PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 00:26 +0300, Michael S. Tsirkin wrote:
> > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > @@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
> > >  	 * It is now safe to release the object's resources
> > >  	 */
> > >  	eventfd_ctx_put(irqfd->eventfd);
> > > +
> > > +	_irq_source_put(irqfd->source);
> > > +
> > >  	kfree(irqfd);
> > >  }
> > >  
> > 
> > Hang on, this is a bug I think. This is done asynchronously.  So this
> > means that I can assign MAX number of irqfds, then close one, and now
>                          ^^^^^^^^^^^^^^^^^^^^ What is this?
> Do you mean max irq source ids?

Yes, this is what I meant. Sorry about being unclear.

> > assign will fail because deassign did not get freed.
> > 
> > Maybe we can solve this by flushing wq before assign?
> > Looks a bit fragile but may be enough - need to document well.
> > 
> 
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 21:05                                   ` Michael S. Tsirkin
@ 2012-07-17 22:01                                     ` Alex Williamson
  2012-07-17 22:05                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 22:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 00:05 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 01:51:27PM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 21:55 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > > > > effort.
> > > > > > > > > > > 
> > > > > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > > > > 
> > > > > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > > > > 
> > > > > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > > > > amount of code that woul need to be fixed.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > 
> > > > > > > Look at this:
> > > > > > > 
> > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > >                                      int irq_source_id, int level)
> > > > > > > {
> > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > >         if (level)
> > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > >         else
> > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > 
> > > > > > >         return !!(*irq_state);
> > > > > > > }
> > > > > > > 
> > > > > > > 
> > > > > > > Now:
> > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > 
> > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > and interrupt will get cleared.
> > > > > > > 
> > > > > > > 
> > > > > > > Maybe this is not a problem. But in that case IMO it needs
> > > > > > > a comment explaining why and why it's not a problem in
> > > > > > > your code.
> > > > > > 
> > > > > > So you want to close the door on anything that uses kvm_set_irq until
> > > > > > this gets fixed... that's insane.
> > > > > 
> > > > > What does kvm_set_irq use have to do with it?  You posted this patch:
> > > > > 
> > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > +                            struct kvm *kvm, int irq_source_id)
> > > > > +{
> > > > > +#ifdef CONFIG_X86
> > > > > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > +       int level =
> > > > > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > +                                            irq_source_id);
> > > > > +       if (level)
> > > > > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > +                               !!pic->irq_states[e->irqchip.pin]);
> > > > > +       return level;
> > > > > +#else
> > > > > +       return -1;
> > > > > +#endif
> > > > > +}
> > > > > +
> > > > > 
> > > > > it seems racy in the same way.
> > > > 
> > > > Now you're just misrepresenting how we got here, which was:
> > > > 
> > > > > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > > > performance difference either).
> > > > > > > > > 
> > > > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > > > complex.
> > > > 
> > > > So I'm happy to drop the last 2 patches, which were done at your request
> > > > anyway, but you've failed to show how the locking in patches 1&2 is
> > > > messy, inconsistent, or complex and now you're asking to block all
> > > > progress.
> > > 
> > > I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
> > > not progress. Or maybe there is no bug. Let's see why and add a comment.
> > > 
> > > >  Those patches are just users of kvm_set_irq.
> > > 
> > > 
> > > Well these add calls to kvm_set_irq which scans all vcpus under
> > > spinlock. In the past Avi thought this is not a good idea too.
> > > Maybe things changed.
> > 
> > We can drop the spinlock if we don't care about spurious EOIs, which is
> > only a theoretical scalability problem anyway.
> 
> Not theoretical at all IMO. We see the problem with virtio with old
> guests today.

And how are you injecting level interrupts with virtio today w/o this
interface?

> > We're talking about
> > level interrupts here, how scalable do we need to be?
> > 
> 
> The reason we are moving them into kernel at all is for speed, no?

Come on, if we take that approach why aren't we writing all of this in
assembly for speed?!  All I'm suggesting is there's a limit to return on
investment at some point.  Maybe it's here.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 22:01                                     ` Alex Williamson
@ 2012-07-17 22:05                                       ` Michael S. Tsirkin
  2012-07-17 22:22                                         ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 22:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 04:01:40PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 00:05 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 01:51:27PM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 21:55 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > > > > > effort.
> > > > > > > > > > > > 
> > > > > > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > > > > > 
> > > > > > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > > > > > 
> > > > > > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > > > > > amount of code that woul need to be fixed.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > > 
> > > > > > > > Look at this:
> > > > > > > > 
> > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > {
> > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > >         if (level)
> > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > >         else
> > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > 
> > > > > > > >         return !!(*irq_state);
> > > > > > > > }
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Now:
> > > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > > 
> > > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > > and interrupt will get cleared.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Maybe this is not a problem. But in that case IMO it needs
> > > > > > > > a comment explaining why and why it's not a problem in
> > > > > > > > your code.
> > > > > > > 
> > > > > > > So you want to close the door on anything that uses kvm_set_irq until
> > > > > > > this gets fixed... that's insane.
> > > > > > 
> > > > > > What does kvm_set_irq use have to do with it?  You posted this patch:
> > > > > > 
> > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > +                            struct kvm *kvm, int irq_source_id)
> > > > > > +{
> > > > > > +#ifdef CONFIG_X86
> > > > > > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > +       int level =
> > > > > > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > +                                            irq_source_id);
> > > > > > +       if (level)
> > > > > > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > +                               !!pic->irq_states[e->irqchip.pin]);
> > > > > > +       return level;
> > > > > > +#else
> > > > > > +       return -1;
> > > > > > +#endif
> > > > > > +}
> > > > > > +
> > > > > > 
> > > > > > it seems racy in the same way.
> > > > > 
> > > > > Now you're just misrepresenting how we got here, which was:
> > > > > 
> > > > > > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > > > > performance difference either).
> > > > > > > > > > 
> > > > > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > > > > complex.
> > > > > 
> > > > > So I'm happy to drop the last 2 patches, which were done at your request
> > > > > anyway, but you've failed to show how the locking in patches 1&2 is
> > > > > messy, inconsistent, or complex and now you're asking to block all
> > > > > progress.
> > > > 
> > > > I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
> > > > not progress. Or maybe there is no bug. Let's see why and add a comment.
> > > > 
> > > > >  Those patches are just users of kvm_set_irq.
> > > > 
> > > > 
> > > > Well these add calls to kvm_set_irq which scans all vcpus under
> > > > spinlock. In the past Avi thought this is not a good idea too.
> > > > Maybe things changed.
> > > 
> > > We can drop the spinlock if we don't care about spurious EOIs, which is
> > > only a theoretical scalability problem anyway.
> > 
> > Not theoretical at all IMO. We see the problem with virtio with old
> > guests today.
> 
> And how are you injecting level interrupts with virtio today w/o this
> interface?

Not well at all. Bad performance with interrupt sharing.

> > > We're talking about
> > > level interrupts here, how scalable do we need to be?
> > > 
> > 
> > The reason we are moving them into kernel at all is for speed, no?
> 
> Come on, if we take that approach why aren't we writing all of this in
> assembly for speed?!  All I'm suggesting is there's a limit to return on
> investment at some point.  Maybe it's here.

Well I am just warning about known problems: don't wake up (typically)
many eoifds on a single interrupt or you will have scalability
problems that we already see with emulated devices.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 21:23                               ` Michael S. Tsirkin
@ 2012-07-17 22:09                                 ` Alex Williamson
  2012-07-17 22:24                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 22:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 00:23 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 02:03:05PM -0600, Alex Williamson wrote:
> > On Tue, 2012-07-17 at 21:58 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > > > > do we do anything on release?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > > > > 
> > > > > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > > > > someone else's object.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > > > > 
> > > > > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > > > > references to and automatically release it when there are none.
> > > > > > > > > 
> > > > > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > > > > context.
> > > > > > > > 
> > > > > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > > > > 
> > > > > > > _irqfd too.
> > > > > > 
> > > > > > 
> > > > > > How so?
> > > > > 
> > > > > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > > > > If you get a ctx reference this does not happen.
> > > > 
> > > > I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> > > > eventfd_release (file_operations.release), not from ctx reference
> > > > release.
> > > 
> > > True. I was wrong. so close has the same bug as deassign. To fix,
> > > how about eoifd will hold a reference to the irqfd instead of the
> > > eventfd context?
> > 
> > What does it mean to hold a reference to the irqfd?
> 
> I meant file *reference: eventfd_fget. But there are other options see
> below.

That's no better than the eventfd context we already hold.

> > What state of functionality is an irqfd that has been
> > closed/de-assigned but is still attached to an eoifd?  It can't
> > continue to fire interrupts into the guest.
> >
> > I don't think close or de-assign have a bug, assign has a bug that it
> > can allow re-assignment using an in-use eventfd.  I think I'd rather
> > fix that.
> 
> Let me show you that the bug is in deassign:
> 	assign irqfd for fd=1
> 	assign for eoifd fd=2, irqfd=1
> 	deassign irqfd 1
> 
> At this point eoifd has no meaning and there is also no way to deassign
> it,

Yes, there is.  This is exactly why I hold a reference to the eventfd
ctx.  It can still be deassigned by passing irqfd=1, we'll do an
eventfd_ctx_get and match it to that stored.

>  so the bug already triggered.
> 
> I can see two ways out:
> 1. easy way - fail deassign

Then close() and deassign are not the same.

> 2. elegant way - shut down eoifd on irqfd deassign too

Sorry, I've always been told it's a bad idea to have one interface kill
another from inside the kernel.

Given that your assertion above is incorrect, I still stand by fixing
assign.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-17 22:00       ` Michael S. Tsirkin
@ 2012-07-17 22:16         ` Alex Williamson
  2012-07-17 22:28           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 22:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 01:00 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 03:57:41PM -0600, Alex Williamson wrote:
> > On Wed, 2012-07-18 at 00:26 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > > @@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
> > > >  	 * It is now safe to release the object's resources
> > > >  	 */
> > > >  	eventfd_ctx_put(irqfd->eventfd);
> > > > +
> > > > +	_irq_source_put(irqfd->source);
> > > > +
> > > >  	kfree(irqfd);
> > > >  }
> > > >  
> > > 
> > > Hang on, this is a bug I think. This is done asynchronously.  So this
> > > means that I can assign MAX number of irqfds, then close one, and now
> >                          ^^^^^^^^^^^^^^^^^^^^ What is this?
> > Do you mean max irq source ids?
> 
> Yes, this is what I meant. Sorry about being unclear.
> 
> > > assign will fail because deassign did not get freed.
> > > 
> > > Maybe we can solve this by flushing wq before assign?
> > > Looks a bit fragile but may be enough - need to document well.

Why is this fragile?  We could even make it part of a retry so we don't
call it unless we need to.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 22:05                                       ` Michael S. Tsirkin
@ 2012-07-17 22:22                                         ` Alex Williamson
  2012-07-17 22:31                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-17 22:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 01:05 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 04:01:40PM -0600, Alex Williamson wrote:
> > On Wed, 2012-07-18 at 00:05 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 01:51:27PM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 21:55 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > > > > > > effort.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > > > > > > 
> > > > > > > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > > > > > > 
> > > > > > > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > > > > > > amount of code that woul need to be fixed.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > > > 
> > > > > > > > > Look at this:
> > > > > > > > > 
> > > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > > {
> > > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > > >         if (level)
> > > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > > >         else
> > > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > > 
> > > > > > > > >         return !!(*irq_state);
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Now:
> > > > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > > > 
> > > > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > > > and interrupt will get cleared.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Maybe this is not a problem. But in that case IMO it needs
> > > > > > > > > a comment explaining why and why it's not a problem in
> > > > > > > > > your code.
> > > > > > > > 
> > > > > > > > So you want to close the door on anything that uses kvm_set_irq until
> > > > > > > > this gets fixed... that's insane.
> > > > > > > 
> > > > > > > What does kvm_set_irq use have to do with it?  You posted this patch:
> > > > > > > 
> > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > +                            struct kvm *kvm, int irq_source_id)
> > > > > > > +{
> > > > > > > +#ifdef CONFIG_X86
> > > > > > > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > +       int level =
> > > > > > > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > +                                            irq_source_id);
> > > > > > > +       if (level)
> > > > > > > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > +                               !!pic->irq_states[e->irqchip.pin]);
> > > > > > > +       return level;
> > > > > > > +#else
> > > > > > > +       return -1;
> > > > > > > +#endif
> > > > > > > +}
> > > > > > > +
> > > > > > > 
> > > > > > > it seems racy in the same way.
> > > > > > 
> > > > > > Now you're just misrepresenting how we got here, which was:
> > > > > > 
> > > > > > > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > > > > > performance difference either).
> > > > > > > > > > > 
> > > > > > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > > > > > complex.
> > > > > > 
> > > > > > So I'm happy to drop the last 2 patches, which were done at your request
> > > > > > anyway, but you've failed to show how the locking in patches 1&2 is
> > > > > > messy, inconsistent, or complex and now you're asking to block all
> > > > > > progress.
> > > > > 
> > > > > I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
> > > > > not progress. Or maybe there is no bug. Let's see why and add a comment.
> > > > > 
> > > > > >  Those patches are just users of kvm_set_irq.
> > > > > 
> > > > > 
> > > > > Well these add calls to kvm_set_irq which scans all vcpus under
> > > > > spinlock. In the past Avi thought this is not a good idea too.
> > > > > Maybe things changed.
> > > > 
> > > > We can drop the spinlock if we don't care about spurious EOIs, which is
> > > > only a theoretical scalability problem anyway.
> > > 
> > > Not theoretical at all IMO. We see the problem with virtio with old
> > > guests today.
> > 
> > And how are you injecting level interrupts with virtio today w/o this
> > interface?
> 
> Not well at all. Bad performance with interrupt sharing.
> 
> > > > We're talking about
> > > > level interrupts here, how scalable do we need to be?
> > > > 
> > > 
> > > The reason we are moving them into kernel at all is for speed, no?
> > 
> > Come on, if we take that approach why aren't we writing all of this in
> > assembly for speed?!  All I'm suggesting is there's a limit to return on
> > investment at some point.  Maybe it's here.
> 
> Well I am just warning about known problems: don't wake up (typically)
> many eoifds on a single interrupt or you will have scalability
> problems that we already see with emulated devices.

Well, that's why we don't want to bounce back to userspace.  All we need
to do in VFIO for each callback is note that the interrupt wasn't
previously masked and do nothing.  So we're talking about acquiring a
spinlock and a few data references.  Sure, I'd like to avoid doing that,
but not if it means blocking this patch series until some unknown number
of patches fixes all the tangential problems you find.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 22:09                                 ` Alex Williamson
@ 2012-07-17 22:24                                   ` Michael S. Tsirkin
  2012-07-18  2:44                                     ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 22:24 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 04:09:25PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 00:23 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 02:03:05PM -0600, Alex Williamson wrote:
> > > On Tue, 2012-07-17 at 21:58 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > > > > > do we do anything on release?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > > > > > someone else's object.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > > > > > 
> > > > > > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > > > > > references to and automatically release it when there are none.
> > > > > > > > > > 
> > > > > > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > > > > > context.
> > > > > > > > > 
> > > > > > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > > > > > 
> > > > > > > > _irqfd too.
> > > > > > > 
> > > > > > > 
> > > > > > > How so?
> > > > > > 
> > > > > > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > > > > > If you get a ctx reference this does not happen.
> > > > > 
> > > > > I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> > > > > eventfd_release (file_operations.release), not from ctx reference
> > > > > release.
> > > > 
> > > > True. I was wrong. so close has the same bug as deassign. To fix,
> > > > how about eoifd will hold a reference to the irqfd instead of the
> > > > eventfd context?
> > > 
> > > What does it mean to hold a reference to the irqfd?
> > 
> > I meant file *reference: eventfd_fget. But there are other options see
> > below.
> 
> That's no better than the eventfd context we already hold.

It means POLLHUP is not invoked until eoifd is closed.

> > > What state of functionality is an irqfd that has been
> > > closed/de-assigned but is still attached to an eoifd?  It can't
> > > continue to fire interrupts into the guest.
> > >
> > > I don't think close or de-assign have a bug, assign has a bug that it
> > > can allow re-assignment using an in-use eventfd.  I think I'd rather
> > > fix that.
> > 
> > Let me show you that the bug is in deassign:
> > 	assign irqfd for fd=1
> > 	assign for eoifd fd=2, irqfd=1
> > 	deassign irqfd 1
> > 
> > At this point eoifd has no meaning and there is also no way to deassign
> > it,
> 
> Yes, there is.  This is exactly why I hold a reference to the eventfd
> ctx.  It can still be deassigned by passing irqfd=1, we'll do an
> eventfd_ctx_get and match it to that stored.

OK.
What if instead we close irqfd 1?

> >  so the bug already triggered.
> >
> > I can see two ways out:
> > 1. easy way - fail deassign
> 
> Then close() and deassign are not the same.
> 
> > 2. elegant way - shut down eoifd on irqfd deassign too
> 
> Sorry, I've always been told it's a bad idea to have one interface kill
> another from inside the kernel.

Not kill merely deassign.

> Given that your assertion above is incorrect, I still stand by fixing
> assign.

OK, but then you also would need to protect against someone binding
an irqfd that is not level to same GSI.

Also if we go ahead with fixing assign - I do not think we need
to rebind to the same source id - just failing assign
of this irqfd with EBUSY should be enough.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-17 22:16         ` Alex Williamson
@ 2012-07-17 22:28           ` Michael S. Tsirkin
  0 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 22:28 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 04:16:04PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 01:00 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 03:57:41PM -0600, Alex Williamson wrote:
> > > On Wed, 2012-07-18 at 00:26 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > > > @@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
> > > > >  	 * It is now safe to release the object's resources
> > > > >  	 */
> > > > >  	eventfd_ctx_put(irqfd->eventfd);
> > > > > +
> > > > > +	_irq_source_put(irqfd->source);
> > > > > +
> > > > >  	kfree(irqfd);
> > > > >  }
> > > > >  
> > > > 
> > > > Hang on, this is a bug I think. This is done asynchronously.  So this
> > > > means that I can assign MAX number of irqfds, then close one, and now
> > >                          ^^^^^^^^^^^^^^^^^^^^ What is this?
> > > Do you mean max irq source ids?
> > 
> > Yes, this is what I meant. Sorry about being unclear.
> > 
> > > > assign will fail because deassign did not get freed.
> > > > 
> > > > Maybe we can solve this by flushing wq before assign?
> > > > Looks a bit fragile but may be enough - need to document well.
> 
> Why is this fragile?  We could even make it part of a retry so we don't
> call it unless we need to.
> 

It just ties in assign and deassign. Maybe it's ok - but pls
add a comment explaining the whole design.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 22:22                                         ` Alex Williamson
@ 2012-07-17 22:31                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-17 22:31 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 04:22:10PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 01:05 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 04:01:40PM -0600, Alex Williamson wrote:
> > > On Wed, 2012-07-18 at 00:05 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 01:51:27PM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 21:55 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 10:45:52AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 19:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 10:17:03AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > >   And current code looks buggy if yes we need to fix it somehow.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Which to me seems to indicate this should be handled as a separate
> > > > > > > > > > > > > > > effort.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > A separate patchset, sure. But likely a prerequisite: we still need to
> > > > > > > > > > > > > > look at all the code. Let's not copy bugs, need to fix them.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This looks tangential to me unless you can come up with an actual reason
> > > > > > > > > > > > > the above spinlock usage is incorrect or insufficient.
> > > > > > > > > > > > 
> > > > > > > > > > > > You copy the same pattern that seems racy. So you double the
> > > > > > > > > > > > amount of code that woul need to be fixed.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > > > > 
> > > > > > > > > > Look at this:
> > > > > > > > > > 
> > > > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > > > {
> > > > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > > > >         if (level)
> > > > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > > > >         else
> > > > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > > > 
> > > > > > > > > >         return !!(*irq_state);
> > > > > > > > > > }
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Now:
> > > > > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > > > > 
> > > > > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > > > > and interrupt will get cleared.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Maybe this is not a problem. But in that case IMO it needs
> > > > > > > > > > a comment explaining why and why it's not a problem in
> > > > > > > > > > your code.
> > > > > > > > > 
> > > > > > > > > So you want to close the door on anything that uses kvm_set_irq until
> > > > > > > > > this gets fixed... that's insane.
> > > > > > > > 
> > > > > > > > What does kvm_set_irq use have to do with it?  You posted this patch:
> > > > > > > > 
> > > > > > > > +static int kvm_clear_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> > > > > > > > +                            struct kvm *kvm, int irq_source_id)
> > > > > > > > +{
> > > > > > > > +#ifdef CONFIG_X86
> > > > > > > > +       struct kvm_pic *pic = pic_irqchip(kvm);
> > > > > > > > +       int level =
> > > > > > > > kvm_clear_irq_line_state(&pic->irq_states[e->irqchip.pin],
> > > > > > > > +                                            irq_source_id);
> > > > > > > > +       if (level)
> > > > > > > > +               kvm_pic_set_irq(pic, e->irqchip.pin,
> > > > > > > > +                               !!pic->irq_states[e->irqchip.pin]);
> > > > > > > > +       return level;
> > > > > > > > +#else
> > > > > > > > +       return -1;
> > > > > > > > +#endif
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > 
> > > > > > > > it seems racy in the same way.
> > > > > > > 
> > > > > > > Now you're just misrepresenting how we got here, which was:
> > > > > > > 
> > > > > > > > > > > > > IMHO, we're going off into the weeds again with these last
> > > > > > > > > > > > > two patches.  It may be a valid optimization, but it really has no
> > > > > > > > > > > > > bearing on the meat of the series (and afaict, no significant
> > > > > > > > > > > > > performance difference either).
> > > > > > > > > > > > 
> > > > > > > > > > > > For me it's not a performance thing. IMO code is cleaner without this locking:
> > > > > > > > > > > > we add a lock but only use it in some cases, so the rules become really
> > > > > > > > > > > > complex.
> > > > > > > 
> > > > > > > So I'm happy to drop the last 2 patches, which were done at your request
> > > > > > > anyway, but you've failed to show how the locking in patches 1&2 is
> > > > > > > messy, inconsistent, or complex and now you're asking to block all
> > > > > > > progress.
> > > > > > 
> > > > > > I'm asking for bugs to get fixed and not duplicated. Adding more bugs is
> > > > > > not progress. Or maybe there is no bug. Let's see why and add a comment.
> > > > > > 
> > > > > > >  Those patches are just users of kvm_set_irq.
> > > > > > 
> > > > > > 
> > > > > > Well these add calls to kvm_set_irq which scans all vcpus under
> > > > > > spinlock. In the past Avi thought this is not a good idea too.
> > > > > > Maybe things changed.
> > > > > 
> > > > > We can drop the spinlock if we don't care about spurious EOIs, which is
> > > > > only a theoretical scalability problem anyway.
> > > > 
> > > > Not theoretical at all IMO. We see the problem with virtio with old
> > > > guests today.
> > > 
> > > And how are you injecting level interrupts with virtio today w/o this
> > > interface?
> > 
> > Not well at all. Bad performance with interrupt sharing.
> > 
> > > > > We're talking about
> > > > > level interrupts here, how scalable do we need to be?
> > > > > 
> > > > 
> > > > The reason we are moving them into kernel at all is for speed, no?
> > > 
> > > Come on, if we take that approach why aren't we writing all of this in
> > > assembly for speed?!  All I'm suggesting is there's a limit to return on
> > > investment at some point.  Maybe it's here.
> > 
> > Well I am just warning about known problems: don't wake up (typically)
> > many eoifds on a single interrupt or you will have scalability
> > problems that we already see with emulated devices.
> 
> Well, that's why we don't want to bounce back to userspace.  All we need
> to do in VFIO for each callback is note that the interrupt wasn't
> previously masked and do nothing.  So we're talking about acquiring a
> spinlock and a few data references.  Sure, I'd like to avoid doing that,
> but not if it means blocking this patch series until some unknown number
> of patches fixes all the tangential problems you find.

I'd like Avi's take on whether kvm_set_irq under a spinlock here
is OK. He nacked this in the past.

Meanwhile fixing the tangential problems is time well spent too.
If there are races in kvm irq handling it seems more important
to fix them than add an optimization of level interrupts.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-17 22:24                                   ` Michael S. Tsirkin
@ 2012-07-18  2:44                                     ` Alex Williamson
  2012-07-18 10:31                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-18  2:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 01:24 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2012 at 04:09:25PM -0600, Alex Williamson wrote:
> > On Wed, 2012-07-18 at 00:23 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 17, 2012 at 02:03:05PM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-07-17 at 21:58 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> > > > > > On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > > > > > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > > > > > > do we do anything on release?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > > > > > > someone else's object.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > > > > > > 
> > > > > > > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > > > > > > references to and automatically release it when there are none.
> > > > > > > > > > > 
> > > > > > > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > > > > > > context.
> > > > > > > > > > 
> > > > > > > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > > > > > > 
> > > > > > > > > _irqfd too.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > How so?
> > > > > > > 
> > > > > > > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > > > > > > If you get a ctx reference this does not happen.
> > > > > > 
> > > > > > I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> > > > > > eventfd_release (file_operations.release), not from ctx reference
> > > > > > release.
> > > > > 
> > > > > True. I was wrong. so close has the same bug as deassign. To fix,
> > > > > how about eoifd will hold a reference to the irqfd instead of the
> > > > > eventfd context?
> > > > 
> > > > What does it mean to hold a reference to the irqfd?
> > > 
> > > I meant file *reference: eventfd_fget. But there are other options see
> > > below.
> > 
> > That's no better than the eventfd context we already hold.
> 
> It means POLLHUP is not invoked until eoifd is closed.
> 
> > > > What state of functionality is an irqfd that has been
> > > > closed/de-assigned but is still attached to an eoifd?  It can't
> > > > continue to fire interrupts into the guest.
> > > >
> > > > I don't think close or de-assign have a bug, assign has a bug that it
> > > > can allow re-assignment using an in-use eventfd.  I think I'd rather
> > > > fix that.
> > > 
> > > Let me show you that the bug is in deassign:
> > > 	assign irqfd for fd=1
> > > 	assign for eoifd fd=2, irqfd=1
> > > 	deassign irqfd 1
> > > 
> > > At this point eoifd has no meaning and there is also no way to deassign
> > > it,
> > 
> > Yes, there is.  This is exactly why I hold a reference to the eventfd
> > ctx.  It can still be deassigned by passing irqfd=1, we'll do an
> > eventfd_ctx_get and match it to that stored.
> 
> OK.
> What if instead we close irqfd 1?

Then the user isn't reading directions very well because the API clearly
indicates to pass the irqfd on both assign and de-assign of the eoifd.
However, it will still get de-assigned if they close the eoifd.

> > >  so the bug already triggered.
> > >
> > > I can see two ways out:
> > > 1. easy way - fail deassign
> > 
> > Then close() and deassign are not the same.
> > 
> > > 2. elegant way - shut down eoifd on irqfd deassign too
> > 
> > Sorry, I've always been told it's a bad idea to have one interface kill
> > another from inside the kernel.
> 
> Not kill merely deassign.

That's what I mean.  Unintended consequences should not be designed in.

> > Given that your assertion above is incorrect, I still stand by fixing
> > assign.
> 
> OK, but then you also would need to protect against someone binding
> an irqfd that is not level to same GSI.
> 
> Also if we go ahead with fixing assign - I do not think we need
> to rebind to the same source id - just failing assign
> of this irqfd with EBUSY should be enough.
> 




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-17 16:14                       ` Michael S. Tsirkin
  2012-07-17 16:17                         ` Alex Williamson
@ 2012-07-18  6:27                         ` Gleb Natapov
  2012-07-18 10:20                           ` Michael S. Tsirkin
  2012-07-18 21:55                           ` Michael S. Tsirkin
  1 sibling, 2 replies; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > _Seems_ racy, or _is_ racy?  Please identify the race.
> 
> Look at this:
> 
> static inline int kvm_irq_line_state(unsigned long *irq_state,
>                                      int irq_source_id, int level)
> {
>         /* Logical OR for level trig interrupt */
>         if (level)
>                 set_bit(irq_source_id, irq_state);
>         else
>                 clear_bit(irq_source_id, irq_state);
> 
>         return !!(*irq_state);
> }
> 
> 
> Now:
> If other CPU changes some other bit after the atomic change,
> it looks like !!(*irq_state) might return a stale value.
> 
> CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> If CPU 0 sees a stale value now it will return 0 here
> and interrupt will get cleared.
> 
This will hardly happen on x86 especially since bit is set with
serialized instruction. But there is actually a race here.
CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
No ioapic thinks the level is 0 but irq_state is not 0.

This untested and un-compiled patch should fix it.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ef91d79..e22c78b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -825,7 +825,7 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
 
-int kvm_pic_set_irq(void *opaque, int irq, int level);
+int kvm_pic_set_irq(void *opaque, int irq);
 
 void kvm_inject_nmi(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 81cf4fa..0d6988f 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -188,12 +188,13 @@ void kvm_pic_update_irq(struct kvm_pic *s)
 	pic_unlock(s);
 }
 
-int kvm_pic_set_irq(void *opaque, int irq, int level)
+int kvm_pic_set_irq(void *opaque, int irq)
 {
 	struct kvm_pic *s = opaque;
-	int ret = -1;
+	int ret = -1, level;
 
 	pic_lock(s);
+	level = !!s->irq_states[irq];
 	if (irq >= 0 && irq < PIC_NUM_PINS) {
 		ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
 		pic_update_irq(s);
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 26fd54d..6ad6a6b 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -191,14 +191,15 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
 	return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
 }
 
-int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
+int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq)
 {
 	u32 old_irr;
 	u32 mask = 1 << irq;
 	union kvm_ioapic_redirect_entry entry;
-	int ret = 1;
+	int ret = 1, level;
 
 	spin_lock(&ioapic->lock);
+	level = !!ioapic->irq_states[irq];
 	old_irr = ioapic->irr;
 	if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
 		entry = ioapic->redirtbl[irq];
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 32872a0..65894dd 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -74,7 +74,7 @@ void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode);
 bool kvm_ioapic_handles_vector(struct kvm *kvm, int vector);
 int kvm_ioapic_init(struct kvm *kvm);
 void kvm_ioapic_destroy(struct kvm *kvm);
-int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level);
+int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq);
 void kvm_ioapic_reset(struct kvm_ioapic *ioapic);
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index a6a0365..db0ccef 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -33,7 +33,7 @@
 
 #include "ioapic.h"
 
-static inline int kvm_irq_line_state(unsigned long *irq_state,
+static inline void kvm_irq_line_state(unsigned long *irq_state,
 				     int irq_source_id, int level)
 {
 	/* Logical OR for level trig interrupt */
@@ -41,8 +41,6 @@ static inline int kvm_irq_line_state(unsigned long *irq_state,
 		set_bit(irq_source_id, irq_state);
 	else
 		clear_bit(irq_source_id, irq_state);
-
-	return !!(*irq_state);
 }
 
 static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
@@ -50,9 +48,9 @@ static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
 {
 #ifdef CONFIG_X86
 	struct kvm_pic *pic = pic_irqchip(kvm);
-	level = kvm_irq_line_state(&pic->irq_states[e->irqchip.pin],
+	kvm_irq_line_state(&pic->irq_states[e->irqchip.pin],
 				   irq_source_id, level);
-	return kvm_pic_set_irq(pic, e->irqchip.pin, level);
+	return kvm_pic_set_irq(pic, e->irqchip.pin);
 #else
 	return -1;
 #endif
@@ -62,10 +60,10 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
 			      struct kvm *kvm, int irq_source_id, int level)
 {
 	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
-	level = kvm_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
+	kvm_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
 				   irq_source_id, level);
 
-	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
+	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin);
 }
 
 inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)

--
			Gleb.

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18  6:27                         ` Gleb Natapov
@ 2012-07-18 10:20                           ` Michael S. Tsirkin
  2012-07-18 10:27                             ` Gleb Natapov
  2012-07-18 21:55                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:20 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > 
> > Look at this:
> > 
> > static inline int kvm_irq_line_state(unsigned long *irq_state,
> >                                      int irq_source_id, int level)
> > {
> >         /* Logical OR for level trig interrupt */
> >         if (level)
> >                 set_bit(irq_source_id, irq_state);
> >         else
> >                 clear_bit(irq_source_id, irq_state);
> > 
> >         return !!(*irq_state);
> > }
> > 
> > 
> > Now:
> > If other CPU changes some other bit after the atomic change,
> > it looks like !!(*irq_state) might return a stale value.
> > 
> > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > If CPU 0 sees a stale value now it will return 0 here
> > and interrupt will get cleared.
> > 
> This will hardly happen on x86 especially since bit is set with
> serialized instruction.

Probably. But it does make me a bit uneasy.  Why don't we pass
irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
__set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
and saving an atomic op in the process.

> But there is actually a race here.
> CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
> CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
> No ioapic thinks the level is 0 but irq_state is not 0.
> 
> This untested and un-compiled patch should fix it.
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ef91d79..e22c78b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -825,7 +825,7 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
>  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
>  
> -int kvm_pic_set_irq(void *opaque, int irq, int level);
> +int kvm_pic_set_irq(void *opaque, int irq);
>  
>  void kvm_inject_nmi(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index 81cf4fa..0d6988f 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -188,12 +188,13 @@ void kvm_pic_update_irq(struct kvm_pic *s)
>  	pic_unlock(s);
>  }
>  
> -int kvm_pic_set_irq(void *opaque, int irq, int level)
> +int kvm_pic_set_irq(void *opaque, int irq)
>  {
>  	struct kvm_pic *s = opaque;
> -	int ret = -1;
> +	int ret = -1, level;
>  
>  	pic_lock(s);
> +	level = !!s->irq_states[irq];
>  	if (irq >= 0 && irq < PIC_NUM_PINS) {
>  		ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
>  		pic_update_irq(s);
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index 26fd54d..6ad6a6b 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -191,14 +191,15 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
>  	return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
>  }
>  
> -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
> +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq)
>  {
>  	u32 old_irr;
>  	u32 mask = 1 << irq;
>  	union kvm_ioapic_redirect_entry entry;
> -	int ret = 1;
> +	int ret = 1, level;
>  
>  	spin_lock(&ioapic->lock);
> +	level = !!ioapic->irq_states[irq];
>  	old_irr = ioapic->irr;
>  	if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
>  		entry = ioapic->redirtbl[irq];
> diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
> index 32872a0..65894dd 100644
> --- a/virt/kvm/ioapic.h
> +++ b/virt/kvm/ioapic.h
> @@ -74,7 +74,7 @@ void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode);
>  bool kvm_ioapic_handles_vector(struct kvm *kvm, int vector);
>  int kvm_ioapic_init(struct kvm *kvm);
>  void kvm_ioapic_destroy(struct kvm *kvm);
> -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level);
> +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq);
>  void kvm_ioapic_reset(struct kvm_ioapic *ioapic);
>  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  		struct kvm_lapic_irq *irq);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index a6a0365..db0ccef 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -33,7 +33,7 @@
>  
>  #include "ioapic.h"
>  
> -static inline int kvm_irq_line_state(unsigned long *irq_state,
> +static inline void kvm_irq_line_state(unsigned long *irq_state,
>  				     int irq_source_id, int level)
>  {
>  	/* Logical OR for level trig interrupt */
> @@ -41,8 +41,6 @@ static inline int kvm_irq_line_state(unsigned long *irq_state,
>  		set_bit(irq_source_id, irq_state);
>  	else
>  		clear_bit(irq_source_id, irq_state);
> -
> -	return !!(*irq_state);
>  }
>  
>  static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> @@ -50,9 +48,9 @@ static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
>  {
>  #ifdef CONFIG_X86
>  	struct kvm_pic *pic = pic_irqchip(kvm);
> -	level = kvm_irq_line_state(&pic->irq_states[e->irqchip.pin],
> +	kvm_irq_line_state(&pic->irq_states[e->irqchip.pin],
>  				   irq_source_id, level);
> -	return kvm_pic_set_irq(pic, e->irqchip.pin, level);
> +	return kvm_pic_set_irq(pic, e->irqchip.pin);
>  #else
>  	return -1;
>  #endif
> @@ -62,10 +60,10 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  			      struct kvm *kvm, int irq_source_id, int level)
>  {
>  	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> -	level = kvm_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
> +	kvm_irq_line_state(&ioapic->irq_states[e->irqchip.pin],
>  				   irq_source_id, level);
>  
> -	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin, level);
> +	return kvm_ioapic_set_irq(ioapic, e->irqchip.pin);
>  }
>  
>  inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 10:20                           ` Michael S. Tsirkin
@ 2012-07-18 10:27                             ` Gleb Natapov
  2012-07-18 10:33                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 10:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > 
> > > Look at this:
> > > 
> > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > >                                      int irq_source_id, int level)
> > > {
> > >         /* Logical OR for level trig interrupt */
> > >         if (level)
> > >                 set_bit(irq_source_id, irq_state);
> > >         else
> > >                 clear_bit(irq_source_id, irq_state);
> > > 
> > >         return !!(*irq_state);
> > > }
> > > 
> > > 
> > > Now:
> > > If other CPU changes some other bit after the atomic change,
> > > it looks like !!(*irq_state) might return a stale value.
> > > 
> > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > If CPU 0 sees a stale value now it will return 0 here
> > > and interrupt will get cleared.
> > > 
> > This will hardly happen on x86 especially since bit is set with
> > serialized instruction.
> 
> Probably. But it does make me a bit uneasy.  Why don't we pass
> irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> and saving an atomic op in the process.
> 
With my patch I do not see why we can't change them to unlocked variant
without moving them anywhere. The only requirement is to not use RMW
sequence to set/clear bits. The ordering of setting does not matter. The
ordering of reading is.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-18  2:44                                     ` Alex Williamson
@ 2012-07-18 10:31                                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:31 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 17, 2012 at 08:44:04PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 01:24 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2012 at 04:09:25PM -0600, Alex Williamson wrote:
> > > On Wed, 2012-07-18 at 00:23 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jul 17, 2012 at 02:03:05PM -0600, Alex Williamson wrote:
> > > > > On Tue, 2012-07-17 at 21:58 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jul 17, 2012 at 10:52:16AM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2012-07-17 at 19:19 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 10:06:01AM -0600, Alex Williamson wrote:
> > > > > > > > > On Tue, 2012-07-17 at 18:53 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Jul 17, 2012 at 09:41:09AM -0600, Alex Williamson wrote:
> > > > > > > > > > > On Tue, 2012-07-17 at 18:13 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Tue, Jul 17, 2012 at 08:57:04AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > On Tue, 2012-07-17 at 17:42 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > On Tue, Jul 17, 2012 at 08:29:43AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > On Tue, 2012-07-17 at 17:10 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > On Tue, Jul 17, 2012 at 07:59:16AM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > > > On Tue, 2012-07-17 at 13:21 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > > > On Mon, Jul 16, 2012 at 02:33:55PM -0600, Alex Williamson wrote:
> > > > > > > > > > > > > > > > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > > > > > > > > > > > > > > > +		struct _irqfd *irqfd = _irqfd_fdget_lock(kvm, args->irqfd);
> > > > > > > > > > > > > > > > > > > +		if (IS_ERR(irqfd)) {
> > > > > > > > > > > > > > > > > > > +			ret = PTR_ERR(irqfd);
> > > > > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +		gsi = irqfd->gsi;
> > > > > > > > > > > > > > > > > > > +		level_irqfd = eventfd_ctx_get(irqfd->eventfd);
> > > > > > > > > > > > > > > > > > > +		source = _irq_source_get(irqfd->source);
> > > > > > > > > > > > > > > > > > > +		_irqfd_put_unlock(irqfd);
> > > > > > > > > > > > > > > > > > > +		if (!source) {
> > > > > > > > > > > > > > > > > > > +			ret = -EINVAL;
> > > > > > > > > > > > > > > > > > > +			goto fail;
> > > > > > > > > > > > > > > > > > > +		}
> > > > > > > > > > > > > > > > > > > +	} else {
> > > > > > > > > > > > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > > > > > > > > > > > +		goto fail;
> > > > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > > > > > > > > > > > > > > > +	eoifd->kvm = kvm;
> > > > > > > > > > > > > > > > > > > +	eoifd->eventfd = eventfd;
> > > > > > > > > > > > > > > > > > > +	eoifd->source = source;
> > > > > > > > > > > > > > > > > > > +	eoifd->level_irqfd = level_irqfd;
> > > > > > > > > > > > > > > > > > > +	eoifd->notifier.gsi = gsi;
> > > > > > > > > > > > > > > > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > OK so this means eoifd keeps a reference to the irqfd.
> > > > > > > > > > > > > > > > > > And since this is the case, can't we drop the reference counting
> > > > > > > > > > > > > > > > > > around source ids now? Everything is referenced through irqfd.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Holding a reference and using it as a reference count are not the same
> > > > > > > > > > > > > > > > > thing.  What if another module holds a reference to this eventfd?  How
> > > > > > > > > > > > > > > > > do we do anything on release?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > We don't as there is no release, and using kref on source id does not
> > > > > > > > > > > > > > > > help: it just never gets invoked.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Please work out how you think it should work and let me know, I don't
> > > > > > > > > > > > > > > see it.  We have an irq source id that needs to be allocated by irqfd
> > > > > > > > > > > > > > > and returned when it's unused.  It becomes unused when neither irqfd nor
> > > > > > > > > > > > > > > eoifd are making use of it.  irqfd and eoifd may be closed in any order.
> > > > > > > > > > > > > > > Use of the source id is what we're reference counting, which is why it's
> > > > > > > > > > > > > > > in struct _irq_source.  How can I use an eventfd reference for the same?
> > > > > > > > > > > > > > > I don't know when it's unused.  I don't know who else holds a reference
> > > > > > > > > > > > > > > to it...  Doesn't make sense to me.  Feels like attempting to squat on
> > > > > > > > > > > > > > > someone else's object.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > eoifd should prevent irqfd from being released.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Why?  Note that this is actually quite difficult too.  We can't fail a
> > > > > > > > > > > > > release, nobody checks close(3p) return.  Blocking a release is likely
> > > > > > > > > > > > > to cause all sorts of problems, so what you mean is that irqfd should
> > > > > > > > > > > > > linger around until there are no references to it... but that's exactly
> > > > > > > > > > > > > what struct _irq_source is for, is to hold the bits that we care about
> > > > > > > > > > > > > references to and automatically release it when there are none.
> > > > > > > > > > > > 
> > > > > > > > > > > > No no. You *already* prevent it. You take a reference to the eventfd
> > > > > > > > > > > > context.
> > > > > > > > > > > 
> > > > > > > > > > > Right, which keeps the fd from going away, not the struct _irqfd.
> > > > > > > > > > 
> > > > > > > > > > _irqfd too.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > How so?
> > > > > > > > 
> > > > > > > > Normally irqfd_wakeup is called with POLLHUP and calls irqfd_deactivate.
> > > > > > > > If you get a ctx reference this does not happen.
> > > > > > > 
> > > > > > > I think you're mistaken.  wake_up_poll(,POLLHUP) is called from
> > > > > > > eventfd_release (file_operations.release), not from ctx reference
> > > > > > > release.
> > > > > > 
> > > > > > True. I was wrong. so close has the same bug as deassign. To fix,
> > > > > > how about eoifd will hold a reference to the irqfd instead of the
> > > > > > eventfd context?
> > > > > 
> > > > > What does it mean to hold a reference to the irqfd?
> > > > 
> > > > I meant file *reference: eventfd_fget. But there are other options see
> > > > below.
> > > 
> > > That's no better than the eventfd context we already hold.
> > 
> > It means POLLHUP is not invoked until eoifd is closed.
> > 
> > > > > What state of functionality is an irqfd that has been
> > > > > closed/de-assigned but is still attached to an eoifd?  It can't
> > > > > continue to fire interrupts into the guest.
> > > > >
> > > > > I don't think close or de-assign have a bug, assign has a bug that it
> > > > > can allow re-assignment using an in-use eventfd.  I think I'd rather
> > > > > fix that.
> > > > 
> > > > Let me show you that the bug is in deassign:
> > > > 	assign irqfd for fd=1
> > > > 	assign for eoifd fd=2, irqfd=1
> > > > 	deassign irqfd 1
> > > > 
> > > > At this point eoifd has no meaning and there is also no way to deassign
> > > > it,
> > > 
> > > Yes, there is.  This is exactly why I hold a reference to the eventfd
> > > ctx.  It can still be deassigned by passing irqfd=1, we'll do an
> > > eventfd_ctx_get and match it to that stored.
> > 
> > OK.
> > What if instead we close irqfd 1?
> 
> Then the user isn't reading directions very well because the API clearly
> indicates to pass the irqfd on both assign and de-assign of the eoifd.
> However, it will still get de-assigned if they close the eoifd.

Well you are hanging on the source id, this is an undocumented
side effect, so the following can fail:

assign irqfd
assign eoifd
deassign irqfd
assign irqfd2
close eoifd


Simply source id should stay alive only while irqfd is around.
Instead of hanging on to it from eoifd with reference counting,
you should simply deactivate eoifd when irqfd goes away.

> > > >  so the bug already triggered.
> > > >
> > > > I can see two ways out:
> > > > 1. easy way - fail deassign
> > > 
> > > Then close() and deassign are not the same.
> > > 
> > > > 2. elegant way - shut down eoifd on irqfd deassign too
> > > 
> > > Sorry, I've always been told it's a bad idea to have one interface kill
> > > another from inside the kernel.
> > 
> > Not kill merely deassign.
> 
> That's what I mean.  Unintended consequences should not be designed in.

But source id is an internal to kvm, users do not know about it.

So what is unintended here? You bind eoifd to irqfd. This means
give me indication of eoi when I send this interrupt.
Now you deassign or close irqfd. You will not get any more
eoi indications. All that is needed is fixing a bug: eoi still
hangs on to source id so attempts to create new level irqfd

> > > Given that your assertion above is incorrect, I still stand by fixing
> > > assign.
> > 
> > OK, but then you also would need to protect against someone binding
> > an irqfd that is not level to same GSI.
> > 
> > Also if we go ahead with fixing assign - I do not think we need
> > to rebind to the same source id - just failing assign
> > of this irqfd with EBUSY should be enough.
> > 
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 10:27                             ` Gleb Natapov
@ 2012-07-18 10:33                               ` Michael S. Tsirkin
  2012-07-18 10:36                                 ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:27:39PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > 
> > > > Look at this:
> > > > 
> > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > >                                      int irq_source_id, int level)
> > > > {
> > > >         /* Logical OR for level trig interrupt */
> > > >         if (level)
> > > >                 set_bit(irq_source_id, irq_state);
> > > >         else
> > > >                 clear_bit(irq_source_id, irq_state);
> > > > 
> > > >         return !!(*irq_state);
> > > > }
> > > > 
> > > > 
> > > > Now:
> > > > If other CPU changes some other bit after the atomic change,
> > > > it looks like !!(*irq_state) might return a stale value.
> > > > 
> > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > If CPU 0 sees a stale value now it will return 0 here
> > > > and interrupt will get cleared.
> > > > 
> > > This will hardly happen on x86 especially since bit is set with
> > > serialized instruction.
> > 
> > Probably. But it does make me a bit uneasy.  Why don't we pass
> > irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> > kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> > __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> > and saving an atomic op in the process.
> > 
> With my patch I do not see why we can't change them to unlocked variant
> without moving them anywhere. The only requirement is to not use RMW
> sequence to set/clear bits. The ordering of setting does not matter. The
> ordering of reading is.

You want to use __set_bit/__clear_bit on the same word
from multiple CPUs, without locking?
Why won't this lose information?

In any case, it seems simpler and safer to do accesses under lock
than rely on specific use.

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 10:33                               ` Michael S. Tsirkin
@ 2012-07-18 10:36                                 ` Gleb Natapov
  2012-07-18 10:51                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 10:36 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:33:35PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 01:27:39PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > > > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > 
> > > > > Look at this:
> > > > > 
> > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > >                                      int irq_source_id, int level)
> > > > > {
> > > > >         /* Logical OR for level trig interrupt */
> > > > >         if (level)
> > > > >                 set_bit(irq_source_id, irq_state);
> > > > >         else
> > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > 
> > > > >         return !!(*irq_state);
> > > > > }
> > > > > 
> > > > > 
> > > > > Now:
> > > > > If other CPU changes some other bit after the atomic change,
> > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > 
> > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > and interrupt will get cleared.
> > > > > 
> > > > This will hardly happen on x86 especially since bit is set with
> > > > serialized instruction.
> > > 
> > > Probably. But it does make me a bit uneasy.  Why don't we pass
> > > irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> > > kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> > > __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> > > and saving an atomic op in the process.
> > > 
> > With my patch I do not see why we can't change them to unlocked variant
> > without moving them anywhere. The only requirement is to not use RMW
> > sequence to set/clear bits. The ordering of setting does not matter. The
> > ordering of reading is.
> 
> You want to use __set_bit/__clear_bit on the same word
> from multiple CPUs, without locking?
> Why won't this lose information?
Because it is not RMW. If it is then yes, you can't do that.
> 
> In any case, it seems simpler and safer to do accesses under lock
> than rely on specific use.
> 
> > --
> > 			Gleb.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-16 20:33 ` [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts Alex Williamson
  2012-07-17 21:26   ` Michael S. Tsirkin
@ 2012-07-18 10:41   ` Michael S. Tsirkin
  2012-07-18 10:44     ` Gleb Natapov
  1 sibling, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> In order to inject a level interrupt from an external source using an
> irqfd, we need to allocate a new irq_source_id.  This allows us to
> assert and (later) de-assert an interrupt line independently from
> users of KVM_IRQ_LINE and avoid lost interrupts.
> 
> We also add what may appear like a bit of excessive infrastructure
> around an object for storing this irq_source_id.  However, notice
> that we only provide a way to assert the interrupt here.  A follow-on
> interface will make use of the same irq_source_id to allow de-assert.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  Documentation/virtual/kvm/api.txt |    6 ++
>  arch/x86/kvm/x86.c                |    1 
>  include/linux/kvm.h               |    3 +
>  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
>  4 files changed, 120 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 100acde..c7267d5 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
>  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
>  and kvm_irqfd.gsi.
>  
> +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> +triggered interrupt.  In this case a new irqchip input is allocated
> +which is logically OR'd with other inputs allowing multiple sources
> +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> +is only necessary on setup, teardown is identical to that above.
> +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
>  
>  5. The kvm_run structure
>  ------------------------
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a01a424..80bed07 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_GET_TSC_KHZ:
>  	case KVM_CAP_PCI_2_3:
>  	case KVM_CAP_KVMCLOCK_CTRL:
> +	case KVM_CAP_IRQFD_LEVEL:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 2ce09aa..b2e6e4f 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_PPC_GET_SMMU_INFO 78
>  #define KVM_CAP_S390_COW 79
>  #define KVM_CAP_PPC_ALLOC_HTAB 80
> +#define KVM_CAP_IRQFD_LEVEL 81
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
>  #endif
>  
>  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> +/* Available with KVM_CAP_IRQFD_LEVEL */
> +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
>  
>  struct kvm_irqfd {
>  	__u32 fd;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 7d7e2aa..ecdbfea 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -36,6 +36,68 @@
>  #include "iodev.h"
>  
>  /*
> + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> + * injections and shared with other interfaces for EOI or de-assert.
> + * Create an object with reference counting to make it easy to use.
> + */
> +struct _irq_source {
> +	int id; /* the IRQ source ID */
> +	bool level_asserted; /* Track assertion state and protect with lock */
> +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> +	struct kvm *kvm;
> +	struct kref kref;
> +};
> +
> +static void _irq_source_release(struct kref *kref)
> +{
> +	struct _irq_source *source;
> +
> +	source = container_of(kref, struct _irq_source, kref);
> +
> +	/* This also de-asserts */
> +	kvm_free_irq_source_id(source->kvm, source->id);
> +	kfree(source);
> +}
> +
> +static void _irq_source_put(struct _irq_source *source)
> +{
> +	if (source)
> +		kref_put(&source->kref, _irq_source_release);
> +}
> +
> +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> +_irq_source_get(struct _irq_source *source)
> +{
> +	if (source)
> +		kref_get(&source->kref);
> +
> +	return source;
> +}
> +
> +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> +{
> +	struct _irq_source *source;
> +	int id;
> +
> +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> +	if (!source)
> +		return ERR_PTR(-ENOMEM);
> +
> +	id = kvm_request_irq_source_id(kvm);
> +	if (id < 0) {
> +		kfree(source);
> +		return ERR_PTR(id);
> +	}
> +
> +	kref_init(&source->kref);
> +	spin_lock_init(&source->lock);
> +	source->kvm = kvm;
> +	source->id = id;
> +
> +	return source;
> +}
> +
> +/*
>   * --------------------------------------------------------------------
>   * irqfd: Allows an fd to be used to inject an interrupt to the guest
>   *
> @@ -52,6 +114,8 @@ struct _irqfd {
>  	/* Used for level IRQ fast-path */
>  	int gsi;
>  	struct work_struct inject;
> +	/* IRQ source ID for level triggered irqfds */
> +	struct _irq_source *source;
>  	/* Used for setup/shutdown */
>  	struct eventfd_ctx *eventfd;
>  	struct list_head list;
> @@ -62,7 +126,7 @@ struct _irqfd {
>  static struct workqueue_struct *irqfd_cleanup_wq;
>  
>  static void
> -irqfd_inject(struct work_struct *work)
> +irqfd_inject_edge(struct work_struct *work)
>  {
>  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>  	struct kvm *kvm = irqfd->kvm;
> @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
>  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>  }
>  
> +static void
> +irqfd_inject_level(struct work_struct *work)
> +{
> +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> +
> +	/*
> +	 * Inject an interrupt only if not already asserted.
> +	 *
> +	 * We can safely ignore the kvm_set_irq return value here.  If
> +	 * masked, the irr bit is still set and will eventually be serviced.
> +	 * This interface does not guarantee immediate injection.  If
> +	 * coalesced, an eoi will be coming where we can de-assert and
> +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> +	 * was coalesced, this interface is not for you.
> +	 */
> +	spin_lock(&irqfd->source->lock);
> +	if (!irqfd->source->level_asserted) {
> +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> +		irqfd->source->level_asserted = true;
> +	}
> +	spin_unlock(&irqfd->source->lock);
> +}
> +

So as was discussed kvm_set_irq under spinlock is bad for scalability
with multiple VCPUs.  Why do we need a spinlock simply to protect
level_asserted?  Let's use an atomic test and set/test and clear and the
problem goes away.

>  /*
>   * Race-free decouple logic (ordering is critical)
>   */
> @@ -96,6 +183,9 @@ irqfd_shutdown(struct work_struct *work)
>  	 * It is now safe to release the object's resources
>  	 */
>  	eventfd_ctx_put(irqfd->eventfd);
> +
> +	_irq_source_put(irqfd->source);
> +
>  	kfree(irqfd);
>  }
>  
> @@ -202,6 +292,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>  {
>  	struct kvm_irq_routing_table *irq_rt;
>  	struct _irqfd *irqfd, *tmp;
> +	struct _irq_source *source = NULL;
>  	struct file *file = NULL;
>  	struct eventfd_ctx *eventfd = NULL;
>  	int ret;
> @@ -214,7 +305,19 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>  	irqfd->kvm = kvm;
>  	irqfd->gsi = args->gsi;
>  	INIT_LIST_HEAD(&irqfd->list);
> -	INIT_WORK(&irqfd->inject, irqfd_inject);
> +
> +	if (args->flags & KVM_IRQFD_FLAG_LEVEL) {
> +		source = _irq_source_alloc(kvm);
> +		if (IS_ERR(source)) {
> +			ret = PTR_ERR(source);
> +			goto fail;
> +		}
> +
> +		irqfd->source = source;
> +		INIT_WORK(&irqfd->inject, irqfd_inject_level);
> +	} else
> +		INIT_WORK(&irqfd->inject, irqfd_inject_edge);
> +
>  	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
>  
>  	file = eventfd_fget(args->fd);
> @@ -276,10 +379,13 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>  	return 0;
>  
>  fail:
> +	if (source && !IS_ERR(source))
> +		_irq_source_put(source);
> +
>  	if (eventfd && !IS_ERR(eventfd))
>  		eventfd_ctx_put(eventfd);
>  
> -	if (!IS_ERR(file))
> +	if (file && !IS_ERR(file))
>  		fput(file);
>  
>  	kfree(irqfd);
> @@ -340,7 +446,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
>  int
>  kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
>  {
> -	if (args->flags & ~KVM_IRQFD_FLAG_DEASSIGN)
> +	if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_LEVEL))
>  		return -EINVAL;
>  
>  	if (args->flags & KVM_IRQFD_FLAG_DEASSIGN)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
                   ` (3 preceding siblings ...)
  2012-07-16 20:34 ` [PATCH v5 4/4] kvm: Convert eoifd to use kvm_clear_irq Alex Williamson
@ 2012-07-18 10:43 ` Michael S. Tsirkin
  2012-07-19 16:59 ` Michael S. Tsirkin
  5 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> v5:
>  - irqfds now have a one-to-one mapping with eoifds to prevent users
>    from consuming all of kernel memory by repeatedly creating eoifds
>    from a single irqfd.
>  - implement a kvm_clear_irq() which does a test_and_clear_bit of
>    the irq_state, only updating the pic/ioapic if changes and allowing
>    the caller to know if anything was done.  I added this onto the end
>    as it's essentially an optimization on the previous design.  It's
>    hard to tell if there's an actual performance benefit to this.

I have to agree to this, but we need to avoid invoking kvm_set_irq in
atomic context, without introducing sprurious eois.

Can bool + spinlock that previous patch has be replaced by an atomic?

>  - dropped eoifd gsi support patch as it was only an FYI.
> 
> Thanks,
> 
> Alex
> 
> ---
> 
> Alex Williamson (4):
>       kvm: Convert eoifd to use kvm_clear_irq
>       kvm: Create kvm_clear_irq()
>       kvm: KVM_EOIFD, an eventfd for EOIs
>       kvm: Extend irqfd to support level interrupts
> 
> 
>  Documentation/virtual/kvm/api.txt |   28 +++
>  arch/x86/kvm/x86.c                |    3 
>  include/linux/kvm.h               |   18 ++
>  include/linux/kvm_host.h          |   16 ++
>  virt/kvm/eventfd.c                |  333 +++++++++++++++++++++++++++++++++++++
>  virt/kvm/irq_comm.c               |   78 +++++++++
>  virt/kvm/kvm_main.c               |   11 +
>  7 files changed, 483 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 10:41   ` Michael S. Tsirkin
@ 2012-07-18 10:44     ` Gleb Natapov
  2012-07-18 10:48       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 10:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:41:14PM +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > In order to inject a level interrupt from an external source using an
> > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > assert and (later) de-assert an interrupt line independently from
> > users of KVM_IRQ_LINE and avoid lost interrupts.
> > 
> > We also add what may appear like a bit of excessive infrastructure
> > around an object for storing this irq_source_id.  However, notice
> > that we only provide a way to assert the interrupt here.  A follow-on
> > interface will make use of the same irq_source_id to allow de-assert.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> >  Documentation/virtual/kvm/api.txt |    6 ++
> >  arch/x86/kvm/x86.c                |    1 
> >  include/linux/kvm.h               |    3 +
> >  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
> >  4 files changed, 120 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > index 100acde..c7267d5 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
> >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> >  and kvm_irqfd.gsi.
> >  
> > +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> > +triggered interrupt.  In this case a new irqchip input is allocated
> > +which is logically OR'd with other inputs allowing multiple sources
> > +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > +is only necessary on setup, teardown is identical to that above.
> > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> >  
> >  5. The kvm_run structure
> >  ------------------------
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a01a424..80bed07 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >  	case KVM_CAP_GET_TSC_KHZ:
> >  	case KVM_CAP_PCI_2_3:
> >  	case KVM_CAP_KVMCLOCK_CTRL:
> > +	case KVM_CAP_IRQFD_LEVEL:
> >  		r = 1;
> >  		break;
> >  	case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 2ce09aa..b2e6e4f 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_PPC_GET_SMMU_INFO 78
> >  #define KVM_CAP_S390_COW 79
> >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > +#define KVM_CAP_IRQFD_LEVEL 81
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
> >  #endif
> >  
> >  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > +/* Available with KVM_CAP_IRQFD_LEVEL */
> > +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
> >  
> >  struct kvm_irqfd {
> >  	__u32 fd;
> > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > index 7d7e2aa..ecdbfea 100644
> > --- a/virt/kvm/eventfd.c
> > +++ b/virt/kvm/eventfd.c
> > @@ -36,6 +36,68 @@
> >  #include "iodev.h"
> >  
> >  /*
> > + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> > + * injections and shared with other interfaces for EOI or de-assert.
> > + * Create an object with reference counting to make it easy to use.
> > + */
> > +struct _irq_source {
> > +	int id; /* the IRQ source ID */
> > +	bool level_asserted; /* Track assertion state and protect with lock */
> > +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> > +	struct kvm *kvm;
> > +	struct kref kref;
> > +};
> > +
> > +static void _irq_source_release(struct kref *kref)
> > +{
> > +	struct _irq_source *source;
> > +
> > +	source = container_of(kref, struct _irq_source, kref);
> > +
> > +	/* This also de-asserts */
> > +	kvm_free_irq_source_id(source->kvm, source->id);
> > +	kfree(source);
> > +}
> > +
> > +static void _irq_source_put(struct _irq_source *source)
> > +{
> > +	if (source)
> > +		kref_put(&source->kref, _irq_source_release);
> > +}
> > +
> > +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > +_irq_source_get(struct _irq_source *source)
> > +{
> > +	if (source)
> > +		kref_get(&source->kref);
> > +
> > +	return source;
> > +}
> > +
> > +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> > +{
> > +	struct _irq_source *source;
> > +	int id;
> > +
> > +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> > +	if (!source)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	id = kvm_request_irq_source_id(kvm);
> > +	if (id < 0) {
> > +		kfree(source);
> > +		return ERR_PTR(id);
> > +	}
> > +
> > +	kref_init(&source->kref);
> > +	spin_lock_init(&source->lock);
> > +	source->kvm = kvm;
> > +	source->id = id;
> > +
> > +	return source;
> > +}
> > +
> > +/*
> >   * --------------------------------------------------------------------
> >   * irqfd: Allows an fd to be used to inject an interrupt to the guest
> >   *
> > @@ -52,6 +114,8 @@ struct _irqfd {
> >  	/* Used for level IRQ fast-path */
> >  	int gsi;
> >  	struct work_struct inject;
> > +	/* IRQ source ID for level triggered irqfds */
> > +	struct _irq_source *source;
> >  	/* Used for setup/shutdown */
> >  	struct eventfd_ctx *eventfd;
> >  	struct list_head list;
> > @@ -62,7 +126,7 @@ struct _irqfd {
> >  static struct workqueue_struct *irqfd_cleanup_wq;
> >  
> >  static void
> > -irqfd_inject(struct work_struct *work)
> > +irqfd_inject_edge(struct work_struct *work)
> >  {
> >  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> >  	struct kvm *kvm = irqfd->kvm;
> > @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
> >  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >  }
> >  
> > +static void
> > +irqfd_inject_level(struct work_struct *work)
> > +{
> > +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > +
> > +	/*
> > +	 * Inject an interrupt only if not already asserted.
> > +	 *
> > +	 * We can safely ignore the kvm_set_irq return value here.  If
> > +	 * masked, the irr bit is still set and will eventually be serviced.
> > +	 * This interface does not guarantee immediate injection.  If
> > +	 * coalesced, an eoi will be coming where we can de-assert and
> > +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> > +	 * was coalesced, this interface is not for you.
> > +	 */
> > +	spin_lock(&irqfd->source->lock);
> > +	if (!irqfd->source->level_asserted) {
> > +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > +		irqfd->source->level_asserted = true;
> > +	}
> > +	spin_unlock(&irqfd->source->lock);
> > +}
> > +
> 
> So as was discussed kvm_set_irq under spinlock is bad for scalability
> with multiple VCPUs.  Why do we need a spinlock simply to protect
> level_asserted?  Let's use an atomic test and set/test and clear and the
> problem goes away.
> 
That sad reality is that for level interrupt we already scan all vcpus
under spinlock.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 10:44     ` Gleb Natapov
@ 2012-07-18 10:48       ` Michael S. Tsirkin
  2012-07-18 10:49         ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:48 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:44:29PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 01:41:14PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > In order to inject a level interrupt from an external source using an
> > > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > > assert and (later) de-assert an interrupt line independently from
> > > users of KVM_IRQ_LINE and avoid lost interrupts.
> > > 
> > > We also add what may appear like a bit of excessive infrastructure
> > > around an object for storing this irq_source_id.  However, notice
> > > that we only provide a way to assert the interrupt here.  A follow-on
> > > interface will make use of the same irq_source_id to allow de-assert.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > > 
> > >  Documentation/virtual/kvm/api.txt |    6 ++
> > >  arch/x86/kvm/x86.c                |    1 
> > >  include/linux/kvm.h               |    3 +
> > >  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
> > >  4 files changed, 120 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > index 100acde..c7267d5 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
> > >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> > >  and kvm_irqfd.gsi.
> > >  
> > > +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> > > +triggered interrupt.  In this case a new irqchip input is allocated
> > > +which is logically OR'd with other inputs allowing multiple sources
> > > +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > > +is only necessary on setup, teardown is identical to that above.
> > > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > >  
> > >  5. The kvm_run structure
> > >  ------------------------
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index a01a424..80bed07 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > >  	case KVM_CAP_GET_TSC_KHZ:
> > >  	case KVM_CAP_PCI_2_3:
> > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > +	case KVM_CAP_IRQFD_LEVEL:
> > >  		r = 1;
> > >  		break;
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 2ce09aa..b2e6e4f 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > >  #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > >  #define KVM_CAP_S390_COW 79
> > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > +#define KVM_CAP_IRQFD_LEVEL 81
> > >  
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >  
> > > @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
> > >  #endif
> > >  
> > >  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > +/* Available with KVM_CAP_IRQFD_LEVEL */
> > > +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
> > >  
> > >  struct kvm_irqfd {
> > >  	__u32 fd;
> > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > index 7d7e2aa..ecdbfea 100644
> > > --- a/virt/kvm/eventfd.c
> > > +++ b/virt/kvm/eventfd.c
> > > @@ -36,6 +36,68 @@
> > >  #include "iodev.h"
> > >  
> > >  /*
> > > + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> > > + * injections and shared with other interfaces for EOI or de-assert.
> > > + * Create an object with reference counting to make it easy to use.
> > > + */
> > > +struct _irq_source {
> > > +	int id; /* the IRQ source ID */
> > > +	bool level_asserted; /* Track assertion state and protect with lock */
> > > +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> > > +	struct kvm *kvm;
> > > +	struct kref kref;
> > > +};
> > > +
> > > +static void _irq_source_release(struct kref *kref)
> > > +{
> > > +	struct _irq_source *source;
> > > +
> > > +	source = container_of(kref, struct _irq_source, kref);
> > > +
> > > +	/* This also de-asserts */
> > > +	kvm_free_irq_source_id(source->kvm, source->id);
> > > +	kfree(source);
> > > +}
> > > +
> > > +static void _irq_source_put(struct _irq_source *source)
> > > +{
> > > +	if (source)
> > > +		kref_put(&source->kref, _irq_source_release);
> > > +}
> > > +
> > > +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > +_irq_source_get(struct _irq_source *source)
> > > +{
> > > +	if (source)
> > > +		kref_get(&source->kref);
> > > +
> > > +	return source;
> > > +}
> > > +
> > > +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> > > +{
> > > +	struct _irq_source *source;
> > > +	int id;
> > > +
> > > +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> > > +	if (!source)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	id = kvm_request_irq_source_id(kvm);
> > > +	if (id < 0) {
> > > +		kfree(source);
> > > +		return ERR_PTR(id);
> > > +	}
> > > +
> > > +	kref_init(&source->kref);
> > > +	spin_lock_init(&source->lock);
> > > +	source->kvm = kvm;
> > > +	source->id = id;
> > > +
> > > +	return source;
> > > +}
> > > +
> > > +/*
> > >   * --------------------------------------------------------------------
> > >   * irqfd: Allows an fd to be used to inject an interrupt to the guest
> > >   *
> > > @@ -52,6 +114,8 @@ struct _irqfd {
> > >  	/* Used for level IRQ fast-path */
> > >  	int gsi;
> > >  	struct work_struct inject;
> > > +	/* IRQ source ID for level triggered irqfds */
> > > +	struct _irq_source *source;
> > >  	/* Used for setup/shutdown */
> > >  	struct eventfd_ctx *eventfd;
> > >  	struct list_head list;
> > > @@ -62,7 +126,7 @@ struct _irqfd {
> > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > >  
> > >  static void
> > > -irqfd_inject(struct work_struct *work)
> > > +irqfd_inject_edge(struct work_struct *work)
> > >  {
> > >  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > >  	struct kvm *kvm = irqfd->kvm;
> > > @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
> > >  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > >  }
> > >  
> > > +static void
> > > +irqfd_inject_level(struct work_struct *work)
> > > +{
> > > +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > +
> > > +	/*
> > > +	 * Inject an interrupt only if not already asserted.
> > > +	 *
> > > +	 * We can safely ignore the kvm_set_irq return value here.  If
> > > +	 * masked, the irr bit is still set and will eventually be serviced.
> > > +	 * This interface does not guarantee immediate injection.  If
> > > +	 * coalesced, an eoi will be coming where we can de-assert and
> > > +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> > > +	 * was coalesced, this interface is not for you.
> > > +	 */
> > > +	spin_lock(&irqfd->source->lock);
> > > +	if (!irqfd->source->level_asserted) {
> > > +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > +		irqfd->source->level_asserted = true;
> > > +	}
> > > +	spin_unlock(&irqfd->source->lock);
> > > +}
> > > +
> > 
> > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > level_asserted?  Let's use an atomic test and set/test and clear and the
> > problem goes away.
> > 
> That sad reality is that for level interrupt we already scan all vcpus
> under spinlock.

Where?

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 10:48       ` Michael S. Tsirkin
@ 2012-07-18 10:49         ` Gleb Natapov
  2012-07-18 10:53           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 10:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:48:44PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 01:44:29PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 18, 2012 at 01:41:14PM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > > In order to inject a level interrupt from an external source using an
> > > > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > > > assert and (later) de-assert an interrupt line independently from
> > > > users of KVM_IRQ_LINE and avoid lost interrupts.
> > > > 
> > > > We also add what may appear like a bit of excessive infrastructure
> > > > around an object for storing this irq_source_id.  However, notice
> > > > that we only provide a way to assert the interrupt here.  A follow-on
> > > > interface will make use of the same irq_source_id to allow de-assert.
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > > 
> > > >  Documentation/virtual/kvm/api.txt |    6 ++
> > > >  arch/x86/kvm/x86.c                |    1 
> > > >  include/linux/kvm.h               |    3 +
> > > >  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
> > > >  4 files changed, 120 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > index 100acde..c7267d5 100644
> > > > --- a/Documentation/virtual/kvm/api.txt
> > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
> > > >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> > > >  and kvm_irqfd.gsi.
> > > >  
> > > > +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> > > > +triggered interrupt.  In this case a new irqchip input is allocated
> > > > +which is logically OR'd with other inputs allowing multiple sources
> > > > +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > > > +is only necessary on setup, teardown is identical to that above.
> > > > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > >  
> > > >  5. The kvm_run structure
> > > >  ------------------------
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index a01a424..80bed07 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > >  	case KVM_CAP_GET_TSC_KHZ:
> > > >  	case KVM_CAP_PCI_2_3:
> > > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > > +	case KVM_CAP_IRQFD_LEVEL:
> > > >  		r = 1;
> > > >  		break;
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 2ce09aa..b2e6e4f 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > > >  #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > > >  #define KVM_CAP_S390_COW 79
> > > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > > +#define KVM_CAP_IRQFD_LEVEL 81
> > > >  
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > >  
> > > > @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
> > > >  #endif
> > > >  
> > > >  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > > +/* Available with KVM_CAP_IRQFD_LEVEL */
> > > > +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
> > > >  
> > > >  struct kvm_irqfd {
> > > >  	__u32 fd;
> > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > index 7d7e2aa..ecdbfea 100644
> > > > --- a/virt/kvm/eventfd.c
> > > > +++ b/virt/kvm/eventfd.c
> > > > @@ -36,6 +36,68 @@
> > > >  #include "iodev.h"
> > > >  
> > > >  /*
> > > > + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> > > > + * injections and shared with other interfaces for EOI or de-assert.
> > > > + * Create an object with reference counting to make it easy to use.
> > > > + */
> > > > +struct _irq_source {
> > > > +	int id; /* the IRQ source ID */
> > > > +	bool level_asserted; /* Track assertion state and protect with lock */
> > > > +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> > > > +	struct kvm *kvm;
> > > > +	struct kref kref;
> > > > +};
> > > > +
> > > > +static void _irq_source_release(struct kref *kref)
> > > > +{
> > > > +	struct _irq_source *source;
> > > > +
> > > > +	source = container_of(kref, struct _irq_source, kref);
> > > > +
> > > > +	/* This also de-asserts */
> > > > +	kvm_free_irq_source_id(source->kvm, source->id);
> > > > +	kfree(source);
> > > > +}
> > > > +
> > > > +static void _irq_source_put(struct _irq_source *source)
> > > > +{
> > > > +	if (source)
> > > > +		kref_put(&source->kref, _irq_source_release);
> > > > +}
> > > > +
> > > > +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > > +_irq_source_get(struct _irq_source *source)
> > > > +{
> > > > +	if (source)
> > > > +		kref_get(&source->kref);
> > > > +
> > > > +	return source;
> > > > +}
> > > > +
> > > > +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> > > > +{
> > > > +	struct _irq_source *source;
> > > > +	int id;
> > > > +
> > > > +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> > > > +	if (!source)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	id = kvm_request_irq_source_id(kvm);
> > > > +	if (id < 0) {
> > > > +		kfree(source);
> > > > +		return ERR_PTR(id);
> > > > +	}
> > > > +
> > > > +	kref_init(&source->kref);
> > > > +	spin_lock_init(&source->lock);
> > > > +	source->kvm = kvm;
> > > > +	source->id = id;
> > > > +
> > > > +	return source;
> > > > +}
> > > > +
> > > > +/*
> > > >   * --------------------------------------------------------------------
> > > >   * irqfd: Allows an fd to be used to inject an interrupt to the guest
> > > >   *
> > > > @@ -52,6 +114,8 @@ struct _irqfd {
> > > >  	/* Used for level IRQ fast-path */
> > > >  	int gsi;
> > > >  	struct work_struct inject;
> > > > +	/* IRQ source ID for level triggered irqfds */
> > > > +	struct _irq_source *source;
> > > >  	/* Used for setup/shutdown */
> > > >  	struct eventfd_ctx *eventfd;
> > > >  	struct list_head list;
> > > > @@ -62,7 +126,7 @@ struct _irqfd {
> > > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > > >  
> > > >  static void
> > > > -irqfd_inject(struct work_struct *work)
> > > > +irqfd_inject_edge(struct work_struct *work)
> > > >  {
> > > >  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > >  	struct kvm *kvm = irqfd->kvm;
> > > > @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
> > > >  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > >  }
> > > >  
> > > > +static void
> > > > +irqfd_inject_level(struct work_struct *work)
> > > > +{
> > > > +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > +
> > > > +	/*
> > > > +	 * Inject an interrupt only if not already asserted.
> > > > +	 *
> > > > +	 * We can safely ignore the kvm_set_irq return value here.  If
> > > > +	 * masked, the irr bit is still set and will eventually be serviced.
> > > > +	 * This interface does not guarantee immediate injection.  If
> > > > +	 * coalesced, an eoi will be coming where we can de-assert and
> > > > +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> > > > +	 * was coalesced, this interface is not for you.
> > > > +	 */
> > > > +	spin_lock(&irqfd->source->lock);
> > > > +	if (!irqfd->source->level_asserted) {
> > > > +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > +		irqfd->source->level_asserted = true;
> > > > +	}
> > > > +	spin_unlock(&irqfd->source->lock);
> > > > +}
> > > > +
> > > 
> > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > problem goes away.
> > > 
> > That sad reality is that for level interrupt we already scan all vcpus
> > under spinlock.
> 
> Where?
> 
ioapic

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 10:36                                 ` Gleb Natapov
@ 2012-07-18 10:51                                   ` Michael S. Tsirkin
  2012-07-18 10:53                                     ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:36:08PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 01:33:35PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 01:27:39PM +0300, Gleb Natapov wrote:
> > > On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > > > > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > 
> > > > > > Look at this:
> > > > > > 
> > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > >                                      int irq_source_id, int level)
> > > > > > {
> > > > > >         /* Logical OR for level trig interrupt */
> > > > > >         if (level)
> > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > >         else
> > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > 
> > > > > >         return !!(*irq_state);
> > > > > > }
> > > > > > 
> > > > > > 
> > > > > > Now:
> > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > 
> > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > and interrupt will get cleared.
> > > > > > 
> > > > > This will hardly happen on x86 especially since bit is set with
> > > > > serialized instruction.
> > > > 
> > > > Probably. But it does make me a bit uneasy.  Why don't we pass
> > > > irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> > > > kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> > > > __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> > > > and saving an atomic op in the process.
> > > > 
> > > With my patch I do not see why we can't change them to unlocked variant
> > > without moving them anywhere. The only requirement is to not use RMW
> > > sequence to set/clear bits. The ordering of setting does not matter. The
> > > ordering of reading is.
> > 
> > You want to use __set_bit/__clear_bit on the same word
> > from multiple CPUs, without locking?
> > Why won't this lose information?
> Because it is not RMW. If it is then yes, you can't do that.

You are saying __set_bit does not do RMW on x86? Interesting.
It's probably not a good idea to rely on this I think.

> > 
> > In any case, it seems simpler and safer to do accesses under lock
> > than rely on specific use.
> > 
> > > --
> > > 			Gleb.
> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 10:49         ` Gleb Natapov
@ 2012-07-18 10:53           ` Michael S. Tsirkin
  2012-07-18 10:55             ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 10:53 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:49:06PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 01:48:44PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 01:44:29PM +0300, Gleb Natapov wrote:
> > > On Wed, Jul 18, 2012 at 01:41:14PM +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > > > In order to inject a level interrupt from an external source using an
> > > > > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > > > > assert and (later) de-assert an interrupt line independently from
> > > > > users of KVM_IRQ_LINE and avoid lost interrupts.
> > > > > 
> > > > > We also add what may appear like a bit of excessive infrastructure
> > > > > around an object for storing this irq_source_id.  However, notice
> > > > > that we only provide a way to assert the interrupt here.  A follow-on
> > > > > interface will make use of the same irq_source_id to allow de-assert.
> > > > > 
> > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > ---
> > > > > 
> > > > >  Documentation/virtual/kvm/api.txt |    6 ++
> > > > >  arch/x86/kvm/x86.c                |    1 
> > > > >  include/linux/kvm.h               |    3 +
> > > > >  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
> > > > >  4 files changed, 120 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > > index 100acde..c7267d5 100644
> > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
> > > > >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> > > > >  and kvm_irqfd.gsi.
> > > > >  
> > > > > +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> > > > > +triggered interrupt.  In this case a new irqchip input is allocated
> > > > > +which is logically OR'd with other inputs allowing multiple sources
> > > > > +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > > > > +is only necessary on setup, teardown is identical to that above.
> > > > > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > >  
> > > > >  5. The kvm_run structure
> > > > >  ------------------------
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index a01a424..80bed07 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > >  	case KVM_CAP_GET_TSC_KHZ:
> > > > >  	case KVM_CAP_PCI_2_3:
> > > > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > > > +	case KVM_CAP_IRQFD_LEVEL:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 2ce09aa..b2e6e4f 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > > > >  #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > > > >  #define KVM_CAP_S390_COW 79
> > > > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > > > +#define KVM_CAP_IRQFD_LEVEL 81
> > > > >  
> > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > >  
> > > > > @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
> > > > >  #endif
> > > > >  
> > > > >  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > > > +/* Available with KVM_CAP_IRQFD_LEVEL */
> > > > > +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
> > > > >  
> > > > >  struct kvm_irqfd {
> > > > >  	__u32 fd;
> > > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > > index 7d7e2aa..ecdbfea 100644
> > > > > --- a/virt/kvm/eventfd.c
> > > > > +++ b/virt/kvm/eventfd.c
> > > > > @@ -36,6 +36,68 @@
> > > > >  #include "iodev.h"
> > > > >  
> > > > >  /*
> > > > > + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> > > > > + * injections and shared with other interfaces for EOI or de-assert.
> > > > > + * Create an object with reference counting to make it easy to use.
> > > > > + */
> > > > > +struct _irq_source {
> > > > > +	int id; /* the IRQ source ID */
> > > > > +	bool level_asserted; /* Track assertion state and protect with lock */
> > > > > +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> > > > > +	struct kvm *kvm;
> > > > > +	struct kref kref;
> > > > > +};
> > > > > +
> > > > > +static void _irq_source_release(struct kref *kref)
> > > > > +{
> > > > > +	struct _irq_source *source;
> > > > > +
> > > > > +	source = container_of(kref, struct _irq_source, kref);
> > > > > +
> > > > > +	/* This also de-asserts */
> > > > > +	kvm_free_irq_source_id(source->kvm, source->id);
> > > > > +	kfree(source);
> > > > > +}
> > > > > +
> > > > > +static void _irq_source_put(struct _irq_source *source)
> > > > > +{
> > > > > +	if (source)
> > > > > +		kref_put(&source->kref, _irq_source_release);
> > > > > +}
> > > > > +
> > > > > +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > > > +_irq_source_get(struct _irq_source *source)
> > > > > +{
> > > > > +	if (source)
> > > > > +		kref_get(&source->kref);
> > > > > +
> > > > > +	return source;
> > > > > +}
> > > > > +
> > > > > +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> > > > > +{
> > > > > +	struct _irq_source *source;
> > > > > +	int id;
> > > > > +
> > > > > +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> > > > > +	if (!source)
> > > > > +		return ERR_PTR(-ENOMEM);
> > > > > +
> > > > > +	id = kvm_request_irq_source_id(kvm);
> > > > > +	if (id < 0) {
> > > > > +		kfree(source);
> > > > > +		return ERR_PTR(id);
> > > > > +	}
> > > > > +
> > > > > +	kref_init(&source->kref);
> > > > > +	spin_lock_init(&source->lock);
> > > > > +	source->kvm = kvm;
> > > > > +	source->id = id;
> > > > > +
> > > > > +	return source;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > >   * --------------------------------------------------------------------
> > > > >   * irqfd: Allows an fd to be used to inject an interrupt to the guest
> > > > >   *
> > > > > @@ -52,6 +114,8 @@ struct _irqfd {
> > > > >  	/* Used for level IRQ fast-path */
> > > > >  	int gsi;
> > > > >  	struct work_struct inject;
> > > > > +	/* IRQ source ID for level triggered irqfds */
> > > > > +	struct _irq_source *source;
> > > > >  	/* Used for setup/shutdown */
> > > > >  	struct eventfd_ctx *eventfd;
> > > > >  	struct list_head list;
> > > > > @@ -62,7 +126,7 @@ struct _irqfd {
> > > > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > > > >  
> > > > >  static void
> > > > > -irqfd_inject(struct work_struct *work)
> > > > > +irqfd_inject_edge(struct work_struct *work)
> > > > >  {
> > > > >  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > >  	struct kvm *kvm = irqfd->kvm;
> > > > > @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
> > > > >  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > >  }
> > > > >  
> > > > > +static void
> > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > +{
> > > > > +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > > +
> > > > > +	/*
> > > > > +	 * Inject an interrupt only if not already asserted.
> > > > > +	 *
> > > > > +	 * We can safely ignore the kvm_set_irq return value here.  If
> > > > > +	 * masked, the irr bit is still set and will eventually be serviced.
> > > > > +	 * This interface does not guarantee immediate injection.  If
> > > > > +	 * coalesced, an eoi will be coming where we can de-assert and
> > > > > +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> > > > > +	 * was coalesced, this interface is not for you.
> > > > > +	 */
> > > > > +	spin_lock(&irqfd->source->lock);
> > > > > +	if (!irqfd->source->level_asserted) {
> > > > > +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > +		irqfd->source->level_asserted = true;
> > > > > +	}
> > > > > +	spin_unlock(&irqfd->source->lock);
> > > > > +}
> > > > > +
> > > > 
> > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > problem goes away.
> > > > 
> > > That sad reality is that for level interrupt we already scan all vcpus
> > > under spinlock.
> > 
> > Where?
> > 
> ioapic

$ grep kvm_for_each_vcpu virt/kvm/ioapic.c
$

?

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 10:51                                   ` Michael S. Tsirkin
@ 2012-07-18 10:53                                     ` Gleb Natapov
  2012-07-18 11:08                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 10:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:51:05PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 01:36:08PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 18, 2012 at 01:33:35PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 18, 2012 at 01:27:39PM +0300, Gleb Natapov wrote:
> > > > On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> > > > > On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > > > > > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > 
> > > > > > > Look at this:
> > > > > > > 
> > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > >                                      int irq_source_id, int level)
> > > > > > > {
> > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > >         if (level)
> > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > >         else
> > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > 
> > > > > > >         return !!(*irq_state);
> > > > > > > }
> > > > > > > 
> > > > > > > 
> > > > > > > Now:
> > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > 
> > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > and interrupt will get cleared.
> > > > > > > 
> > > > > > This will hardly happen on x86 especially since bit is set with
> > > > > > serialized instruction.
> > > > > 
> > > > > Probably. But it does make me a bit uneasy.  Why don't we pass
> > > > > irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> > > > > kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> > > > > __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> > > > > and saving an atomic op in the process.
> > > > > 
> > > > With my patch I do not see why we can't change them to unlocked variant
> > > > without moving them anywhere. The only requirement is to not use RMW
> > > > sequence to set/clear bits. The ordering of setting does not matter. The
> > > > ordering of reading is.
> > > 
> > > You want to use __set_bit/__clear_bit on the same word
> > > from multiple CPUs, without locking?
> > > Why won't this lose information?
> > Because it is not RMW. If it is then yes, you can't do that.
> 
> You are saying __set_bit does not do RMW on x86? Interesting.
I think it doesn't.

> It's probably not a good idea to rely on this I think.
> 
The code is no in arch/x86 so probably no. Although it is used only on
x86 (and ia64 which has broken kvm anyway).

> > > 
> > > In any case, it seems simpler and safer to do accesses under lock
> > > than rely on specific use.
> > > 
> > > > --
> > > > 			Gleb.
> > 
> > --
> > 			Gleb.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 10:53           ` Michael S. Tsirkin
@ 2012-07-18 10:55             ` Gleb Natapov
  2012-07-18 11:22               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 10:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:53:11PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 01:49:06PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 18, 2012 at 01:48:44PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 18, 2012 at 01:44:29PM +0300, Gleb Natapov wrote:
> > > > On Wed, Jul 18, 2012 at 01:41:14PM +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > > > > In order to inject a level interrupt from an external source using an
> > > > > > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > > > > > assert and (later) de-assert an interrupt line independently from
> > > > > > users of KVM_IRQ_LINE and avoid lost interrupts.
> > > > > > 
> > > > > > We also add what may appear like a bit of excessive infrastructure
> > > > > > around an object for storing this irq_source_id.  However, notice
> > > > > > that we only provide a way to assert the interrupt here.  A follow-on
> > > > > > interface will make use of the same irq_source_id to allow de-assert.
> > > > > > 
> > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > >  Documentation/virtual/kvm/api.txt |    6 ++
> > > > > >  arch/x86/kvm/x86.c                |    1 
> > > > > >  include/linux/kvm.h               |    3 +
> > > > > >  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
> > > > > >  4 files changed, 120 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > > > index 100acde..c7267d5 100644
> > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
> > > > > >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> > > > > >  and kvm_irqfd.gsi.
> > > > > >  
> > > > > > +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> > > > > > +triggered interrupt.  In this case a new irqchip input is allocated
> > > > > > +which is logically OR'd with other inputs allowing multiple sources
> > > > > > +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > > > > > +is only necessary on setup, teardown is identical to that above.
> > > > > > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > > >  
> > > > > >  5. The kvm_run structure
> > > > > >  ------------------------
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index a01a424..80bed07 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > >  	case KVM_CAP_GET_TSC_KHZ:
> > > > > >  	case KVM_CAP_PCI_2_3:
> > > > > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > > > > +	case KVM_CAP_IRQFD_LEVEL:
> > > > > >  		r = 1;
> > > > > >  		break;
> > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > index 2ce09aa..b2e6e4f 100644
> > > > > > --- a/include/linux/kvm.h
> > > > > > +++ b/include/linux/kvm.h
> > > > > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > > > > >  #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > > > > >  #define KVM_CAP_S390_COW 79
> > > > > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > > > > +#define KVM_CAP_IRQFD_LEVEL 81
> > > > > >  
> > > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > >  
> > > > > > @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
> > > > > >  #endif
> > > > > >  
> > > > > >  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > > > > +/* Available with KVM_CAP_IRQFD_LEVEL */
> > > > > > +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
> > > > > >  
> > > > > >  struct kvm_irqfd {
> > > > > >  	__u32 fd;
> > > > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > > > index 7d7e2aa..ecdbfea 100644
> > > > > > --- a/virt/kvm/eventfd.c
> > > > > > +++ b/virt/kvm/eventfd.c
> > > > > > @@ -36,6 +36,68 @@
> > > > > >  #include "iodev.h"
> > > > > >  
> > > > > >  /*
> > > > > > + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> > > > > > + * injections and shared with other interfaces for EOI or de-assert.
> > > > > > + * Create an object with reference counting to make it easy to use.
> > > > > > + */
> > > > > > +struct _irq_source {
> > > > > > +	int id; /* the IRQ source ID */
> > > > > > +	bool level_asserted; /* Track assertion state and protect with lock */
> > > > > > +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> > > > > > +	struct kvm *kvm;
> > > > > > +	struct kref kref;
> > > > > > +};
> > > > > > +
> > > > > > +static void _irq_source_release(struct kref *kref)
> > > > > > +{
> > > > > > +	struct _irq_source *source;
> > > > > > +
> > > > > > +	source = container_of(kref, struct _irq_source, kref);
> > > > > > +
> > > > > > +	/* This also de-asserts */
> > > > > > +	kvm_free_irq_source_id(source->kvm, source->id);
> > > > > > +	kfree(source);
> > > > > > +}
> > > > > > +
> > > > > > +static void _irq_source_put(struct _irq_source *source)
> > > > > > +{
> > > > > > +	if (source)
> > > > > > +		kref_put(&source->kref, _irq_source_release);
> > > > > > +}
> > > > > > +
> > > > > > +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > > > > +_irq_source_get(struct _irq_source *source)
> > > > > > +{
> > > > > > +	if (source)
> > > > > > +		kref_get(&source->kref);
> > > > > > +
> > > > > > +	return source;
> > > > > > +}
> > > > > > +
> > > > > > +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> > > > > > +{
> > > > > > +	struct _irq_source *source;
> > > > > > +	int id;
> > > > > > +
> > > > > > +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> > > > > > +	if (!source)
> > > > > > +		return ERR_PTR(-ENOMEM);
> > > > > > +
> > > > > > +	id = kvm_request_irq_source_id(kvm);
> > > > > > +	if (id < 0) {
> > > > > > +		kfree(source);
> > > > > > +		return ERR_PTR(id);
> > > > > > +	}
> > > > > > +
> > > > > > +	kref_init(&source->kref);
> > > > > > +	spin_lock_init(&source->lock);
> > > > > > +	source->kvm = kvm;
> > > > > > +	source->id = id;
> > > > > > +
> > > > > > +	return source;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > >   * --------------------------------------------------------------------
> > > > > >   * irqfd: Allows an fd to be used to inject an interrupt to the guest
> > > > > >   *
> > > > > > @@ -52,6 +114,8 @@ struct _irqfd {
> > > > > >  	/* Used for level IRQ fast-path */
> > > > > >  	int gsi;
> > > > > >  	struct work_struct inject;
> > > > > > +	/* IRQ source ID for level triggered irqfds */
> > > > > > +	struct _irq_source *source;
> > > > > >  	/* Used for setup/shutdown */
> > > > > >  	struct eventfd_ctx *eventfd;
> > > > > >  	struct list_head list;
> > > > > > @@ -62,7 +126,7 @@ struct _irqfd {
> > > > > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > > > > >  
> > > > > >  static void
> > > > > > -irqfd_inject(struct work_struct *work)
> > > > > > +irqfd_inject_edge(struct work_struct *work)
> > > > > >  {
> > > > > >  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > > >  	struct kvm *kvm = irqfd->kvm;
> > > > > > @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
> > > > > >  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > > >  }
> > > > > >  
> > > > > > +static void
> > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > +{
> > > > > > +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Inject an interrupt only if not already asserted.
> > > > > > +	 *
> > > > > > +	 * We can safely ignore the kvm_set_irq return value here.  If
> > > > > > +	 * masked, the irr bit is still set and will eventually be serviced.
> > > > > > +	 * This interface does not guarantee immediate injection.  If
> > > > > > +	 * coalesced, an eoi will be coming where we can de-assert and
> > > > > > +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> > > > > > +	 * was coalesced, this interface is not for you.
> > > > > > +	 */
> > > > > > +	spin_lock(&irqfd->source->lock);
> > > > > > +	if (!irqfd->source->level_asserted) {
> > > > > > +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > > +		irqfd->source->level_asserted = true;
> > > > > > +	}
> > > > > > +	spin_unlock(&irqfd->source->lock);
> > > > > > +}
> > > > > > +
> > > > > 
> > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > problem goes away.
> > > > > 
> > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > under spinlock.
> > > 
> > > Where?
> > > 
> > ioapic
> 
> $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> $
> 
> ?
> 

Come on Michael. You can do better than grep and actually look at what
code does. The code that loops over all vcpus while delivering an irq is
in kvm_irq_delivery_to_apic(). Now grep for that.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 10:53                                     ` Gleb Natapov
@ 2012-07-18 11:08                                       ` Michael S. Tsirkin
  2012-07-18 11:50                                         ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 11:08 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:53:15PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 01:51:05PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 01:36:08PM +0300, Gleb Natapov wrote:
> > > On Wed, Jul 18, 2012 at 01:33:35PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 18, 2012 at 01:27:39PM +0300, Gleb Natapov wrote:
> > > > > On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > > > > > > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > > 
> > > > > > > > Look at this:
> > > > > > > > 
> > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > {
> > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > >         if (level)
> > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > >         else
> > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > 
> > > > > > > >         return !!(*irq_state);
> > > > > > > > }
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Now:
> > > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > > 
> > > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > > and interrupt will get cleared.
> > > > > > > > 
> > > > > > > This will hardly happen on x86 especially since bit is set with
> > > > > > > serialized instruction.
> > > > > > 
> > > > > > Probably. But it does make me a bit uneasy.  Why don't we pass
> > > > > > irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> > > > > > kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> > > > > > __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> > > > > > and saving an atomic op in the process.
> > > > > > 
> > > > > With my patch I do not see why we can't change them to unlocked variant
> > > > > without moving them anywhere. The only requirement is to not use RMW
> > > > > sequence to set/clear bits. The ordering of setting does not matter. The
> > > > > ordering of reading is.
> > > > 
> > > > You want to use __set_bit/__clear_bit on the same word
> > > > from multiple CPUs, without locking?
> > > > Why won't this lose information?
> > > Because it is not RMW. If it is then yes, you can't do that.
> > 
> > You are saying __set_bit does not do RMW on x86? Interesting.
> I think it doesn't.

Anywhere I can read about this?

> > It's probably not a good idea to rely on this I think.
> > 
> The code is no in arch/x86 so probably no. Although it is used only on
> x86 (and ia64 which has broken kvm anyway).

Yes but exactly the reverse is documented.

/**
 * __set_bit - Set a bit in memory
 * @nr: the bit to set
 * @addr: the address to start counting from
 *
 * Unlike set_bit(), this function is non-atomic and may be reordered.


>>>> pls note the below

 * If it's called on the same region of memory simultaneously, the effect
 * may be that only one operation succeeds.
>>>> until here

 */
static inline void __set_bit(int nr, volatile unsigned long *addr)
{
        asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
}




> > > > 
> > > > In any case, it seems simpler and safer to do accesses under lock
> > > > than rely on specific use.
> > > > 
> > > > > --
> > > > > 			Gleb.
> > > 
> > > --
> > > 			Gleb.
> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 10:55             ` Gleb Natapov
@ 2012-07-18 11:22               ` Michael S. Tsirkin
  2012-07-18 11:39                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 11:22 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:55:30PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 01:53:11PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 01:49:06PM +0300, Gleb Natapov wrote:
> > > On Wed, Jul 18, 2012 at 01:48:44PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 18, 2012 at 01:44:29PM +0300, Gleb Natapov wrote:
> > > > > On Wed, Jul 18, 2012 at 01:41:14PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jul 16, 2012 at 02:33:47PM -0600, Alex Williamson wrote:
> > > > > > > In order to inject a level interrupt from an external source using an
> > > > > > > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > > > > > > assert and (later) de-assert an interrupt line independently from
> > > > > > > users of KVM_IRQ_LINE and avoid lost interrupts.
> > > > > > > 
> > > > > > > We also add what may appear like a bit of excessive infrastructure
> > > > > > > around an object for storing this irq_source_id.  However, notice
> > > > > > > that we only provide a way to assert the interrupt here.  A follow-on
> > > > > > > interface will make use of the same irq_source_id to allow de-assert.
> > > > > > > 
> > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > ---
> > > > > > > 
> > > > > > >  Documentation/virtual/kvm/api.txt |    6 ++
> > > > > > >  arch/x86/kvm/x86.c                |    1 
> > > > > > >  include/linux/kvm.h               |    3 +
> > > > > > >  virt/kvm/eventfd.c                |  114 ++++++++++++++++++++++++++++++++++++-
> > > > > > >  4 files changed, 120 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > > > > index 100acde..c7267d5 100644
> > > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > > @@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
> > > > > > >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> > > > > > >  and kvm_irqfd.gsi.
> > > > > > >  
> > > > > > > +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
> > > > > > > +triggered interrupt.  In this case a new irqchip input is allocated
> > > > > > > +which is logically OR'd with other inputs allowing multiple sources
> > > > > > > +to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > > > > > > +is only necessary on setup, teardown is identical to that above.
> > > > > > > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > > > >  
> > > > > > >  5. The kvm_run structure
> > > > > > >  ------------------------
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index a01a424..80bed07 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > >  	case KVM_CAP_GET_TSC_KHZ:
> > > > > > >  	case KVM_CAP_PCI_2_3:
> > > > > > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > > > > > +	case KVM_CAP_IRQFD_LEVEL:
> > > > > > >  		r = 1;
> > > > > > >  		break;
> > > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > > index 2ce09aa..b2e6e4f 100644
> > > > > > > --- a/include/linux/kvm.h
> > > > > > > +++ b/include/linux/kvm.h
> > > > > > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > > > > > >  #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > > > > > >  #define KVM_CAP_S390_COW 79
> > > > > > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > > > > > +#define KVM_CAP_IRQFD_LEVEL 81
> > > > > > >  
> > > > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > > >  
> > > > > > > @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
> > > > > > >  #endif
> > > > > > >  
> > > > > > >  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > > > > > +/* Available with KVM_CAP_IRQFD_LEVEL */
> > > > > > > +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
> > > > > > >  
> > > > > > >  struct kvm_irqfd {
> > > > > > >  	__u32 fd;
> > > > > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > > > > index 7d7e2aa..ecdbfea 100644
> > > > > > > --- a/virt/kvm/eventfd.c
> > > > > > > +++ b/virt/kvm/eventfd.c
> > > > > > > @@ -36,6 +36,68 @@
> > > > > > >  #include "iodev.h"
> > > > > > >  
> > > > > > >  /*
> > > > > > > + * An irq_source_id can be created from KVM_IRQFD for level interrupt
> > > > > > > + * injections and shared with other interfaces for EOI or de-assert.
> > > > > > > + * Create an object with reference counting to make it easy to use.
> > > > > > > + */
> > > > > > > +struct _irq_source {
> > > > > > > +	int id; /* the IRQ source ID */
> > > > > > > +	bool level_asserted; /* Track assertion state and protect with lock */
> > > > > > > +	spinlock_t lock;     /* to avoid unnecessary re-assert/spurious eoi. */
> > > > > > > +	struct kvm *kvm;
> > > > > > > +	struct kref kref;
> > > > > > > +};
> > > > > > > +
> > > > > > > +static void _irq_source_release(struct kref *kref)
> > > > > > > +{
> > > > > > > +	struct _irq_source *source;
> > > > > > > +
> > > > > > > +	source = container_of(kref, struct _irq_source, kref);
> > > > > > > +
> > > > > > > +	/* This also de-asserts */
> > > > > > > +	kvm_free_irq_source_id(source->kvm, source->id);
> > > > > > > +	kfree(source);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void _irq_source_put(struct _irq_source *source)
> > > > > > > +{
> > > > > > > +	if (source)
> > > > > > > +		kref_put(&source->kref, _irq_source_release);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > > > > > +_irq_source_get(struct _irq_source *source)
> > > > > > > +{
> > > > > > > +	if (source)
> > > > > > > +		kref_get(&source->kref);
> > > > > > > +
> > > > > > > +	return source;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
> > > > > > > +{
> > > > > > > +	struct _irq_source *source;
> > > > > > > +	int id;
> > > > > > > +
> > > > > > > +	source = kzalloc(sizeof(*source), GFP_KERNEL);
> > > > > > > +	if (!source)
> > > > > > > +		return ERR_PTR(-ENOMEM);
> > > > > > > +
> > > > > > > +	id = kvm_request_irq_source_id(kvm);
> > > > > > > +	if (id < 0) {
> > > > > > > +		kfree(source);
> > > > > > > +		return ERR_PTR(id);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	kref_init(&source->kref);
> > > > > > > +	spin_lock_init(&source->lock);
> > > > > > > +	source->kvm = kvm;
> > > > > > > +	source->id = id;
> > > > > > > +
> > > > > > > +	return source;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > >   * --------------------------------------------------------------------
> > > > > > >   * irqfd: Allows an fd to be used to inject an interrupt to the guest
> > > > > > >   *
> > > > > > > @@ -52,6 +114,8 @@ struct _irqfd {
> > > > > > >  	/* Used for level IRQ fast-path */
> > > > > > >  	int gsi;
> > > > > > >  	struct work_struct inject;
> > > > > > > +	/* IRQ source ID for level triggered irqfds */
> > > > > > > +	struct _irq_source *source;
> > > > > > >  	/* Used for setup/shutdown */
> > > > > > >  	struct eventfd_ctx *eventfd;
> > > > > > >  	struct list_head list;
> > > > > > > @@ -62,7 +126,7 @@ struct _irqfd {
> > > > > > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > > > > > >  
> > > > > > >  static void
> > > > > > > -irqfd_inject(struct work_struct *work)
> > > > > > > +irqfd_inject_edge(struct work_struct *work)
> > > > > > >  {
> > > > > > >  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > > > >  	struct kvm *kvm = irqfd->kvm;
> > > > > > > @@ -71,6 +135,29 @@ irqfd_inject(struct work_struct *work)
> > > > > > >  	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > > > >  }
> > > > > > >  
> > > > > > > +static void
> > > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > > +{
> > > > > > > +	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Inject an interrupt only if not already asserted.
> > > > > > > +	 *
> > > > > > > +	 * We can safely ignore the kvm_set_irq return value here.  If
> > > > > > > +	 * masked, the irr bit is still set and will eventually be serviced.
> > > > > > > +	 * This interface does not guarantee immediate injection.  If
> > > > > > > +	 * coalesced, an eoi will be coming where we can de-assert and
> > > > > > > +	 * re-inject if necessary.  NB, if you need to know if an interrupt
> > > > > > > +	 * was coalesced, this interface is not for you.
> > > > > > > +	 */
> > > > > > > +	spin_lock(&irqfd->source->lock);
> > > > > > > +	if (!irqfd->source->level_asserted) {
> > > > > > > +		kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > > > +		irqfd->source->level_asserted = true;
> > > > > > > +	}
> > > > > > > +	spin_unlock(&irqfd->source->lock);
> > > > > > > +}
> > > > > > > +
> > > > > > 
> > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > problem goes away.
> > > > > > 
> > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > under spinlock.
> > > > 
> > > > Where?
> > > > 
> > > ioapic
> > 
> > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > $
> > 
> > ?
> > 
> 
> Come on Michael. You can do better than grep and actually look at what
> code does. The code that loops over all vcpus while delivering an irq is
> in kvm_irq_delivery_to_apic(). Now grep for that.

Hmm, I see, it's actually done for edge if injected from ioapic too,
right?

So set_irq does a linear scan, and for each matching CPU it calls
kvm_irq_delivery_to_apic which is another scan?
So it's actually N^2 worst case for a broadcast?

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 11:22               ` Michael S. Tsirkin
@ 2012-07-18 11:39                 ` Michael S. Tsirkin
  2012-07-18 11:48                   ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 11:39 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > problem goes away.
> > > > > > > 
> > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > under spinlock.
> > > > > 
> > > > > Where?
> > > > > 
> > > > ioapic
> > > 
> > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > $
> > > 
> > > ?
> > > 
> > 
> > Come on Michael. You can do better than grep and actually look at what
> > code does. The code that loops over all vcpus while delivering an irq is
> > in kvm_irq_delivery_to_apic(). Now grep for that.
> 
> Hmm, I see, it's actually done for edge if injected from ioapic too,
> right?
> 
> So set_irq does a linear scan, and for each matching CPU it calls
> kvm_irq_delivery_to_apic which is another scan?
> So it's actually N^2 worst case for a broadcast?

No it isn't, I misread the code.


Anyway, maybe not trivially but this looks fixable to me: we could drop
the ioapic lock before calling kvm_irq_delivery_to_apic.

> > --
> > 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 11:39                 ` Michael S. Tsirkin
@ 2012-07-18 11:48                   ` Gleb Natapov
  2012-07-18 12:07                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 11:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 02:39:10PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > > problem goes away.
> > > > > > > > 
> > > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > > under spinlock.
> > > > > > 
> > > > > > Where?
> > > > > > 
> > > > > ioapic
> > > > 
> > > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > > $
> > > > 
> > > > ?
> > > > 
> > > 
> > > Come on Michael. You can do better than grep and actually look at what
> > > code does. The code that loops over all vcpus while delivering an irq is
> > > in kvm_irq_delivery_to_apic(). Now grep for that.
> > 
> > Hmm, I see, it's actually done for edge if injected from ioapic too,
> > right?
> > 
> > So set_irq does a linear scan, and for each matching CPU it calls
> > kvm_irq_delivery_to_apic which is another scan?
> > So it's actually N^2 worst case for a broadcast?
> 
> No it isn't, I misread the code.
> 
> 
> Anyway, maybe not trivially but this looks fixable to me: we could drop
> the ioapic lock before calling kvm_irq_delivery_to_apic.
> 
May be, may be not. Just saying "lets drop lock whenever we don't feel
like holding one" does not cut it. Back to original point though current
situation is that calling kvm_set_irq() under spinlock is not worse for
scalability than calling it not under one.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18 11:08                                       ` Michael S. Tsirkin
@ 2012-07-18 11:50                                         ` Gleb Natapov
  0 siblings, 0 replies; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 11:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 02:08:43PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 01:53:15PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 18, 2012 at 01:51:05PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 18, 2012 at 01:36:08PM +0300, Gleb Natapov wrote:
> > > > On Wed, Jul 18, 2012 at 01:33:35PM +0300, Michael S. Tsirkin wrote:
> > > > > On Wed, Jul 18, 2012 at 01:27:39PM +0300, Gleb Natapov wrote:
> > > > > > On Wed, Jul 18, 2012 at 01:20:29PM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> > > > > > > > On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > > > > > > > > 
> > > > > > > > > Look at this:
> > > > > > > > > 
> > > > > > > > > static inline int kvm_irq_line_state(unsigned long *irq_state,
> > > > > > > > >                                      int irq_source_id, int level)
> > > > > > > > > {
> > > > > > > > >         /* Logical OR for level trig interrupt */
> > > > > > > > >         if (level)
> > > > > > > > >                 set_bit(irq_source_id, irq_state);
> > > > > > > > >         else
> > > > > > > > >                 clear_bit(irq_source_id, irq_state);
> > > > > > > > > 
> > > > > > > > >         return !!(*irq_state);
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Now:
> > > > > > > > > If other CPU changes some other bit after the atomic change,
> > > > > > > > > it looks like !!(*irq_state) might return a stale value.
> > > > > > > > > 
> > > > > > > > > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > > > > > > > > If CPU 0 sees a stale value now it will return 0 here
> > > > > > > > > and interrupt will get cleared.
> > > > > > > > > 
> > > > > > > > This will hardly happen on x86 especially since bit is set with
> > > > > > > > serialized instruction.
> > > > > > > 
> > > > > > > Probably. But it does make me a bit uneasy.  Why don't we pass
> > > > > > > irq_source_id to kvm_pic_set_irq/kvm_ioapic_set_irq, and move
> > > > > > > kvm_irq_line_state to under pic_lock/ioapic_lock?  We can then use
> > > > > > > __set_bit/__clear_bit in kvm_irq_line_state, making the ordering simpler
> > > > > > > and saving an atomic op in the process.
> > > > > > > 
> > > > > > With my patch I do not see why we can't change them to unlocked variant
> > > > > > without moving them anywhere. The only requirement is to not use RMW
> > > > > > sequence to set/clear bits. The ordering of setting does not matter. The
> > > > > > ordering of reading is.
> > > > > 
> > > > > You want to use __set_bit/__clear_bit on the same word
> > > > > from multiple CPUs, without locking?
> > > > > Why won't this lose information?
> > > > Because it is not RMW. If it is then yes, you can't do that.
> > > 
> > > You are saying __set_bit does not do RMW on x86? Interesting.
> > I think it doesn't.
> 
> Anywhere I can read about this?
> 
Well actually SDM says LOCK prefix is needed, so yes we cannot use
__set_bit/__clear_bit without moving it under lock.

> > > It's probably not a good idea to rely on this I think.
> > > 
> > The code is no in arch/x86 so probably no. Although it is used only on
> > x86 (and ia64 which has broken kvm anyway).
> 
> Yes but exactly the reverse is documented.
> 
> /**
>  * __set_bit - Set a bit in memory
>  * @nr: the bit to set
>  * @addr: the address to start counting from
>  *
>  * Unlike set_bit(), this function is non-atomic and may be reordered.
> 
> 
> >>>> pls note the below
> 
>  * If it's called on the same region of memory simultaneously, the effect
>  * may be that only one operation succeeds.
> >>>> until here
> 
>  */
> static inline void __set_bit(int nr, volatile unsigned long *addr)
> {
>         asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
> }
> 
> 
> 
> 
> > > > > 
> > > > > In any case, it seems simpler and safer to do accesses under lock
> > > > > than rely on specific use.
> > > > > 
> > > > > > --
> > > > > > 			Gleb.
> > > > 
> > > > --
> > > > 			Gleb.
> > 
> > --
> > 			Gleb.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 11:48                   ` Gleb Natapov
@ 2012-07-18 12:07                     ` Michael S. Tsirkin
  2012-07-18 14:47                       ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 12:07 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 02:48:44PM +0300, Gleb Natapov wrote:
> On Wed, Jul 18, 2012 at 02:39:10PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > > > problem goes away.
> > > > > > > > > 
> > > > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > > > under spinlock.
> > > > > > > 
> > > > > > > Where?
> > > > > > > 
> > > > > > ioapic
> > > > > 
> > > > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > > > $
> > > > > 
> > > > > ?
> > > > > 
> > > > 
> > > > Come on Michael. You can do better than grep and actually look at what
> > > > code does. The code that loops over all vcpus while delivering an irq is
> > > > in kvm_irq_delivery_to_apic(). Now grep for that.
> > > 
> > > Hmm, I see, it's actually done for edge if injected from ioapic too,
> > > right?
> > > 
> > > So set_irq does a linear scan, and for each matching CPU it calls
> > > kvm_irq_delivery_to_apic which is another scan?
> > > So it's actually N^2 worst case for a broadcast?
> > 
> > No it isn't, I misread the code.
> > 
> > 
> > Anyway, maybe not trivially but this looks fixable to me: we could drop
> > the ioapic lock before calling kvm_irq_delivery_to_apic.
> > 
> May be, may be not. Just saying "lets drop lock whenever we don't feel
> like holding one" does not cut it.

One thing we do is set remote_irr if interrupt was injected.
I agree these things are tricky.

One other question:

static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx)
{
        union kvm_ioapic_redirect_entry *pent;
        int injected = -1;

        pent = &ioapic->redirtbl[idx];

        if (!pent->fields.mask) {
                injected = ioapic_deliver(ioapic, idx);
                if (injected && pent->fields.trig_mode == IOAPIC_LEVEL_TRIG)
                        pent->fields.remote_irr = 1;
        }

        return injected;
}


This if (injected) looks a bit strange since ioapic_deliver returns
-1 if no matching destinations. Should be if (injected > 0)?



> Back to original point though current
> situation is that calling kvm_set_irq() under spinlock is not worse for
> scalability than calling it not under one.

Yes. Still the specific use can just use an atomic flag,
lock+bool is not needed, and we won't need to undo it later.

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 12:07                     ` Michael S. Tsirkin
@ 2012-07-18 14:47                       ` Alex Williamson
  2012-07-18 15:38                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-18 14:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 15:07 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 02:48:44PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 18, 2012 at 02:39:10PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > > > > problem goes away.
> > > > > > > > > > 
> > > > > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > > > > under spinlock.
> > > > > > > > 
> > > > > > > > Where?
> > > > > > > > 
> > > > > > > ioapic
> > > > > > 
> > > > > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > > > > $
> > > > > > 
> > > > > > ?
> > > > > > 
> > > > > 
> > > > > Come on Michael. You can do better than grep and actually look at what
> > > > > code does. The code that loops over all vcpus while delivering an irq is
> > > > > in kvm_irq_delivery_to_apic(). Now grep for that.
> > > > 
> > > > Hmm, I see, it's actually done for edge if injected from ioapic too,
> > > > right?
> > > > 
> > > > So set_irq does a linear scan, and for each matching CPU it calls
> > > > kvm_irq_delivery_to_apic which is another scan?
> > > > So it's actually N^2 worst case for a broadcast?
> > > 
> > > No it isn't, I misread the code.
> > > 
> > > 
> > > Anyway, maybe not trivially but this looks fixable to me: we could drop
> > > the ioapic lock before calling kvm_irq_delivery_to_apic.
> > > 
> > May be, may be not. Just saying "lets drop lock whenever we don't feel
> > like holding one" does not cut it.
> 
> One thing we do is set remote_irr if interrupt was injected.
> I agree these things are tricky.
> 
> One other question:
> 
> static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx)
> {
>         union kvm_ioapic_redirect_entry *pent;
>         int injected = -1;
> 
>         pent = &ioapic->redirtbl[idx];
> 
>         if (!pent->fields.mask) {
>                 injected = ioapic_deliver(ioapic, idx);
>                 if (injected && pent->fields.trig_mode == IOAPIC_LEVEL_TRIG)
>                         pent->fields.remote_irr = 1;
>         }
> 
>         return injected;
> }
> 
> 
> This if (injected) looks a bit strange since ioapic_deliver returns
> -1 if no matching destinations. Should be if (injected > 0)?
> 
> 
> 
> > Back to original point though current
> > situation is that calling kvm_set_irq() under spinlock is not worse for
> > scalability than calling it not under one.
> 
> Yes. Still the specific use can just use an atomic flag,
> lock+bool is not needed, and we won't need to undo it later.


Actually, no, replacing it with an atomic is racy.

CPU0 (inject)                       CPU1 (EOI)
atomic_cmpxchg(&asserted, 0, 1)
                                    atomic_cmpxchg(&asserted, 1, 0)
                                    kvm_set_irq(0)
kvm_set_irq(1)
                                    eventfd_signal

The interrupt is now stuck on until another interrupt is injected.




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 14:47                       ` Alex Williamson
@ 2012-07-18 15:38                         ` Michael S. Tsirkin
  2012-07-18 15:48                           ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:38 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 08:47:23AM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 15:07 +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 02:48:44PM +0300, Gleb Natapov wrote:
> > > On Wed, Jul 18, 2012 at 02:39:10PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > > > > > problem goes away.
> > > > > > > > > > > 
> > > > > > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > > > > > under spinlock.
> > > > > > > > > 
> > > > > > > > > Where?
> > > > > > > > > 
> > > > > > > > ioapic
> > > > > > > 
> > > > > > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > > > > > $
> > > > > > > 
> > > > > > > ?
> > > > > > > 
> > > > > > 
> > > > > > Come on Michael. You can do better than grep and actually look at what
> > > > > > code does. The code that loops over all vcpus while delivering an irq is
> > > > > > in kvm_irq_delivery_to_apic(). Now grep for that.
> > > > > 
> > > > > Hmm, I see, it's actually done for edge if injected from ioapic too,
> > > > > right?
> > > > > 
> > > > > So set_irq does a linear scan, and for each matching CPU it calls
> > > > > kvm_irq_delivery_to_apic which is another scan?
> > > > > So it's actually N^2 worst case for a broadcast?
> > > > 
> > > > No it isn't, I misread the code.
> > > > 
> > > > 
> > > > Anyway, maybe not trivially but this looks fixable to me: we could drop
> > > > the ioapic lock before calling kvm_irq_delivery_to_apic.
> > > > 
> > > May be, may be not. Just saying "lets drop lock whenever we don't feel
> > > like holding one" does not cut it.
> > 
> > One thing we do is set remote_irr if interrupt was injected.
> > I agree these things are tricky.
> > 
> > One other question:
> > 
> > static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx)
> > {
> >         union kvm_ioapic_redirect_entry *pent;
> >         int injected = -1;
> > 
> >         pent = &ioapic->redirtbl[idx];
> > 
> >         if (!pent->fields.mask) {
> >                 injected = ioapic_deliver(ioapic, idx);
> >                 if (injected && pent->fields.trig_mode == IOAPIC_LEVEL_TRIG)
> >                         pent->fields.remote_irr = 1;
> >         }
> > 
> >         return injected;
> > }
> > 
> > 
> > This if (injected) looks a bit strange since ioapic_deliver returns
> > -1 if no matching destinations. Should be if (injected > 0)?
> > 
> > 
> > 
> > > Back to original point though current
> > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > scalability than calling it not under one.
> > 
> > Yes. Still the specific use can just use an atomic flag,
> > lock+bool is not needed, and we won't need to undo it later.
> 
> 
> Actually, no, replacing it with an atomic is racy.
> 
> CPU0 (inject)                       CPU1 (EOI)
> atomic_cmpxchg(&asserted, 0, 1)
>                                     atomic_cmpxchg(&asserted, 1, 0)
>                                     kvm_set_irq(0)
> kvm_set_irq(1)
>                                     eventfd_signal
> 
> The interrupt is now stuck on until another interrupt is injected.
> 

Well EOI somehow happened here before interrupt so it's a bug somewhere
else?


-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 15:38                         ` Michael S. Tsirkin
@ 2012-07-18 15:48                           ` Alex Williamson
  2012-07-18 15:58                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-18 15:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 18:38 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2012 at 08:47:23AM -0600, Alex Williamson wrote:
> > On Wed, 2012-07-18 at 15:07 +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jul 18, 2012 at 02:48:44PM +0300, Gleb Natapov wrote:
> > > > On Wed, Jul 18, 2012 at 02:39:10PM +0300, Michael S. Tsirkin wrote:
> > > > > On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > > > > > > problem goes away.
> > > > > > > > > > > > 
> > > > > > > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > > > > > > under spinlock.
> > > > > > > > > > 
> > > > > > > > > > Where?
> > > > > > > > > > 
> > > > > > > > > ioapic
> > > > > > > > 
> > > > > > > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > > > > > > $
> > > > > > > > 
> > > > > > > > ?
> > > > > > > > 
> > > > > > > 
> > > > > > > Come on Michael. You can do better than grep and actually look at what
> > > > > > > code does. The code that loops over all vcpus while delivering an irq is
> > > > > > > in kvm_irq_delivery_to_apic(). Now grep for that.
> > > > > > 
> > > > > > Hmm, I see, it's actually done for edge if injected from ioapic too,
> > > > > > right?
> > > > > > 
> > > > > > So set_irq does a linear scan, and for each matching CPU it calls
> > > > > > kvm_irq_delivery_to_apic which is another scan?
> > > > > > So it's actually N^2 worst case for a broadcast?
> > > > > 
> > > > > No it isn't, I misread the code.
> > > > > 
> > > > > 
> > > > > Anyway, maybe not trivially but this looks fixable to me: we could drop
> > > > > the ioapic lock before calling kvm_irq_delivery_to_apic.
> > > > > 
> > > > May be, may be not. Just saying "lets drop lock whenever we don't feel
> > > > like holding one" does not cut it.
> > > 
> > > One thing we do is set remote_irr if interrupt was injected.
> > > I agree these things are tricky.
> > > 
> > > One other question:
> > > 
> > > static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx)
> > > {
> > >         union kvm_ioapic_redirect_entry *pent;
> > >         int injected = -1;
> > > 
> > >         pent = &ioapic->redirtbl[idx];
> > > 
> > >         if (!pent->fields.mask) {
> > >                 injected = ioapic_deliver(ioapic, idx);
> > >                 if (injected && pent->fields.trig_mode == IOAPIC_LEVEL_TRIG)
> > >                         pent->fields.remote_irr = 1;
> > >         }
> > > 
> > >         return injected;
> > > }
> > > 
> > > 
> > > This if (injected) looks a bit strange since ioapic_deliver returns
> > > -1 if no matching destinations. Should be if (injected > 0)?
> > > 
> > > 
> > > 
> > > > Back to original point though current
> > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > scalability than calling it not under one.
> > > 
> > > Yes. Still the specific use can just use an atomic flag,
> > > lock+bool is not needed, and we won't need to undo it later.
> > 
> > 
> > Actually, no, replacing it with an atomic is racy.
> > 
> > CPU0 (inject)                       CPU1 (EOI)
> > atomic_cmpxchg(&asserted, 0, 1)
> >                                     atomic_cmpxchg(&asserted, 1, 0)
> >                                     kvm_set_irq(0)
> > kvm_set_irq(1)
> >                                     eventfd_signal
> > 
> > The interrupt is now stuck on until another interrupt is injected.
> > 
> 
> Well EOI somehow happened here before interrupt so it's a bug somewhere
> else?

Interrupts can be shared.  We also can't guarantee that the guest won't
write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
irq source id... I'm not sure it can.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 15:48                           ` Alex Williamson
@ 2012-07-18 15:58                             ` Michael S. Tsirkin
  2012-07-18 18:42                               ` Marcelo Tosatti
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 15:58 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 09:48:01AM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 18:38 +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 18, 2012 at 08:47:23AM -0600, Alex Williamson wrote:
> > > On Wed, 2012-07-18 at 15:07 +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 18, 2012 at 02:48:44PM +0300, Gleb Natapov wrote:
> > > > > On Wed, Jul 18, 2012 at 02:39:10PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jul 18, 2012 at 02:22:19PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > So as was discussed kvm_set_irq under spinlock is bad for scalability
> > > > > > > > > > > > > with multiple VCPUs.  Why do we need a spinlock simply to protect
> > > > > > > > > > > > > level_asserted?  Let's use an atomic test and set/test and clear and the
> > > > > > > > > > > > > problem goes away.
> > > > > > > > > > > > > 
> > > > > > > > > > > > That sad reality is that for level interrupt we already scan all vcpus
> > > > > > > > > > > > under spinlock.
> > > > > > > > > > > 
> > > > > > > > > > > Where?
> > > > > > > > > > > 
> > > > > > > > > > ioapic
> > > > > > > > > 
> > > > > > > > > $ grep kvm_for_each_vcpu virt/kvm/ioapic.c
> > > > > > > > > $
> > > > > > > > > 
> > > > > > > > > ?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Come on Michael. You can do better than grep and actually look at what
> > > > > > > > code does. The code that loops over all vcpus while delivering an irq is
> > > > > > > > in kvm_irq_delivery_to_apic(). Now grep for that.
> > > > > > > 
> > > > > > > Hmm, I see, it's actually done for edge if injected from ioapic too,
> > > > > > > right?
> > > > > > > 
> > > > > > > So set_irq does a linear scan, and for each matching CPU it calls
> > > > > > > kvm_irq_delivery_to_apic which is another scan?
> > > > > > > So it's actually N^2 worst case for a broadcast?
> > > > > > 
> > > > > > No it isn't, I misread the code.
> > > > > > 
> > > > > > 
> > > > > > Anyway, maybe not trivially but this looks fixable to me: we could drop
> > > > > > the ioapic lock before calling kvm_irq_delivery_to_apic.
> > > > > > 
> > > > > May be, may be not. Just saying "lets drop lock whenever we don't feel
> > > > > like holding one" does not cut it.
> > > > 
> > > > One thing we do is set remote_irr if interrupt was injected.
> > > > I agree these things are tricky.
> > > > 
> > > > One other question:
> > > > 
> > > > static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx)
> > > > {
> > > >         union kvm_ioapic_redirect_entry *pent;
> > > >         int injected = -1;
> > > > 
> > > >         pent = &ioapic->redirtbl[idx];
> > > > 
> > > >         if (!pent->fields.mask) {
> > > >                 injected = ioapic_deliver(ioapic, idx);
> > > >                 if (injected && pent->fields.trig_mode == IOAPIC_LEVEL_TRIG)
> > > >                         pent->fields.remote_irr = 1;
> > > >         }
> > > > 
> > > >         return injected;
> > > > }
> > > > 
> > > > 
> > > > This if (injected) looks a bit strange since ioapic_deliver returns
> > > > -1 if no matching destinations. Should be if (injected > 0)?
> > > > 
> > > > 
> > > > 
> > > > > Back to original point though current
> > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > scalability than calling it not under one.
> > > > 
> > > > Yes. Still the specific use can just use an atomic flag,
> > > > lock+bool is not needed, and we won't need to undo it later.
> > > 
> > > 
> > > Actually, no, replacing it with an atomic is racy.
> > > 
> > > CPU0 (inject)                       CPU1 (EOI)
> > > atomic_cmpxchg(&asserted, 0, 1)
> > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > >                                     kvm_set_irq(0)
> > > kvm_set_irq(1)
> > >                                     eventfd_signal
> > > 
> > > The interrupt is now stuck on until another interrupt is injected.
> > > 
> > 
> > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > else?
> 
> Interrupts can be shared.  We also can't guarantee that the guest won't
> write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> irq source id... I'm not sure it can.

I guess if Avi OKs adding another kvm_set_irq under spinlock that's
the best we can do for now.

If not, maybe we can teach kvm_set_irq to return an indication
of the previous status. Specifically kvm_irq_line_state
could do test_and_set/test_and_clear and if already set/clear
we return 0 immediately.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 15:58                             ` Michael S. Tsirkin
@ 2012-07-18 18:42                               ` Marcelo Tosatti
  2012-07-18 19:00                                 ` Gleb Natapov
  2012-07-18 19:07                                 ` Alex Williamson
  0 siblings, 2 replies; 96+ messages in thread
From: Marcelo Tosatti @ 2012-07-18 18:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
> > > > > > Back to original point though current
> > > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > > scalability than calling it not under one.
> > > > > 
> > > > > Yes. Still the specific use can just use an atomic flag,
> > > > > lock+bool is not needed, and we won't need to undo it later.
> > > > 
> > > > 
> > > > Actually, no, replacing it with an atomic is racy.
> > > > 
> > > > CPU0 (inject)                       CPU1 (EOI)
> > > > atomic_cmpxchg(&asserted, 0, 1)
> > > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > > >                                     kvm_set_irq(0)
> > > > kvm_set_irq(1)
> > > >                                     eventfd_signal
> > > > 
> > > > The interrupt is now stuck on until another interrupt is injected.
> > > > 
> > > 
> > > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > > else?
> > 
> > Interrupts can be shared.  We also can't guarantee that the guest won't
> > write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> > irq source id... I'm not sure it can.
> 
> I guess if Avi OKs adding another kvm_set_irq under spinlock that's
> the best we can do for now.

Why can't a mutex be used instead of a spinlock again?



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 18:42                               ` Marcelo Tosatti
@ 2012-07-18 19:00                                 ` Gleb Natapov
  2012-07-18 19:07                                 ` Alex Williamson
  1 sibling, 0 replies; 96+ messages in thread
From: Gleb Natapov @ 2012-07-18 19:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Michael S. Tsirkin, Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 03:42:09PM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
> > > > > > > Back to original point though current
> > > > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > > > scalability than calling it not under one.
> > > > > > 
> > > > > > Yes. Still the specific use can just use an atomic flag,
> > > > > > lock+bool is not needed, and we won't need to undo it later.
> > > > > 
> > > > > 
> > > > > Actually, no, replacing it with an atomic is racy.
> > > > > 
> > > > > CPU0 (inject)                       CPU1 (EOI)
> > > > > atomic_cmpxchg(&asserted, 0, 1)
> > > > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > > > >                                     kvm_set_irq(0)
> > > > > kvm_set_irq(1)
> > > > >                                     eventfd_signal
> > > > > 
> > > > > The interrupt is now stuck on until another interrupt is injected.
> > > > > 
> > > > 
> > > > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > > > else?
> > > 
> > > Interrupts can be shared.  We also can't guarantee that the guest won't
> > > write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> > > irq source id... I'm not sure it can.
> > 
> > I guess if Avi OKs adding another kvm_set_irq under spinlock that's
> > the best we can do for now.
> 
> Why can't a mutex be used instead of a spinlock again?
> 
Why was it changed at the first place? Commit says that the function is
called from unsleepable context, but no stack trace.

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 18:42                               ` Marcelo Tosatti
  2012-07-18 19:00                                 ` Gleb Natapov
@ 2012-07-18 19:07                                 ` Alex Williamson
  2012-07-18 19:13                                   ` Alex Williamson
  1 sibling, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-18 19:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Michael S. Tsirkin, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
> On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
> > > > > > > Back to original point though current
> > > > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > > > scalability than calling it not under one.
> > > > > > 
> > > > > > Yes. Still the specific use can just use an atomic flag,
> > > > > > lock+bool is not needed, and we won't need to undo it later.
> > > > > 
> > > > > 
> > > > > Actually, no, replacing it with an atomic is racy.
> > > > > 
> > > > > CPU0 (inject)                       CPU1 (EOI)
> > > > > atomic_cmpxchg(&asserted, 0, 1)
> > > > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > > > >                                     kvm_set_irq(0)
> > > > > kvm_set_irq(1)
> > > > >                                     eventfd_signal
> > > > > 
> > > > > The interrupt is now stuck on until another interrupt is injected.
> > > > > 
> > > > 
> > > > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > > > else?
> > > 
> > > Interrupts can be shared.  We also can't guarantee that the guest won't
> > > write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> > > irq source id... I'm not sure it can.
> > 
> > I guess if Avi OKs adding another kvm_set_irq under spinlock that's
> > the best we can do for now.
> 
> Why can't a mutex be used instead of a spinlock again?

eventfd_signal calls the inject function from atomic context.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 19:07                                 ` Alex Williamson
@ 2012-07-18 19:13                                   ` Alex Williamson
  2012-07-18 19:16                                     ` Michael S. Tsirkin
  2012-07-18 20:28                                     ` Alex Williamson
  0 siblings, 2 replies; 96+ messages in thread
From: Alex Williamson @ 2012-07-18 19:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Michael S. Tsirkin, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
> > On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > Back to original point though current
> > > > > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > > > > scalability than calling it not under one.
> > > > > > > 
> > > > > > > Yes. Still the specific use can just use an atomic flag,
> > > > > > > lock+bool is not needed, and we won't need to undo it later.
> > > > > > 
> > > > > > 
> > > > > > Actually, no, replacing it with an atomic is racy.
> > > > > > 
> > > > > > CPU0 (inject)                       CPU1 (EOI)
> > > > > > atomic_cmpxchg(&asserted, 0, 1)
> > > > > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > > > > >                                     kvm_set_irq(0)
> > > > > > kvm_set_irq(1)
> > > > > >                                     eventfd_signal
> > > > > > 
> > > > > > The interrupt is now stuck on until another interrupt is injected.
> > > > > > 
> > > > > 
> > > > > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > > > > else?
> > > > 
> > > > Interrupts can be shared.  We also can't guarantee that the guest won't
> > > > write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> > > > irq source id... I'm not sure it can.
> > > 
> > > I guess if Avi OKs adding another kvm_set_irq under spinlock that's
> > > the best we can do for now.
> > 
> > Why can't a mutex be used instead of a spinlock again?
> 
> eventfd_signal calls the inject function from atomic context.

Actually, that's called from a workq.  I'll have to switch it back and
turn on lockdep to remember why I couldn't sleep there.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 19:13                                   ` Alex Williamson
@ 2012-07-18 19:16                                     ` Michael S. Tsirkin
  2012-07-18 20:28                                     ` Alex Williamson
  1 sibling, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 19:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Marcelo Tosatti, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 01:13:06PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote:
> > On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
> > > On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > Back to original point though current
> > > > > > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > > > > > scalability than calling it not under one.
> > > > > > > > 
> > > > > > > > Yes. Still the specific use can just use an atomic flag,
> > > > > > > > lock+bool is not needed, and we won't need to undo it later.
> > > > > > > 
> > > > > > > 
> > > > > > > Actually, no, replacing it with an atomic is racy.
> > > > > > > 
> > > > > > > CPU0 (inject)                       CPU1 (EOI)
> > > > > > > atomic_cmpxchg(&asserted, 0, 1)
> > > > > > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > > > > > >                                     kvm_set_irq(0)
> > > > > > > kvm_set_irq(1)
> > > > > > >                                     eventfd_signal
> > > > > > > 
> > > > > > > The interrupt is now stuck on until another interrupt is injected.
> > > > > > > 
> > > > > > 
> > > > > > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > > > > > else?
> > > > > 
> > > > > Interrupts can be shared.  We also can't guarantee that the guest won't
> > > > > write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> > > > > irq source id... I'm not sure it can.
> > > > 
> > > > I guess if Avi OKs adding another kvm_set_irq under spinlock that's
> > > > the best we can do for now.
> > > 
> > > Why can't a mutex be used instead of a spinlock again?
> > 
> > eventfd_signal calls the inject function from atomic context.
> 
> Actually, that's called from a workq.  I'll have to switch it back and
> turn on lockdep to remember why I couldn't sleep there.


I'll try to fix kvm_set_irq so it returns 0 if level was already 0.
Then you do not need extra state.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 19:13                                   ` Alex Williamson
  2012-07-18 19:16                                     ` Michael S. Tsirkin
@ 2012-07-18 20:28                                     ` Alex Williamson
  2012-07-18 21:23                                       ` Marcelo Tosatti
  1 sibling, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-18 20:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Michael S. Tsirkin, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-18 at 13:13 -0600, Alex Williamson wrote:
> On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote:
> > On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
> > > On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > Back to original point though current
> > > > > > > > > situation is that calling kvm_set_irq() under spinlock is not worse for
> > > > > > > > > scalability than calling it not under one.
> > > > > > > > 
> > > > > > > > Yes. Still the specific use can just use an atomic flag,
> > > > > > > > lock+bool is not needed, and we won't need to undo it later.
> > > > > > > 
> > > > > > > 
> > > > > > > Actually, no, replacing it with an atomic is racy.
> > > > > > > 
> > > > > > > CPU0 (inject)                       CPU1 (EOI)
> > > > > > > atomic_cmpxchg(&asserted, 0, 1)
> > > > > > >                                     atomic_cmpxchg(&asserted, 1, 0)
> > > > > > >                                     kvm_set_irq(0)
> > > > > > > kvm_set_irq(1)
> > > > > > >                                     eventfd_signal
> > > > > > > 
> > > > > > > The interrupt is now stuck on until another interrupt is injected.
> > > > > > > 
> > > > > > 
> > > > > > Well EOI somehow happened here before interrupt so it's a bug somewhere
> > > > > > else?
> > > > > 
> > > > > Interrupts can be shared.  We also can't guarantee that the guest won't
> > > > > write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
> > > > > irq source id... I'm not sure it can.
> > > > 
> > > > I guess if Avi OKs adding another kvm_set_irq under spinlock that's
> > > > the best we can do for now.
> > > 
> > > Why can't a mutex be used instead of a spinlock again?
> > 
> > eventfd_signal calls the inject function from atomic context.
> 
> Actually, that's called from a workq.  I'll have to switch it back and
> turn on lockdep to remember why I couldn't sleep there.

switching to a mutex results in:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86
INFO: lockdep is turned off.
Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109
Call Trace:
 [<ffffffff81088425>] __might_sleep+0xf5/0x130
 [<ffffffff81564c6f>] mutex_lock_nested+0x2f/0x60
 [<ffffffffa07db7d5>] eoifd_event+0x25/0x70 [kvm]
 [<ffffffffa07daea4>] kvm_notify_acked_irq+0xa4/0x140 [kvm]
 [<ffffffffa07dae2a>] ? kvm_notify_acked_irq+0x2a/0x140 [kvm]
 [<ffffffffa07d9bb4>] kvm_ioapic_update_eoi+0x84/0xf0 [kvm]
 [<ffffffffa0806c43>] apic_set_eoi+0x123/0x130 [kvm]
 [<ffffffffa0806fd8>] apic_reg_write+0x388/0x670 [kvm]
 [<ffffffffa07eb03c>] ? vcpu_enter_guest+0x32c/0x740 [kvm]
 [<ffffffffa0807481>] kvm_lapic_set_eoi+0x21/0x30 [kvm]
 [<ffffffffa04ba3f9>] handle_apic_access+0x69/0x80 [kvm_intel]
 [<ffffffffa04ba02a>] vmx_handle_exit+0xaa/0x260 [kvm_intel]



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 20:28                                     ` Alex Williamson
@ 2012-07-18 21:23                                       ` Marcelo Tosatti
  2012-07-18 21:30                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Marcelo Tosatti @ 2012-07-18 21:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 02:28:34PM -0600, Alex Williamson wrote:
> > turn on lockdep to remember why I couldn't sleep there.
> 
> switching to a mutex results in:
> 
> BUG: sleeping function called from invalid context at kernel/mutex.c:269
> in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86
> INFO: lockdep is turned off.
> Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109
> Call Trace:
>  [<ffffffff81088425>] __might_sleep+0xf5/0x130
>  [<ffffffff81564c6f>] mutex_lock_nested+0x2f/0x60
>  [<ffffffffa07db7d5>] eoifd_event+0x25/0x70 [kvm]
>  [<ffffffffa07daea4>] kvm_notify_acked_irq+0xa4/0x140 [kvm]
>  [<ffffffffa07dae2a>] ? kvm_notify_acked_irq+0x2a/0x140 [kvm]
>  [<ffffffffa07d9bb4>] kvm_ioapic_update_eoi+0x84/0xf0 [kvm]
>  [<ffffffffa0806c43>] apic_set_eoi+0x123/0x130 [kvm]
>  [<ffffffffa0806fd8>] apic_reg_write+0x388/0x670 [kvm]
>  [<ffffffffa07eb03c>] ? vcpu_enter_guest+0x32c/0x740 [kvm]
>  [<ffffffffa0807481>] kvm_lapic_set_eoi+0x21/0x30 [kvm]
>  [<ffffffffa04ba3f9>] handle_apic_access+0x69/0x80 [kvm_intel]
>  [<ffffffffa04ba02a>] vmx_handle_exit+0xaa/0x260 [kvm_intel]

Its RCU from ack notifiers, OK. 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts
  2012-07-18 21:23                                       ` Marcelo Tosatti
@ 2012-07-18 21:30                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 21:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alex Williamson, Gleb Natapov, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 06:23:34PM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 18, 2012 at 02:28:34PM -0600, Alex Williamson wrote:
> > > turn on lockdep to remember why I couldn't sleep there.
> > 
> > switching to a mutex results in:
> > 
> > BUG: sleeping function called from invalid context at kernel/mutex.c:269
> > in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86
> > INFO: lockdep is turned off.
> > Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109
> > Call Trace:
> >  [<ffffffff81088425>] __might_sleep+0xf5/0x130
> >  [<ffffffff81564c6f>] mutex_lock_nested+0x2f/0x60
> >  [<ffffffffa07db7d5>] eoifd_event+0x25/0x70 [kvm]
> >  [<ffffffffa07daea4>] kvm_notify_acked_irq+0xa4/0x140 [kvm]
> >  [<ffffffffa07dae2a>] ? kvm_notify_acked_irq+0x2a/0x140 [kvm]
> >  [<ffffffffa07d9bb4>] kvm_ioapic_update_eoi+0x84/0xf0 [kvm]
> >  [<ffffffffa0806c43>] apic_set_eoi+0x123/0x130 [kvm]
> >  [<ffffffffa0806fd8>] apic_reg_write+0x388/0x670 [kvm]
> >  [<ffffffffa07eb03c>] ? vcpu_enter_guest+0x32c/0x740 [kvm]
> >  [<ffffffffa0807481>] kvm_lapic_set_eoi+0x21/0x30 [kvm]
> >  [<ffffffffa04ba3f9>] handle_apic_access+0x69/0x80 [kvm_intel]
> >  [<ffffffffa04ba02a>] vmx_handle_exit+0xaa/0x260 [kvm_intel]
> 
> Its RCU from ack notifiers, OK. 

I'm testing a patch that moves all bitmap handling to under
pic/ioapic lock. After this we can teach kvm_set_irq to report
when a bit that is cleared/set is already clear/set, without
races.

And then no tracking will be necessary in irqfd - we can just
call kvm_set_irq(..., 0) and look at the return status.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()
  2012-07-18  6:27                         ` Gleb Natapov
  2012-07-18 10:20                           ` Michael S. Tsirkin
@ 2012-07-18 21:55                           ` Michael S. Tsirkin
  1 sibling, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-18 21:55 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
> On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
> > > _Seems_ racy, or _is_ racy?  Please identify the race.
> > 
> > Look at this:
> > 
> > static inline int kvm_irq_line_state(unsigned long *irq_state,
> >                                      int irq_source_id, int level)
> > {
> >         /* Logical OR for level trig interrupt */
> >         if (level)
> >                 set_bit(irq_source_id, irq_state);
> >         else
> >                 clear_bit(irq_source_id, irq_state);
> > 
> >         return !!(*irq_state);
> > }
> > 
> > 
> > Now:
> > If other CPU changes some other bit after the atomic change,
> > it looks like !!(*irq_state) might return a stale value.
> > 
> > CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
> > If CPU 0 sees a stale value now it will return 0 here
> > and interrupt will get cleared.
> > 
> This will hardly happen on x86 especially since bit is set with
> serialized instruction. But there is actually a race here.
> CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
> CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
> No ioapic thinks the level is 0 but irq_state is not 0.
> 
> This untested and un-compiled patch should fix it.

Getting rid of atomics completely makes me more comfortable,
and by moving all bitmap handling to under pic/ioapic lock
we can do just that.
I just tested and posted a patch that fixes the race in this way.
Could you take a look pls?

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
                   ` (4 preceding siblings ...)
  2012-07-18 10:43 ` [PATCH v5 0/4] kvm: level irqfd and new eoifd Michael S. Tsirkin
@ 2012-07-19 16:59 ` Michael S. Tsirkin
  2012-07-19 17:29   ` Alex Williamson
  5 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-19 16:59 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> v5:
>  - irqfds now have a one-to-one mapping with eoifds to prevent users
>    from consuming all of kernel memory by repeatedly creating eoifds
>    from a single irqfd.
>  - implement a kvm_clear_irq() which does a test_and_clear_bit of
>    the irq_state, only updating the pic/ioapic if changes and allowing
>    the caller to know if anything was done.  I added this onto the end
>    as it's essentially an optimization on the previous design.  It's
>    hard to tell if there's an actual performance benefit to this.
>  - dropped eoifd gsi support patch as it was only an FYI.
> 
> Thanks,
> 
> Alex


So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for
now. I hope that fact that we already scan all vcpus under spinlock for
level interrupts is enough to justify adding a lock here.

To summarize issues still outstanding with 1/2, 2/2:
- source id lingering after irqfd was destroyed/deassigned
  prevents assigning a new irqfd
- if same irqfd is deassigned and re-assigned, this
  seems to succeed but does not give any more EOIs
- document that user needs to re-inject interrupts
  injected by level IRQFD after migration as they are cleared

Hope this helps!

> ---
> 
> Alex Williamson (4):
>       kvm: Convert eoifd to use kvm_clear_irq
>       kvm: Create kvm_clear_irq()
>       kvm: KVM_EOIFD, an eventfd for EOIs
>       kvm: Extend irqfd to support level interrupts
> 
> 
>  Documentation/virtual/kvm/api.txt |   28 +++
>  arch/x86/kvm/x86.c                |    3 
>  include/linux/kvm.h               |   18 ++
>  include/linux/kvm_host.h          |   16 ++
>  virt/kvm/eventfd.c                |  333 +++++++++++++++++++++++++++++++++++++
>  virt/kvm/irq_comm.c               |   78 +++++++++
>  virt/kvm/kvm_main.c               |   11 +
>  7 files changed, 483 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-19 16:59 ` Michael S. Tsirkin
@ 2012-07-19 17:29   ` Alex Williamson
  2012-07-19 17:45     ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-19 17:29 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Thu, 2012-07-19 at 19:59 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> > v5:
> >  - irqfds now have a one-to-one mapping with eoifds to prevent users
> >    from consuming all of kernel memory by repeatedly creating eoifds
> >    from a single irqfd.
> >  - implement a kvm_clear_irq() which does a test_and_clear_bit of
> >    the irq_state, only updating the pic/ioapic if changes and allowing
> >    the caller to know if anything was done.  I added this onto the end
> >    as it's essentially an optimization on the previous design.  It's
> >    hard to tell if there's an actual performance benefit to this.
> >  - dropped eoifd gsi support patch as it was only an FYI.
> > 
> > Thanks,
> > 
> > Alex
> 
> 
> So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for
> now. I hope that fact that we already scan all vcpus under spinlock for
> level interrupts is enough to justify adding a lock here.
> 
> To summarize issues still outstanding with 1/2, 2/2:
(a)
> - source id lingering after irqfd was destroyed/deassigned
>   prevents assigning a new irqfd
(b)
> - if same irqfd is deassigned and re-assigned, this
>   seems to succeed but does not give any more EOIs
(c)
> - document that user needs to re-inject interrupts
>   injected by level IRQFD after migration as they are cleared
> 
> Hope this helps!

Thanks, I'm refining and testing a re-write.  One thing I also noticed
is that we don't do anything when the eoifd is closed.  We'll cleanup
when kvm is closed, but that can leave a lot of stray eoifds, and
therefore used irq_source_ids tied up.  So, I think I need to pull in a
lot of the irqfd code just to be able to catch the POLLHUP and do
cleanup.  Fixing (a) is a simple flush, so I already added that.  To
solve (b), I think that saving the irqfd eventfd ctx was a bad idea.
The new api I will propose to solve it is that kvm_irqfd returns a token
(or key) when used as a level irqfd (actually the irq source id, but the
user shouldn't care what it is).  We pass that into eoifd instead of the
irqfd.  That means that if the irqfd is closed and re-configured, the
user will get a new key and should have no expectation that it's tied to
the previous eoifd.  I'll add a comment for (c).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-19 17:29   ` Alex Williamson
@ 2012-07-19 17:45     ` Michael S. Tsirkin
  2012-07-19 18:48       ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-19 17:45 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Thu, Jul 19, 2012 at 11:29:38AM -0600, Alex Williamson wrote:
> On Thu, 2012-07-19 at 19:59 +0300, Michael S. Tsirkin wrote:
> > On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> > > v5:
> > >  - irqfds now have a one-to-one mapping with eoifds to prevent users
> > >    from consuming all of kernel memory by repeatedly creating eoifds
> > >    from a single irqfd.
> > >  - implement a kvm_clear_irq() which does a test_and_clear_bit of
> > >    the irq_state, only updating the pic/ioapic if changes and allowing
> > >    the caller to know if anything was done.  I added this onto the end
> > >    as it's essentially an optimization on the previous design.  It's
> > >    hard to tell if there's an actual performance benefit to this.
> > >  - dropped eoifd gsi support patch as it was only an FYI.
> > > 
> > > Thanks,
> > > 
> > > Alex
> > 
> > 
> > So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for
> > now. I hope that fact that we already scan all vcpus under spinlock for
> > level interrupts is enough to justify adding a lock here.
> > 
> > To summarize issues still outstanding with 1/2, 2/2:
> (a)
> > - source id lingering after irqfd was destroyed/deassigned
> >   prevents assigning a new irqfd
> (b)
> > - if same irqfd is deassigned and re-assigned, this
> >   seems to succeed but does not give any more EOIs
> (c)
> > - document that user needs to re-inject interrupts
> >   injected by level IRQFD after migration as they are cleared
> > 
> > Hope this helps!
> 
> Thanks, I'm refining and testing a re-write.  One thing I also noticed
> is that we don't do anything when the eoifd is closed.  We'll cleanup
> when kvm is closed, but that can leave a lot of stray eoifds, and
> therefore used irq_source_ids tied up.  So, I think I need to pull in a
> lot of the irqfd code just to be able to catch the POLLHUP and do
> cleanup.

I don't think it's worth it. With ioeventfd we have the same issue
and we don't care: userspace should just DEASSIGN before close.
With irqfd we committed to support cleanup by close but
it happens kind of naturally since we poll irqfd anyway.

It's there for irqfd for historical reasons.

> Fixing (a) is a simple flush, so I already added that.  To
> solve (b), I think that saving the irqfd eventfd ctx was a bad idea.

I actually think we should just fix it. Scan eoifds when closing/opening
irqfds and bind/unbind source id.

> The new api I will propose to solve it is that kvm_irqfd returns a token
> (or key) when used as a level irqfd (actually the irq source id, but the
> user shouldn't care what it is).  We pass that into eoifd instead of the
> irqfd.  That means that if the irqfd is closed and re-configured, the
> user will get a new key and should have no expectation that it's tied to
> the previous eoifd.  I'll add a comment for (c).  Thanks,
> 
> Alex

Hmm, another API rewrite, when I felt it is finally stabilizing. Maybe
it's the right thing to do but it does feel like we change userspace ABI
just because we have run into an implementation difficulty.

Pls note I'm offline next week so won't have time to review soon.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-19 17:45     ` Michael S. Tsirkin
@ 2012-07-19 18:48       ` Alex Williamson
  2012-07-20 10:07         ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2012-07-19 18:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Thu, 2012-07-19 at 20:45 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 19, 2012 at 11:29:38AM -0600, Alex Williamson wrote:
> > On Thu, 2012-07-19 at 19:59 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> > > > v5:
> > > >  - irqfds now have a one-to-one mapping with eoifds to prevent users
> > > >    from consuming all of kernel memory by repeatedly creating eoifds
> > > >    from a single irqfd.
> > > >  - implement a kvm_clear_irq() which does a test_and_clear_bit of
> > > >    the irq_state, only updating the pic/ioapic if changes and allowing
> > > >    the caller to know if anything was done.  I added this onto the end
> > > >    as it's essentially an optimization on the previous design.  It's
> > > >    hard to tell if there's an actual performance benefit to this.
> > > >  - dropped eoifd gsi support patch as it was only an FYI.
> > > > 
> > > > Thanks,
> > > > 
> > > > Alex
> > > 
> > > 
> > > So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for
> > > now. I hope that fact that we already scan all vcpus under spinlock for
> > > level interrupts is enough to justify adding a lock here.
> > > 
> > > To summarize issues still outstanding with 1/2, 2/2:
> > (a)
> > > - source id lingering after irqfd was destroyed/deassigned
> > >   prevents assigning a new irqfd
> > (b)
> > > - if same irqfd is deassigned and re-assigned, this
> > >   seems to succeed but does not give any more EOIs
> > (c)
> > > - document that user needs to re-inject interrupts
> > >   injected by level IRQFD after migration as they are cleared
> > > 
> > > Hope this helps!
> > 
> > Thanks, I'm refining and testing a re-write.  One thing I also noticed
> > is that we don't do anything when the eoifd is closed.  We'll cleanup
> > when kvm is closed, but that can leave a lot of stray eoifds, and
> > therefore used irq_source_ids tied up.  So, I think I need to pull in a
> > lot of the irqfd code just to be able to catch the POLLHUP and do
> > cleanup.
> 
> I don't think it's worth it. With ioeventfd we have the same issue
> and we don't care: userspace should just DEASSIGN before close.
> With irqfd we committed to support cleanup by close but
> it happens kind of naturally since we poll irqfd anyway.
> 
> It's there for irqfd for historical reasons.

You're not dealing with such a limited resource for ioeventfds though.
It's pretty easily conceivable we could run out of irq source IDs.

> > Fixing (a) is a simple flush, so I already added that.  To
> > solve (b), I think that saving the irqfd eventfd ctx was a bad idea.
> 
> I actually think we should just fix it. Scan eoifds when closing/opening
> irqfds and bind/unbind source id.

Hmm,  IMHO we had no business holding onto an eventfd ctx.  That was an
ugly implementation detail forced by the desire to allow the same
eventfd to be used in multiple eoifds.  The fallout from that leaves a
lasting tie between the eoifd and the future use of that eventfd.  I can
imagine the scenario you present is just one of the glitches and I
really don't want to have one interface disable another.

> > The new api I will propose to solve it is that kvm_irqfd returns a token
> > (or key) when used as a level irqfd (actually the irq source id, but the
> > user shouldn't care what it is).  We pass that into eoifd instead of the
> > irqfd.  That means that if the irqfd is closed and re-configured, the
> > user will get a new key and should have no expectation that it's tied to
> > the previous eoifd.  I'll add a comment for (c).  Thanks,
> > 
> > Alex
> 
> Hmm, another API rewrite, when I felt it is finally stabilizing. Maybe
> it's the right thing to do but it does feel like we change userspace ABI
> just because we have run into an implementation difficulty.
> 
> Pls note I'm offline next week so won't have time to review soon.

We could return the key in the struct kvm_irqfd if it adds anything, but
I felt returning the key was preferable and is compatible with the
existing ABI.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-19 18:48       ` Alex Williamson
@ 2012-07-20 10:07         ` Michael S. Tsirkin
  2012-07-22 15:09           ` Gleb Natapov
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2012-07-20 10:07 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Thu, Jul 19, 2012 at 12:48:07PM -0600, Alex Williamson wrote:
> On Thu, 2012-07-19 at 20:45 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 19, 2012 at 11:29:38AM -0600, Alex Williamson wrote:
> > > On Thu, 2012-07-19 at 19:59 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> > > > > v5:
> > > > >  - irqfds now have a one-to-one mapping with eoifds to prevent users
> > > > >    from consuming all of kernel memory by repeatedly creating eoifds
> > > > >    from a single irqfd.
> > > > >  - implement a kvm_clear_irq() which does a test_and_clear_bit of
> > > > >    the irq_state, only updating the pic/ioapic if changes and allowing
> > > > >    the caller to know if anything was done.  I added this onto the end
> > > > >    as it's essentially an optimization on the previous design.  It's
> > > > >    hard to tell if there's an actual performance benefit to this.
> > > > >  - dropped eoifd gsi support patch as it was only an FYI.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > 
> > > > So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for
> > > > now. I hope that fact that we already scan all vcpus under spinlock for
> > > > level interrupts is enough to justify adding a lock here.
> > > > 
> > > > To summarize issues still outstanding with 1/2, 2/2:
> > > (a)
> > > > - source id lingering after irqfd was destroyed/deassigned
> > > >   prevents assigning a new irqfd
> > > (b)
> > > > - if same irqfd is deassigned and re-assigned, this
> > > >   seems to succeed but does not give any more EOIs
> > > (c)
> > > > - document that user needs to re-inject interrupts
> > > >   injected by level IRQFD after migration as they are cleared
> > > > 
> > > > Hope this helps!
> > > 
> > > Thanks, I'm refining and testing a re-write.  One thing I also noticed
> > > is that we don't do anything when the eoifd is closed.  We'll cleanup
> > > when kvm is closed, but that can leave a lot of stray eoifds, and
> > > therefore used irq_source_ids tied up.  So, I think I need to pull in a
> > > lot of the irqfd code just to be able to catch the POLLHUP and do
> > > cleanup.
> > 
> > I don't think it's worth it. With ioeventfd we have the same issue
> > and we don't care: userspace should just DEASSIGN before close.
> > With irqfd we committed to support cleanup by close but
> > it happens kind of naturally since we poll irqfd anyway.
> > 
> > It's there for irqfd for historical reasons.
> 
> You're not dealing with such a limited resource for ioeventfds though.
> It's pretty easily conceivable we could run out of irq source IDs.

Running out of fds is also very conceivable.  Not deassigning
before close is a userspace bug anyway.

> > > Fixing (a) is a simple flush, so I already added that.  To
> > > solve (b), I think that saving the irqfd eventfd ctx was a bad idea.
> > 
> > I actually think we should just fix it. Scan eoifds when closing/opening
> > irqfds and bind/unbind source id.
> 
> Hmm,  IMHO we had no business holding onto an eventfd ctx.  That was an
> ugly implementation detail forced by the desire to allow the same
> eventfd to be used in multiple eoifds.  The fallout from that leaves a
> lasting tie between the eoifd and the future use of that eventfd.  I can
> imagine the scenario you present is just one of the glitches and I
> really don't want to have one interface disable another.

Looks like this disabling is inherent in what we want eoifd to do.  You
bind irqfd and eoifd. If irqfd is deassigned, eoifd will not get any
more events, it is disabled. Whether it keeps the pointer to source id
internally or not does not matter to the user.

> > > The new api I will propose to solve it is that kvm_irqfd returns a token
> > > (or key) when used as a level irqfd (actually the irq source id, but the
> > > user shouldn't care what it is).  We pass that into eoifd instead of the
> > > irqfd.  That means that if the irqfd is closed and re-configured, the
> > > user will get a new key and should have no expectation that it's tied to
> > > the previous eoifd.  I'll add a comment for (c).  Thanks,
> > > 
> > > Alex
> > 
> > Hmm, another API rewrite, when I felt it is finally stabilizing. Maybe
> > it's the right thing to do but it does feel like we change userspace ABI
> > just because we have run into an implementation difficulty.
> > 
> > Pls note I'm offline next week so won't have time to review soon.
> 
> We could return the key in the struct kvm_irqfd if it adds anything, but
> I felt returning the key was preferable and is compatible with the
> existing ABI.  Thanks,
> 
> Alex

You say it is preferable but I wonder what does it buy users compared to
using the fd directly - it is certainly more work for userspace to keep
track of it.

-- 
MST

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
  2012-07-20 10:07         ` Michael S. Tsirkin
@ 2012-07-22 15:09           ` Gleb Natapov
  0 siblings, 0 replies; 96+ messages in thread
From: Gleb Natapov @ 2012-07-22 15:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, avi, kvm, linux-kernel, jan.kiszka

On Fri, Jul 20, 2012 at 01:07:32PM +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 19, 2012 at 12:48:07PM -0600, Alex Williamson wrote:
> > On Thu, 2012-07-19 at 20:45 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jul 19, 2012 at 11:29:38AM -0600, Alex Williamson wrote:
> > > > On Thu, 2012-07-19 at 19:59 +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote:
> > > > > > v5:
> > > > > >  - irqfds now have a one-to-one mapping with eoifds to prevent users
> > > > > >    from consuming all of kernel memory by repeatedly creating eoifds
> > > > > >    from a single irqfd.
> > > > > >  - implement a kvm_clear_irq() which does a test_and_clear_bit of
> > > > > >    the irq_state, only updating the pic/ioapic if changes and allowing
> > > > > >    the caller to know if anything was done.  I added this onto the end
> > > > > >    as it's essentially an optimization on the previous design.  It's
> > > > > >    hard to tell if there's an actual performance benefit to this.
> > > > > >  - dropped eoifd gsi support patch as it was only an FYI.
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > 
> > > > > So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for
> > > > > now. I hope that fact that we already scan all vcpus under spinlock for
> > > > > level interrupts is enough to justify adding a lock here.
> > > > > 
> > > > > To summarize issues still outstanding with 1/2, 2/2:
> > > > (a)
> > > > > - source id lingering after irqfd was destroyed/deassigned
> > > > >   prevents assigning a new irqfd
> > > > (b)
> > > > > - if same irqfd is deassigned and re-assigned, this
> > > > >   seems to succeed but does not give any more EOIs
> > > > (c)
> > > > > - document that user needs to re-inject interrupts
> > > > >   injected by level IRQFD after migration as they are cleared
> > > > > 
> > > > > Hope this helps!
> > > > 
> > > > Thanks, I'm refining and testing a re-write.  One thing I also noticed
> > > > is that we don't do anything when the eoifd is closed.  We'll cleanup
> > > > when kvm is closed, but that can leave a lot of stray eoifds, and
> > > > therefore used irq_source_ids tied up.  So, I think I need to pull in a
> > > > lot of the irqfd code just to be able to catch the POLLHUP and do
> > > > cleanup.
> > > 
> > > I don't think it's worth it. With ioeventfd we have the same issue
> > > and we don't care: userspace should just DEASSIGN before close.
> > > With irqfd we committed to support cleanup by close but
> > > it happens kind of naturally since we poll irqfd anyway.
> > > 
> > > It's there for irqfd for historical reasons.
> > 
> > You're not dealing with such a limited resource for ioeventfds though.
> > It's pretty easily conceivable we could run out of irq source IDs.
> 
> Running out of fds is also very conceivable.  Not deassigning
> before close is a userspace bug anyway.
> 
Close should free all recourses allocated by an fd. What if a code that
closes the fd have no idea what cleanup should be done (fd was passed by
unix socket). Heck, the code may not have permission to call ioctl
to deassign. 

--
			Gleb.

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2012-07-22 15:09 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-16 20:33 [PATCH v5 0/4] kvm: level irqfd and new eoifd Alex Williamson
2012-07-16 20:33 ` [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts Alex Williamson
2012-07-17 21:26   ` Michael S. Tsirkin
2012-07-17 21:57     ` Alex Williamson
2012-07-17 22:00       ` Michael S. Tsirkin
2012-07-17 22:16         ` Alex Williamson
2012-07-17 22:28           ` Michael S. Tsirkin
2012-07-18 10:41   ` Michael S. Tsirkin
2012-07-18 10:44     ` Gleb Natapov
2012-07-18 10:48       ` Michael S. Tsirkin
2012-07-18 10:49         ` Gleb Natapov
2012-07-18 10:53           ` Michael S. Tsirkin
2012-07-18 10:55             ` Gleb Natapov
2012-07-18 11:22               ` Michael S. Tsirkin
2012-07-18 11:39                 ` Michael S. Tsirkin
2012-07-18 11:48                   ` Gleb Natapov
2012-07-18 12:07                     ` Michael S. Tsirkin
2012-07-18 14:47                       ` Alex Williamson
2012-07-18 15:38                         ` Michael S. Tsirkin
2012-07-18 15:48                           ` Alex Williamson
2012-07-18 15:58                             ` Michael S. Tsirkin
2012-07-18 18:42                               ` Marcelo Tosatti
2012-07-18 19:00                                 ` Gleb Natapov
2012-07-18 19:07                                 ` Alex Williamson
2012-07-18 19:13                                   ` Alex Williamson
2012-07-18 19:16                                     ` Michael S. Tsirkin
2012-07-18 20:28                                     ` Alex Williamson
2012-07-18 21:23                                       ` Marcelo Tosatti
2012-07-18 21:30                                         ` Michael S. Tsirkin
2012-07-16 20:33 ` [PATCH v5 2/4] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
2012-07-17 10:21   ` Michael S. Tsirkin
2012-07-17 13:59     ` Alex Williamson
2012-07-17 14:10       ` Michael S. Tsirkin
2012-07-17 14:29         ` Alex Williamson
2012-07-17 14:42           ` Michael S. Tsirkin
2012-07-17 14:57             ` Alex Williamson
2012-07-17 15:13               ` Michael S. Tsirkin
2012-07-17 15:41                 ` Alex Williamson
2012-07-17 15:53                   ` Michael S. Tsirkin
2012-07-17 16:06                     ` Alex Williamson
2012-07-17 16:19                       ` Michael S. Tsirkin
2012-07-17 16:52                         ` Alex Williamson
2012-07-17 18:58                           ` Michael S. Tsirkin
2012-07-17 20:03                             ` Alex Williamson
2012-07-17 21:23                               ` Michael S. Tsirkin
2012-07-17 22:09                                 ` Alex Williamson
2012-07-17 22:24                                   ` Michael S. Tsirkin
2012-07-18  2:44                                     ` Alex Williamson
2012-07-18 10:31                                       ` Michael S. Tsirkin
2012-07-16 20:34 ` [PATCH v5 3/4] kvm: Create kvm_clear_irq() Alex Williamson
2012-07-17  0:51   ` Michael S. Tsirkin
2012-07-17  2:42     ` Alex Williamson
2012-07-17  0:55   ` Michael S. Tsirkin
2012-07-17 10:14   ` Michael S. Tsirkin
2012-07-17 13:56     ` Alex Williamson
2012-07-17 14:08       ` Michael S. Tsirkin
2012-07-17 14:21         ` Alex Williamson
2012-07-17 14:53           ` Michael S. Tsirkin
2012-07-17 15:20             ` Alex Williamson
2012-07-17 15:36               ` Michael S. Tsirkin
2012-07-17 15:51                 ` Alex Williamson
2012-07-17 15:57                   ` Michael S. Tsirkin
2012-07-17 16:01                     ` Gleb Natapov
2012-07-17 16:08                     ` Alex Williamson
2012-07-17 16:14                       ` Michael S. Tsirkin
2012-07-17 16:17                         ` Alex Williamson
2012-07-17 16:21                           ` Michael S. Tsirkin
2012-07-17 16:45                             ` Alex Williamson
2012-07-17 18:55                               ` Michael S. Tsirkin
2012-07-17 19:51                                 ` Alex Williamson
2012-07-17 21:05                                   ` Michael S. Tsirkin
2012-07-17 22:01                                     ` Alex Williamson
2012-07-17 22:05                                       ` Michael S. Tsirkin
2012-07-17 22:22                                         ` Alex Williamson
2012-07-17 22:31                                           ` Michael S. Tsirkin
2012-07-18  6:27                         ` Gleb Natapov
2012-07-18 10:20                           ` Michael S. Tsirkin
2012-07-18 10:27                             ` Gleb Natapov
2012-07-18 10:33                               ` Michael S. Tsirkin
2012-07-18 10:36                                 ` Gleb Natapov
2012-07-18 10:51                                   ` Michael S. Tsirkin
2012-07-18 10:53                                     ` Gleb Natapov
2012-07-18 11:08                                       ` Michael S. Tsirkin
2012-07-18 11:50                                         ` Gleb Natapov
2012-07-18 21:55                           ` Michael S. Tsirkin
2012-07-17 16:36                       ` Michael S. Tsirkin
2012-07-17 17:09                         ` Gleb Natapov
2012-07-17 10:18   ` Michael S. Tsirkin
2012-07-16 20:34 ` [PATCH v5 4/4] kvm: Convert eoifd to use kvm_clear_irq Alex Williamson
2012-07-18 10:43 ` [PATCH v5 0/4] kvm: level irqfd and new eoifd Michael S. Tsirkin
2012-07-19 16:59 ` Michael S. Tsirkin
2012-07-19 17:29   ` Alex Williamson
2012-07-19 17:45     ` Michael S. Tsirkin
2012-07-19 18:48       ` Alex Williamson
2012-07-20 10:07         ` Michael S. Tsirkin
2012-07-22 15:09           ` Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.