kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd)
@ 2021-02-21 12:04 Elena Afanasova
  2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-21 12:04 UTC (permalink / raw)
  To: kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst,
	cohuck, john.levon, Elena Afanasova

This patchset introduces a KVM dispatch mechanism which can be used 
for handling MMIO/PIO accesses over file descriptors without returning 
from ioctl(KVM_RUN). This allows device emulation to run in another task 
separate from the vCPU task.

This is achieved through KVM vm ioctl for registering MMIO/PIO regions and 
a wire protocol that KVM uses to communicate with a task handling an 
MMIO/PIO access.

TODOs:
* Implement KVM_EXIT_IOREGIONFD_FAILURE
* Add non-x86 arch support
* Add kvm-unittests
* Flush waiters if ioregion is deleted

v3:
 - add FAST_MMIO bus support
 - add KVM_IOREGION_DEASSIGN flag
 - rename kvm_ioregion read/write file descriptors
 - split ioregionfd signal handling support into two patches
 - move ioregion_interrupted flag to ioregion_ctx
 - reorder ioregion_ctx fields
 - rework complete_ioregion operations 
 - add signal handling support for crossing a page boundary case
 - change wire protocol license
 - fix ioregionfd state machine
 - remove ioregionfd_cmd info and drop appropriate macros
 - add comment on ioregionfd cmds/replies serialization
 - drop kvm_io_bus_finish/prepare()

Elena Afanasova (5):
  KVM: add initial support for KVM_SET_IOREGION
  KVM: x86: add support for ioregionfd signal handling
  KVM: implement wire protocol
  KVM: add ioregionfd context
  KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled

 arch/x86/kvm/Kconfig          |   1 +
 arch/x86/kvm/Makefile         |   1 +
 arch/x86/kvm/vmx/vmx.c        |  40 ++-
 arch/x86/kvm/x86.c            | 273 +++++++++++++++++-
 include/linux/kvm_host.h      |  28 ++
 include/uapi/linux/ioregion.h |  30 ++
 include/uapi/linux/kvm.h      |  25 ++
 virt/kvm/Kconfig              |   3 +
 virt/kvm/eventfd.c            |  25 ++
 virt/kvm/eventfd.h            |  14 +
 virt/kvm/ioregion.c           | 529 ++++++++++++++++++++++++++++++++++
 virt/kvm/ioregion.h           |  15 +
 virt/kvm/kvm_main.c           |  36 ++-
 13 files changed, 996 insertions(+), 24 deletions(-)
 create mode 100644 include/uapi/linux/ioregion.h
 create mode 100644 virt/kvm/eventfd.h
 create mode 100644 virt/kvm/ioregion.c
 create mode 100644 virt/kvm/ioregion.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
@ 2021-02-21 12:04 ` Elena Afanasova
  2021-02-24 10:06   ` Stefan Hajnoczi
                     ` (2 more replies)
  2021-02-21 12:04 ` [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling Elena Afanasova
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-21 12:04 UTC (permalink / raw)
  To: kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst,
	cohuck, john.levon, Elena Afanasova

This vm ioctl adds or removes an ioregionfd MMIO/PIO region. Guest
read and write accesses are dispatched through the given ioregionfd
instead of returning from ioctl(KVM_RUN).

Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
---
v3:
 - add FAST_MMIO bus support
 - add KVM_IOREGION_DEASSIGN flag
 - rename kvm_ioregion read/write file descriptors

 arch/x86/kvm/Kconfig     |   1 +
 arch/x86/kvm/Makefile    |   1 +
 arch/x86/kvm/x86.c       |   1 +
 include/linux/kvm_host.h |  18 +++
 include/uapi/linux/kvm.h |  25 ++++
 virt/kvm/Kconfig         |   3 +
 virt/kvm/eventfd.c       |  25 ++++
 virt/kvm/eventfd.h       |  14 +++
 virt/kvm/ioregion.c      | 265 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/ioregion.h      |  15 +++
 virt/kvm/kvm_main.c      |  11 ++
 11 files changed, 379 insertions(+)
 create mode 100644 virt/kvm/eventfd.h
 create mode 100644 virt/kvm/ioregion.c
 create mode 100644 virt/kvm/ioregion.h

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f92dfd8ef10d..b914ef375199 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -33,6 +33,7 @@ config KVM
 	select HAVE_KVM_IRQ_BYPASS
 	select HAVE_KVM_IRQ_ROUTING
 	select HAVE_KVM_EVENTFD
+	select KVM_IOREGION
 	select KVM_ASYNC_PF
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index b804444e16d4..b3b17dc9f7d4 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,6 +12,7 @@ KVM := ../../../virt/kvm
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
+kvm-$(CONFIG_KVM_IOREGION)	+= $(KVM)/ioregion.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e545a8a613b1..ddb28f5ca252 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3739,6 +3739,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_USER_SPACE_MSR:
 	case KVM_CAP_X86_MSR_FILTER:
 	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
+	case KVM_CAP_IOREGIONFD:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7f2e2a09ebbd..f35f0976f5cf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -470,6 +470,11 @@ struct kvm {
 		struct mutex      resampler_lock;
 	} irqfds;
 	struct list_head ioeventfds;
+#endif
+#ifdef CONFIG_KVM_IOREGION
+	struct list_head ioregions_fast_mmio;
+	struct list_head ioregions_mmio;
+	struct list_head ioregions_pio;
 #endif
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
@@ -1262,6 +1267,19 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 
 #endif /* CONFIG_HAVE_KVM_EVENTFD */
 
+#ifdef CONFIG_KVM_IOREGION
+void kvm_ioregionfd_init(struct kvm *kvm);
+int kvm_ioregionfd(struct kvm *kvm, struct kvm_ioregion *args);
+
+#else
+
+static inline void kvm_ioregionfd_init(struct kvm *kvm) {}
+static inline int kvm_ioregionfd(struct kvm *kvm, struct kvm_ioregion *args)
+{
+	return -ENOSYS;
+}
+#endif
+
 void kvm_arch_irq_routing_update(struct kvm *kvm);
 
 static inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index ca41220b40b8..a1b1a60571f8 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -732,6 +732,29 @@ struct kvm_ioeventfd {
 	__u8  pad[36];
 };
 
+enum {
+	kvm_ioregion_flag_nr_pio,
+	kvm_ioregion_flag_nr_posted_writes,
+	kvm_ioregion_flag_nr_deassign,
+	kvm_ioregion_flag_nr_max,
+};
+
+#define KVM_IOREGION_PIO (1 << kvm_ioregion_flag_nr_pio)
+#define KVM_IOREGION_POSTED_WRITES (1 << kvm_ioregion_flag_nr_posted_writes)
+#define KVM_IOREGION_DEASSIGN (1 << kvm_ioregion_flag_nr_deassign)
+
+#define KVM_IOREGION_VALID_FLAG_MASK ((1 << kvm_ioregion_flag_nr_max) - 1)
+
+struct kvm_ioregion {
+	__u64 guest_paddr; /* guest physical address */
+	__u64 memory_size; /* bytes */
+	__u64 user_data;
+	__s32 read_fd;
+	__s32 write_fd;
+	__u32 flags;
+	__u8  pad[28];
+};
+
 #define KVM_X86_DISABLE_EXITS_MWAIT          (1 << 0)
 #define KVM_X86_DISABLE_EXITS_HLT            (1 << 1)
 #define KVM_X86_DISABLE_EXITS_PAUSE          (1 << 2)
@@ -1053,6 +1076,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_X86_USER_SPACE_MSR 188
 #define KVM_CAP_X86_MSR_FILTER 189
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
+#define KVM_CAP_IOREGIONFD 191
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1308,6 +1332,7 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_IOREGION          _IOW(KVMIO,  0x49, struct kvm_ioregion)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 1c37ccd5d402..5e6620bbf000 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -17,6 +17,9 @@ config HAVE_KVM_EVENTFD
        bool
        select EVENTFD
 
+config KVM_IOREGION
+       bool
+
 config KVM_MMIO
        bool
 
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c2323c27a28b..aadb73903f8b 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -27,6 +27,7 @@
 #include <trace/events/kvm.h>
 
 #include <kvm/iodev.h>
+#include "ioregion.h"
 
 #ifdef CONFIG_HAVE_KVM_IRQFD
 
@@ -755,6 +756,23 @@ static const struct kvm_io_device_ops ioeventfd_ops = {
 	.destructor = ioeventfd_destructor,
 };
 
+#ifdef CONFIG_KVM_IOREGION
+/* assumes kvm->slots_lock held */
+bool kvm_eventfd_collides(struct kvm *kvm, int bus_idx,
+			  u64 start, u64 size)
+{
+	struct _ioeventfd *_p;
+
+	list_for_each_entry(_p, &kvm->ioeventfds, list)
+		if (_p->bus_idx == bus_idx &&
+		    overlap(start, size, _p->addr,
+			    !_p->length ? 8 : _p->length))
+			return true;
+
+	return false;
+}
+#endif
+
 /* assumes kvm->slots_lock held */
 static bool
 ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
@@ -770,6 +788,13 @@ ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
 		       _p->datamatch == p->datamatch))))
 			return true;
 
+#ifdef CONFIG_KVM_IOREGION
+	if (p->bus_idx == KVM_MMIO_BUS || p->bus_idx == KVM_PIO_BUS)
+		if (kvm_ioregion_collides(kvm, p->bus_idx, p->addr,
+					  !p->length ? 8 : p->length))
+			return true;
+#endif
+
 	return false;
 }
 
diff --git a/virt/kvm/eventfd.h b/virt/kvm/eventfd.h
new file mode 100644
index 000000000000..73a621eebae3
--- /dev/null
+++ b/virt/kvm/eventfd.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_EVENTFD_H__
+#define __KVM_EVENTFD_H__
+
+#ifdef CONFIG_KVM_IOREGION
+bool kvm_eventfd_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size);
+#else
+static inline bool
+kvm_eventfd_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size)
+{
+	return false;
+}
+#endif
+#endif
diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
new file mode 100644
index 000000000000..e09ef3e2c9d7
--- /dev/null
+++ b/virt/kvm/ioregion.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/kvm_host.h>
+#include <linux/fs.h>
+#include <kvm/iodev.h>
+#include "eventfd.h"
+
+void
+kvm_ioregionfd_init(struct kvm *kvm)
+{
+	INIT_LIST_HEAD(&kvm->ioregions_fast_mmio);
+	INIT_LIST_HEAD(&kvm->ioregions_mmio);
+	INIT_LIST_HEAD(&kvm->ioregions_pio);
+}
+
+struct ioregion {
+	struct list_head     list;
+	u64                  paddr;  /* guest physical address */
+	u64                  size;   /* size in bytes */
+	struct file         *rf;
+	struct file         *wf;
+	u64                  user_data; /* opaque token used by userspace */
+	struct kvm_io_device dev;
+	bool                 posted_writes;
+};
+
+static inline struct ioregion *
+to_ioregion(struct kvm_io_device *dev)
+{
+	return container_of(dev, struct ioregion, dev);
+}
+
+/* assumes kvm->slots_lock held */
+static void
+ioregion_release(struct ioregion *p)
+{
+	if (p->rf)
+		fput(p->rf);
+	fput(p->wf);
+	list_del(&p->list);
+	kfree(p);
+}
+
+static int
+ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
+	      int len, void *val)
+{
+	return -EOPNOTSUPP;
+}
+
+static int
+ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
+		int len, const void *val)
+{
+	return -EOPNOTSUPP;
+}
+
+/*
+ * This function is called as KVM is completely shutting down.  We do not
+ * need to worry about locking just nuke anything we have as quickly as possible
+ */
+static void
+ioregion_destructor(struct kvm_io_device *this)
+{
+	struct ioregion *p = to_ioregion(this);
+
+	ioregion_release(p);
+}
+
+static const struct kvm_io_device_ops ioregion_ops = {
+	.read       = ioregion_read,
+	.write      = ioregion_write,
+	.destructor = ioregion_destructor,
+};
+
+static inline struct list_head *
+get_ioregion_list(struct kvm *kvm, enum kvm_bus bus_idx)
+{
+	if (bus_idx == KVM_FAST_MMIO_BUS)
+		return &kvm->ioregions_fast_mmio;
+	if (bus_idx == KVM_MMIO_BUS)
+		return &kvm->ioregions_mmio;
+	if (bus_idx == KVM_PIO_BUS)
+		return &kvm->ioregions_pio;
+}
+
+/* check for not overlapping case and reverse */
+inline bool
+overlap(u64 start1, u64 size1, u64 start2, u64 size2)
+{
+	u64 end1 = start1 + size1 - 1;
+	u64 end2 = start2 + size2 - 1;
+
+	return !(end1 < start2 || start1 >= end2);
+}
+
+/* assumes kvm->slots_lock held */
+bool
+kvm_ioregion_collides(struct kvm *kvm, int bus_idx,
+		      u64 start, u64 size)
+{
+	struct ioregion *p;
+	struct list_head *ioregions = get_ioregion_list(kvm, bus_idx);
+
+	list_for_each_entry(p, ioregions, list)
+		if (overlap(start, size, p->paddr, !p->size ? 8 : p->size))
+			return true;
+
+	return false;
+}
+
+/* assumes kvm->slots_lock held */
+static bool
+ioregion_collision(struct kvm *kvm, struct ioregion *p, enum kvm_bus bus_idx)
+{
+	if (kvm_ioregion_collides(kvm, bus_idx, p->paddr, !p->size ? 8 : p->size) ||
+	    kvm_eventfd_collides(kvm, bus_idx, p->paddr, !p->size ? 8 : p->size))
+		return true;
+
+	return false;
+}
+
+static enum kvm_bus
+get_bus_from_flags(__u32 flags)
+{
+	if (flags & KVM_IOREGION_PIO)
+		return KVM_PIO_BUS;
+	return KVM_MMIO_BUS;
+}
+
+int
+kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bus_idx)
+{
+	struct ioregion *p;
+	struct file *rfile = NULL, *wfile;
+	int ret = 0;
+
+	wfile = fget(args->write_fd);
+	if (!wfile)
+		return -EBADF;
+	if (args->memory_size) {
+		rfile = fget(args->read_fd);
+		if (!rfile) {
+			fput(wfile);
+			return -EBADF;
+		}
+	}
+	p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT);
+	if (!p) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	INIT_LIST_HEAD(&p->list);
+	p->paddr = args->guest_paddr;
+	p->size = args->memory_size;
+	p->user_data = args->user_data;
+	p->rf = rfile;
+	p->wf = wfile;
+	p->posted_writes = args->flags & KVM_IOREGION_POSTED_WRITES;
+
+	mutex_lock(&kvm->slots_lock);
+
+	if (ioregion_collision(kvm, p, bus_idx)) {
+		ret = -EEXIST;
+		goto unlock_fail;
+	}
+	kvm_iodevice_init(&p->dev, &ioregion_ops);
+	ret = kvm_io_bus_register_dev(kvm, bus_idx, p->paddr, p->size,
+				      &p->dev);
+	if (ret < 0)
+		goto unlock_fail;
+	list_add_tail(&p->list, get_ioregion_list(kvm, bus_idx));
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return 0;
+
+unlock_fail:
+	mutex_unlock(&kvm->slots_lock);
+	kfree(p);
+fail:
+	if (rfile)
+		fput(rfile);
+	fput(wfile);
+
+	return ret;
+}
+
+static int
+kvm_rm_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bus_idx)
+{
+	struct ioregion *p, *tmp;
+	int ret = -ENOENT;
+
+	struct list_head *ioregions = get_ioregion_list(kvm, bus_idx);
+
+	mutex_lock(&kvm->slots_lock);
+
+	list_for_each_entry_safe(p, tmp, ioregions, list) {
+		if (p->paddr == args->guest_paddr  &&
+		    p->size == args->memory_size) {
+			kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
+			ioregion_release(p);
+			ret = 0;
+			break;
+		}
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return ret;
+}
+
+static int
+kvm_set_ioregion(struct kvm *kvm, struct kvm_ioregion *args)
+{
+	int ret;
+
+	enum kvm_bus bus_idx = get_bus_from_flags(args->flags);
+
+	/* check for range overflow */
+	if (args->guest_paddr + args->memory_size < args->guest_paddr)
+		return -EINVAL;
+	/* If size is ignored only posted writes are allowed */
+	if (!args->memory_size && !(args->flags & KVM_IOREGION_POSTED_WRITES))
+		return -EINVAL;
+
+	ret = kvm_set_ioregion_idx(kvm, args, bus_idx);
+	if (ret)
+		return ret;
+
+	/* If size is ignored, MMIO is also put on a FAST_MMIO bus */
+	if (!args->memory_size && bus_idx == KVM_MMIO_BUS)
+		ret = kvm_set_ioregion_idx(kvm, args, KVM_FAST_MMIO_BUS);
+	if (ret) {
+		kvm_rm_ioregion_idx(kvm, args, bus_idx);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int
+kvm_rm_ioregion(struct kvm *kvm, struct kvm_ioregion *args)
+{
+	enum kvm_bus bus_idx = get_bus_from_flags(args->flags);
+	int ret = kvm_rm_ioregion_idx(kvm, args, bus_idx);
+
+	if (!args->memory_size && bus_idx == KVM_MMIO_BUS)
+		kvm_rm_ioregion_idx(kvm, args, KVM_FAST_MMIO_BUS);
+
+	return ret;
+}
+
+int
+kvm_ioregionfd(struct kvm *kvm, struct kvm_ioregion *args)
+{
+	if (args->flags & ~KVM_IOREGION_VALID_FLAG_MASK)
+		return -EINVAL;
+
+	if (args->flags & KVM_IOREGION_DEASSIGN)
+		return kvm_rm_ioregion(kvm, args);
+
+	return kvm_set_ioregion(kvm, args);
+}
diff --git a/virt/kvm/ioregion.h b/virt/kvm/ioregion.h
new file mode 100644
index 000000000000..23ffa812ec7a
--- /dev/null
+++ b/virt/kvm/ioregion.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_IOREGION_H__
+#define __KVM_IOREGION_H__
+
+#ifdef CONFIG_KVM_IOREGION
+inline bool overlap(u64 start1, u64 size1, u64 start2, u64 size2);
+bool kvm_ioregion_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size);
+#else
+static inline bool
+kvm_ioregion_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size)
+{
+	return false;
+}
+#endif
+#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2541a17ff1c4..88b92fc3da51 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -747,6 +747,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
 	kvm_eventfd_init(kvm);
+	kvm_ioregionfd_init(kvm);
 	mutex_init(&kvm->lock);
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
@@ -3708,6 +3709,16 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
 		break;
 	}
+	case KVM_SET_IOREGION: {
+		struct kvm_ioregion data;
+
+		r = -EFAULT;
+		if (copy_from_user(&data, argp, sizeof(data)))
+			goto out;
+
+		r = kvm_ioregionfd(kvm, &data);
+		break;
+	}
 	case KVM_GET_DIRTY_LOG: {
 		struct kvm_dirty_log log;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
  2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
@ 2021-02-21 12:04 ` Elena Afanasova
  2021-02-24 10:42   ` Stefan Hajnoczi
  2021-03-09  5:51   ` Jason Wang
  2021-02-21 12:04 ` [RFC v3 3/5] KVM: implement wire protocol Elena Afanasova
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-21 12:04 UTC (permalink / raw)
  To: kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst,
	cohuck, john.levon, Elena Afanasova

The vCPU thread may receive a signal during ioregionfd communication,
ioctl(KVM_RUN) needs to return to userspace and then ioctl(KVM_RUN)
must resume ioregionfd.

Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
---
v3:
 - add FAST_MMIO bus support
 - move ioregion_interrupted flag to ioregion_ctx
 - reorder ioregion_ctx fields
 - rework complete_ioregion operations 
 - add signal handling support for crossing a page boundary case
 - fix kvm_arch_vcpu_ioctl_run() should return -EINTR in case ioregionfd 
   is interrupted

 arch/x86/kvm/vmx/vmx.c   |  40 +++++-
 arch/x86/kvm/x86.c       | 272 +++++++++++++++++++++++++++++++++++++--
 include/linux/kvm_host.h |  10 ++
 virt/kvm/kvm_main.c      |  16 ++-
 4 files changed, 317 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 47b8357b9751..39db31afd27e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5357,19 +5357,51 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
+#ifdef CONFIG_KVM_IOREGION
+static int complete_ioregion_fast_mmio(struct kvm_vcpu *vcpu)
+{
+	int ret, idx;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS,
+			       vcpu->ioregion_ctx.addr, 0, NULL);
+	if (ret) {
+		ret = kvm_mmu_page_fault(vcpu, vcpu->ioregion_ctx.addr,
+					 PFERR_RSVD_MASK, NULL, 0);
+		srcu_read_unlock(&vcpu->kvm->srcu, idx);
+		return ret;
+	}
+
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	return kvm_skip_emulated_instruction(vcpu);
+}
+#endif
+
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 {
 	gpa_t gpa;
+	int ret;
 
 	/*
 	 * A nested guest cannot optimize MMIO vmexits, because we have an
 	 * nGPA here instead of the required GPA.
 	 */
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
-	if (!is_guest_mode(vcpu) &&
-	    !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
-		trace_kvm_fast_mmio(gpa);
-		return kvm_skip_emulated_instruction(vcpu);
+	if (!is_guest_mode(vcpu)) {
+		ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL);
+		if (!ret) {
+			trace_kvm_fast_mmio(gpa);
+			return kvm_skip_emulated_instruction(vcpu);
+		}
+
+#ifdef CONFIG_KVM_IOREGION
+		if (unlikely(vcpu->ioregion_ctx.is_interrupted && ret == -EINTR)) {
+			vcpu->run->exit_reason = KVM_EXIT_INTR;
+			vcpu->arch.complete_userspace_io = complete_ioregion_fast_mmio;
+			++vcpu->stat.signal_exits;
+			return ret;
+		}
+#endif
 	}
 
 	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ddb28f5ca252..07a538f02e3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5799,19 +5799,33 @@ static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
 {
 	int handled = 0;
 	int n;
+	int ret = 0;
+	bool is_apic;
 
 	do {
 		n = min(len, 8);
-		if (!(lapic_in_kernel(vcpu) &&
-		      !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
-		    && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
-			break;
+		is_apic = lapic_in_kernel(vcpu) &&
+			  !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev,
+					      addr, n, v);
+		if (!is_apic) {
+			ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS,
+					       addr, n, v);
+			if (ret)
+				break;
+		}
 		handled += n;
 		addr += n;
 		len -= n;
 		v += n;
 	} while (len);
 
+#ifdef CONFIG_KVM_IOREGION
+	if (ret == -EINTR) {
+		vcpu->run->exit_reason = KVM_EXIT_INTR;
+		++vcpu->stat.signal_exits;
+	}
+#endif
+
 	return handled;
 }
 
@@ -5819,14 +5833,20 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v)
 {
 	int handled = 0;
 	int n;
+	int ret = 0;
+	bool is_apic;
 
 	do {
 		n = min(len, 8);
-		if (!(lapic_in_kernel(vcpu) &&
-		      !kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev,
-					 addr, n, v))
-		    && kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, n, v))
-			break;
+		is_apic = lapic_in_kernel(vcpu) &&
+			  !kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev,
+					     addr, n, v);
+		if (!is_apic) {
+			ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS,
+					      addr, n, v);
+			if (ret)
+				break;
+		}
 		trace_kvm_mmio(KVM_TRACE_MMIO_READ, n, addr, v);
 		handled += n;
 		addr += n;
@@ -5834,6 +5854,13 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v)
 		v += n;
 	} while (len);
 
+#ifdef CONFIG_KVM_IOREGION
+	if (ret == -EINTR) {
+		vcpu->run->exit_reason = KVM_EXIT_INTR;
+		++vcpu->stat.signal_exits;
+	}
+#endif
+
 	return handled;
 }
 
@@ -6229,6 +6256,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 	if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
 		return X86EMUL_CONTINUE;
 
+#ifdef CONFIG_KVM_IOREGION
+	/* crossing a page boundary case is interrupted */
+	if (vcpu->ioregion_ctx.is_interrupted &&
+	    vcpu->run->exit_reason == KVM_EXIT_INTR)
+		goto out;
+#endif
+
 	/*
 	 * Is this MMIO handled locally?
 	 */
@@ -6236,6 +6270,7 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
 	if (handled == bytes)
 		return X86EMUL_CONTINUE;
 
+out:
 	gpa += handled;
 	bytes -= handled;
 	val += handled;
@@ -6294,6 +6329,12 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
 	vcpu->mmio_needed = 1;
 	vcpu->mmio_cur_fragment = 0;
 
+#ifdef CONFIG_KVM_IOREGION
+	if (vcpu->ioregion_ctx.is_interrupted &&
+	    vcpu->run->exit_reason == KVM_EXIT_INTR)
+		return (vcpu->ioregion_ctx.in) ? X86EMUL_IO_NEEDED : X86EMUL_CONTINUE;
+#endif
+
 	vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len);
 	vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write;
 	vcpu->run->exit_reason = KVM_EXIT_MMIO;
@@ -6411,16 +6452,22 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
 
 	for (i = 0; i < vcpu->arch.pio.count; i++) {
 		if (vcpu->arch.pio.in)
-			r = kvm_io_bus_read(vcpu, KVM_PIO_BUS, vcpu->arch.pio.port,
+			r = kvm_io_bus_read(vcpu, KVM_PIO_BUS,
+					    vcpu->arch.pio.port,
 					    vcpu->arch.pio.size, pd);
 		else
 			r = kvm_io_bus_write(vcpu, KVM_PIO_BUS,
-					     vcpu->arch.pio.port, vcpu->arch.pio.size,
-					     pd);
+					     vcpu->arch.pio.port,
+					     vcpu->arch.pio.size, pd);
 		if (r)
 			break;
 		pd += vcpu->arch.pio.size;
 	}
+#ifdef CONFIG_KVM_IOREGION
+	if (vcpu->ioregion_ctx.is_interrupted && r == -EINTR)
+		vcpu->ioregion_ctx.pio = i;
+#endif
+
 	return r;
 }
 
@@ -6428,16 +6475,27 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
 			       unsigned short port, void *val,
 			       unsigned int count, bool in)
 {
+	int ret = 0;
+
 	vcpu->arch.pio.port = port;
 	vcpu->arch.pio.in = in;
 	vcpu->arch.pio.count  = count;
 	vcpu->arch.pio.size = size;
 
-	if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
+	ret = kernel_pio(vcpu, vcpu->arch.pio_data);
+	if (!ret) {
 		vcpu->arch.pio.count = 0;
 		return 1;
 	}
 
+#ifdef CONFIG_KVM_IOREGION
+	if (ret == -EINTR) {
+		vcpu->run->exit_reason = KVM_EXIT_INTR;
+		++vcpu->stat.signal_exits;
+		return 0;
+	}
+#endif
+
 	vcpu->run->exit_reason = KVM_EXIT_IO;
 	vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
 	vcpu->run->io.size = size;
@@ -7141,6 +7199,10 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 
 static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
 static int complete_emulated_pio(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_KVM_IOREGION
+static int complete_ioregion_io(struct kvm_vcpu *vcpu);
+static int complete_ioregion_fast_pio(struct kvm_vcpu *vcpu);
+#endif
 
 static void kvm_smm_changed(struct kvm_vcpu *vcpu)
 {
@@ -7405,6 +7467,14 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		r = 1;
 		if (inject_emulated_exception(vcpu))
 			return r;
+#ifdef CONFIG_KVM_IOREGION
+	} else if (vcpu->ioregion_ctx.is_interrupted &&
+		   vcpu->run->exit_reason == KVM_EXIT_INTR) {
+		if (vcpu->ioregion_ctx.in)
+			writeback = false;
+		vcpu->arch.complete_userspace_io = complete_ioregion_io;
+		r = 0;
+#endif
 	} else if (vcpu->arch.pio.count) {
 		if (!vcpu->arch.pio.in) {
 			/* FIXME: return into emulator if single-stepping.  */
@@ -7501,6 +7571,12 @@ static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size,
 		vcpu->arch.complete_userspace_io =
 			complete_fast_pio_out_port_0x7e;
 		kvm_skip_emulated_instruction(vcpu);
+#ifdef CONFIG_KVM_IOREGION
+	} else if (vcpu->ioregion_ctx.is_interrupted &&
+		   vcpu->run->exit_reason == KVM_EXIT_INTR) {
+		vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
+		vcpu->arch.complete_userspace_io = complete_ioregion_fast_pio;
+#endif
 	} else {
 		vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
 		vcpu->arch.complete_userspace_io = complete_fast_pio_out;
@@ -7549,6 +7625,14 @@ static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size,
 	}
 
 	vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
+
+#ifdef CONFIG_KVM_IOREGION
+	if (vcpu->ioregion_ctx.is_interrupted &&
+	    vcpu->run->exit_reason == KVM_EXIT_INTR) {
+		vcpu->arch.complete_userspace_io = complete_ioregion_fast_pio;
+		return 0;
+	}
+#endif
 	vcpu->arch.complete_userspace_io = complete_fast_pio_in;
 
 	return 0;
@@ -9204,6 +9288,162 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+#ifdef CONFIG_KVM_IOREGION
+static int complete_ioregion_access(struct kvm_vcpu *vcpu, u8 bus, gpa_t addr,
+				    int len, void *val)
+{
+	if (vcpu->ioregion_ctx.in)
+		return kvm_io_bus_read(vcpu, bus, addr, len, val);
+	else
+		return kvm_io_bus_write(vcpu, bus, addr, len, val);
+}
+
+static int complete_ioregion_mmio(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmio_fragment *frag;
+	int idx, ret, i, n;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	for (i = vcpu->mmio_cur_fragment; i < vcpu->mmio_nr_fragments; i++) {
+		frag = &vcpu->mmio_fragments[i];
+		do {
+			n = min(8u, frag->len);
+			ret = complete_ioregion_access(vcpu, KVM_MMIO_BUS,
+						       frag->gpa, n, frag->data);
+			if (ret < 0)
+				goto do_exit;
+			frag->len -= n;
+			frag->data += n;
+			frag->gpa += n;
+		} while (frag->len);
+		vcpu->mmio_cur_fragment++;
+	}
+
+	vcpu->mmio_needed = 0;
+	if (!vcpu->ioregion_ctx.in) {
+		ret = 1;
+		goto out;
+	}
+
+	vcpu->mmio_read_completed = 1;
+	ret = kvm_emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
+	goto out;
+
+do_exit:
+	if (ret != -EOPNOTSUPP) {
+		vcpu->arch.complete_userspace_io = complete_ioregion_mmio;
+		goto out;
+	}
+
+	/* if ioregion is removed KVM needs to return with KVM_EXIT_MMIO */
+	vcpu->run->exit_reason = KVM_EXIT_MMIO;
+	vcpu->run->mmio.phys_addr = frag->gpa;
+	if (!vcpu->ioregion_ctx.in)
+		memcpy(vcpu->run->mmio.data, frag->data, n);
+	vcpu->run->mmio.len = n;
+	vcpu->run->mmio.is_write = !vcpu->ioregion_ctx.in;
+	vcpu->arch.complete_userspace_io = complete_emulated_mmio;
+	ret = 0;
+out:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	return ret;
+}
+
+static int complete_ioregion_pio(struct kvm_vcpu *vcpu)
+{
+	int i, idx, ret;
+	unsigned long off;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	for (i = vcpu->ioregion_ctx.pio; i < vcpu->arch.pio.count; i++) {
+		ret = complete_ioregion_access(vcpu, KVM_PIO_BUS, vcpu->arch.pio.port,
+					       vcpu->arch.pio.size,
+					       vcpu->ioregion_ctx.val);
+		if (ret < 0)
+			goto do_exit;
+		vcpu->ioregion_ctx.val += vcpu->arch.pio.size;
+	}
+
+	ret = 1;
+	if (vcpu->ioregion_ctx.in)
+		ret = kvm_emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
+	vcpu->arch.pio.count = 0;
+	goto out;
+
+do_exit:
+	if (ret != -EOPNOTSUPP) {
+		vcpu->ioregion_ctx.pio = i;
+		vcpu->arch.complete_userspace_io = complete_ioregion_pio;
+		goto out;
+	}
+
+	/* if ioregion is removed KVM needs to return with KVM_EXIT_IO */
+	off = vcpu->ioregion_ctx.val - vcpu->arch.pio_data;
+	vcpu->run->exit_reason = KVM_EXIT_IO;
+	vcpu->run->io.direction = vcpu->ioregion_ctx.in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
+	vcpu->run->io.size = vcpu->arch.pio.size;
+	vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE + off;
+	vcpu->run->io.count = vcpu->arch.pio.count - i;
+	vcpu->run->io.port = vcpu->arch.pio.port;
+	if (vcpu->ioregion_ctx.in)
+		vcpu->arch.complete_userspace_io = complete_emulated_pio;
+	else
+		vcpu->arch.pio.count = 0;
+	ret = 0;
+out:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	return ret;
+}
+
+static int complete_ioregion_fast_pio(struct kvm_vcpu *vcpu)
+{
+	int idx, ret;
+	u64 val;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	ret = complete_ioregion_access(vcpu, KVM_PIO_BUS, vcpu->arch.pio.port,
+				       vcpu->arch.pio.size, vcpu->ioregion_ctx.val);
+	if (ret < 0)
+		goto do_exit;
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	if (vcpu->ioregion_ctx.in) {
+		memcpy(&val, vcpu->ioregion_ctx.val, vcpu->ioregion_ctx.len);
+		kvm_rax_write(vcpu, val);
+	}
+	vcpu->arch.pio.count = 0;
+	return kvm_skip_emulated_instruction(vcpu);
+
+do_exit:
+	if (ret != -EOPNOTSUPP) {
+		vcpu->arch.complete_userspace_io = complete_ioregion_fast_pio;
+		goto out;
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_IO;
+	vcpu->run->io.direction = vcpu->ioregion_ctx.in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
+	vcpu->run->io.size = vcpu->arch.pio.size;
+	vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
+	vcpu->run->io.count = 1;
+	vcpu->run->io.port = vcpu->arch.pio.port;
+	vcpu->arch.complete_userspace_io = vcpu->ioregion_ctx.in ?
+			complete_fast_pio_in : complete_fast_pio_out;
+	ret = 0;
+out:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	return ret;
+}
+
+static int complete_ioregion_io(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->mmio_needed)
+		return complete_ioregion_mmio(vcpu);
+	if (vcpu->arch.pio.count)
+		return complete_ioregion_pio(vcpu);
+}
+#endif /* CONFIG_KVM_IOREGION */
+
 static void kvm_save_current_fpu(struct fpu *fpu)
 {
 	/*
@@ -9309,6 +9549,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	else
 		r = vcpu_run(vcpu);
 
+#ifdef CONFIG_KVM_IOREGION
+	if (vcpu->ioregion_ctx.is_interrupted &&
+	    vcpu->run->exit_reason == KVM_EXIT_INTR)
+		r = -EINTR;
+#endif
+
 out:
 	kvm_put_guest_fpu(vcpu);
 	if (kvm_run->kvm_valid_regs)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f35f0976f5cf..84f07597d131 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -318,6 +318,16 @@ struct kvm_vcpu {
 #endif
 	bool preempted;
 	bool ready;
+#ifdef CONFIG_KVM_IOREGION
+	struct {
+		u64 addr;
+		void *val;
+		int pio;
+		u8 state; /* SEND_CMD/GET_REPLY */
+		bool in;
+		bool is_interrupted;
+	} ioregion_ctx;
+#endif
 	struct kvm_vcpu_arch arch;
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 88b92fc3da51..df387857f51f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4193,6 +4193,7 @@ static int __kvm_io_bus_write(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
 			      struct kvm_io_range *range, const void *val)
 {
 	int idx;
+	int ret = 0;
 
 	idx = kvm_io_bus_get_first_dev(bus, range->addr, range->len);
 	if (idx < 0)
@@ -4200,9 +4201,12 @@ static int __kvm_io_bus_write(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
 
 	while (idx < bus->dev_count &&
 		kvm_io_bus_cmp(range, &bus->range[idx]) == 0) {
-		if (!kvm_iodevice_write(vcpu, bus->range[idx].dev, range->addr,
-					range->len, val))
+		ret = kvm_iodevice_write(vcpu, bus->range[idx].dev, range->addr,
+					 range->len, val);
+		if (!ret)
 			return idx;
+		if (ret < 0 && ret != -EOPNOTSUPP)
+			return ret;
 		idx++;
 	}
 
@@ -4264,6 +4268,7 @@ static int __kvm_io_bus_read(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
 			     struct kvm_io_range *range, void *val)
 {
 	int idx;
+	int ret = 0;
 
 	idx = kvm_io_bus_get_first_dev(bus, range->addr, range->len);
 	if (idx < 0)
@@ -4271,9 +4276,12 @@ static int __kvm_io_bus_read(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
 
 	while (idx < bus->dev_count &&
 		kvm_io_bus_cmp(range, &bus->range[idx]) == 0) {
-		if (!kvm_iodevice_read(vcpu, bus->range[idx].dev, range->addr,
-				       range->len, val))
+		ret = kvm_iodevice_read(vcpu, bus->range[idx].dev, range->addr,
+					range->len, val);
+		if (!ret)
 			return idx;
+		if (ret < 0 && ret != -EOPNOTSUPP)
+			return ret;
 		idx++;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v3 3/5] KVM: implement wire protocol
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
  2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
  2021-02-21 12:04 ` [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling Elena Afanasova
@ 2021-02-21 12:04 ` Elena Afanasova
  2021-02-24 11:02   ` Stefan Hajnoczi
  2021-03-09  6:19   ` Jason Wang
  2021-02-21 12:04 ` [RFC v3 4/5] KVM: add ioregionfd context Elena Afanasova
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-21 12:04 UTC (permalink / raw)
  To: kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst,
	cohuck, john.levon, Elena Afanasova

Add ioregionfd blocking read/write operations.

Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
---
v3:
 - change wire protocol license
 - remove ioregionfd_cmd info and drop appropriate macros
 - fix ioregionfd state machine
 - add sizeless ioregions support
 - drop redundant check in ioregion_read/write()

 include/uapi/linux/ioregion.h |  30 +++++++
 virt/kvm/ioregion.c           | 162 +++++++++++++++++++++++++++++++++-
 2 files changed, 190 insertions(+), 2 deletions(-)
 create mode 100644 include/uapi/linux/ioregion.h

diff --git a/include/uapi/linux/ioregion.h b/include/uapi/linux/ioregion.h
new file mode 100644
index 000000000000..58f9b5ba6186
--- /dev/null
+++ b/include/uapi/linux/ioregion.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: ((GPL-2.0-only WITH Linux-syscall-note) OR BSD-3-Clause) */
+#ifndef _UAPI_LINUX_IOREGION_H
+#define _UAPI_LINUX_IOREGION_H
+
+/* Wire protocol */
+
+struct ioregionfd_cmd {
+	__u8 cmd;
+	__u8 size_exponent : 4;
+	__u8 resp : 1;
+	__u8 padding[6];
+	__u64 user_data;
+	__u64 offset;
+	__u64 data;
+};
+
+struct ioregionfd_resp {
+	__u64 data;
+	__u8 pad[24];
+};
+
+#define IOREGIONFD_CMD_READ    0
+#define IOREGIONFD_CMD_WRITE   1
+
+#define IOREGIONFD_SIZE_8BIT   0
+#define IOREGIONFD_SIZE_16BIT  1
+#define IOREGIONFD_SIZE_32BIT  2
+#define IOREGIONFD_SIZE_64BIT  3
+
+#endif
diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
index e09ef3e2c9d7..1e1c7772d274 100644
--- a/virt/kvm/ioregion.c
+++ b/virt/kvm/ioregion.c
@@ -3,6 +3,7 @@
 #include <linux/fs.h>
 #include <kvm/iodev.h>
 #include "eventfd.h"
+#include <uapi/linux/ioregion.h>
 
 void
 kvm_ioregionfd_init(struct kvm *kvm)
@@ -40,18 +41,175 @@ ioregion_release(struct ioregion *p)
 	kfree(p);
 }
 
+static bool
+pack_cmd(struct ioregionfd_cmd *cmd, u64 offset, u64 len, u8 opt, u8 resp,
+	 u64 user_data, const void *val)
+{
+	switch (len) {
+	case 0:
+		break;
+	case 1:
+		cmd->size_exponent = IOREGIONFD_SIZE_8BIT;
+		break;
+	case 2:
+		cmd->size_exponent = IOREGIONFD_SIZE_16BIT;
+		break;
+	case 4:
+		cmd->size_exponent = IOREGIONFD_SIZE_32BIT;
+		break;
+	case 8:
+		cmd->size_exponent = IOREGIONFD_SIZE_64BIT;
+		break;
+	default:
+		return false;
+	}
+
+	if (val)
+		memcpy(&cmd->data, val, len);
+	cmd->user_data = user_data;
+	cmd->offset = offset;
+	cmd->cmd = opt;
+	cmd->resp = resp;
+
+	return true;
+}
+
+enum {
+	SEND_CMD,
+	GET_REPLY,
+	COMPLETE
+};
+
+static void
+ioregion_save_ctx(struct kvm_vcpu *vcpu, bool in, gpa_t addr, u8 state, void *val)
+{
+	vcpu->ioregion_ctx.is_interrupted = true;
+	vcpu->ioregion_ctx.val = val;
+	vcpu->ioregion_ctx.state = state;
+	vcpu->ioregion_ctx.addr = addr;
+	vcpu->ioregion_ctx.in = in;
+}
+
 static int
 ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 	      int len, void *val)
 {
-	return -EOPNOTSUPP;
+	struct ioregion *p = to_ioregion(this);
+	union {
+		struct ioregionfd_cmd cmd;
+		struct ioregionfd_resp resp;
+	} buf;
+	int ret = 0;
+	int state = SEND_CMD;
+
+	if (unlikely(vcpu->ioregion_ctx.is_interrupted)) {
+		vcpu->ioregion_ctx.is_interrupted = false;
+
+		switch (vcpu->ioregion_ctx.state) {
+		case SEND_CMD:
+			goto send_cmd;
+		case GET_REPLY:
+			goto get_repl;
+		default:
+			return -EINVAL;
+		}
+	}
+
+send_cmd:
+	memset(&buf, 0, sizeof(buf));
+	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_READ,
+		      1, p->user_data, NULL))
+		return -EOPNOTSUPP;
+
+	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
+	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
+	if (signal_pending(current) && state == SEND_CMD) {
+		ioregion_save_ctx(vcpu, 1, addr, state, val);
+		return -EINTR;
+	}
+	if (ret != sizeof(buf.cmd)) {
+		ret = (ret < 0) ? ret : -EIO;
+		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+	}
+	if (!p->rf)
+		return 0;
+
+get_repl:
+	memset(&buf, 0, sizeof(buf));
+	ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
+	state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
+	if (signal_pending(current) && state == GET_REPLY) {
+		ioregion_save_ctx(vcpu, 1, addr, state, val);
+		return -EINTR;
+	}
+	if (ret != sizeof(buf.resp)) {
+		ret = (ret < 0) ? ret : -EIO;
+		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+	}
+
+	memcpy(val, &buf.resp.data, len);
+
+	return 0;
 }
 
 static int
 ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 		int len, const void *val)
 {
-	return -EOPNOTSUPP;
+	struct ioregion *p = to_ioregion(this);
+	union {
+		struct ioregionfd_cmd cmd;
+		struct ioregionfd_resp resp;
+	} buf;
+	int ret = 0;
+	int state = SEND_CMD;
+
+	if (unlikely(vcpu->ioregion_ctx.is_interrupted)) {
+		vcpu->ioregion_ctx.is_interrupted = false;
+
+		switch (vcpu->ioregion_ctx.state) {
+		case SEND_CMD:
+			goto send_cmd;
+		case GET_REPLY:
+			goto get_repl;
+		default:
+			return -EINVAL;
+		}
+	}
+
+send_cmd:
+	memset(&buf, 0, sizeof(buf));
+	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_WRITE,
+		      p->posted_writes ? 0 : 1, p->user_data, val))
+		return -EOPNOTSUPP;
+
+	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
+	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
+	if (signal_pending(current) && state == SEND_CMD) {
+		ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
+		return -EINTR;
+	}
+	if (ret != sizeof(buf.cmd)) {
+		ret = (ret < 0) ? ret : -EIO;
+		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+	}
+
+get_repl:
+	if (!p->posted_writes) {
+		memset(&buf, 0, sizeof(buf));
+		ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
+		state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
+		if (signal_pending(current) && state == GET_REPLY) {
+			ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
+			return -EINTR;
+		}
+		if (ret != sizeof(buf.resp)) {
+			ret = (ret < 0) ? ret : -EIO;
+			return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		}
+	}
+
+	return 0;
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v3 4/5] KVM: add ioregionfd context
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
                   ` (2 preceding siblings ...)
  2021-02-21 12:04 ` [RFC v3 3/5] KVM: implement wire protocol Elena Afanasova
@ 2021-02-21 12:04 ` Elena Afanasova
  2021-02-24 11:27   ` Stefan Hajnoczi
  2021-03-09  7:54   ` Jason Wang
  2021-02-21 12:04 ` [RFC v3 5/5] KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled Elena Afanasova
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-21 12:04 UTC (permalink / raw)
  To: kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst,
	cohuck, john.levon, Elena Afanasova

Add support for ioregionfd cmds/replies serialization.

Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
---
v3:
 - add comment
 - drop kvm_io_bus_finish/prepare()

 virt/kvm/ioregion.c | 164 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 135 insertions(+), 29 deletions(-)

diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
index 1e1c7772d274..d53e3d1cd2ff 100644
--- a/virt/kvm/ioregion.c
+++ b/virt/kvm/ioregion.c
@@ -1,10 +1,39 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include <linux/kvm_host.h>
-#include <linux/fs.h>
+#include <linux/wait.h>
 #include <kvm/iodev.h>
 #include "eventfd.h"
 #include <uapi/linux/ioregion.h>
 
+/* ioregions that share the same rfd are serialized so that only one vCPU
+ * thread sends a struct ioregionfd_cmd to userspace at a time. This
+ * ensures that the struct ioregionfd_resp received from userspace will
+ * be processed by the one and only vCPU thread that sent it.
+ *
+ * A waitqueue is used to wake up waiting vCPU threads in order. Most of
+ * the time the waitqueue is unused and the lock is not contended.
+ * For best performance userspace should set up ioregionfds so that there
+ * is no contention (e.g. dedicated ioregionfds for queue doorbell
+ * registers on multi-queue devices).
+ */
+struct ioregionfd {
+	wait_queue_head_t	  wq;
+	struct file		 *rf;
+	struct kref		  kref;
+	bool			  busy;
+};
+
+struct ioregion {
+	struct list_head	  list;
+	u64			  paddr;   /* guest physical address */
+	u64			  size;    /* size in bytes */
+	struct file		 *wf;
+	u64			  user_data; /* opaque token used by userspace */
+	struct kvm_io_device	  dev;
+	bool			  posted_writes;
+	struct ioregionfd	 *ctx;
+};
+
 void
 kvm_ioregionfd_init(struct kvm *kvm)
 {
@@ -13,29 +42,28 @@ kvm_ioregionfd_init(struct kvm *kvm)
 	INIT_LIST_HEAD(&kvm->ioregions_pio);
 }
 
-struct ioregion {
-	struct list_head     list;
-	u64                  paddr;  /* guest physical address */
-	u64                  size;   /* size in bytes */
-	struct file         *rf;
-	struct file         *wf;
-	u64                  user_data; /* opaque token used by userspace */
-	struct kvm_io_device dev;
-	bool                 posted_writes;
-};
-
 static inline struct ioregion *
 to_ioregion(struct kvm_io_device *dev)
 {
 	return container_of(dev, struct ioregion, dev);
 }
 
+/* assumes kvm->slots_lock held */
+static void ctx_free(struct kref *kref)
+{
+	struct ioregionfd *ctx = container_of(kref, struct ioregionfd, kref);
+
+	kfree(ctx);
+}
+
 /* assumes kvm->slots_lock held */
 static void
 ioregion_release(struct ioregion *p)
 {
-	if (p->rf)
-		fput(p->rf);
+	if (p->ctx) {
+		fput(p->ctx->rf);
+		kref_put(&p->ctx->kref, ctx_free);
+	}
 	fput(p->wf);
 	list_del(&p->list);
 	kfree(p);
@@ -90,6 +118,30 @@ ioregion_save_ctx(struct kvm_vcpu *vcpu, bool in, gpa_t addr, u8 state, void *va
 	vcpu->ioregion_ctx.in = in;
 }
 
+static inline void
+ioregion_lock_ctx(struct ioregionfd *ctx)
+{
+	if (!ctx)
+		return;
+
+	spin_lock(&ctx->wq.lock);
+	wait_event_interruptible_exclusive_locked(ctx->wq, !ctx->busy);
+	ctx->busy = true;
+	spin_unlock(&ctx->wq.lock);
+}
+
+static inline void
+ioregion_unlock_ctx(struct ioregionfd *ctx)
+{
+	if (!ctx)
+		return;
+
+	spin_lock(&ctx->wq.lock);
+	ctx->busy = false;
+	wake_up_locked(&ctx->wq);
+	spin_unlock(&ctx->wq.lock);
+}
+
 static int
 ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 	      int len, void *val)
@@ -115,11 +167,15 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 		}
 	}
 
+	ioregion_lock_ctx(p->ctx);
+
 send_cmd:
 	memset(&buf, 0, sizeof(buf));
 	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_READ,
-		      1, p->user_data, NULL))
-		return -EOPNOTSUPP;
+		      1, p->user_data, NULL)) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
 
 	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
 	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
@@ -129,14 +185,15 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 	}
 	if (ret != sizeof(buf.cmd)) {
 		ret = (ret < 0) ? ret : -EIO;
-		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		goto out;
 	}
-	if (!p->rf)
+	if (!p->ctx)
 		return 0;
 
 get_repl:
 	memset(&buf, 0, sizeof(buf));
-	ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
+	ret = kernel_read(p->ctx->rf, &buf.resp, sizeof(buf.resp), 0);
 	state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
 	if (signal_pending(current) && state == GET_REPLY) {
 		ioregion_save_ctx(vcpu, 1, addr, state, val);
@@ -144,12 +201,17 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 	}
 	if (ret != sizeof(buf.resp)) {
 		ret = (ret < 0) ? ret : -EIO;
-		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		goto out;
 	}
 
 	memcpy(val, &buf.resp.data, len);
+	ret = 0;
 
-	return 0;
+out:
+	ioregion_unlock_ctx(p->ctx);
+
+	return ret;
 }
 
 static int
@@ -177,11 +239,15 @@ ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 		}
 	}
 
+	ioregion_lock_ctx(p->ctx);
+
 send_cmd:
 	memset(&buf, 0, sizeof(buf));
 	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_WRITE,
-		      p->posted_writes ? 0 : 1, p->user_data, val))
-		return -EOPNOTSUPP;
+		      p->posted_writes ? 0 : 1, p->user_data, val)) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
 
 	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
 	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
@@ -191,13 +257,14 @@ ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 	}
 	if (ret != sizeof(buf.cmd)) {
 		ret = (ret < 0) ? ret : -EIO;
-		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+		goto out;
 	}
 
 get_repl:
 	if (!p->posted_writes) {
 		memset(&buf, 0, sizeof(buf));
-		ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
+		ret = kernel_read(p->ctx->rf, &buf.resp, sizeof(buf.resp), 0);
 		state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
 		if (signal_pending(current) && state == GET_REPLY) {
 			ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
@@ -205,11 +272,16 @@ ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
 		}
 		if (ret != sizeof(buf.resp)) {
 			ret = (ret < 0) ? ret : -EIO;
-			return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+			ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
+			goto out;
 		}
 	}
+	ret = 0;
 
-	return 0;
+out:
+	ioregion_unlock_ctx(p->ctx);
+
+	return ret;
 }
 
 /*
@@ -285,6 +357,33 @@ get_bus_from_flags(__u32 flags)
 	return KVM_MMIO_BUS;
 }
 
+/* assumes kvm->slots_lock held */
+static bool
+ioregion_get_ctx(struct kvm *kvm, struct ioregion *p, struct file *rf, int bus_idx)
+{
+	struct ioregion *_p;
+	struct list_head *ioregions;
+
+	ioregions = get_ioregion_list(kvm, bus_idx);
+	list_for_each_entry(_p, ioregions, list)
+		if (file_inode(_p->ctx->rf)->i_ino == file_inode(rf)->i_ino) {
+			p->ctx = _p->ctx;
+			kref_get(&p->ctx->kref);
+			return true;
+		}
+
+	p->ctx = kzalloc(sizeof(*p->ctx), GFP_KERNEL_ACCOUNT);
+	if (!p->ctx)
+		return false;
+
+	p->ctx->rf = rf;
+	p->ctx->busy = false;
+	init_waitqueue_head(&p->ctx->wq);
+	kref_get(&p->ctx->kref);
+
+	return true;
+}
+
 int
 kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bus_idx)
 {
@@ -309,11 +408,10 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
 	}
 
 	INIT_LIST_HEAD(&p->list);
+	p->wf = wfile;
 	p->paddr = args->guest_paddr;
 	p->size = args->memory_size;
 	p->user_data = args->user_data;
-	p->rf = rfile;
-	p->wf = wfile;
 	p->posted_writes = args->flags & KVM_IOREGION_POSTED_WRITES;
 
 	mutex_lock(&kvm->slots_lock);
@@ -322,6 +420,12 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
 		ret = -EEXIST;
 		goto unlock_fail;
 	}
+
+	if (rfile && !ioregion_get_ctx(kvm, p, rfile, bus_idx)) {
+		ret = -ENOMEM;
+		goto unlock_fail;
+	}
+
 	kvm_iodevice_init(&p->dev, &ioregion_ops);
 	ret = kvm_io_bus_register_dev(kvm, bus_idx, p->paddr, p->size,
 				      &p->dev);
@@ -335,6 +439,8 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
 
 unlock_fail:
 	mutex_unlock(&kvm->slots_lock);
+	if (p->ctx)
+		kref_put(&p->ctx->kref, ctx_free);
 	kfree(p);
 fail:
 	if (rfile)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC v3 5/5] KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
                   ` (3 preceding siblings ...)
  2021-02-21 12:04 ` [RFC v3 4/5] KVM: add ioregionfd context Elena Afanasova
@ 2021-02-21 12:04 ` Elena Afanasova
  2021-02-21 17:06 ` [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Paolo Bonzini
  2021-02-24 11:34 ` Stefan Hajnoczi
  6 siblings, 0 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-21 12:04 UTC (permalink / raw)
  To: kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst,
	cohuck, john.levon, Elena Afanasova

ioregionfd relies on kmemcg in order to limit the amount of kernel memory
that userspace can consume. Enforce NR_IOBUS_DEVS hardcoded limit in case
kmemcg is disabled.

Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
---
 virt/kvm/kvm_main.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index df387857f51f..99a828153afd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4320,9 +4320,12 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 	if (!bus)
 		return -ENOMEM;
 
-	/* exclude ioeventfd which is limited by maximum fd */
-	if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1)
-		return -ENOSPC;
+	/* enforce hard limit if kmemcg is disabled and
+	 * exclude ioeventfd which is limited by maximum fd
+	 */
+	if (!memcg_kmem_enabled())
+		if (bus->dev_count - bus->ioeventfd_count > NR_IOBUS_DEVS - 1)
+			return -ENOSPC;
 
 	new_bus = kmalloc(struct_size(bus, range, bus->dev_count + 1),
 			  GFP_KERNEL_ACCOUNT);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd)
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
                   ` (4 preceding siblings ...)
  2021-02-21 12:04 ` [RFC v3 5/5] KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled Elena Afanasova
@ 2021-02-21 17:06 ` Paolo Bonzini
  2021-02-22 16:40   ` Elena Afanasova
  2021-02-24 11:34 ` Stefan Hajnoczi
  6 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2021-02-21 17:06 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, jasowang, mst, cohuck, john.levon

On 21/02/21 13:04, Elena Afanasova wrote:
> This patchset introduces a KVM dispatch mechanism which can be used
> for handling MMIO/PIO accesses over file descriptors without returning
> from ioctl(KVM_RUN). This allows device emulation to run in another task
> separate from the vCPU task.
> 
> This is achieved through KVM vm ioctl for registering MMIO/PIO regions and
> a wire protocol that KVM uses to communicate with a task handling an
> MMIO/PIO access.
> 
> TODOs:
> * Implement KVM_EXIT_IOREGIONFD_FAILURE
> * Add non-x86 arch support
> * Add kvm-unittests
> * Flush waiters if ioregion is deleted

Hi ELena,

as a quick thing that jumped at me before starting the review, you 
should add a test for the new API in tools/testing/selftests/kvm, as 
well as documentation.  Ideally, patch 4 would also add a testcase that 
fails before and passes afterwards.

Also, does this work already with io_uring?

Paolo

> v3:
>   - add FAST_MMIO bus support
>   - add KVM_IOREGION_DEASSIGN flag
>   - rename kvm_ioregion read/write file descriptors
>   - split ioregionfd signal handling support into two patches
>   - move ioregion_interrupted flag to ioregion_ctx
>   - reorder ioregion_ctx fields
>   - rework complete_ioregion operations
>   - add signal handling support for crossing a page boundary case
>   - change wire protocol license
>   - fix ioregionfd state machine
>   - remove ioregionfd_cmd info and drop appropriate macros
>   - add comment on ioregionfd cmds/replies serialization
>   - drop kvm_io_bus_finish/prepare()
> 
> Elena Afanasova (5):
>    KVM: add initial support for KVM_SET_IOREGION
>    KVM: x86: add support for ioregionfd signal handling
>    KVM: implement wire protocol
>    KVM: add ioregionfd context
>    KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled
> 
>   arch/x86/kvm/Kconfig          |   1 +
>   arch/x86/kvm/Makefile         |   1 +
>   arch/x86/kvm/vmx/vmx.c        |  40 ++-
>   arch/x86/kvm/x86.c            | 273 +++++++++++++++++-
>   include/linux/kvm_host.h      |  28 ++
>   include/uapi/linux/ioregion.h |  30 ++
>   include/uapi/linux/kvm.h      |  25 ++
>   virt/kvm/Kconfig              |   3 +
>   virt/kvm/eventfd.c            |  25 ++
>   virt/kvm/eventfd.h            |  14 +
>   virt/kvm/ioregion.c           | 529 ++++++++++++++++++++++++++++++++++
>   virt/kvm/ioregion.h           |  15 +
>   virt/kvm/kvm_main.c           |  36 ++-
>   13 files changed, 996 insertions(+), 24 deletions(-)
>   create mode 100644 include/uapi/linux/ioregion.h
>   create mode 100644 virt/kvm/eventfd.h
>   create mode 100644 virt/kvm/ioregion.c
>   create mode 100644 virt/kvm/ioregion.h
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd)
  2021-02-21 17:06 ` [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Paolo Bonzini
@ 2021-02-22 16:40   ` Elena Afanasova
  0 siblings, 0 replies; 30+ messages in thread
From: Elena Afanasova @ 2021-02-22 16:40 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, jasowang, mst, cohuck, john.levon

On Sun, 2021-02-21 at 18:06 +0100, Paolo Bonzini wrote:
> On 21/02/21 13:04, Elena Afanasova wrote:
> > This patchset introduces a KVM dispatch mechanism which can be used
> > for handling MMIO/PIO accesses over file descriptors without
> > returning
> > from ioctl(KVM_RUN). This allows device emulation to run in another
> > task
> > separate from the vCPU task.
> > 
> > This is achieved through KVM vm ioctl for registering MMIO/PIO
> > regions and
> > a wire protocol that KVM uses to communicate with a task handling
> > an
> > MMIO/PIO access.
> > 
> > TODOs:
> > * Implement KVM_EXIT_IOREGIONFD_FAILURE
> > * Add non-x86 arch support
> > * Add kvm-unittests
> > * Flush waiters if ioregion is deleted
> 
> Hi ELena,
> 

Hi Paolo,

Thank you for your answer.

> as a quick thing that jumped at me before starting the review, you 
> should add a test for the new API in tools/testing/selftests/kvm, as 
> well as documentation.  Ideally, patch 4 would also add a testcase
> that 
> fails before and passes afterwards.
> 
Ok

> Also, does this work already with io_uring?
> 
I have a few kvm-unittests and QEMU testdev patch for testing base
functionality. I haven't tried to run them with io_uring (only run with
socket and pipes). Will do.

> Paolo
> 
> > v3:
> >   - add FAST_MMIO bus support
> >   - add KVM_IOREGION_DEASSIGN flag
> >   - rename kvm_ioregion read/write file descriptors
> >   - split ioregionfd signal handling support into two patches
> >   - move ioregion_interrupted flag to ioregion_ctx
> >   - reorder ioregion_ctx fields
> >   - rework complete_ioregion operations
> >   - add signal handling support for crossing a page boundary case
> >   - change wire protocol license
> >   - fix ioregionfd state machine
> >   - remove ioregionfd_cmd info and drop appropriate macros
> >   - add comment on ioregionfd cmds/replies serialization
> >   - drop kvm_io_bus_finish/prepare()
> > 
> > Elena Afanasova (5):
> >    KVM: add initial support for KVM_SET_IOREGION
> >    KVM: x86: add support for ioregionfd signal handling
> >    KVM: implement wire protocol
> >    KVM: add ioregionfd context
> >    KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled
> > 
> >   arch/x86/kvm/Kconfig          |   1 +
> >   arch/x86/kvm/Makefile         |   1 +
> >   arch/x86/kvm/vmx/vmx.c        |  40 ++-
> >   arch/x86/kvm/x86.c            | 273 +++++++++++++++++-
> >   include/linux/kvm_host.h      |  28 ++
> >   include/uapi/linux/ioregion.h |  30 ++
> >   include/uapi/linux/kvm.h      |  25 ++
> >   virt/kvm/Kconfig              |   3 +
> >   virt/kvm/eventfd.c            |  25 ++
> >   virt/kvm/eventfd.h            |  14 +
> >   virt/kvm/ioregion.c           | 529
> > ++++++++++++++++++++++++++++++++++
> >   virt/kvm/ioregion.h           |  15 +
> >   virt/kvm/kvm_main.c           |  36 ++-
> >   13 files changed, 996 insertions(+), 24 deletions(-)
> >   create mode 100644 include/uapi/linux/ioregion.h
> >   create mode 100644 virt/kvm/eventfd.h
> >   create mode 100644 virt/kvm/ioregion.c
> >   create mode 100644 virt/kvm/ioregion.h
> > 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION
  2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
@ 2021-02-24 10:06   ` Stefan Hajnoczi
  2021-03-05 13:09   ` Cornelia Huck
  2021-03-09  5:26   ` Jason Wang
  2 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-02-24 10:06 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst, cohuck,
	john.levon

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

On Sun, Feb 21, 2021 at 03:04:37PM +0300, Elena Afanasova wrote:
> diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
> new file mode 100644
> index 000000000000..e09ef3e2c9d7
> --- /dev/null
> +++ b/virt/kvm/ioregion.c
> @@ -0,0 +1,265 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/kvm_host.h>
> +#include <linux/fs.h>
> +#include <kvm/iodev.h>
> +#include "eventfd.h"
> +
> +void
> +kvm_ioregionfd_init(struct kvm *kvm)
> +{
> +	INIT_LIST_HEAD(&kvm->ioregions_fast_mmio);
> +	INIT_LIST_HEAD(&kvm->ioregions_mmio);
> +	INIT_LIST_HEAD(&kvm->ioregions_pio);
> +}
> +
> +struct ioregion {
> +	struct list_head     list;

Linux struct list_head gives no clue about which list this belongs to.
You can help readers by adding a comment:

/* struct kvm ioregions_fast_mmio/ioregions_mmio/ioregions_pio */

> +	u64                  paddr;  /* guest physical address */
> +	u64                  size;   /* size in bytes */
> +	struct file         *rf;

+	struct file         *rf;     /* responses are read from this */

> +	struct file         *wf;

+	struct file         *wf;     /* commands are written to this */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling
  2021-02-21 12:04 ` [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling Elena Afanasova
@ 2021-02-24 10:42   ` Stefan Hajnoczi
  2021-03-09  5:51   ` Jason Wang
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-02-24 10:42 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst, cohuck,
	john.levon

[-- Attachment #1: Type: text/plain, Size: 3045 bytes --]

On Sun, Feb 21, 2021 at 03:04:38PM +0300, Elena Afanasova wrote:
> The vCPU thread may receive a signal during ioregionfd communication,
> ioctl(KVM_RUN) needs to return to userspace and then ioctl(KVM_RUN)
> must resume ioregionfd.
> 
> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>

Functionally what this patch does makes sense to me. I'm not familiar
enough with the arch/x86/kvm mmio/pio code to say whether it's possible
to unify mmio/pio/ioregionfd state somehow so that this code can be
simplified.

> +static int complete_ioregion_io(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu->mmio_needed)
> +		return complete_ioregion_mmio(vcpu);
> +	if (vcpu->arch.pio.count)
> +		return complete_ioregion_pio(vcpu);

Can this be written as:

  if ... {
  } else if ... {
  } else {
      BUG();
  }

?

In other words, I'm not sure if ever get here without either
vcpu->mmio_needed or vcpu->arch.pio.count set.

> +}
> +#endif /* CONFIG_KVM_IOREGION */
> +
>  static void kvm_save_current_fpu(struct fpu *fpu)
>  {
>  	/*
> @@ -9309,6 +9549,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	else
>  		r = vcpu_run(vcpu);
>  
> +#ifdef CONFIG_KVM_IOREGION
> +	if (vcpu->ioregion_ctx.is_interrupted &&
> +	    vcpu->run->exit_reason == KVM_EXIT_INTR)
> +		r = -EINTR;
> +#endif

Userspace can write to vcpu->run->exit_reason memory so its value cannot
be trusted. It's not obvious to me what happens when is_interrupted ==
true if userspace has corrupted vcpu->run->exit_reason. Since I can't
easily tell if it's safe, I suggest finding another way to perform this
check that does not rely on vcpu->run->exit_reason. Is just checking
vcpu->ioregion_ctx.is_interrupted enough?

(The same applies to other instances of vcpu->run->exit_reason in this
patch.)

> +
>  out:
>  	kvm_put_guest_fpu(vcpu);
>  	if (kvm_run->kvm_valid_regs)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f35f0976f5cf..84f07597d131 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -318,6 +318,16 @@ struct kvm_vcpu {
>  #endif
>  	bool preempted;
>  	bool ready;
> +#ifdef CONFIG_KVM_IOREGION
> +	struct {

A comment would be nice to explain the purpose of this struct:

/*
 * MMIO/PIO state kept when a signal interrupts ioregionfd. When
 * ioctl(KVM_RUN) is invoked again we resume from this state.
 */

> +		u64 addr;
> +		void *val;
> +		int pio;

"int pio" could be a boolean indicating whether this is pio or mmio.
Calling it cur_pio or pio_offset might be clearer.

> +		u8 state; /* SEND_CMD/GET_REPLY */
> +		bool in;

/* true - read, false - write */

> +		bool is_interrupted;
> +	} ioregion_ctx;
> +#endif

The ioregion_ctx fields are not set in this patch. Either they are
unused or a later patch will set them. You can help reviewers by noting
this in the commit description. That way they won't need to worry about
whether there are unused fields that were accidentally left in the code.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 3/5] KVM: implement wire protocol
  2021-02-21 12:04 ` [RFC v3 3/5] KVM: implement wire protocol Elena Afanasova
@ 2021-02-24 11:02   ` Stefan Hajnoczi
  2021-03-09  6:19   ` Jason Wang
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-02-24 11:02 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst, cohuck,
	john.levon

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

On Sun, Feb 21, 2021 at 03:04:39PM +0300, Elena Afanasova wrote:
> +static bool
> +pack_cmd(struct ioregionfd_cmd *cmd, u64 offset, u64 len, u8 opt, u8 resp,
> +	 u64 user_data, const void *val)
> +{
> +	switch (len) {
> +	case 0:
> +		break;

The 0 case might be non-obvious. A comment would be nice:

/* FAST_MMIO does not specify a length */

> +	case 1:
> +		cmd->size_exponent = IOREGIONFD_SIZE_8BIT;
> +		break;
> +	case 2:
> +		cmd->size_exponent = IOREGIONFD_SIZE_16BIT;
> +		break;
> +	case 4:
> +		cmd->size_exponent = IOREGIONFD_SIZE_32BIT;
> +		break;
> +	case 8:
> +		cmd->size_exponent = IOREGIONFD_SIZE_64BIT;
> +		break;
> +	default:
> +		return false;
> +	}
> +
> +	if (val)
> +		memcpy(&cmd->data, val, len);
> +	cmd->user_data = user_data;
> +	cmd->offset = offset;
> +	cmd->cmd = opt;
> +	cmd->resp = resp;
> +
> +	return true;
> +}
> +
> +enum {

A comment would help explain why the enum is needed:

/* A way to remember our state when interrupted by a signal */

> +	SEND_CMD,
> +	GET_REPLY,
> +	COMPLETE
> +};

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-02-21 12:04 ` [RFC v3 4/5] KVM: add ioregionfd context Elena Afanasova
@ 2021-02-24 11:27   ` Stefan Hajnoczi
  2021-03-09  7:54   ` Jason Wang
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-02-24 11:27 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst, cohuck,
	john.levon

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On Sun, Feb 21, 2021 at 03:04:40PM +0300, Elena Afanasova wrote:
> +/* ioregions that share the same rfd are serialized so that only one vCPU
> + * thread sends a struct ioregionfd_cmd to userspace at a time. This
> + * ensures that the struct ioregionfd_resp received from userspace will
> + * be processed by the one and only vCPU thread that sent it.
> + *
> + * A waitqueue is used to wake up waiting vCPU threads in order. Most of
> + * the time the waitqueue is unused and the lock is not contended.
> + * For best performance userspace should set up ioregionfds so that there
> + * is no contention (e.g. dedicated ioregionfds for queue doorbell
> + * registers on multi-queue devices).
> + */
> +struct ioregionfd {
> +	wait_queue_head_t	  wq;
> +	struct file		 *rf;
> +	struct kref		  kref;
> +	bool			  busy;
> +};
> +
> +struct ioregion {
> +	struct list_head	  list;
> +	u64			  paddr;   /* guest physical address */
> +	u64			  size;    /* size in bytes */
> +	struct file		 *wf;
> +	u64			  user_data; /* opaque token used by userspace */
> +	struct kvm_io_device	  dev;
> +	bool			  posted_writes;
> +	struct ioregionfd	 *ctx;

The name "ctx" is a little confusion since there is also an ioregion_ctx
in struct kvm_vcpu now. Maybe s/struct ioregionfd/struct
ioregion_resp_file/ and s/ctx/resp_file/?

> @@ -90,6 +118,30 @@ ioregion_save_ctx(struct kvm_vcpu *vcpu, bool in, gpa_t addr, u8 state, void *va
>  	vcpu->ioregion_ctx.in = in;
>  }
>  
> +static inline void
> +ioregion_lock_ctx(struct ioregionfd *ctx)
> +{
> +	if (!ctx)
> +		return;
> +
> +	spin_lock(&ctx->wq.lock);
> +	wait_event_interruptible_exclusive_locked(ctx->wq, !ctx->busy);

What happens if this function call is interrupted by a signal?

> @@ -129,14 +185,15 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>  	}
>  	if (ret != sizeof(buf.cmd)) {
>  		ret = (ret < 0) ? ret : -EIO;
> -		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		goto out;
>  	}
> -	if (!p->rf)
> +	if (!p->ctx)
>  		return 0;

I think this should be goto out so that ioregion_unlock_ctx() gets
called.

> @@ -322,6 +420,12 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
>  		ret = -EEXIST;
>  		goto unlock_fail;
>  	}
> +
> +	if (rfile && !ioregion_get_ctx(kvm, p, rfile, bus_idx)) {
> +		ret = -ENOMEM;
> +		goto unlock_fail;
> +	}
> +
>  	kvm_iodevice_init(&p->dev, &ioregion_ops);
>  	ret = kvm_io_bus_register_dev(kvm, bus_idx, p->paddr, p->size,
>  				      &p->dev);
> @@ -335,6 +439,8 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
>  
>  unlock_fail:
>  	mutex_unlock(&kvm->slots_lock);
> +	if (p->ctx)
> +		kref_put(&p->ctx->kref, ctx_free);

Please move the kref_put() before mutex_unlock(&kvm->slots_lock) since
ioregion_get_ctx() depends on kvm->slots_lock to avoid race conditions
with struct ioregionfds that are being created/destroyed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd)
  2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
                   ` (5 preceding siblings ...)
  2021-02-21 17:06 ` [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Paolo Bonzini
@ 2021-02-24 11:34 ` Stefan Hajnoczi
  6 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-02-24 11:34 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, jag.raman, elena.ufimtseva, pbonzini, jasowang, mst, cohuck,
	john.levon

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Sun, Feb 21, 2021 at 03:04:36PM +0300, Elena Afanasova wrote:
> This patchset introduces a KVM dispatch mechanism which can be used 
> for handling MMIO/PIO accesses over file descriptors without returning 
> from ioctl(KVM_RUN). This allows device emulation to run in another task 
> separate from the vCPU task.
> 
> This is achieved through KVM vm ioctl for registering MMIO/PIO regions and 
> a wire protocol that KVM uses to communicate with a task handling an 
> MMIO/PIO access.
> 
> TODOs:
> * Implement KVM_EXIT_IOREGIONFD_FAILURE
> * Add non-x86 arch support
> * Add kvm-unittests
> * Flush waiters if ioregion is deleted
 * Add ioctl docs to api.rst
 * Add wire protocol docs to <linux/ioregionfd.h>

Great, looks like userspace can really start trying out ioregionfd now -
most features are implemented. I will do a deeper review of the state
machine when you send the next revision.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION
  2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
  2021-02-24 10:06   ` Stefan Hajnoczi
@ 2021-03-05 13:09   ` Cornelia Huck
  2021-03-09  5:26   ` Jason Wang
  2 siblings, 0 replies; 30+ messages in thread
From: Cornelia Huck @ 2021-03-05 13:09 UTC (permalink / raw)
  To: Elena Afanasova
  Cc: kvm, stefanha, jag.raman, elena.ufimtseva, pbonzini, jasowang,
	mst, john.levon

On Sun, 21 Feb 2021 15:04:37 +0300
Elena Afanasova <eafanasova@gmail.com> wrote:

> This vm ioctl adds or removes an ioregionfd MMIO/PIO region. Guest
> read and write accesses are dispatched through the given ioregionfd
> instead of returning from ioctl(KVM_RUN).
> 
> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> ---
> v3:
>  - add FAST_MMIO bus support
>  - add KVM_IOREGION_DEASSIGN flag
>  - rename kvm_ioregion read/write file descriptors
> 
>  arch/x86/kvm/Kconfig     |   1 +
>  arch/x86/kvm/Makefile    |   1 +
>  arch/x86/kvm/x86.c       |   1 +
>  include/linux/kvm_host.h |  18 +++
>  include/uapi/linux/kvm.h |  25 ++++
>  virt/kvm/Kconfig         |   3 +
>  virt/kvm/eventfd.c       |  25 ++++
>  virt/kvm/eventfd.h       |  14 +++
>  virt/kvm/ioregion.c      | 265 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/ioregion.h      |  15 +++
>  virt/kvm/kvm_main.c      |  11 ++
>  11 files changed, 379 insertions(+)
>  create mode 100644 virt/kvm/eventfd.h
>  create mode 100644 virt/kvm/ioregion.c
>  create mode 100644 virt/kvm/ioregion.h
> 

(...)

> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index c2323c27a28b..aadb73903f8b 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -27,6 +27,7 @@
>  #include <trace/events/kvm.h>
>  
>  #include <kvm/iodev.h>
> +#include "ioregion.h"
>  
>  #ifdef CONFIG_HAVE_KVM_IRQFD
>  
> @@ -755,6 +756,23 @@ static const struct kvm_io_device_ops ioeventfd_ops = {
>  	.destructor = ioeventfd_destructor,
>  };
>  
> +#ifdef CONFIG_KVM_IOREGION
> +/* assumes kvm->slots_lock held */
> +bool kvm_eventfd_collides(struct kvm *kvm, int bus_idx,
> +			  u64 start, u64 size)
> +{
> +	struct _ioeventfd *_p;
> +
> +	list_for_each_entry(_p, &kvm->ioeventfds, list)
> +		if (_p->bus_idx == bus_idx &&
> +		    overlap(start, size, _p->addr,
> +			    !_p->length ? 8 : _p->length))

I'm wondering whether we should enable a bus-specific overlap()
function; that would make it possible to wire up weird things like ccw.
But not really needed right now, and can be easily added when it
becomes relevant.

> +			return true;
> +
> +	return false;
> +}
> +#endif
> +
>  /* assumes kvm->slots_lock held */
>  static bool
>  ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
> @@ -770,6 +788,13 @@ ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
>  		       _p->datamatch == p->datamatch))))
>  			return true;
>  
> +#ifdef CONFIG_KVM_IOREGION

This might benefit from a comment why you only check these two
(especially as you also check FAST_MMIO during ioregionfd setup.)

> +	if (p->bus_idx == KVM_MMIO_BUS || p->bus_idx == KVM_PIO_BUS)
> +		if (kvm_ioregion_collides(kvm, p->bus_idx, p->addr,
> +					  !p->length ? 8 : p->length))
> +			return true;
> +#endif
> +
>  	return false;
>  }
>  

(...)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION
  2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
  2021-02-24 10:06   ` Stefan Hajnoczi
  2021-03-05 13:09   ` Cornelia Huck
@ 2021-03-09  5:26   ` Jason Wang
  2021-03-22  9:57     ` Stefan Hajnoczi
  2 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2021-03-09  5:26 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon


On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> This vm ioctl adds or removes an ioregionfd MMIO/PIO region. Guest
> read and write accesses are dispatched through the given ioregionfd
> instead of returning from ioctl(KVM_RUN).
>
> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> ---
> v3:
>   - add FAST_MMIO bus support
>   - add KVM_IOREGION_DEASSIGN flag
>   - rename kvm_ioregion read/write file descriptors
>
>   arch/x86/kvm/Kconfig     |   1 +
>   arch/x86/kvm/Makefile    |   1 +
>   arch/x86/kvm/x86.c       |   1 +
>   include/linux/kvm_host.h |  18 +++
>   include/uapi/linux/kvm.h |  25 ++++
>   virt/kvm/Kconfig         |   3 +
>   virt/kvm/eventfd.c       |  25 ++++
>   virt/kvm/eventfd.h       |  14 +++
>   virt/kvm/ioregion.c      | 265 +++++++++++++++++++++++++++++++++++++++
>   virt/kvm/ioregion.h      |  15 +++
>   virt/kvm/kvm_main.c      |  11 ++
>   11 files changed, 379 insertions(+)
>   create mode 100644 virt/kvm/eventfd.h
>   create mode 100644 virt/kvm/ioregion.c
>   create mode 100644 virt/kvm/ioregion.h
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index f92dfd8ef10d..b914ef375199 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -33,6 +33,7 @@ config KVM
>   	select HAVE_KVM_IRQ_BYPASS
>   	select HAVE_KVM_IRQ_ROUTING
>   	select HAVE_KVM_EVENTFD
> +	select KVM_IOREGION
>   	select KVM_ASYNC_PF
>   	select USER_RETURN_NOTIFIER
>   	select KVM_MMIO
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index b804444e16d4..b3b17dc9f7d4 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -12,6 +12,7 @@ KVM := ../../../virt/kvm
>   kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
>   				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
>   kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
> +kvm-$(CONFIG_KVM_IOREGION)	+= $(KVM)/ioregion.o
>   
>   kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
>   			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e545a8a613b1..ddb28f5ca252 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3739,6 +3739,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_X86_USER_SPACE_MSR:
>   	case KVM_CAP_X86_MSR_FILTER:
>   	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> +	case KVM_CAP_IOREGIONFD:
>   		r = 1;
>   		break;
>   	case KVM_CAP_SYNC_REGS:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7f2e2a09ebbd..f35f0976f5cf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -470,6 +470,11 @@ struct kvm {
>   		struct mutex      resampler_lock;
>   	} irqfds;
>   	struct list_head ioeventfds;
> +#endif
> +#ifdef CONFIG_KVM_IOREGION
> +	struct list_head ioregions_fast_mmio;
> +	struct list_head ioregions_mmio;
> +	struct list_head ioregions_pio;
>   #endif
>   	struct kvm_vm_stat stat;
>   	struct kvm_arch arch;
> @@ -1262,6 +1267,19 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>   
>   #endif /* CONFIG_HAVE_KVM_EVENTFD */
>   
> +#ifdef CONFIG_KVM_IOREGION
> +void kvm_ioregionfd_init(struct kvm *kvm);
> +int kvm_ioregionfd(struct kvm *kvm, struct kvm_ioregion *args);
> +
> +#else
> +
> +static inline void kvm_ioregionfd_init(struct kvm *kvm) {}
> +static inline int kvm_ioregionfd(struct kvm *kvm, struct kvm_ioregion *args)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
>   void kvm_arch_irq_routing_update(struct kvm *kvm);
>   
>   static inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index ca41220b40b8..a1b1a60571f8 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -732,6 +732,29 @@ struct kvm_ioeventfd {
>   	__u8  pad[36];
>   };
>   
> +enum {
> +	kvm_ioregion_flag_nr_pio,
> +	kvm_ioregion_flag_nr_posted_writes,
> +	kvm_ioregion_flag_nr_deassign,
> +	kvm_ioregion_flag_nr_max,
> +};
> +
> +#define KVM_IOREGION_PIO (1 << kvm_ioregion_flag_nr_pio)
> +#define KVM_IOREGION_POSTED_WRITES (1 << kvm_ioregion_flag_nr_posted_writes)
> +#define KVM_IOREGION_DEASSIGN (1 << kvm_ioregion_flag_nr_deassign)
> +
> +#define KVM_IOREGION_VALID_FLAG_MASK ((1 << kvm_ioregion_flag_nr_max) - 1)
> +
> +struct kvm_ioregion {
> +	__u64 guest_paddr; /* guest physical address */
> +	__u64 memory_size; /* bytes */
> +	__u64 user_data;
> +	__s32 read_fd;
> +	__s32 write_fd;
> +	__u32 flags;
> +	__u8  pad[28];
> +};
> +
>   #define KVM_X86_DISABLE_EXITS_MWAIT          (1 << 0)
>   #define KVM_X86_DISABLE_EXITS_HLT            (1 << 1)
>   #define KVM_X86_DISABLE_EXITS_PAUSE          (1 << 2)
> @@ -1053,6 +1076,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_X86_USER_SPACE_MSR 188
>   #define KVM_CAP_X86_MSR_FILTER 189
>   #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> +#define KVM_CAP_IOREGIONFD 191
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -1308,6 +1332,7 @@ struct kvm_vfio_spapr_tce {
>   					struct kvm_userspace_memory_region)
>   #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>   #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_IOREGION          _IOW(KVMIO,  0x49, struct kvm_ioregion)


I wonder how could we extend ioregion fd in the future? Do we need 
something like handshake or version here?


>   
>   /* enable ucontrol for s390 */
>   struct kvm_s390_ucas_mapping {
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 1c37ccd5d402..5e6620bbf000 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -17,6 +17,9 @@ config HAVE_KVM_EVENTFD
>          bool
>          select EVENTFD
>   
> +config KVM_IOREGION
> +       boolƒ
> +
>   config KVM_MMIO
>          bool
>   
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index c2323c27a28b..aadb73903f8b 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -27,6 +27,7 @@
>   #include <trace/events/kvm.h>
>   
>   #include <kvm/iodev.h>
> +#include "ioregion.h"
>   
>   #ifdef CONFIG_HAVE_KVM_IRQFD
>   
> @@ -755,6 +756,23 @@ static const struct kvm_io_device_ops ioeventfd_ops = {
>   	.destructor = ioeventfd_destructor,
>   };
>   
> +#ifdef CONFIG_KVM_IOREGION
> +/* assumes kvm->slots_lock held */
> +bool kvm_eventfd_collides(struct kvm *kvm, int bus_idx,
> +			  u64 start, u64 size)
> +{
> +	struct _ioeventfd *_p;
> +
> +	list_for_each_entry(_p, &kvm->ioeventfds, list)
> +		if (_p->bus_idx == bus_idx &&
> +		    overlap(start, size, _p->addr,
> +			    !_p->length ? 8 : _p->length))
> +			return true;
> +
> +	return false;
> +}
> +#endif
> +
>   /* assumes kvm->slots_lock held */
>   static bool
>   ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
> @@ -770,6 +788,13 @@ ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)
>   		       _p->datamatch == p->datamatch))))
>   			return true;
>   
> +#ifdef CONFIG_KVM_IOREGION
> +	if (p->bus_idx == KVM_MMIO_BUS || p->bus_idx == KVM_PIO_BUS)
> +		if (kvm_ioregion_collides(kvm, p->bus_idx, p->addr,
> +					  !p->length ? 8 : p->length))
> +			return true;
> +#endif
> +
>   	return false;
>   }
>   
> diff --git a/virt/kvm/eventfd.h b/virt/kvm/eventfd.h
> new file mode 100644
> index 000000000000..73a621eebae3
> --- /dev/null
> +++ b/virt/kvm/eventfd.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_EVENTFD_H__
> +#define __KVM_EVENTFD_H__
> +
> +#ifdef CONFIG_KVM_IOREGION
> +bool kvm_eventfd_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size);
> +#else
> +static inline bool
> +kvm_eventfd_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size)
> +{
> +	return false;
> +}
> +#endif
> +#endif
> diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
> new file mode 100644
> index 000000000000..e09ef3e2c9d7
> --- /dev/null
> +++ b/virt/kvm/ioregion.c
> @@ -0,0 +1,265 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/kvm_host.h>
> +#include <linux/fs.h>
> +#include <kvm/iodev.h>
> +#include "eventfd.h"
> +
> +void
> +kvm_ioregionfd_init(struct kvm *kvm)
> +{
> +	INIT_LIST_HEAD(&kvm->ioregions_fast_mmio);
> +	INIT_LIST_HEAD(&kvm->ioregions_mmio);
> +	INIT_LIST_HEAD(&kvm->ioregions_pio);
> +}
> +
> +struct ioregion {
> +	struct list_head     list;
> +	u64                  paddr;  /* guest physical address */
> +	u64                  size;   /* size in bytes */
> +	struct file         *rf;
> +	struct file         *wf;
> +	u64                  user_data; /* opaque token used by userspace */
> +	struct kvm_io_device dev;
> +	bool                 posted_writes;
> +};
> +
> +static inline struct ioregion *
> +to_ioregion(struct kvm_io_device *dev)
> +{
> +	return container_of(dev, struct ioregion, dev);
> +}
> +
> +/* assumes kvm->slots_lock held */
> +static void
> +ioregion_release(struct ioregion *p)
> +{
> +	if (p->rf)
> +		fput(p->rf);
> +	fput(p->wf);
> +	list_del(&p->list);
> +	kfree(p);
> +}
> +
> +static int
> +ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
> +	      int len, void *val)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static int
> +ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
> +		int len, const void *val)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +/*
> + * This function is called as KVM is completely shutting down.  We do not
> + * need to worry about locking just nuke anything we have as quickly as possible
> + */
> +static void
> +ioregion_destructor(struct kvm_io_device *this)
> +{
> +	struct ioregion *p = to_ioregion(this);
> +
> +	ioregion_release(p);
> +}
> +
> +static const struct kvm_io_device_ops ioregion_ops = {
> +	.read       = ioregion_read,
> +	.write      = ioregion_write,
> +	.destructor = ioregion_destructor,
> +};
> +
> +static inline struct list_head *
> +get_ioregion_list(struct kvm *kvm, enum kvm_bus bus_idx)
> +{
> +	if (bus_idx == KVM_FAST_MMIO_BUS)
> +		return &kvm->ioregions_fast_mmio;
> +	if (bus_idx == KVM_MMIO_BUS)
> +		return &kvm->ioregions_mmio;
> +	if (bus_idx == KVM_PIO_BUS)
> +		return &kvm->ioregions_pio;
> +}
> +
> +/* check for not overlapping case and reverse */


This check is much more stricit than what has been done in 
ioeventfd_check_collision(). Any raeson for that?



> +inline bool
> +overlap(u64 start1, u64 size1, u64 start2, u64 size2)
> +{
> +	u64 end1 = start1 + size1 - 1;
> +	u64 end2 = start2 + size2 - 1;
> +
> +	return !(end1 < start2 || start1 >= end2);
> +}
> +
> +/* assumes kvm->slots_lock held */
> +bool
> +kvm_ioregion_collides(struct kvm *kvm, int bus_idx,
> +		      u64 start, u64 size)
> +{
> +	struct ioregion *p;
> +	struct list_head *ioregions = get_ioregion_list(kvm, bus_idx);
> +
> +	list_for_each_entry(p, ioregions, list)
> +		if (overlap(start, size, p->paddr, !p->size ? 8 : p->size))
> +			return true;
> +
> +	return false;
> +}
> +
> +/* assumes kvm->slots_lock held */
> +static bool
> +ioregion_collision(struct kvm *kvm, struct ioregion *p, enum kvm_bus bus_idx)
> +{
> +	if (kvm_ioregion_collides(kvm, bus_idx, p->paddr, !p->size ? 8 : p->size) ||
> +	    kvm_eventfd_collides(kvm, bus_idx, p->paddr, !p->size ? 8 : p->size))
> +		return true;
> +
> +	return false;
> +}
> +
> +static enum kvm_bus
> +get_bus_from_flags(__u32 flags)
> +{
> +	if (flags & KVM_IOREGION_PIO)
> +		return KVM_PIO_BUS;
> +	return KVM_MMIO_BUS;
> +}
> +
> +int
> +kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bus_idx)
> +{
> +	struct ioregion *p;
> +	struct file *rfile = NULL, *wfile;
> +	int ret = 0;
> +
> +	wfile = fget(args->write_fd);
> +	if (!wfile)
> +		return -EBADF;
> +	if (args->memory_size) {
> +		rfile = fget(args->read_fd);
> +		if (!rfile) {
> +			fput(wfile);
> +			return -EBADF;
> +		}
> +	}
> +	p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT);
> +	if (!p) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	INIT_LIST_HEAD(&p->list);
> +	p->paddr = args->guest_paddr;
> +	p->size = args->memory_size;
> +	p->user_data = args->user_data;
> +	p->rf = rfile;
> +	p->wf = wfile;
> +	p->posted_writes = args->flags & KVM_IOREGION_POSTED_WRITES;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	if (ioregion_collision(kvm, p, bus_idx)) {
> +		ret = -EEXIST;
> +		goto unlock_fail;
> +	}
> +	kvm_iodevice_init(&p->dev, &ioregion_ops);
> +	ret = kvm_io_bus_register_dev(kvm, bus_idx, p->paddr, p->size,
> +				      &p->dev);
> +	if (ret < 0)
> +		goto unlock_fail;
> +	list_add_tail(&p->list, get_ioregion_list(kvm, bus_idx));
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return 0;
> +
> +unlock_fail:
> +	mutex_unlock(&kvm->slots_lock);
> +	kfree(p);
> +fail:
> +	if (rfile)
> +		fput(rfile);
> +	fput(wfile);
> +
> +	return ret;
> +}
> +
> +static int
> +kvm_rm_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bus_idx)
> +{
> +	struct ioregion *p, *tmp;
> +	int ret = -ENOENT;
> +
> +	struct list_head *ioregions = get_ioregion_list(kvm, bus_idx);
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	list_for_each_entry_safe(p, tmp, ioregions, list) {
> +		if (p->paddr == args->guest_paddr  &&
> +		    p->size == args->memory_size) {
> +			kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
> +			ioregion_release(p);
> +			ret = 0;
> +			break;
> +		}
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	return ret;
> +}
> +
> +static int
> +kvm_set_ioregion(struct kvm *kvm, struct kvm_ioregion *args)
> +{
> +	int ret;
> +
> +	enum kvm_bus bus_idx = get_bus_from_flags(args->flags);
> +
> +	/* check for range overflow */
> +	if (args->guest_paddr + args->memory_size < args->guest_paddr)
> +		return -EINVAL;
> +	/* If size is ignored only posted writes are allowed */
> +	if (!args->memory_size && !(args->flags & KVM_IOREGION_POSTED_WRITES))


We don't have flags like KVM_IOREGION_POSTED_WRITES for ioeventfd. Is 
this a must?


> +		return -EINVAL;
> +
> +	ret = kvm_set_ioregion_idx(kvm, args, bus_idx);
> +	if (ret)
> +		return ret;
> +
> +	/* If size is ignored, MMIO is also put on a FAST_MMIO bus */
> +	if (!args->memory_size && bus_idx == KVM_MMIO_BUS)
> +		ret = kvm_set_ioregion_idx(kvm, args, KVM_FAST_MMIO_BUS);
> +	if (ret) {


The check is duplicaetd if it wasn't a FAST_MMIO_BUS. Let's check only 
for FAST_MMIO.


> +		kvm_rm_ioregion_idx(kvm, args, bus_idx);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +kvm_rm_ioregion(struct kvm *kvm, struct kvm_ioregion *args)
> +{


Since we use "kvm_set_ioregion" is it better to use "kvm_unset_ioregion" 
or just use assign/deassign as what ioeventfd did?

Thanks


> +	enum kvm_bus bus_idx = get_bus_from_flags(args->flags);
> +	int ret = kvm_rm_ioregion_idx(kvm, args, bus_idx);
> +
> +	if (!args->memory_size && bus_idx == KVM_MMIO_BUS)
> +		kvm_rm_ioregion_idx(kvm, args, KVM_FAST_MMIO_BUS);
> +
> +	return ret;
> +}
> +
> +int
> +kvm_ioregionfd(struct kvm *kvm, struct kvm_ioregion *args)
> +{
> +	if (args->flags & ~KVM_IOREGION_VALID_FLAG_MASK)
> +		return -EINVAL;
> +
> +	if (args->flags & KVM_IOREGION_DEASSIGN)
> +		return kvm_rm_ioregion(kvm, args);
> +
> +	return kvm_set_ioregion(kvm, args);
> +}
> diff --git a/virt/kvm/ioregion.h b/virt/kvm/ioregion.h
> new file mode 100644
> index 000000000000..23ffa812ec7a
> --- /dev/null
> +++ b/virt/kvm/ioregion.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_IOREGION_H__
> +#define __KVM_IOREGION_H__
> +
> +#ifdef CONFIG_KVM_IOREGION
> +inline bool overlap(u64 start1, u64 size1, u64 start2, u64 size2);
> +bool kvm_ioregion_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size);
> +#else
> +static inline bool
> +kvm_ioregion_collides(struct kvm *kvm, int bus_idx, u64 start, u64 size)
> +{
> +	return false;
> +}
> +#endif
> +#endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2541a17ff1c4..88b92fc3da51 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -747,6 +747,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   	mmgrab(current->mm);
>   	kvm->mm = current->mm;
>   	kvm_eventfd_init(kvm);
> +	kvm_ioregionfd_init(kvm);
>   	mutex_init(&kvm->lock);
>   	mutex_init(&kvm->irq_lock);
>   	mutex_init(&kvm->slots_lock);
> @@ -3708,6 +3709,16 @@ static long kvm_vm_ioctl(struct file *filp,
>   		r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
>   		break;
>   	}
> +	case KVM_SET_IOREGION: {
> +		struct kvm_ioregion data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&data, argp, sizeof(data)))
> +			goto out;
> +
> +		r = kvm_ioregionfd(kvm, &data);
> +		break;
> +	}
>   	case KVM_GET_DIRTY_LOG: {
>   		struct kvm_dirty_log log;
>   


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling
  2021-02-21 12:04 ` [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling Elena Afanasova
  2021-02-24 10:42   ` Stefan Hajnoczi
@ 2021-03-09  5:51   ` Jason Wang
  2021-03-17 14:19     ` Elena Afanasova
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Wang @ 2021-03-09  5:51 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon


On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> The vCPU thread may receive a signal during ioregionfd communication,
> ioctl(KVM_RUN) needs to return to userspace and then ioctl(KVM_RUN)
> must resume ioregionfd.


After a glance at the patch, I wonder can we split the patch into two?

1) sleepable iodevice which is not supported currently, probably with a 
new cap?
2) ioregionfd specific codes (I wonder if it has any)

Then the sleepable iodevice could be reused by future features.


>
> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> ---
> v3:
>   - add FAST_MMIO bus support
>   - move ioregion_interrupted flag to ioregion_ctx
>   - reorder ioregion_ctx fields
>   - rework complete_ioregion operations
>   - add signal handling support for crossing a page boundary case
>   - fix kvm_arch_vcpu_ioctl_run() should return -EINTR in case ioregionfd
>     is interrupted
>
>   arch/x86/kvm/vmx/vmx.c   |  40 +++++-
>   arch/x86/kvm/x86.c       | 272 +++++++++++++++++++++++++++++++++++++--
>   include/linux/kvm_host.h |  10 ++
>   virt/kvm/kvm_main.c      |  16 ++-
>   4 files changed, 317 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 47b8357b9751..39db31afd27e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5357,19 +5357,51 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   
> +#ifdef CONFIG_KVM_IOREGION
> +static int complete_ioregion_fast_mmio(struct kvm_vcpu *vcpu)
> +{
> +	int ret, idx;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS,
> +			       vcpu->ioregion_ctx.addr, 0, NULL);
> +	if (ret) {
> +		ret = kvm_mmu_page_fault(vcpu, vcpu->ioregion_ctx.addr,
> +					 PFERR_RSVD_MASK, NULL, 0);
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +		return ret;
> +	}
> +
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +	return kvm_skip_emulated_instruction(vcpu);
> +}
> +#endif
> +
>   static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>   {
>   	gpa_t gpa;
> +	int ret;
>   
>   	/*
>   	 * A nested guest cannot optimize MMIO vmexits, because we have an
>   	 * nGPA here instead of the required GPA.
>   	 */
>   	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -	if (!is_guest_mode(vcpu) &&
> -	    !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> -		trace_kvm_fast_mmio(gpa);
> -		return kvm_skip_emulated_instruction(vcpu);
> +	if (!is_guest_mode(vcpu)) {
> +		ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL);
> +		if (!ret) {
> +			trace_kvm_fast_mmio(gpa);
> +			return kvm_skip_emulated_instruction(vcpu);
> +		}
> +
> +#ifdef CONFIG_KVM_IOREGION
> +		if (unlikely(vcpu->ioregion_ctx.is_interrupted && ret == -EINTR)) {


So the question still, EINTR looks wrong which means the syscall can't 
be restarted. Not that the syscal doesn't mean KVM_RUN but actually the 
kernel_read|write() you want to do with the ioregion fd.

Also do we need to treat differently for EINTR and ERESTARTSYS since 
EINTR means the kernel_read()|write() can't be resumed.

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 3/5] KVM: implement wire protocol
  2021-02-21 12:04 ` [RFC v3 3/5] KVM: implement wire protocol Elena Afanasova
  2021-02-24 11:02   ` Stefan Hajnoczi
@ 2021-03-09  6:19   ` Jason Wang
  2021-03-17 13:08     ` Elena Afanasova
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Wang @ 2021-03-09  6:19 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon


On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> Add ioregionfd blocking read/write operations.
>
> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> ---
> v3:
>   - change wire protocol license
>   - remove ioregionfd_cmd info and drop appropriate macros
>   - fix ioregionfd state machine
>   - add sizeless ioregions support
>   - drop redundant check in ioregion_read/write()
>
>   include/uapi/linux/ioregion.h |  30 +++++++
>   virt/kvm/ioregion.c           | 162 +++++++++++++++++++++++++++++++++-
>   2 files changed, 190 insertions(+), 2 deletions(-)
>   create mode 100644 include/uapi/linux/ioregion.h
>
> diff --git a/include/uapi/linux/ioregion.h b/include/uapi/linux/ioregion.h
> new file mode 100644
> index 000000000000..58f9b5ba6186
> --- /dev/null
> +++ b/include/uapi/linux/ioregion.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: ((GPL-2.0-only WITH Linux-syscall-note) OR BSD-3-Clause) */
> +#ifndef _UAPI_LINUX_IOREGION_H
> +#define _UAPI_LINUX_IOREGION_H
> +
> +/* Wire protocol */
> +
> +struct ioregionfd_cmd {
> +	__u8 cmd;
> +	__u8 size_exponent : 4;
> +	__u8 resp : 1;
> +	__u8 padding[6];
> +	__u64 user_data;
> +	__u64 offset;
> +	__u64 data;
> +};


Sorry if I've asked this before. Do we need a id for each 
request/response? E.g an ioregion fd could be used by multiple vCPUS. 
VCPU needs to have a way to find which request belongs to itself or not?


> +
> +struct ioregionfd_resp {
> +	__u64 data;
> +	__u8 pad[24];
> +};
> +
> +#define IOREGIONFD_CMD_READ    0
> +#define IOREGIONFD_CMD_WRITE   1
> +
> +#define IOREGIONFD_SIZE_8BIT   0
> +#define IOREGIONFD_SIZE_16BIT  1
> +#define IOREGIONFD_SIZE_32BIT  2
> +#define IOREGIONFD_SIZE_64BIT  3
> +
> +#endif
> diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
> index e09ef3e2c9d7..1e1c7772d274 100644
> --- a/virt/kvm/ioregion.c
> +++ b/virt/kvm/ioregion.c
> @@ -3,6 +3,7 @@
>   #include <linux/fs.h>
>   #include <kvm/iodev.h>
>   #include "eventfd.h"
> +#include <uapi/linux/ioregion.h>
>   
>   void
>   kvm_ioregionfd_init(struct kvm *kvm)
> @@ -40,18 +41,175 @@ ioregion_release(struct ioregion *p)
>   	kfree(p);
>   }
>   
> +static bool
> +pack_cmd(struct ioregionfd_cmd *cmd, u64 offset, u64 len, u8 opt, u8 resp,
> +	 u64 user_data, const void *val)
> +{
> +	switch (len) {
> +	case 0:
> +		break;
> +	case 1:
> +		cmd->size_exponent = IOREGIONFD_SIZE_8BIT;
> +		break;
> +	case 2:
> +		cmd->size_exponent = IOREGIONFD_SIZE_16BIT;
> +		break;
> +	case 4:
> +		cmd->size_exponent = IOREGIONFD_SIZE_32BIT;
> +		break;
> +	case 8:
> +		cmd->size_exponent = IOREGIONFD_SIZE_64BIT;
> +		break;
> +	default:
> +		return false;
> +	}
> +
> +	if (val)
> +		memcpy(&cmd->data, val, len);
> +	cmd->user_data = user_data;
> +	cmd->offset = offset;
> +	cmd->cmd = opt;
> +	cmd->resp = resp;
> +
> +	return true;
> +}
> +
> +enum {
> +	SEND_CMD,
> +	GET_REPLY,
> +	COMPLETE
> +};
> +
> +static void
> +ioregion_save_ctx(struct kvm_vcpu *vcpu, bool in, gpa_t addr, u8 state, void *val)
> +{
> +	vcpu->ioregion_ctx.is_interrupted = true;
> +	vcpu->ioregion_ctx.val = val;
> +	vcpu->ioregion_ctx.state = state;
> +	vcpu->ioregion_ctx.addr = addr;
> +	vcpu->ioregion_ctx.in = in;
> +}
> +
>   static int
>   ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   	      int len, void *val)
>   {
> -	return -EOPNOTSUPP;
> +	struct ioregion *p = to_ioregion(this);
> +	union {
> +		struct ioregionfd_cmd cmd;
> +		struct ioregionfd_resp resp;
> +	} buf;
> +	int ret = 0;
> +	int state = SEND_CMD;
> +
> +	if (unlikely(vcpu->ioregion_ctx.is_interrupted)) {
> +		vcpu->ioregion_ctx.is_interrupted = false;
> +
> +		switch (vcpu->ioregion_ctx.state) {
> +		case SEND_CMD:
> +			goto send_cmd;
> +		case GET_REPLY:
> +			goto get_repl;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +send_cmd:
> +	memset(&buf, 0, sizeof(buf));
> +	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_READ,
> +		      1, p->user_data, NULL))
> +		return -EOPNOTSUPP;
> +
> +	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
> +	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
> +	if (signal_pending(current) && state == SEND_CMD) {


Can the signal be delivered after a success of kernel_write()? If yes, 
is there any side effect if we want to redo the write here?


> +		ioregion_save_ctx(vcpu, 1, addr, state, val);
> +		return -EINTR;
> +	}
> +	if (ret != sizeof(buf.cmd)) {
> +		ret = (ret < 0) ? ret : -EIO;
> +		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +	}
> +	if (!p->rf)
> +		return 0;
> +
> +get_repl:
> +	memset(&buf, 0, sizeof(buf));
> +	ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
> +	state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
> +	if (signal_pending(current) && state == GET_REPLY) {
> +		ioregion_save_ctx(vcpu, 1, addr, state, val);
> +		return -EINTR;
> +	}
> +	if (ret != sizeof(buf.resp)) {
> +		ret = (ret < 0) ? ret : -EIO;
> +		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +	}


We may need to unify the duplicated codes here with send_cmd.


> +
> +	memcpy(val, &buf.resp.data, len);
> +
> +	return 0;
>   }
>   
>   static int
>   ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   		int len, const void *val)
>   {
> -	return -EOPNOTSUPP;
> +	struct ioregion *p = to_ioregion(this);
> +	union {
> +		struct ioregionfd_cmd cmd;
> +		struct ioregionfd_resp resp;
> +	} buf;
> +	int ret = 0;
> +	int state = SEND_CMD;
> +
> +	if (unlikely(vcpu->ioregion_ctx.is_interrupted)) {
> +		vcpu->ioregion_ctx.is_interrupted = false;
> +
> +		switch (vcpu->ioregion_ctx.state) {
> +		case SEND_CMD:
> +			goto send_cmd;
> +		case GET_REPLY:
> +			goto get_repl;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
> +send_cmd:
> +	memset(&buf, 0, sizeof(buf));
> +	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_WRITE,
> +		      p->posted_writes ? 0 : 1, p->user_data, val))
> +		return -EOPNOTSUPP;
> +
> +	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
> +	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
> +	if (signal_pending(current) && state == SEND_CMD) {
> +		ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
> +		return -EINTR;
> +	}
> +	if (ret != sizeof(buf.cmd)) {
> +		ret = (ret < 0) ? ret : -EIO;
> +		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +	}
> +
> +get_repl:
> +	if (!p->posted_writes) {
> +		memset(&buf, 0, sizeof(buf));
> +		ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
> +		state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
> +		if (signal_pending(current) && state == GET_REPLY) {
> +			ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
> +			return -EINTR;
> +		}
> +		if (ret != sizeof(buf.resp)) {
> +			ret = (ret < 0) ? ret : -EIO;
> +			return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		}
> +	}
> +
> +	return 0;


It looks to me we had more chance to unify the code with ioregion_read().

Thanks


>   }
>   
>   /*


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-02-21 12:04 ` [RFC v3 4/5] KVM: add ioregionfd context Elena Afanasova
  2021-02-24 11:27   ` Stefan Hajnoczi
@ 2021-03-09  7:54   ` Jason Wang
  2021-03-09  8:01     ` Paolo Bonzini
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Wang @ 2021-03-09  7:54 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon


On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> Add support for ioregionfd cmds/replies serialization.


Let's be verbose here, e.g why and how it needs serailization.


>
> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> ---
> v3:
>   - add comment
>   - drop kvm_io_bus_finish/prepare()
>
>   virt/kvm/ioregion.c | 164 ++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 135 insertions(+), 29 deletions(-)
>
> diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
> index 1e1c7772d274..d53e3d1cd2ff 100644
> --- a/virt/kvm/ioregion.c
> +++ b/virt/kvm/ioregion.c
> @@ -1,10 +1,39 @@
>   // SPDX-License-Identifier: GPL-2.0-only
>   #include <linux/kvm_host.h>
> -#include <linux/fs.h>
> +#include <linux/wait.h>
>   #include <kvm/iodev.h>
>   #include "eventfd.h"
>   #include <uapi/linux/ioregion.h>
>   
> +/* ioregions that share the same rfd are serialized so that only one vCPU
> + * thread sends a struct ioregionfd_cmd to userspace at a time. This
> + * ensures that the struct ioregionfd_resp received from userspace will
> + * be processed by the one and only vCPU thread that sent it.
> + *
> + * A waitqueue is used to wake up waiting vCPU threads in order. Most of
> + * the time the waitqueue is unused and the lock is not contended.
> + * For best performance userspace should set up ioregionfds so that there
> + * is no contention (e.g. dedicated ioregionfds for queue doorbell
> + * registers on multi-queue devices).
> + */
> +struct ioregionfd {
> +	wait_queue_head_t	  wq;


If we can generalize this at kvm iodevice level, that would be better.


> +	struct file		 *rf;
> +	struct kref		  kref;
> +	bool			  busy;
> +};
> +
> +struct ioregion {
> +	struct list_head	  list;
> +	u64			  paddr;   /* guest physical address */
> +	u64			  size;    /* size in bytes */
> +	struct file		 *wf;
> +	u64			  user_data; /* opaque token used by userspace */
> +	struct kvm_io_device	  dev;
> +	bool			  posted_writes;
> +	struct ioregionfd	 *ctx;
> +};
> +
>   void
>   kvm_ioregionfd_init(struct kvm *kvm)
>   {
> @@ -13,29 +42,28 @@ kvm_ioregionfd_init(struct kvm *kvm)
>   	INIT_LIST_HEAD(&kvm->ioregions_pio);
>   }
>   
> -struct ioregion {
> -	struct list_head     list;
> -	u64                  paddr;  /* guest physical address */
> -	u64                  size;   /* size in bytes */
> -	struct file         *rf;
> -	struct file         *wf;
> -	u64                  user_data; /* opaque token used by userspace */
> -	struct kvm_io_device dev;
> -	bool                 posted_writes;
> -};


It's better to squash this patch to the implementation of the ioregion 
fd. (We don't want to have a patch to fix bug of previous patch, and 
this may ease the reviewers).


> -
>   static inline struct ioregion *
>   to_ioregion(struct kvm_io_device *dev)
>   {
>   	return container_of(dev, struct ioregion, dev);
>   }
>   
> +/* assumes kvm->slots_lock held */
> +static void ctx_free(struct kref *kref)
> +{
> +	struct ioregionfd *ctx = container_of(kref, struct ioregionfd, kref);
> +
> +	kfree(ctx);
> +}
> +
>   /* assumes kvm->slots_lock held */
>   static void
>   ioregion_release(struct ioregion *p)
>   {
> -	if (p->rf)
> -		fput(p->rf);
> +	if (p->ctx) {
> +		fput(p->ctx->rf);
> +		kref_put(&p->ctx->kref, ctx_free);
> +	}
>   	fput(p->wf);
>   	list_del(&p->list);
>   	kfree(p);
> @@ -90,6 +118,30 @@ ioregion_save_ctx(struct kvm_vcpu *vcpu, bool in, gpa_t addr, u8 state, void *va
>   	vcpu->ioregion_ctx.in = in;
>   }
>   
> +static inline void
> +ioregion_lock_ctx(struct ioregionfd *ctx)
> +{
> +	if (!ctx)
> +		return;
> +
> +	spin_lock(&ctx->wq.lock);
> +	wait_event_interruptible_exclusive_locked(ctx->wq, !ctx->busy);


Any reason that a simple mutex_lock_interruptible() can't work here?

Thanks


> +	ctx->busy = true;
> +	spin_unlock(&ctx->wq.lock);
> +}
> +
> +static inline void
> +ioregion_unlock_ctx(struct ioregionfd *ctx)
> +{
> +	if (!ctx)
> +		return;
> +
> +	spin_lock(&ctx->wq.lock);
> +	ctx->busy = false;
> +	wake_up_locked(&ctx->wq);
> +	spin_unlock(&ctx->wq.lock);
> +}
> +
>   static int
>   ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   	      int len, void *val)
> @@ -115,11 +167,15 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   		}
>   	}
>   
> +	ioregion_lock_ctx(p->ctx);
> +
>   send_cmd:
>   	memset(&buf, 0, sizeof(buf));
>   	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_READ,
> -		      1, p->user_data, NULL))
> -		return -EOPNOTSUPP;
> +		      1, p->user_data, NULL)) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
>   
>   	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
>   	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
> @@ -129,14 +185,15 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   	}
>   	if (ret != sizeof(buf.cmd)) {
>   		ret = (ret < 0) ? ret : -EIO;
> -		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		goto out;
>   	}
> -	if (!p->rf)
> +	if (!p->ctx)
>   		return 0;
>   
>   get_repl:
>   	memset(&buf, 0, sizeof(buf));
> -	ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
> +	ret = kernel_read(p->ctx->rf, &buf.resp, sizeof(buf.resp), 0);
>   	state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
>   	if (signal_pending(current) && state == GET_REPLY) {
>   		ioregion_save_ctx(vcpu, 1, addr, state, val);
> @@ -144,12 +201,17 @@ ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   	}
>   	if (ret != sizeof(buf.resp)) {
>   		ret = (ret < 0) ? ret : -EIO;
> -		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		goto out;
>   	}
>   
>   	memcpy(val, &buf.resp.data, len);
> +	ret = 0;
>   
> -	return 0;
> +out:
> +	ioregion_unlock_ctx(p->ctx);
> +
> +	return ret;
>   }
>   
>   static int
> @@ -177,11 +239,15 @@ ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   		}
>   	}
>   
> +	ioregion_lock_ctx(p->ctx);
> +
>   send_cmd:
>   	memset(&buf, 0, sizeof(buf));
>   	if (!pack_cmd(&buf.cmd, addr - p->paddr, len, IOREGIONFD_CMD_WRITE,
> -		      p->posted_writes ? 0 : 1, p->user_data, val))
> -		return -EOPNOTSUPP;
> +		      p->posted_writes ? 0 : 1, p->user_data, val)) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
>   
>   	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
>   	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
> @@ -191,13 +257,14 @@ ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   	}
>   	if (ret != sizeof(buf.cmd)) {
>   		ret = (ret < 0) ? ret : -EIO;
> -		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +		goto out;
>   	}
>   
>   get_repl:
>   	if (!p->posted_writes) {
>   		memset(&buf, 0, sizeof(buf));
> -		ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
> +		ret = kernel_read(p->ctx->rf, &buf.resp, sizeof(buf.resp), 0);
>   		state = (retƒ == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
>   		if (signal_pending(current) && state == GET_REPLY) {
>   			ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
> @@ -205,11 +272,16 @@ ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
>   		}
>   		if (ret != sizeof(buf.resp)) {
>   			ret = (ret < 0) ? ret : -EIO;
> -			return (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +			ret = (ret == -EAGAIN || ret == -EWOULDBLOCK) ? -EINVAL : ret;
> +			goto out;
>   		}
>   	}
> +	ret = 0;
>   
> -	return 0;
> +out:
> +	ioregion_unlock_ctx(p->ctx);
> +
> +	return ret;
>   }
>   
>   /*
> @@ -285,6 +357,33 @@ get_bus_from_flags(__u32 flags)
>   	return KVM_MMIO_BUS;
>   }
>   
> +/* assumes kvm->slots_lock held */
> +static bool
> +ioregion_get_ctx(struct kvm *kvm, struct ioregion *p, struct file *rf, int bus_idx)
> +{
> +	struct ioregion *_p;
> +	struct list_head *ioregions;
> +
> +	ioregions = get_ioregion_list(kvm, bus_idx);
> +	list_for_each_entry(_p, ioregions, list)
> +		if (file_inode(_p->ctx->rf)->i_ino == file_inode(rf)->i_ino) {
> +			p->ctx = _p->ctx;
> +			kref_get(&p->ctx->kref);
> +			return true;
> +		}
> +
> +	p->ctx = kzalloc(sizeof(*p->ctx), GFP_KERNEL_ACCOUNT);
> +	if (!p->ctx)
> +		return false;
> +
> +	p->ctx->rf = rf;
> +	p->ctx->busy = false;
> +	init_waitqueue_head(&p->ctx->wq);
> +	kref_get(&p->ctx->kref);
> +
> +	return true;
> +}
> +
>   int
>   kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bus_idx)
>   {
> @@ -309,11 +408,10 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
>   	}
>   
>   	INIT_LIST_HEAD(&p->list);
> +	p->wf = wfile;
>   	p->paddr = args->guest_paddr;
>   	p->size = args->memory_size;
>   	p->user_data = args->user_data;
> -	p->rf = rfile;
> -	p->wf = wfile;
>   	p->posted_writes = args->flags & KVM_IOREGION_POSTED_WRITES;
>   
>   	mutex_lock(&kvm->slots_lock);
> @@ -322,6 +420,12 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
>   		ret = -EEXIST;
>   		goto unlock_fail;
>   	}
> +
> +	if (rfile && !ioregion_get_ctx(kvm, p, rfile, bus_idx)) {
> +		ret = -ENOMEM;
> +		goto unlock_fail;
> +	}
> +
>   	kvm_iodevice_init(&p->dev, &ioregion_ops);
>   	ret = kvm_io_bus_register_dev(kvm, bus_idx, p->paddr, p->size,
>   				      &p->dev);
> @@ -335,6 +439,8 @@ kvm_set_ioregion_idx(struct kvm *kvm, struct kvm_ioregion *args, enum kvm_bus bu
>   
>   unlock_fail:
>   	mutex_unlock(&kvm->slots_lock);
> +	if (p->ctx)
> +		kref_put(&p->ctx->kref, ctx_free);
>   	kfree(p);
>   fail:
>   	if (rfile)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-03-09  7:54   ` Jason Wang
@ 2021-03-09  8:01     ` Paolo Bonzini
  2021-03-10 13:20       ` Elena Afanasova
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2021-03-09  8:01 UTC (permalink / raw)
  To: Jason Wang, Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, mst, cohuck, john.levon

On 09/03/21 08:54, Jason Wang wrote:
>>
>> +        return;
>> +
>> +    spin_lock(&ctx->wq.lock);
>> +    wait_event_interruptible_exclusive_locked(ctx->wq, !ctx->busy);
> 
> 
> Any reason that a simple mutex_lock_interruptible() can't work here?

Or alternatively why can't the callers just take the spinlock.

Paolo

> 
> Thanks
> 
> 
>> +    ctx->busy = true;
>> +    spin_unlock(&ctx->wq.lock); 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-03-09  8:01     ` Paolo Bonzini
@ 2021-03-10 13:20       ` Elena Afanasova
  2021-03-10 14:11         ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Elena Afanasova @ 2021-03-10 13:20 UTC (permalink / raw)
  To: Paolo Bonzini, Jason Wang, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, mst, cohuck, john.levon

On Tue, 2021-03-09 at 09:01 +0100, Paolo Bonzini wrote:
> On 09/03/21 08:54, Jason Wang wrote:
> > > +        return;
> > > +
> > > +    spin_lock(&ctx->wq.lock);
> > > +    wait_event_interruptible_exclusive_locked(ctx->wq, !ctx-
> > > >busy);
> > 
> > Any reason that a simple mutex_lock_interruptible() can't work
> > here?
> 
> Or alternatively why can't the callers just take the spinlock.
> 
I'm not sure I understand your question. Do you mean why locked version
of wait_event() is used?

> Paolo
> 
> > Thanks
> > 
> > 
> > > +    ctx->busy = true;
> > > +    spin_unlock(&ctx->wq.lock); 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-03-10 13:20       ` Elena Afanasova
@ 2021-03-10 14:11         ` Paolo Bonzini
  2021-03-10 16:41           ` Elena Afanasova
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2021-03-10 14:11 UTC (permalink / raw)
  To: Elena Afanasova, Jason Wang, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, mst, cohuck, john.levon

On 10/03/21 14:20, Elena Afanasova wrote:
> On Tue, 2021-03-09 at 09:01 +0100, Paolo Bonzini wrote:
>> On 09/03/21 08:54, Jason Wang wrote:
>>>> +        return;
>>>> +
>>>> +    spin_lock(&ctx->wq.lock);
>>>> +    wait_event_interruptible_exclusive_locked(ctx->wq, !ctx-
>>>>> busy);
>>>
>>> Any reason that a simple mutex_lock_interruptible() can't work
>>> here?
>>
>> Or alternatively why can't the callers just take the spinlock.
>>
> I'm not sure I understand your question. Do you mean why locked version
> of wait_event() is used?

No, I mean why do you need to use ctx->busy and wait_event, instead of 
operating directly on the spinlock or on a mutex.

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-03-10 14:11         ` Paolo Bonzini
@ 2021-03-10 16:41           ` Elena Afanasova
       [not found]             ` <6ff79d0b-3b6a-73d3-ffbd-e4af9758735f@redhat.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Elena Afanasova @ 2021-03-10 16:41 UTC (permalink / raw)
  To: Paolo Bonzini, Jason Wang, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, mst, cohuck, john.levon

On Wed, 2021-03-10 at 15:11 +0100, Paolo Bonzini wrote:
> On 10/03/21 14:20, Elena Afanasova wrote:
> > On Tue, 2021-03-09 at 09:01 +0100, Paolo Bonzini wrote:
> > > On 09/03/21 08:54, Jason Wang wrote:
> > > > > +        return;
> > > > > +
> > > > > +    spin_lock(&ctx->wq.lock);
> > > > > +    wait_event_interruptible_exclusive_locked(ctx->wq, !ctx-
> > > > > > busy);
> > > > 
> > > > Any reason that a simple mutex_lock_interruptible() can't work
> > > > here?
> > > 
> > > Or alternatively why can't the callers just take the spinlock.
> > > 
> > I'm not sure I understand your question. Do you mean why locked
> > version
> > of wait_event() is used?
> 
> No, I mean why do you need to use ctx->busy and wait_event, instead
> of 
> operating directly on the spinlock or on a mutex.
> 
When ioregionfd communication is interrupted by a signal ioctl(KVM_RUN)
has to return to userspace. I'm not sure it's ok to do that with the
spinlock/mutex being held.

> Paolo
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
       [not found]             ` <6ff79d0b-3b6a-73d3-ffbd-e4af9758735f@redhat.com>
@ 2021-03-17 10:46               ` Elena Afanasova
  2021-03-26  6:47                 ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Elena Afanasova @ 2021-03-17 10:46 UTC (permalink / raw)
  To: Jason Wang, Paolo Bonzini, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, mst, cohuck, john.levon

On Thu, 2021-03-11 at 11:04 +0800, Jason Wang wrote:
> 
> On 2021/3/11 12:41 上午, Elena Afanasova wrote:
> > On Wed, 2021-03-10 at 15:11 +0100, Paolo Bonzini wrote:
> > > On 10/03/21 14:20, Elena Afanasova wrote:
> > > > On Tue, 2021-03-09 at 09:01 +0100, Paolo Bonzini wrote:
> > > > > On 09/03/21 08:54, Jason Wang wrote:
> > > > > > > +        return;
> > > > > > > +
> > > > > > > +    spin_lock(&ctx->wq.lock);
> > > > > > > +    wait_event_interruptible_exclusive_locked(ctx->wq,
> > > > > > > !ctx-
> > > > > > > > busy);
> > > > > > 
> > > > > > Any reason that a simple mutex_lock_interruptible() can't
> > > > > > work
> > > > > > here?
> > > > > 
> > > > > Or alternatively why can't the callers just take the
> > > > > spinlock.
> > > > > 
> > > > 
> > > > I'm not sure I understand your question. Do you mean why locked
> > > > version
> > > > of wait_event() is used?
> > > 
> > > No, I mean why do you need to use ctx->busy and wait_event,
> > > instead
> > > of 
> > > operating directly on the spinlock or on a mutex.
> > > 
> > 
> > When ioregionfd communication is interrupted by a signal
> > ioctl(KVM_RUN)
> > has to return to userspace. I'm not sure it's ok to do that with
> > the
> > spinlock/mutex being held.
> 
> 
> So you don't answer my question. Why you can't use
> mutex_lock_interruptible() here?
> 
> It looks can do exactly what you want here.
> 
> Thanks
> 
I think mutex could work here. I used it for the first implementation.
But is it ok to hold a mutex on kernel->user transitions? Is it correct
pattern in this case? 
If ioctl returns to userspace and then ioregionfd is deleted or vcpu fd
is closed, with a mutex held it will be necessary to unlock it. But I
think it’s a bit clearer to use wake_up in e.g. kvm_vcpu_release
instead of mutex_unlock. Paolo, could you please also comment on this?

> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 3/5] KVM: implement wire protocol
  2021-03-09  6:19   ` Jason Wang
@ 2021-03-17 13:08     ` Elena Afanasova
  2021-03-26  6:21       ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Elena Afanasova @ 2021-03-17 13:08 UTC (permalink / raw)
  To: Jason Wang, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon

On Tue, 2021-03-09 at 14:19 +0800, Jason Wang wrote:
> On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> > Add ioregionfd blocking read/write operations.
> > 
> > Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> > ---
> > v3:
> >   - change wire protocol license
> >   - remove ioregionfd_cmd info and drop appropriate macros
> >   - fix ioregionfd state machine
> >   - add sizeless ioregions support
> >   - drop redundant check in ioregion_read/write()
> > 
> >   include/uapi/linux/ioregion.h |  30 +++++++
> >   virt/kvm/ioregion.c           | 162
> > +++++++++++++++++++++++++++++++++-
> >   2 files changed, 190 insertions(+), 2 deletions(-)
> >   create mode 100644 include/uapi/linux/ioregion.h
> > 
> > diff --git a/include/uapi/linux/ioregion.h
> > b/include/uapi/linux/ioregion.h
> > new file mode 100644
> > index 000000000000..58f9b5ba6186
> > --- /dev/null
> > +++ b/include/uapi/linux/ioregion.h
> > @@ -0,0 +1,30 @@
> > +/* SPDX-License-Identifier: ((GPL-2.0-only WITH Linux-syscall-
> > note) OR BSD-3-Clause) */
> > +#ifndef _UAPI_LINUX_IOREGION_H
> > +#define _UAPI_LINUX_IOREGION_H
> > +
> > +/* Wire protocol */
> > +
> > +struct ioregionfd_cmd {
> > +	__u8 cmd;
> > +	__u8 size_exponent : 4;
> > +	__u8 resp : 1;
> > +	__u8 padding[6];
> > +	__u64 user_data;
> > +	__u64 offset;
> > +	__u64 data;
> > +};
> 
> Sorry if I've asked this before. Do we need a id for each 
> request/response? E.g an ioregion fd could be used by multiple
> vCPUS. 
> VCPU needs to have a way to find which request belongs to itself or
> not?
> 
I don’t think the id is necessary here since all requests/responses are
serialized.

> 
> > +
> > +struct ioregionfd_resp {
> > +	__u64 data;
> > +	__u8 pad[24];
> > +};
> > +
> > +#define IOREGIONFD_CMD_READ    0
> > +#define IOREGIONFD_CMD_WRITE   1
> > +
> > +#define IOREGIONFD_SIZE_8BIT   0
> > +#define IOREGIONFD_SIZE_16BIT  1
> > +#define IOREGIONFD_SIZE_32BIT  2
> > +#define IOREGIONFD_SIZE_64BIT  3
> > +
> > +#endif
> > diff --git a/virt/kvm/ioregion.c b/virt/kvm/ioregion.c
> > index e09ef3e2c9d7..1e1c7772d274 100644
> > --- a/virt/kvm/ioregion.c
> > +++ b/virt/kvm/ioregion.c
> > @@ -3,6 +3,7 @@
> >   #include <linux/fs.h>
> >   #include <kvm/iodev.h>
> >   #include "eventfd.h"
> > +#include <uapi/linux/ioregion.h>
> >   
> >   void
> >   kvm_ioregionfd_init(struct kvm *kvm)
> > @@ -40,18 +41,175 @@ ioregion_release(struct ioregion *p)
> >   	kfree(p);
> >   }
> >   
> > +static bool
> > +pack_cmd(struct ioregionfd_cmd *cmd, u64 offset, u64 len, u8 opt,
> > u8 resp,
> > +	 u64 user_data, const void *val)
> > +{
> > +	switch (len) {
> > +	case 0:
> > +		break;
> > +	case 1:
> > +		cmd->size_exponent = IOREGIONFD_SIZE_8BIT;
> > +		break;
> > +	case 2:
> > +		cmd->size_exponent = IOREGIONFD_SIZE_16BIT;
> > +		break;
> > +	case 4:
> > +		cmd->size_exponent = IOREGIONFD_SIZE_32BIT;
> > +		break;
> > +	case 8:
> > +		cmd->size_exponent = IOREGIONFD_SIZE_64BIT;
> > +		break;
> > +	default:
> > +		return false;
> > +	}
> > +
> > +	if (val)
> > +		memcpy(&cmd->data, val, len);
> > +	cmd->user_data = user_data;
> > +	cmd->offset = offset;
> > +	cmd->cmd = opt;
> > +	cmd->resp = resp;
> > +
> > +	return true;
> > +}
> > +
> > +enum {
> > +	SEND_CMD,
> > +	GET_REPLY,
> > +	COMPLETE
> > +};
> > +
> > +static void
> > +ioregion_save_ctx(struct kvm_vcpu *vcpu, bool in, gpa_t addr, u8
> > state, void *val)
> > +{
> > +	vcpu->ioregion_ctx.is_interrupted = true;
> > +	vcpu->ioregion_ctx.val = val;
> > +	vcpu->ioregion_ctx.state = state;
> > +	vcpu->ioregion_ctx.addr = addr;
> > +	vcpu->ioregion_ctx.in = in;
> > +}
> > +
> >   static int
> >   ioregion_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
> > gpa_t addr,
> >   	      int len, void *val)
> >   {
> > -	return -EOPNOTSUPP;
> > +	struct ioregion *p = to_ioregion(this);
> > +	union {
> > +		struct ioregionfd_cmd cmd;
> > +		struct ioregionfd_resp resp;
> > +	} buf;
> > +	int ret = 0;
> > +	int state = SEND_CMD;
> > +
> > +	if (unlikely(vcpu->ioregion_ctx.is_interrupted)) {
> > +		vcpu->ioregion_ctx.is_interrupted = false;
> > +
> > +		switch (vcpu->ioregion_ctx.state) {
> > +		case SEND_CMD:
> > +			goto send_cmd;
> > +		case GET_REPLY:
> > +			goto get_repl;
> > +		default:
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +send_cmd:
> > +	memset(&buf, 0, sizeof(buf));
> > +	if (!pack_cmd(&buf.cmd, addr - p->paddr, len,
> > IOREGIONFD_CMD_READ,
> > +		      1, p->user_data, NULL))
> > +		return -EOPNOTSUPP;
> > +
> > +	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
> > +	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
> > +	if (signal_pending(current) && state == SEND_CMD) {
> 
> Can the signal be delivered after a success of kernel_write()? If
> yes, 
> is there any side effect if we want to redo the write here?
> 
Yes, in this case the signal should be handled by KVM. There can be a
side effect when ioregion fd is broken and there is a pending signal
that isn’t related to ioregionfd. But this doesn’t seem to be an issue
since it can be handled later in ioregionfd complete operations.

> 
> > +		ioregion_save_ctx(vcpu, 1, addr, state, val);
> > +		return -EINTR;
> > +	}
> > +	if (ret != sizeof(buf.cmd)) {
> > +		ret = (ret < 0) ? ret : -EIO;
> > +		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ?
> > -EINVAL : ret;
> > +	}
> > +	if (!p->rf)
> > +		return 0;
> > +
> > +get_repl:
> > +	memset(&buf, 0, sizeof(buf));
> > +	ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp), 0);
> > +	state = (ret == sizeof(buf.resp)) ? COMPLETE : GET_REPLY;
> > +	if (signal_pending(current) && state == GET_REPLY) {
> > +		ioregion_save_ctx(vcpu, 1, addr, state, val);
> > +		return -EINTR;
> > +	}
> > +	if (ret != sizeof(buf.resp)) {
> > +		ret = (ret < 0) ? ret : -EIO;
> > +		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ?
> > -EINVAL : ret;
> > +	}
> 
> We may need to unify the duplicated codes here with send_cmd.
> 
Yes, will refactor.

Thank you
> 
> > +
> > +	memcpy(val, &buf.resp.data, len);
> > +
> > +	return 0;
> >   }
> >   
> >   static int
> >   ioregion_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
> > gpa_t addr,
> >   		int len, const void *val)
> >   {
> > -	return -EOPNOTSUPP;
> > +	struct ioregion *p = to_ioregion(this);
> > +	union {
> > +		struct ioregionfd_cmd cmd;
> > +		struct ioregionfd_resp resp;
> > +	} buf;
> > +	int ret = 0;
> > +	int state = SEND_CMD;
> > +
> > +	if (unlikely(vcpu->ioregion_ctx.is_interrupted)) {
> > +		vcpu->ioregion_ctx.is_interrupted = false;
> > +
> > +		switch (vcpu->ioregion_ctx.state) {
> > +		case SEND_CMD:
> > +			goto send_cmd;
> > +		case GET_REPLY:
> > +			goto get_repl;
> > +		default:
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +send_cmd:
> > +	memset(&buf, 0, sizeof(buf));
> > +	if (!pack_cmd(&buf.cmd, addr - p->paddr, len,
> > IOREGIONFD_CMD_WRITE,
> > +		      p->posted_writes ? 0 : 1, p->user_data, val))
> > +		return -EOPNOTSUPP;
> > +
> > +	ret = kernel_write(p->wf, &buf.cmd, sizeof(buf.cmd), 0);
> > +	state = (ret == sizeof(buf.cmd)) ? GET_REPLY : SEND_CMD;
> > +	if (signal_pending(current) && state == SEND_CMD) {
> > +		ioregion_save_ctx(vcpu, 0, addr, state, (void *)val);
> > +		return -EINTR;
> > +	}
> > +	if (ret != sizeof(buf.cmd)) {
> > +		ret = (ret < 0) ? ret : -EIO;
> > +		return (ret == -EAGAIN || ret == -EWOULDBLOCK) ?
> > -EINVAL : ret;
> > +	}
> > +
> > +get_repl:
> > +	if (!p->posted_writes) {
> > +		memset(&buf, 0, sizeof(buf));
> > +		ret = kernel_read(p->rf, &buf.resp, sizeof(buf.resp),
> > 0);
> > +		state = (ret == sizeof(buf.resp)) ? COMPLETE :
> > GET_REPLY;
> > +		if (signal_pending(current) && state == GET_REPLY) {
> > +			ioregion_save_ctx(vcpu, 0, addr, state, (void
> > *)val);
> > +			return -EINTR;
> > +		}
> > +		if (ret != sizeof(buf.resp)) {
> > +			ret = (ret < 0) ? ret : -EIO;
> > +			return (ret == -EAGAIN || ret == -EWOULDBLOCK)
> > ? -EINVAL : ret;
> > +		}
> > +	}
> > +
> > +	return 0;
> 
> It looks to me we had more chance to unify the code with
> ioregion_read().
> 
> Thanks
> 
> 
> >   }
> >   
> >   /*


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling
  2021-03-09  5:51   ` Jason Wang
@ 2021-03-17 14:19     ` Elena Afanasova
  2021-03-26  6:00       ` Jason Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Elena Afanasova @ 2021-03-17 14:19 UTC (permalink / raw)
  To: Jason Wang, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon

On Tue, 2021-03-09 at 13:51 +0800, Jason Wang wrote:
> On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> > The vCPU thread may receive a signal during ioregionfd
> > communication,
> > ioctl(KVM_RUN) needs to return to userspace and then ioctl(KVM_RUN)
> > must resume ioregionfd.
> 
> After a glance at the patch, I wonder can we split the patch into
> two?
> 
> 1) sleepable iodevice which is not supported currently, probably with
> a 
> new cap?
> 2) ioregionfd specific codes (I wonder if it has any)
> 
> Then the sleepable iodevice could be reused by future features.
> 
Do you have an idea of another possible use cases? Could you please
describe your idea in more details?

> 
> > Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> > ---
> > v3:
> >   - add FAST_MMIO bus support
> >   - move ioregion_interrupted flag to ioregion_ctx
> >   - reorder ioregion_ctx fields
> >   - rework complete_ioregion operations
> >   - add signal handling support for crossing a page boundary case
> >   - fix kvm_arch_vcpu_ioctl_run() should return -EINTR in case
> > ioregionfd
> >     is interrupted
> > 
> >   arch/x86/kvm/vmx/vmx.c   |  40 +++++-
> >   arch/x86/kvm/x86.c       | 272
> > +++++++++++++++++++++++++++++++++++++--
> >   include/linux/kvm_host.h |  10 ++
> >   virt/kvm/kvm_main.c      |  16 ++-
> >   4 files changed, 317 insertions(+), 21 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 47b8357b9751..39db31afd27e 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5357,19 +5357,51 @@ static int handle_ept_violation(struct
> > kvm_vcpu *vcpu)
> >   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> >   }
> >   
> > +#ifdef CONFIG_KVM_IOREGION
> > +static int complete_ioregion_fast_mmio(struct kvm_vcpu *vcpu)
> > +{
> > +	int ret, idx;
> > +
> > +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +	ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS,
> > +			       vcpu->ioregion_ctx.addr, 0, NULL);
> > +	if (ret) {
> > +		ret = kvm_mmu_page_fault(vcpu, vcpu->ioregion_ctx.addr,
> > +					 PFERR_RSVD_MASK, NULL, 0);
> > +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +		return ret;
> > +	}
> > +
> > +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +	return kvm_skip_emulated_instruction(vcpu);
> > +}
> > +#endif
> > +
> >   static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> >   {
> >   	gpa_t gpa;
> > +	int ret;
> >   
> >   	/*
> >   	 * A nested guest cannot optimize MMIO vmexits, because we have
> > an
> >   	 * nGPA here instead of the required GPA.
> >   	 */
> >   	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> > -	if (!is_guest_mode(vcpu) &&
> > -	    !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> > -		trace_kvm_fast_mmio(gpa);
> > -		return kvm_skip_emulated_instruction(vcpu);
> > +	if (!is_guest_mode(vcpu)) {
> > +		ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0,
> > NULL);
> > +		if (!ret) {
> > +			trace_kvm_fast_mmio(gpa);
> > +			return kvm_skip_emulated_instruction(vcpu);
> > +		}
> > +
> > +#ifdef CONFIG_KVM_IOREGION
> > +		if (unlikely(vcpu->ioregion_ctx.is_interrupted && ret
> > == -EINTR)) {
> 
> So the question still, EINTR looks wrong which means the syscall
> can't 
> be restarted. Not that the syscal doesn't mean KVM_RUN but actually
> the 
> kernel_read|write() you want to do with the ioregion fd.
> 
> Also do we need to treat differently for EINTR and ERESTARTSYS since 
> EINTR means the kernel_read()|write() can't be resumed.
> 
> Thanks
> 
I don’t mind replacing EINTR with ERESTARTSYS. I think in this case
there is no more need to process EINTR for ioregionfd. Also it seems
that the QEMU code doesn’t support ERESTARTSYS handling. Can something
like (run_ret == -EINTR || run_ret == -EAGAIN || run_ret ==
-ERESTARTSYS) in kvm_cpu_exec help in this case?

Thank you



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION
  2021-03-09  5:26   ` Jason Wang
@ 2021-03-22  9:57     ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-22  9:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Afanasova, kvm, stefanha, jag.raman, elena.ufimtseva,
	pbonzini, mst, cohuck, john.levon

[-- Attachment #1: Type: text/plain, Size: 1518 bytes --]

On Tue, Mar 09, 2021 at 01:26:48PM +0800, Jason Wang wrote:
> On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> > @@ -1308,6 +1332,7 @@ struct kvm_vfio_spapr_tce {
> >   					struct kvm_userspace_memory_region)
> >   #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
> >   #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> > +#define KVM_SET_IOREGION          _IOW(KVMIO,  0x49, struct kvm_ioregion)
> 
> 
> I wonder how could we extend ioregion fd in the future? Do we need something
> like handshake or version here?

The struct kvm_ioregion->flags field can be used to enable optional
features. KVM capabilities can be used to test the presence of optional
features. This might be enough.

A different approach to extensibility is a sizeof(struct kvm_ioregion)
field for arbitrary extensions to the struct.

> > +static int
> > +kvm_set_ioregion(struct kvm *kvm, struct kvm_ioregion *args)
> > +{
> > +	int ret;
> > +
> > +	enum kvm_bus bus_idx = get_bus_from_flags(args->flags);
> > +
> > +	/* check for range overflow */
> > +	if (args->guest_paddr + args->memory_size < args->guest_paddr)
> > +		return -EINVAL;
> > +	/* If size is ignored only posted writes are allowed */
> > +	if (!args->memory_size && !(args->flags & KVM_IOREGION_POSTED_WRITES))
> 
> 
> We don't have flags like KVM_IOREGION_POSTED_WRITES for ioeventfd. Is this a
> must?

There is no way to trigger a FAST_MMIO read. Guest accesses, including
memory loads, appear as writes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling
  2021-03-17 14:19     ` Elena Afanasova
@ 2021-03-26  6:00       ` Jason Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2021-03-26  6:00 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon


在 2021/3/17 下午10:19, Elena Afanasova 写道:
> On Tue, 2021-03-09 at 13:51 +0800, Jason Wang wrote:
>> On 2021/2/21 8:04 下午, Elena Afanasova wrote:
>>> The vCPU thread may receive a signal during ioregionfd
>>> communication,
>>> ioctl(KVM_RUN) needs to return to userspace and then ioctl(KVM_RUN)
>>> must resume ioregionfd.
>> After a glance at the patch, I wonder can we split the patch into
>> two?
>>
>> 1) sleepable iodevice which is not supported currently, probably with
>> a
>> new cap?
>> 2) ioregionfd specific codes (I wonder if it has any)
>>
>> Then the sleepable iodevice could be reused by future features.
>>
> Do you have an idea of another possible use cases?


I don't but iodevice is a genreal infrastrucutre, having a dedicated 
patch for that helps lot e.g for reviewing.


> Could you please
> describe your idea in more details?


So it's something like:

patch 1: to introduce the sleepable iodevce infrastructure
patch 2: add ioregionfd on top


>
>>> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
>>> ---
>>> v3:
>>>    - add FAST_MMIO bus support
>>>    - move ioregion_interrupted flag to ioregion_ctx
>>>    - reorder ioregion_ctx fields
>>>    - rework complete_ioregion operations
>>>    - add signal handling support for crossing a page boundary case
>>>    - fix kvm_arch_vcpu_ioctl_run() should return -EINTR in case
>>> ioregionfd
>>>      is interrupted
>>>
>>>    arch/x86/kvm/vmx/vmx.c   |  40 +++++-
>>>    arch/x86/kvm/x86.c       | 272
>>> +++++++++++++++++++++++++++++++++++++--
>>>    include/linux/kvm_host.h |  10 ++
>>>    virt/kvm/kvm_main.c      |  16 ++-
>>>    4 files changed, 317 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index 47b8357b9751..39db31afd27e 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -5357,19 +5357,51 @@ static int handle_ept_violation(struct
>>> kvm_vcpu *vcpu)
>>>    	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>>>    }
>>>    
>>> +#ifdef CONFIG_KVM_IOREGION
>>> +static int complete_ioregion_fast_mmio(struct kvm_vcpu *vcpu)
>>> +{
>>> +	int ret, idx;
>>> +
>>> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
>>> +	ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS,
>>> +			       vcpu->ioregion_ctx.addr, 0, NULL);
>>> +	if (ret) {
>>> +		ret = kvm_mmu_page_fault(vcpu, vcpu->ioregion_ctx.addr,
>>> +					 PFERR_RSVD_MASK, NULL, 0);
>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>> +		return ret;
>>> +	}
>>> +
>>> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>> +	return kvm_skip_emulated_instruction(vcpu);
>>> +}
>>> +#endif
>>> +
>>>    static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>>>    {
>>>    	gpa_t gpa;
>>> +	int ret;
>>>    
>>>    	/*
>>>    	 * A nested guest cannot optimize MMIO vmexits, because we have
>>> an
>>>    	 * nGPA here instead of the required GPA.
>>>    	 */
>>>    	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>>> -	if (!is_guest_mode(vcpu) &&
>>> -	    !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
>>> -		trace_kvm_fast_mmio(gpa);
>>> -		return kvm_skip_emulated_instruction(vcpu);
>>> +	if (!is_guest_mode(vcpu)) {
>>> +		ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0,
>>> NULL);
>>> +		if (!ret) {
>>> +			trace_kvm_fast_mmio(gpa);
>>> +			return kvm_skip_emulated_instruction(vcpu);
>>> +		}
>>> +
>>> +#ifdef CONFIG_KVM_IOREGION
>>> +		if (unlikely(vcpu->ioregion_ctx.is_interrupted && ret
>>> == -EINTR)) {
>> So the question still, EINTR looks wrong which means the syscall
>> can't
>> be restarted. Not that the syscal doesn't mean KVM_RUN but actually
>> the
>> kernel_read|write() you want to do with the ioregion fd.
>>
>> Also do we need to treat differently for EINTR and ERESTARTSYS since
>> EINTR means the kernel_read()|write() can't be resumed.
>>
>> Thanks
>>
> I don’t mind replacing EINTR with ERESTARTSYS. I think in this case
> there is no more need to process EINTR for ioregionfd. Also it seems
> that the QEMU code doesn’t support ERESTARTSYS handling.


So ERESTARTSYS is something that should not be seen by userspace. When 
SA_RESTART is not set for the signal, kernel will return -EINTR.

My understanding is that you want to implement a mandated restsart the 
read()/write() syscall which should be fine. The problem is not all the 
read()/write() can be restarted (the read/write that doesn't return 
-ERESTARTSYS). In that case we need fail the VCPU_RUN probably.


> Can something
> like (run_ret == -EINTR || run_ret == -EAGAIN || run_ret ==
> -ERESTARTSYS) in kvm_cpu_exec help in this case?


I can't git grep kvm_cpu_exec in the source.

Thanks


>
> Thank you
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 3/5] KVM: implement wire protocol
  2021-03-17 13:08     ` Elena Afanasova
@ 2021-03-26  6:21       ` Jason Wang
  2021-03-29 16:17         ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Wang @ 2021-03-26  6:21 UTC (permalink / raw)
  To: Elena Afanasova, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, pbonzini, mst, cohuck, john.levon


在 2021/3/17 下午9:08, Elena Afanasova 写道:
> On Tue, 2021-03-09 at 14:19 +0800, Jason Wang wrote:
>> On 2021/2/21 8:04 下午, Elena Afanasova wrote:
>>> Add ioregionfd blocking read/write operations.
>>>
>>> Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
>>> ---
>>> v3:
>>>    - change wire protocol license
>>>    - remove ioregionfd_cmd info and drop appropriate macros
>>>    - fix ioregionfd state machine
>>>    - add sizeless ioregions support
>>>    - drop redundant check in ioregion_read/write()
>>>
>>>    include/uapi/linux/ioregion.h |  30 +++++++
>>>    virt/kvm/ioregion.c           | 162
>>> +++++++++++++++++++++++++++++++++-
>>>    2 files changed, 190 insertions(+), 2 deletions(-)
>>>    create mode 100644 include/uapi/linux/ioregion.h
>>>
>>> diff --git a/include/uapi/linux/ioregion.h
>>> b/include/uapi/linux/ioregion.h
>>> new file mode 100644
>>> index 000000000000..58f9b5ba6186
>>> --- /dev/null
>>> +++ b/include/uapi/linux/ioregion.h
>>> @@ -0,0 +1,30 @@
>>> +/* SPDX-License-Identifier: ((GPL-2.0-only WITH Linux-syscall-
>>> note) OR BSD-3-Clause) */
>>> +#ifndef _UAPI_LINUX_IOREGION_H
>>> +#define _UAPI_LINUX_IOREGION_H
>>> +
>>> +/* Wire protocol */
>>> +
>>> +struct ioregionfd_cmd {
>>> +	__u8 cmd;
>>> +	__u8 size_exponent : 4;
>>> +	__u8 resp : 1;
>>> +	__u8 padding[6];
>>> +	__u64 user_data;
>>> +	__u64 offset;
>>> +	__u64 data;
>>> +};
>> Sorry if I've asked this before. Do we need a id for each
>> request/response? E.g an ioregion fd could be used by multiple
>> vCPUS.
>> VCPU needs to have a way to find which request belongs to itself or
>> not?
>>
> I don’t think the id is necessary here since all requests/responses are
> serialized.


It's probably fine for the first version but it will degrate the 
performance e.g if the ioregionfd is used for multiple queues (e.g 
doorbell). The design should have the capability to allow the extension 
like this.

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 4/5] KVM: add ioregionfd context
  2021-03-17 10:46               ` Elena Afanasova
@ 2021-03-26  6:47                 ` Jason Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Wang @ 2021-03-26  6:47 UTC (permalink / raw)
  To: Elena Afanasova, Paolo Bonzini, kvm
  Cc: stefanha, jag.raman, elena.ufimtseva, mst, cohuck, john.levon


在 2021/3/17 下午6:46, Elena Afanasova 写道:
> On Thu, 2021-03-11 at 11:04 +0800, Jason Wang wrote:
>> On 2021/3/11 12:41 上午, Elena Afanasova wrote:
>>> On Wed, 2021-03-10 at 15:11 +0100, Paolo Bonzini wrote:
>>>> On 10/03/21 14:20, Elena Afanasova wrote:
>>>>> On Tue, 2021-03-09 at 09:01 +0100, Paolo Bonzini wrote:
>>>>>> On 09/03/21 08:54, Jason Wang wrote:
>>>>>>>> +        return;
>>>>>>>> +
>>>>>>>> +    spin_lock(&ctx->wq.lock);
>>>>>>>> +    wait_event_interruptible_exclusive_locked(ctx->wq,
>>>>>>>> !ctx-
>>>>>>>>> busy);
>>>>>>> Any reason that a simple mutex_lock_interruptible() can't
>>>>>>> work
>>>>>>> here?
>>>>>> Or alternatively why can't the callers just take the
>>>>>> spinlock.
>>>>>>
>>>>> I'm not sure I understand your question. Do you mean why locked
>>>>> version
>>>>> of wait_event() is used?
>>>> No, I mean why do you need to use ctx->busy and wait_event,
>>>> instead
>>>> of
>>>> operating directly on the spinlock or on a mutex.
>>>>
>>> When ioregionfd communication is interrupted by a signal
>>> ioctl(KVM_RUN)
>>> has to return to userspace. I'm not sure it's ok to do that with
>>> the
>>> spinlock/mutex being held.
>>
>> So you don't answer my question. Why you can't use
>> mutex_lock_interruptible() here?
>>
>> It looks can do exactly what you want here.
>>
>> Thanks
>>
> I think mutex could work here. I used it for the first implementation.
> But is it ok to hold a mutex on kernel->user transitions? Is it correct
> pattern in this case?


I may miss something but the semantic is the same for the following of 
the two?

A:
     spin_lock(&ctx->wq.lock);
     wait_event_interruptible_exclusive_locked(ctx->wq, !ctx->busy);
     ctx->busy = true;
     spin_unlock(&ctx->wq.lock);

B:
     mutex_lock_interruptible(&ctx->lock);


> If ioctl returns to userspace and then ioregionfd is deleted or vcpu fd
> is closed, with a mutex held it will be necessary to unlock it.


It's something nature and anything different with your current code that 
uses spinlock + waitqueue? And we can know whehter or not we're 
interrupt by a singal (I notice you don't check the return value of 
wait_event_interruptible_exclusive_locked() and set ctx->busy to true 
unconditonally which is probably wrong.


>   But I
> think it’s a bit clearer to use wake_up in e.g. kvm_vcpu_release
> instead of mutex_unlock.


I am not sure I undersatnd here (I dont' this how it is doen in this patch).

Thanks


>   Paolo, could you please also comment on this?
>
>>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC v3 3/5] KVM: implement wire protocol
  2021-03-26  6:21       ` Jason Wang
@ 2021-03-29 16:17         ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-29 16:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Afanasova, kvm, jag.raman, elena.ufimtseva, pbonzini, mst,
	cohuck, john.levon

[-- Attachment #1: Type: text/plain, Size: 2791 bytes --]

On Fri, Mar 26, 2021 at 02:21:29PM +0800, Jason Wang wrote:
> 
> 在 2021/3/17 下午9:08, Elena Afanasova 写道:
> > On Tue, 2021-03-09 at 14:19 +0800, Jason Wang wrote:
> > > On 2021/2/21 8:04 下午, Elena Afanasova wrote:
> > > > Add ioregionfd blocking read/write operations.
> > > > 
> > > > Signed-off-by: Elena Afanasova <eafanasova@gmail.com>
> > > > ---
> > > > v3:
> > > >    - change wire protocol license
> > > >    - remove ioregionfd_cmd info and drop appropriate macros
> > > >    - fix ioregionfd state machine
> > > >    - add sizeless ioregions support
> > > >    - drop redundant check in ioregion_read/write()
> > > > 
> > > >    include/uapi/linux/ioregion.h |  30 +++++++
> > > >    virt/kvm/ioregion.c           | 162
> > > > +++++++++++++++++++++++++++++++++-
> > > >    2 files changed, 190 insertions(+), 2 deletions(-)
> > > >    create mode 100644 include/uapi/linux/ioregion.h
> > > > 
> > > > diff --git a/include/uapi/linux/ioregion.h
> > > > b/include/uapi/linux/ioregion.h
> > > > new file mode 100644
> > > > index 000000000000..58f9b5ba6186
> > > > --- /dev/null
> > > > +++ b/include/uapi/linux/ioregion.h
> > > > @@ -0,0 +1,30 @@
> > > > +/* SPDX-License-Identifier: ((GPL-2.0-only WITH Linux-syscall-
> > > > note) OR BSD-3-Clause) */
> > > > +#ifndef _UAPI_LINUX_IOREGION_H
> > > > +#define _UAPI_LINUX_IOREGION_H
> > > > +
> > > > +/* Wire protocol */
> > > > +
> > > > +struct ioregionfd_cmd {
> > > > +	__u8 cmd;
> > > > +	__u8 size_exponent : 4;
> > > > +	__u8 resp : 1;
> > > > +	__u8 padding[6];
> > > > +	__u64 user_data;
> > > > +	__u64 offset;
> > > > +	__u64 data;
> > > > +};
> > > Sorry if I've asked this before. Do we need a id for each
> > > request/response? E.g an ioregion fd could be used by multiple
> > > vCPUS.
> > > VCPU needs to have a way to find which request belongs to itself or
> > > not?
> > > 
> > I don’t think the id is necessary here since all requests/responses are
> > serialized.
> 
> 
> It's probably fine for the first version but it will degrate the performance
> e.g if the ioregionfd is used for multiple queues (e.g doorbell). The design
> should have the capability to allow the extension like this.

If there is only one doorbell register and one ioregionfd then trying to
process multiple queues in userspace is going to be slow. Adding cmd IDs
doesn't fix this because userspace won't be able to destribute cmds to
multiple queue threads efficiently.

Multiple queues should either use multiple doorbell registers or
ioregionfd needs something like datamatch to dispatch the MMIO/PIO
access to the appropriate queue's ioregionfd. In both cases cmd IDs
aren't necessary.

Can you think of a case where cmd IDs are needed?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-03-29 16:18 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-21 12:04 [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Elena Afanasova
2021-02-21 12:04 ` [RFC v3 1/5] KVM: add initial support for KVM_SET_IOREGION Elena Afanasova
2021-02-24 10:06   ` Stefan Hajnoczi
2021-03-05 13:09   ` Cornelia Huck
2021-03-09  5:26   ` Jason Wang
2021-03-22  9:57     ` Stefan Hajnoczi
2021-02-21 12:04 ` [RFC v3 2/5] KVM: x86: add support for ioregionfd signal handling Elena Afanasova
2021-02-24 10:42   ` Stefan Hajnoczi
2021-03-09  5:51   ` Jason Wang
2021-03-17 14:19     ` Elena Afanasova
2021-03-26  6:00       ` Jason Wang
2021-02-21 12:04 ` [RFC v3 3/5] KVM: implement wire protocol Elena Afanasova
2021-02-24 11:02   ` Stefan Hajnoczi
2021-03-09  6:19   ` Jason Wang
2021-03-17 13:08     ` Elena Afanasova
2021-03-26  6:21       ` Jason Wang
2021-03-29 16:17         ` Stefan Hajnoczi
2021-02-21 12:04 ` [RFC v3 4/5] KVM: add ioregionfd context Elena Afanasova
2021-02-24 11:27   ` Stefan Hajnoczi
2021-03-09  7:54   ` Jason Wang
2021-03-09  8:01     ` Paolo Bonzini
2021-03-10 13:20       ` Elena Afanasova
2021-03-10 14:11         ` Paolo Bonzini
2021-03-10 16:41           ` Elena Afanasova
     [not found]             ` <6ff79d0b-3b6a-73d3-ffbd-e4af9758735f@redhat.com>
2021-03-17 10:46               ` Elena Afanasova
2021-03-26  6:47                 ` Jason Wang
2021-02-21 12:04 ` [RFC v3 5/5] KVM: enforce NR_IOBUS_DEVS limit if kmemcg is disabled Elena Afanasova
2021-02-21 17:06 ` [RFC v3 0/5] Introduce MMIO/PIO dispatch file descriptors (ioregionfd) Paolo Bonzini
2021-02-22 16:40   ` Elena Afanasova
2021-02-24 11:34 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).